Mining disjunctive consequent association rules

Mining disjunctive consequent association rules

Applied Soft Computing 11 (2011) 2129–2133 Contents lists available at ScienceDirect Applied Soft Computing journal homepage: www.elsevier.com/locat...

421KB Sizes 1 Downloads 199 Views

Applied Soft Computing 11 (2011) 2129–2133

Contents lists available at ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Mining disjunctive consequent association rules Ding-An Chiang a , Yi-Fan Wang b,∗ , Yi-Hsin Wang c , Zhi-Yang Chen a , Mei-Hua Hsu d,1 a

Department of Computer Science and Information Engineering, Tamkang University, Taiwan, ROC Institute of Information Science and Management, National Taipei College of Business, Taiwan, ROC c Department of Information Management, Chang Gung Institute of Technology, Taiwan, ROC d Department of General Education, Chang Gung Institute of Technology, 261, Wen-Hwa 1st Road, Kwei-Shan, Taoyuan, Taiwan, ROC b

a r t i c l e

i n f o

Article history: Received 14 February 2008 Received in revised form 29 April 2010 Accepted 25 July 2010 Available online 11 August 2010 Keywords: Data mining Association rule Disjunctive composite items

a b s t r a c t When association rules A → B and A → C cannot be discovered from the database, it does not mean that A → B ∨ C will not be an association rule from the same database. In fact, when A, B or C is the newly marketed product, A → B ∨ C shall be a very useful rule in some cases. Since the consequent item of this kind of rule is formed by a disjunctive composite item, we call this type of rules as the disjunctive consequent association rules. Therefore, we propose a simple but efficient algorithm to discover this type of rules. Moreover, when we apply our algorithm to insurance policy for cross selling, the useful results have been proven by the insurance company. © 2010 Elsevier B.V. All rights reserved.

1. Introduction Within the data mining, the association rules are generally the frequently adopted techniques [1]. The earliest algorithm of association rules was introduced in [2,3]. The general association rules are regularly evaluated by support and confidence. Thereafter, numerous algorithms based on the research of Apriori algorithm were proposed [6,10–14]. For example, Park crews proposed the DHP algorithm [9], this method adopts hash table for the valid reduction of the creation from the candidate itemsets especially on the set of candidate 2-itemsets. Furthermore, the DIC algorithm [4] divides the database into several identical sections. The beginning of given section will be availably added with new candidate itemsets to reduce the time to gather data. In 2002, Lin and Kedem [8] also proposed the algorithm of pincer-search to quickly discover the hi-frequency itemsets from the itemsets of longer average length. There are numerous association algorithms adopted the concepts of either upward or downward closure to reduce the effort of searching the combinatorial search space [9]. The upward closure means whenever the itemsets violates the frequency limitation (e.g. the minimum support pre-set by users), the entire superset of these itemsets will also definitely violate the frequency limitation. Conversely, the downward closure means whenever the itemsets belong to the hi-frequency itemsets, the subset of these itemsets

∗ Corresponding author. Tel.: +886 2 23226291; fax: +886 2 23226293. E-mail addresses: yfwang [email protected] (Y.-F. Wang), [email protected] (M.-H. Hsu). 1 Tel.: +886 3 2118999; fax: +886 3 2118866. 1568-4946/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2010.07.011

will also conform to the requirements of hi-frequency itemsets. We will also adopt these concepts in our algorithm to reduce the combinatorial search space. The general association rules are regularly evaluated by support and confidence. However, in real life, sometimes the support and confidence can never fully express the situations of market shares. For example, all the products cannot be marketed at the same time; the later marketed products will come with relatively much lower support or confidence. Although we probably take great interest in these newly marketed products, the association rule with these new marketed products can never be discovered because its support or confidence cannot reach the minimal support and confidence. Another kind of cause the support and confidence down is that the products are excessive. Usually excessively asunder merchandise’s willing make the support and confidence reduced; therefore, we cannot succeed of finding out the related association rule. Solving this problem, we can mine association rules with taxonomies over the items [5]. Taking the shoe products for example, we can classify the shoe products into the category of sport and leather shoes, and these two categories can be further subdivided into various brands. Such classifications are typically referred to as the multilevel association rules [5]. Taking the previous shoe products for example, we can depict the tree figure similar to Fig. 1. In view of the past notion of association rules, if the support of A → Bi cannot reach the minimum support or confidence, the consequent of association rule will be regressed into the items of sport shoes, where 1 ≤ i ≤ n. However, the range of sport shoes seems to be much larger than those of shoe brands; therefore, we hope for the more excellent association rules available for more detailed feature expression.

2130

D.-A. Chiang et al. / Applied Soft Computing 11 (2011) 2129–2133

Fig. 1. The classification of shoes.

Fig. 2. Apriori-gen function. Fig. 4. The naive disjunctive consequent association rules algorithm.

Fig. 3. To count the frequency of Ck .

To solve above problems, Xingfeng Ye proposed an algorithm for mining association rules with disjunctive composite items [18]. A disjunctive composite item whose length is i consists of a set of atomic items with the form a1 ∨ · · · ∨ ai , where each aj is an atomic item and j = 1 · · · i. Their algorithm allows large itemsets to contain composite items and does not require the users to provide taxonomy. The users need only provide the items which they are interested. Their algorithm will discover all possible association rules with composite items. Different application may need different algorithm; our algorithm was initially developed for an insurance company. While recommending new marked merchandises to customers first time, we hope the types of merchandise in the consequent of association rule to be as less as possible. Therefore, our algorithm only generates useful disjunctive consequent association rules. The algorithm will merge various items into a disjunctive composite item as the consequent of the association rule. Thus, in Fig. 2, even the support or confidence of A → Bi cannot satisfy the minimum support and confidence, it is also available to adopt the disjunctive consequent association rule of A → Bj ∨ · · · ∨ Bk , whose support and confidence satisfy the minimum support and confidence. Consequently, it makes the discovered multilevel association rules more meaningful. 2. Apriori algorithm The association rules algorithm is mainly used to find out the relations between items or features that happened synchronously in the database. As for the exploration of association rules, many researchers take the Apriori algorithm [2] as the basic formulation. As shown in Figs. 2 and 3, Apriori algorithm is a bottom-up algo-

rithm, it starts from large 1-itemsets and gradually outstretches toward large k-itemsets. Initially, the Apriori algorithm searches the database and counts the appearance frequency of each item, namely the Candidate 1itemsets. If the appearance frequency of a certain item equals to or exceeds the regulated minimum support, it will become large 1-itemsets. On the number k − 1 level, we not only find out that the large (k − 1)-itemsets, but also adopt apriori-gen function to create the Candidate k-itemsets Ck . For example, let the large 3-itemsets = {{ABC}, {ABD}, {ACD}, {ACE}, {BCD}}, after passing through the joint step in the apriori-gen function, it creates the Candidate 4-itemsets = {{ABCD}, {ACDE}}. However, as shown in Fig. 2, in the prune steps, the itemsets of {ACDE} will be deleted because the itemsets of {ADE} never appear in the large 3-itemsets. Therefore, the Candidate 4-itemsets are the only left over with {ABCD}. After that, as shown in Fig. 4, we can search the database to count the appearance frequency of {ABCD}. 3. The naive disjunctive consequent association rules algorithm In the commercial environment, the commodities are usually introduced into the market over the time passing but the products are never promoted to the consumers simultaneously. Generally, the time factor gives a higher support of previous marketed items, the later marketed items result in inferior support accordingly. Under the calculation of such transaction records, the support cannot thoroughly represent the market shares of items. Also, since the support does not reach the threshold value of the minimum support yet, it is unavailable to create the association rules. Focusing on the later marketed items, we hope to make the items irrelative with the time factor to avoid the reality that the support would fail to reach the minimum support. Thus, for each specific item A, we can extract the transactions which contain the item A from the database. Since all the selected transactions contain the specific item A, the corresponding association rules A → B will not be affected by deficient time-caused market shares of the item A. Consequently, according to the definitions of conventional

D.-A. Chiang et al. / Applied Soft Computing 11 (2011) 2129–2133

2131

support and confidence of the association rules, when all selected transactions contain the item A, the support and confidence of the association rule A → B with respected to these selected transactions can be reduced to: Support (A → B) =

|B| |All the selected transactions|

Confidence (A → B) =

|B| |All the selected transactions|

where |B| is the number of transactions including product B. In this paper, the formula of support is equal to the confidence ones because we extract the transactions of the interested itemsets. We aim to all selected transactions to calculate the support unaffectedly by other transactions. Thus, support is undifferentiated to confidence of all selected transactions. From the above, we know that it comes without any variance from the support and confidence of rule A → B in this special case. Our aim is to discover the disjunctive consequent association rule with the specific item A as an antecedent. Finding disjunctive consequent association rules have a totally different behavior with respect to the concept of downward closure [8]. That is, when A → B is a disjunctive consequent association rule to the selected transactions, the superset of the disjunctive composite item B must also be the disjunctive composite item of the other disjunctive consequent association rules in the same database. Moreover, in view of the commercial interest, we hope to promote to the consumers with the package sales of the interesting items together to increase the consumers’ desire to purchase. Thus, we wish to promote the most interesting items to consumers by using the packages with least items. Accordingly, we define the following definition. Definition 1. Let A → B be a disjunctive consequent association rule and B be a disjunctive composite item. When C is superset of B, the disjunctive consequent association rule A → C is a useless rule. For example, when A → (B ∨ C) is a disjunctive consequent association rule, despite there is no occurrence of item D in the selected transactions, the support and confidence of A → (B ∨ C ∨ D) must exceed or be equal to that of A → (B ∨ C). Therefore, the support and confidence of A → (B ∨ C ∨ D) must satisfy the threshold values of the minimum support and confidence. In other words, A → (B ∨ C ∨ D) is a disjunctive consequent association rule with respect to the same transactions. However, we view that the rule A → (B ∨ C ∨ D) is a useless rule in this paper by the above definition. As shown in the above example, when the rule A → (B ∨ C) is discovered with respect to the selected transactions, the disjunctive composite item B ∨ C will never be required to combine with other items to create other new disjunctive composite items. Therefore, the item B ∨ C is called disqualified item for creating other new disjunctive composite items. Let the candidate composite kitemset Ck be a set of disjunctive composite items that length are k. Since the candidate composite k-itemset Ck is generated by Ck−1 × C1 ; in order to avoid generating useless rules, we have to previously eliminate the disqualified items from Ck−1 before generating the new disjunctive composite items. We just use the qualified items to create the new disjunctive composite items. Consequently, only useful rules will be generated by definition 1 s. Moreover, we wish to promote the most interesting items to consumers by using the packages with least items; therefore, we can also limit the length of the disjunctive composite items to avoid the composite items consisting of too many items. Accordingly, the naive disjunctive consequent association rules algorithm is shown as Fig. 4.

Fig. 5. The efficient disjunctive consequent association rules algorithm.

4. Improved disjunctive consequent association rules algorithm Even though it can reduce the number of combination items by Definition 1, we still need to spend time to check the items in the candidate composite itemset to prevent the forming of useless or repeated rules. In addition, while recommending the merchandises, we would hope the types of merchandise in the consequent of association rule to be as less as possible. Therefore, if merchandise already exists in some rules of with shorter composite consequent, then the merchandise will not appear in the rule with longer composite consequent. We define Definition 2 based on the aforementioned description. Definition 2. Let A → B ∨ C be a composite consequent association rule, then A → B ∨ X1 ∨ X2 ∨ · · · ∨ Xn and A → C ∨ Y1 ∨ Y2 ∨ · · · ∨ Ym , are useless association rules, where Xi and Yj are atomic items, i = 1 · · · n, j = 1 · · · m, n  1, and m  1. According to the above definition, if A → B ∨ C exits, then B or C will not need to combine with other item to form longer composite consequent association rule. Therefore, the item B and C are called disqualified items. Let the candidate composite k-itemset Ck be a set of disjunctive composite items whose length are k. Since the candidate composite k-itemset Ck is generated by Ck−1 × C1 ; in order to avoid generating useless rules, we have to previously eliminate the disqualified items from Ck−1 and C1 before generating the new disjunctive composite items. Accordingly, the efficient disjunctive consequent association rules algorithm is defined as Fig. 5. 5. Experiments The main insurance policies are usually accompanied with the insurance riders for cross selling. Whenever the main insurance policies can be accompanied with the insurance riders appealing to customers, the promotion performance will be raised. Therefore, our main aim is to explore the combination of main insurance policy and attached insurances to enhance the opportunity of cross selling.

2132

D.-A. Chiang et al. / Applied Soft Computing 11 (2011) 2129–2133

Fig. 6. The resultant rules.

In this paper, we take the transaction records from an international insurance company in Taiwan (whose name and the policy’s name are changed due to the non-Disclosure Agreement). From 1/1992 to 7/2005, this company has about 1,200,000 transactions. For emphasis on the interested items of the company, we especially select some new marketed main insurance policies from the specific insurance policy categories as our goal items. These new main policies are just made available, for example, the earliest case of variable life insurance policy, HCB20, was dated at 2/2005 with 692 transactions. Therefore, in spite of the support value, we only regulate the threshold value of the minimal confidence at 75%. However, after the data mining performed by the traditional association rules, we cannot find out any association rule related to these new products. These new products are the focus products of the company; if we are not able to find out useful rules for these products, it may cause a great lose to the company. Under such results, it is quite different from our initial expectation. The reason of the association rule with respect to these new products cannot be discovered because there are too many riders for these products and different sales may promote different riders. For example, there are 5 different riders for HCB20 in the transactions. Therefore, the rule with respect to HCB20 cannot be discovered because its confidence is quite difficulty to reach the threshold value. In view of the availability to make the comparison between the traditional association rules and our algorithm, we also regulate the threshold value of the minimal confidence at 75%. When we process the same data by our algorithm, as shown in Fig. 6, several rules will be found. As shown in Fig. 6, the rule HCB20 → MIR00 ∨ QTR20 means: People who purchase HCB20 main insurance will also purchase MIR00 or QTR20. For the confidence value, it is calculated by treating MIR00 and QTR20 as the same product. This rule tells us that for customers who purchased HCB20, if it can be combined with MIR00 or QTR20, 80% of them will also purchase at least one of them. On the other hand, if we use traditional association rule, we will not be able to obtain this important information and the company will not be able to combine the products based on this rule. We use the data from 8/2005 to 12/2005 as testing data, the results show that confidences of these rules are over 80%. In fact, when we present these simple rules to the company participated in this research at this year, these rules are strongly agreed by the company’s Actuarial and Product Departments. Further, the results make the company start to plan the budget for bring up their requests, using data mining to further analysis these demands.

Table 1 The rules were produced by Apriori algorithm. Rule HCB20 ⇒ HIR00 HCB20 ⇒ QTR20 HCB20 ⇒ XAH00 HCB20 ⇒ MIR00 HCB20 ⇒ ADR65 HCB20 ⇒ XHX00 HCB20 ⇒ XMR00 HCB20 ⇒ QWC20

Confidence 0.05 0.60 0.1 0.60 0.15 0.75 0.20 0.65

The worst case would be (nm ) in the time complexities of Definition 2. For example as in product HCB20, the traditional Apriori algorithm can only produce the rules as in Table 1. After using Definition 2, we will have the result as in Fig. 7. As shown in Fig. 7, the execution time of traditional algorithm is almost the same as that of the algorithm using Definition 1. Although the difference between traditional algorithm and the algorithm using Definition 1 on number of candidate is obvious when the length of consequent is three, the algorithm using Definition 1 need to check every candidate item to prevent forming useless rules. Therefore, even though Definition 1 can reduce the number of combination, the validation will take a lot of time. Moreover, when composite consequent length reaches four, there are no more items to combine so no new disjunctive composite item will be generated. We will process efficiency analysis towards the algorithm derived from Definitions 1 and 2, respectively. First, we still will not limit the combinational length of composite merchandise item, and then collect all the composite consequent and candidate items that do not have minimum support. Same as previous, we eliminate disqualified items to prevent forming of useless rule. According to the Definitions 1 and 2, the algorithm using Definition 1 will

6. Efficiency analysis In this section, we compare the number of candidate composite items and executing efficiency generated by different length of the traditional algorithm and our algorithms. The traditional algorithm will not eliminate useless combination. For efficiency analysis, we aim at the main insurance policy QNM15 with 23 riders.

Fig. 7. The compassion between QNM15 main insurance using traditional algorithm and using Definition 1.

D.-A. Chiang et al. / Applied Soft Computing 11 (2011) 2129–2133

2133

discover the past concealed rules affected by the time factor automatically. References

Fig. 8. The comparison between QNM15 main insurance using Definitions 1 and 2.

only eliminate disqualified items from Ck ; but the algorithm using Definition 2 will eliminate disqualified items from C1 and Ck . Therefore, Fig. 8 shows that the algorithm using Definition 2 has better performance. 7. Conclusions In this research, we adopt the idea of the disjunctive composite item to discover the disjunctive consequent association rules. Although the proposed algorithm is not difficulty, the resultant rules of our algorithm will bring more commercial applications to make more profit. In the future, we will add the time factor into the association rules and sequential patterns so that we can

[1] P. Adriaans, D. Zantinge, Data Mining, Addison Wesley Longman, 1996. [2] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proc. ACM SIGMOD Int’l Conf. on Management of Data, Washington, 1993. [3] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proc. Int’l Conf. Very Large Data Bases, 1994. [4] S. Brin, R. Motwani, J.D. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market basket data, in: Proc. ACM SIGMOD Int’l Conf. on Management of Data, 1997. [5] J. Han, Y. Fu, Mining multiple-level association rules in large databases, IEEE Transactions on Knowledge and Data Engineering 11 (1999) 798–805. [6] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in: Proc. ACM SIGMOD Int’l Conf. on Management of Data, 2000. [8] D.I. Lin, Z.M. Kedem, Pincer-search: an efficient algorithm for discovering the maximum frequent set, IEEE Transactions on Knowledge and Data Engineering 14 (2002) 553–556. [9] E.R. Omiecinski, Alternative interest measures for mining associations in databases, IEEE Transactions on Knowledge and Data Engineering 15 (2003) 57–69. [10] J.S. Park, M.S. Chen, P.S. Yu, An effective hash-based algorithm for mining association rules, in: Proc. ACM SIGMOD Int’l Conf. Management of Data, 1995. [11] J.S. Park, M.S. Chen, P.S. Yu, Using a hash-based method with transaction trimming for mining association rules, IEEE Transactions on Knowledge and Data Engineering 9 (1997) 813–825. [12] J. Roberto, J. Bayardo, Efficiently mining long patterns from databases, in: Proc. ACM SIGMOD Int’l Conf. on Management of Data, 1998. [13] B. Rozenberg, E. Gudes, Association rules mining in vertically partitioned databases, Data and Knowledge Engineering 59 (2006) 378–396. [14] A. Savasere, E. Omiecinski, S. Navathe, An efficient algorithm for mining association rules in large database, in: Proc. 21th VLDB, 1995. [18] Xinfeng Ye, J.A. Keane, Mining association rules with composite items, IEEE Transactions on Computational Cybernetics and Simulation 2 (1997) 1367–1372.