An effective tree structure for mining high utility itemsets

Expert Systems with Applications 38 (2011) 7419–7424 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

Download PDF

405KB Sizes 6 Downloads 104 Views

Report

PDF Reader
Full Text

Expert Systems with Applications 38 (2011) 7419–7424

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

An effective tree structure for mining high utility itemsets Chun-Wei Lin a, Tzung-Pei Hong b,c,⇑, Wen-Hsiang Lu a a

Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, ROC Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan, ROC c Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung 804, Taiwan, ROC b

a r t i c l e

i n f o

a b s t r a c t

Keywords: Utility mining High utility pattern HUP-tree HUP-growth Two-phase mining Downward closure

In the past, many algorithms were proposed to mine association rules, most of which were based on item frequency values. Considering a customer may buy many copies of an item and each item may have different proﬁts, mining frequent patterns from a traditional database is not suitable for some real-world applications. Utility mining was thus proposed to consider costs, proﬁts and other measures according to user preference. In this paper, the high utility pattern tree (HUP tree) is designed and the HUP-growth mining algorithm is proposed to derive high utility patterns effectively and efﬁciently. The proposed approach integrates the previous two-phase procedure for utility mining and the FP-tree concept to utilize the downward-closure property and generate a compressed tree structure. Experimental results also show that the proposed approach has a better performance than Liu et al.’s two-phase algorithm in execution time. At last, the numbers of tree nodes generated from three different item ordering methods are also compared, with results showing that the frequency ordering produces less tree nodes than the other two. Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction

factors, such as prices or proﬁts. For example, a sale of diamonds may occur less frequently than that of clothing in a department store, but the former gives a much higher proﬁt per unit sold than the latter. Only frequency is thus not sufﬁcient to identify highly proﬁtable items. Utility mining (Yao & Hamilton, 2006; Yao, Hamilton, & Butz, 2004) was thus proposed to partially solve the above problem. It may be thought of as an extension of frequent-itemset mining with sold quantities and item proﬁts considered. The utility means how ‘‘useful’’ an itemset is. Utility mining would usually like to ﬁnd high utility itemsets, which mean their utility values are larger than or equal to a threshold deﬁned by users. In practice, the utility value of an itemset can be measured in terms of costs, proﬁts or other measures from user preference. For example, someone may be interested in ﬁnding the itemsets with good proﬁts and another may focus on the itemsets with low pollution while manufacturing. Liu et al. then presented the two-phase algorithm for fast discovering all high utility itemsets (Liu, Liao, & Choudhary, 2005) based on the downward-closure property. The property indicates that any superset of a non-frequent itemset is also non-frequent. It is thus called the anti-monotone property as well. The property is used to reduce the search space by pruning non-frequent itemsets early. The two-phase algorithm generates candidate high utility itemsets in a level-wise way. The database-scanning time is, however, a bottleneck of the approach. In this paper, a new utility-mining approach with the aid of a tree structure is proposed. A new tree structure called the high

Mining frequent itemsets from a transaction database is a fundamental task for knowledge discovery. Its goal is to identify the itemsets with their appearing frequencies above a certain threshold. It usually serves as a basic procedure in ﬁnding association rules (Agrawal, Imielinksi, & Swami, 1993a; Agrawal, Imielinksi, & Swami, 1993b; Agrawal & Srikant, 1994; Chen, Han, & Yu, 1996; Cheung, Lee, & Kao, 1997) and sequential patterns (Agrawal & Srikant, 1995). In the past, numerous methods were proposed to discover frequent itemsets. The approaches could be divided into two categories: level-wise approaches and pattern-growth approaches. The Apriori algorithm (Agrawal et al., 1993a) was ﬁrst proposed to mine association rules based on a level-wise processing way. The FP-growth algorithm was then proposed to construct a compressed tree structure and to mine rules based on it (Han, Pei, & Yin, 2000). Both the Apriori and the FP-growth approaches treat all the items in a database as binary variables. That is, they only consider whether an item is bought in a transaction or not. In this case, frequent itemsets just reveal the occurrence importance of the itemsets in the transactions, but do not reﬂect any other implicit ⇑ Corresponding author at: Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan, ROC. E-mail addresses: [email protected] (C.-W. Lin), [email protected] (T.-P. Hong), [email protected] (W.-H. Lu). 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.12.082

7420

C.-W. Lin et al. / Expert Systems with Applications 38 (2011) 7419–7424

utility pattern tree (HUP tee) is ﬁrst designed to keep related information for utility mining. A mining approach called HUP-growth based on the proposed HUP-tree structure, is next presented to mine high utility itemsets. The whole process of deriving the high utility itemsets from a database thus consists of the construction process of the HUP tree and the mining process of the HUP-growth. Experimental results also show that the proposed approach for mining high utility itemsets has a better performance than the two-phase approach in execution time. Besides, different itemordering ways may affect the numbers of nodes in the HUP tree. Experiments are thus made to analyze this effect. The remainder of this paper is organized as follows. Related works are reviewed in Section 2. The proposed HUP-tree construction algorithm and an example are given in Section 3. The proposed HUP-growth mining algorithm and an example are described in Section 4. Experimental results for showing the performance of the proposed approach are provided in Section 5. Conclusions and discussions are ﬁnally given in Section 6. 2. Review of related works In this section, some related researches are brieﬂy reviewed. They are the FP-growth algorithm and utility mining. 2.1. The FP-growth algorithm Data mining involves applying speciﬁc algorithms to extract patterns or rules from data sets in a particular representation. One common type of data mining is to derive association rules from transaction data, such that the presence of certain items in a transaction will imply the presence of some other items. To achieve this purpose, Agrawal et al. proposed several mining algorithms based on the level-wise processing to ﬁnd association rules (Agrawal et al., 1993a; Agrawal et al., 1993b; Agrawal & Srikant, 1994). Han et al. then proposed the Frequent-Pattern-tree structure (FP tree) and FP-growth algorithm for efﬁciently mining frequent itemsets without generation of candidate itemsets (Han et al., 2000). The FP-tree mining algorithm consists of two phases. The ﬁrst phase focuses on constructing the FP tree from a database, and the second phase focuses on deriving frequent patterns from the FP tree. Three steps are involved in FP-tree construction. The database is ﬁrst scanned to ﬁnd all items with their counts. The items with their support equal to or larger than a predeﬁned minimum support threshold are then selected as frequent 1-itemsets (items). Next, the frequent items are sorted in the descending order of their frequencies. At last, the database is scanned again to construct the FP tree according to the sorted order of frequent items. The construction process is executed tuple by tuple, from the ﬁrst transaction to the last one. After all transactions are processed, the FP tree is completely constructed. After the FP tree is constructed from a database, the FP-growth mining algorithm (Han et al., 2000) is then executed to ﬁnd all frequent itemsets. A conditional FP tree is generated for each frequent item, and the frequent itemsets with the processed item can be recursively derived from the FP-tree structure. Several other algorithms based on the pattern-growth procedure have also been proposed and some related researches are still in progress (Ezeife, 2002; Koh & Shieh, 2004; Qiu, Lan, & Xie, 2004; Zaiane & Mohammed, 2003).

not reﬂect any other important factors, such as prices or proﬁts. High proﬁtable products but with low frequencies may thus not be found in traditional association-rule mining. For example, jewels and diamonds are high proﬁtable items but may not be frequent when compared with food or drink products in a database. Yao et al. then proposed the utility model by considering both quantities and proﬁts of items (Yao & Hamilton, 2006; Yao et al., 2004). Chan et al. proposed the topic of utility mining to discover high utility itemsets (Chan, Yang, & Shen, 2003). Liu et al. then presented a two-phase algorithm for fast discovering high utility itemsets by adopting the downward-closure property (Liu et al., 2005) and named his approach as the transaction-weighted-utilization (TWU) model. It consisted of two phases. In the ﬁrst phase, the transaction utility was used as the effective upper bound of each candidate itemset such that the ‘‘transaction-weighted downward closure’’ could be kept in the search space to decrease the number of generated candidate itemsets. In the second phase, an additional database scan was performed to ﬁnd out the actual utility values of the candidates and to identify the high utility itemsets. Thus, the main idea is to reduce the size of candidates in order to decrease the time of scanning a database. Several other algorithms for utility mining were also proposed (Chu, Tseng, & Liang, 2008; Tseng, Chu, & Liang, 2006). The proposed utility mining approach thus integrates the twophase approach and the FP-tree concept to efﬁciently and effectively ﬁnd high utility patterns. A new tree structure is designed and a tree-construction algorithm is proposed. They are ﬁrst described below.

3. The proposed HUP-tree construction algorithm The HUP-tree construction algorithm is ﬁrst proposed to keep the high utility items obtained from a database in the tree structure based on the downward-closure property. The proposed algorithm ﬁrst calculates the transaction utility of each transaction. It then ﬁnds the transaction-weighted-utilization values of all the items. If the transaction-weighted-utilization of an item is larger than or equal to the predeﬁned minimum utility threshold, it is thus considered as a high transaction-weighted 1-itemset. The algorithm then keeps only the high transaction-weighted 1-itemsets in the transactions and sorts them according to their transaction frequencies. The updated transactions are then used to build the HUP tree tuple by tuple, from the ﬁrst transaction to the last one. Each node in the tree has to store the transaction-weightedutilization of the item as well as the quantities of its preceding items (including itself) in the path. An array called quan_Ary is then attached to a node to keep those values. The HUP-tree construction algorithm is stated as follows. 3.1. The HUP-tree construction algorithm INPUT: 1. A set of I of m items {i1, i2, . . . , ij, . . ., im}, each item ij with a proﬁt value pj; 2. A transaction database D = {T1, T2, . . . , Tk , . . . , Tn}, in which each transaction includes a subset of items with quantities; 3. A minimum high utility threshold k. OUTPUT: A high utility pattern tree (HUP tree).

2.2. Utility mining In association-rule mining, only binary itemsets are considered in a database. In real-world applications, however, frequent itemsets just reveal the occurrence of itemsets in transactions, but do

STEP 1: Calculate the utility value ujk of each item ij in each transaction Tk as ujk = qjk⁄pj, where qjk is the quantity of ij in Tk and pj is the proﬁt of pj. Accumulate the utility values of the items in each transaction Tk as the transaction utility tuk. That is:

7421

C.-W. Lin et al. / Expert Systems with Applications 38 (2011) 7419–7424

tuk ¼

m X

Table 1 A quantitative database in the example.

ujk :

j¼1

STEP 2: While executing the above step, also ﬁnd the occurrence frequency f(ij) of each item ij in the database. STEP 3: Calculate the high-weighted-utilization (abbreviated as twu) of each item ij as the summation of the utilities of the transactions which include ij. That is:

twuðij Þ ¼

X

tuk :

ij 2T k

STEP 4: Check whether the value of twu(ij) is larger than or equal to the minimum high utility value, which is calculated as follows:

k

m X X

TID

A

B

C

D

E

F

1 2 3 4 5 6 7 8 9 10

3 2 3 1 1 1 2 0 0 3

2 0 0 0 0 2 3 0 0 0

0 0 5 3 0 0 2 0 3 0

3 4 0 0 3 4 0 0 3 4

0 2 0 1 2 0 1 0 0 0

0 0 3 2 0 0 1 2 0 0

Table 2 The utility table.

ujk :

j¼1 T k 2D

If ij satisﬁes the above condition, put it in the set of candidate high utility 1-itemsets C1. That is:

(

C1 ¼

ij jtwuðij Þ P k

m X X

)

ujk ; 1 6 j 6 m :

j¼1 T k 2D

STEP 5: Sort the candidate high utility 1-itemsets (items) in C1 according to their transaction frequencies. STEP 6: Build the Header_Table by keeping the candidate high utility 1-itemsets in C1 in the sorted order in STEP 5. STEP 7: Remove the items not existing in C1 from the quantitative transactions and sort the remaining items in each transaction according to the sorted order in STEP 5. STEP 8: Initially set the root node of the HUP tree as root. STEP 9: Insert the updated transactions into the HUP tree tuple by tuple by the following substeps: Substep 9–1: If an item ij in the currently processed kth transaction has appeared at the corresponding path of the HUP tree, add the transaction utility ujk to the node with ij in the path as its accumulated value. Besides, add the quantities of the preﬁx items of ij to the corresponding elements of the quan_Ary array in the node, which stores the accumulated quantities of the preﬁx items of ij. Substep 9–2: Otherwise, add a node with ij to the end of the corresponding path and set the utility value ujk in the kth transaction as its value in the node. Besides, set the quantities of the preﬁx items of ij in the transaction to the corresponding elements of the array (quan_Ary) in the node. At last, insert a link from the node of ij in the last branch to the current node. If there is no such branch, insert a link from the entry of ij in the Header_Table to the current node. After STEP 9, the HUP tree is built. Note in STEP 9, a corresponding path means a path in the tree which corresponds to the items to be processed in a transaction according to the order of items appearing in the Header_Table. Besides, Steps 7 and 9 can be done in a database scan. Below, an example is given to illustrate the proposed HUP-tree construction algorithm. Assume there is a quantitative database shown in Table 1. It consists of 10 transactions and 6 items, denoted A to F. Assume the minimum high utility threshold k is set at 35%. Also assume that the predeﬁned proﬁt values for the items are deﬁned in a utility table shown in Table 2. The proposed tree-construction algorithm proceeds as follows. STEPs 1 and 2: the utility value of each item occurring in each transaction in Table 1 is calculated. Take the ﬁrst transaction as an example to illustrate it. The items with quantities in the ﬁrst transaction are (A:3, B:2, D:3). The proﬁts for items A, B, and D

Item

Proﬁt ($)

A B C D E F

3 150 1 50 100 20

are 3, 150 and 50 from Table 2. The transaction utility for the ﬁrst transaction is thus calculated as tu(T1) = (3⁄3) + (2⁄150) + (3⁄50), which is 459. The transaction utilities for the other transactions can be calculated in the same way. Besides, the occurrence frequency of each item is also calculated. Take item A as an example. It appears in transactions 1, 2, 3, 4, 5, 6, 7 and 10. Its transaction frequency f(A) is thus set as 8. After STEPs 1 and 2, the results are shown in Table 3. STEP 3: The transaction-weighted-utilization (abbreviated as twu) of each item in the transactions are summed together. Take item A as an example. It appears in transactions 1, 2, 3, 4, 5, 6, 7 and 10 in Table 3. Its transaction utilities twu(A) is then calculated as tu(T1, A) + tu(T2, A) + tu(T3, A) + tu(T4, A) + tu(T3, A) + tu(T4, A) + tu(T5, A) + tu(T6, A) + tu(T7, A) + tu(T10, A), which is 459 + 406 + 74 + 146 + 353 + 503 + 578 + 209 (= 2728). The other items are then processed in the same way. The results are shown in Table 4. STEP 4: The twu values of the 1-itemsets are checked against the minimum high utility value, which is calculated as 1022.35, which is 35% of the total utility. In this example, the four items A, B, D and E satisfy the condition and are recorded in the candidate set of high utility 1-itemsets C1. Thus, C1 = {A:2725, B:1540, D:2080, E:1483}. STEP 5: The candidate high utility 1-itemsets in C1 are sorted according to their descending occurrence frequencies. The sorted list is (A, D, E, B). Table 3 The utilities and the occurrence frequencies of all items in each transaction. TID

A

B

C

D

E

F

tu

1 2 3 4 5 6 7 8 9 10 Occurrence frequency

9 6 9 3 3 3 6 0 0 9 8

300 0 0 0 0 300 450 0 0 0 2

0 0 5 3 0 0 2 0 3 0 4

150 200 0 0 150 200 0 0 150 200 6

0 200 0 100 200 0 100 0 0 0 4

0 0 60 40 0 0 20 40 0 0 4

459 406 74 146 353 503 578 40 153 209

7422

C.-W. Lin et al. / Expert Systems with Applications 38 (2011) 7419–7424 Table 4 The transaction-weightedutilization of each item. Item

twu

A B C D E F

2728 1540 951 2083 1483 838

{root}

Header_Table

STEP 6: The candidate high utility 1-itemsets in C1 are kept in the Header_Table in the sorted order. The results are shown in Table 5. STEP 7: The items not existing in C1 are removed from the transactions in the quantitative database. The remaining items in each updated transaction are then sorted according to the above order. The results of the updated transactions are shown in Table 6. STEP 8: The root node of the HUP tree is initially set as the node root. STEP 9: The updated transactions in Table 6 are thus used to construct the HUP tree tuple by tuple from the ﬁrst transaction to the last one. Each node consists of not only the twu value of the item within it but also the quantities of the preﬁx items in the path. Take the ﬁrst and the second transactions as examples to illustrate the construction process. At the beginning, there is not a corresponding path in the HUP tree for the ﬁrst updated transaction (A:3, D:3, B:2). Three new nodes are then sequentially created for the items and are linked together. Each node consists of the transaction utility 459 in the ﬁrst updated transaction as its value as well as the quantities of its preﬁx items in the transaction. In this example, the preﬁx item of the item A in the ﬁrst transaction is null. The array of quan_Ary attached to the node then only keeps the quantity of item A itself (A:3). Next, the item D is inserted. Its preﬁx item in the ﬁrst transaction is A. The array of quan_Ary attached to the node with D then keeps (A:3, D:3). Similarly, the array of quan_Ary attached to the node with B keeps (A:3, D:3, B:2). After the ﬁrst transaction is processed, the HUP tree is shown in Fig. 1.

twu 2728 2083 1483 1540

A

2728

D

2083

E

1483

B

1540

A:459

D:459

B:459

A 3

A 3 D 3

A 3 D 3 B 2

Fig. 1. The HUP tree after the ﬁrst updated transaction is processed.

After that, the second updated transaction (A:2, D:4, E:2) is processed. There is a corresponding path (A, D) in the HUP tree for the second transaction. The values (transaction utility) of the nodes A and D in the path are then accumulated as 459 + 406, which is 865. The quan_Ary arrays in nodes A and D are also updated by the quantities of items A and D in the second updated transactions as {A: 3 + 2 (= 5)} and {A: 3 + 2 (= 5), D: 3 + 4 (= 7)}, respectively. Besides, a new node is created for the item E and linked to the node with D. Its quan_Ary array is generated as (A:2, D:4, E:2). After the second updated transaction is inserted into the HUP tree, the results are shown in Fig. 2. The other transactions are then processed in the same way. After all the ten transactions are processed, the ﬁnal constructed HUP tree is shown in Fig. 3. After STEP 9, a complete HUP tree has been constructed. The high utility itemsets can then be derived by the proposed HUPgrowth algorithm which is stated below. 4. The HUP-growth mining algorithm After the HUP tree is constructed, the desired high utility itemsets can be derived by the proposed HUP-growth mining algorithm. The algorithm is stated as follows.

INPUT: The constructed HUP tree, its corresponding Header_Table, the proﬁt value of each item, and the predeﬁned minimum utility threshold k. OUTPUT: The high utility itemsets.

Header_Table

A D E B

twu

4.1. The HUP-growth mining algorithm

Table 5 The constructed Header_Table in the example.

Item

Item

{root}

Table 6 The updated transactions in the quantitative database. TID

A

D

B

E

1 2 3 4 5 6 7 8 9 10

3 2 3 1 1 1 2 0 0 3

3 4 0 0 3 4 0 0 3 4

2 0 0 0 0 2 3 0 0 0

0 2 0 1 2 0 1 0 0 0

Header_Table Item

twu

A

2728

D

2083

E

1483

B

1540

A:865 A 5

D:865

B:459

A 3 D 3 B 2

A 5 D 7

E:406

A 2 D 4 E 2

Fig. 2. The HUP tree after the second updated transaction is processed.

7423

C.-W. Lin et al. / Expert Systems with Applications 38 (2011) 7419–7424

{root} Header_Table Item

twu

A

2728

D

2083

E

1483

B

1540

A:2728

D:1930

B:962

A 4 D 7 B 4

E:724

A 10 D 18

E:759

D:153

A 16

A 3 D 7 E 4

B:578

D 3

A 3 E 2

A 2 E 1 B 3

Fig. 3. The ﬁnal constructed HUP tree.

STEP 1: Process the items in the Header_Table one by one and bottom-up by the following steps. STEP 2: Find all the nodes with the currently processed item ij in the HUP tree, extract their quan_Ary arrays to get the quantities of the preﬁx items in this corresponding path, and combine the items to form the itemsets which contain the currently processed item ij. If an itemset s is generating from more than one path, merge the quantities from the two arrays by addition. STEP 3: Calculate the actual utility (abbreviated as au) value of each merged itemset s with item ij as follows:

aus ¼

XX

ujk ;

s2T k ij 2s

where ujk is the utility value of an item ij in transaction Tk. STEP 4: Calculate whether the actual utility value aus of each generated itemsets s is larger than or equal to the predeﬁned minimum high utility value. If it is, output it as a high utility itemset. STEP 5: Repeat STEPs 2 to 4 for another item until all the items in Header_Table are processed. After STEP 5, all the high utility itemsets can be derived from the constructed HUP tree. Below, an example is given to illustrate the algorithm. For the constructed HUP tree in Fig. 3, the proposed HUP-growth mining algorithm ﬁnds the high utility itemsets as follows. STEP 1: The items in Header_Table are processed one by one and bottom-up. In this example, the items are processed in the order of B, E, D and A. Item B is ﬁrst processed by the following steps. STEP 2: The item B is ﬁrst processed. In this case, there are two nodes in the HUP tree containing item B. The preceding items with their quantities are extracted from the quan_Ary arrays in the two extracted nodes. The itemsets containing the item B are then generated as B, AB, BD, BE, ABD and ABE. The quantities of the same itemsets are then summed together. For example, the quantities of the itemset AB in the two extracted nodes are (A:4, B:4) and (A:2, B:3), respectively. The summed quantities from the two extracted nodes are thus {A: (4 + 2) (= 6), B: (4 + 3) (= 7)}. The other itemsets are processed in the same way. The results are shown in Table 7.

Table 7 The itemsets containing item B with their summed quantities. 1-Itemsets

2-Itemsets

3-Itemsets

B: 7

(A: 6, B: 7) (B: 4, D: 7) (B: 3, E: 1)

(A: 4, B: 4, D: 7) (A: 2, B: 3, E: 1)

STEP 3: The actual utility values of the itemsets generated in STEP 2 are calculated. Take the itemset AB as an example. The quantities of the two items A and B in the itemset AB are 6 and 7, respectively. Its actual utility value is calculated as (6 ⁄ 3 + 7⁄150), which is 1068. The actual utility of the other itemsets are calculated in the same way. The results are shown in Table 8. STEP 4: The utility value of each itemset is checked against the minimum high utility value, which is 1022.35. In this case, the two itemsets B and AB satisfy the condition and are output as the high utility itemsets. STEP 5: The above steps are repeated for another item until all the items in the Header_Table are processed. The ﬁnal high utility itemsets are shown in Table 9. 5. Experimental results The experiments were performed in Java on an Intel Core2 Due with a 2.8G Hz processor and 4G main memory, running the Microsoft Windows XP operating system. A real dataset called the chess was used in the experiments (Frequent itemset mining dataset repository, xxxx). The values of quantities and proﬁts were assigned to the purchased items in the database. The range was set at from 1 to 20 for quantity, and from 1 to 200 for utility, respectively, in uniform distribution. Experiments were ﬁrst made to compare the execution time by the two-phase (TP) algorithm and by our proposed HUP-growth mining algorithm, which included the tree-construction and the mining processes. The minimum utility threshold was set at from 70% to 90%, with 5% increment each time. The results are shown in Fig. 4. From Fig. 4, it is obvious to see the execution time increased with the decrease of the minimum utility threshold. It was reasonable because when the minimum utility threshold became smaller, more candidate itemsets were processed. Besides, the proposed

Table 8 The actual utility values of the itemsets associated to item B. 1-Itemsets

2-Itemsets

3-Itemsets

B: 1050

AB: 1068 BD: 950 BE: 550

ABD: 962 ABE: 556

Table 9 The ﬁnal high utility itemsets. 1-Itemsets

2-Itemsets

B: 1050 D: 1150

AB: 1068

7424

C.-W. Lin et al. / Expert Systems with Applications 38 (2011) 7419–7424

Fig. 4. The comparison of the execution time.

HUP-growth mining algorithm has also been proposed to derive high utility itemsets from the proposed HUP-tree structure. Without level-wise generation of candidate itemsets, high utility itemsets can be derived efﬁciently and effectively from the HUP tree. Experimental results show that the performance of the proposed algorithm executes faster than the two-phase algorithm. Besides, three item-ordering methods are and compared. The results show the frequency ordering can derive less tree nodes than the other two. In the HUP-growth mining algorithm, the possible itemsets including a speciﬁc item are generated at the same time. As an alternative, a recursive processing way like FP-growth can also be adopted in the proposed approach to generate and handle candidate itemsets. Besides, the attached arrays in the nodes can greatly help the calculation of the actual utility values of the candidate itemsets. In this paper, we assume the database is static. In real-world applications, data may be dynamically inserted into or deleted from a database. In the future, we will attempt to handle the maintenance problem of utility mining when the transactions are inserted, deleted or modiﬁed. How to further improve the HUP tree structure is another interesting topic. References

Fig. 5. The numbers of tree nodes generated by the three different ordering methods.

HUP-growth mining algorithm had a better performance than the two-phase algorithm in the ﬁve different minimum utility thresholds. Especially when the minimum utility threshold was set lower, the two-phase algorithm took much more execution time than our proposed algorithm. In the proposed tree-construction algorithm, the high utility 1itemsets (items) are sorted according to their transaction frequencies. In addition to frequency ordering, however, the 1-itemsets can be sorted according to their twu values or lexicographic order. All the three ordering methods can incur correct mining results, but with different efﬁciency. Experiments were thus performed to compare their effects. The numbers of nodes in the trees generated by the three ordering methods are shown in Fig. 5. It is obvious from Fig. 5 that less tree nodes were generated based on the frequency ordering than on the other two. It thus indicated that the frequency ordering used in the proposed tree construction algorithm is reasonable and acceptable. 6. Conclusion and discussion In this paper, a new tree structure called the high utility pattern tree (HUP tree) has been proposed. The tree helps keep related mining information such that the database scan time can be greatly reduced. The HUP tree is a little like the FP-tree structure except that each node has been attached an array to keep the quantities of its preﬁx items in the path for utility mining. The

Agrawal, R., & Srikant, R. (1994). Fast algorithm for mining association rules. In The international conference on very large data bases. (pp. 487–499). Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In The Eleventh IEEE International Conference on Data Engineering. (pp. 3–14). Agrawal, R., Imielinksi, T., & Swami, A. (1993a). Mining association rules between sets of items in large database. In The ACM SIGMOD international conference on management of data. (pp. 207–216). Agrawal, R., Imielinksi, T., & Swami, A. (1993b). Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 914–925. Chan, R., Yang, Q., & Shen, Y. D. (2003). Mining high utility itemsets. The third IEEE international conference on data mining. (pp. 19–26). Chen, M. S., Han, J., & Yu, P. S. (1996). Data mining: An overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering, 866–883. Cheung, D. W., Lee, S. D., & Kao, B. (1997). A general incremental technique for maintaining discovered association rules. The ﬁfth international conference on database systems for advanced application (pp. 185–194). Chu, C. J., Tseng, V. S., & Liang, T. (2008). Mining temporal rare utility itemsets in large databases using relative utility thresholds. International Journal of Innovative Computing, Information and Control, 4(8). Ezeife, C. I. (2002). Mining incremental association rules with generalized FP-tree. The ﬁfteenth conference of the Canadian Society for computational studies of intelligence on advances in artiﬁcial intelligence (pp. 147–160). Frequent itemset mining dataset repository, (2003). . Han, J., Pei, J., & Yin, Y. Mining frequent patterns without candidate generation. The ACM SIGMOD international conference on management of data (pp. 1–12). Koh, J. L., & Shieh, S. F. An efﬁcient approach for maintaining association rules based on adjusting FP-tree structures. The ninth international conference on database systems for advanced applications (pp. 417–424). Liu, Y., Liao, W. K., & Choudhary, A. (2005). A fast high utility itemsets mining algorithm. The ﬁrst international workshop on utility-based data mining (pp. 90–99). Qiu, Y., Lan, Y. J., & Xie, Q. S. (2004). An improved algorithm of mining from FP- tree. The third international conference on machine learning and cybernetics (pp. 26–29). Tseng, V. S., Chu, C. J., & Liang, T. (2006). Efﬁcient mining of temporal high utility itemsets from data streams. The ACM KDD Workshop on Utility-Based Data Mining. Yao, H., Hamilton, H. J., & Butz, C. J. (2004). A foundational approach to mining itemset utilities from databases. The fourth SIAM international conference on data mining (pp. 482–486). Yao, H., & Hamilton, H. J. (2006). Mining itemset utilities from transaction databases. Data and Knowledge Engineering, 59(3), 603–626. Zaiane, O. R., & Mohammed, E. H. (2003). COFI-tree mining: A new approach to pattern growth with reduced candidacy generation. In The workshop on frequent itemset mining implementations, IEEE international conference on data mining.

An effective tree structure for mining high utility itemsets

An effective tree structure for mining high utility itemsets

Recommend Documents