Accepted Manuscript
Efficiently mining high utility itemsets with negative unit profits Srikumar Krishnamoorthy PII: DOI: Reference:
S0950-7051(17)30613-5 10.1016/j.knosys.2017.12.035 KNOSYS 4171
To appear in:
Knowledge-Based Systems
Received date: Revised date: Accepted date:
9 March 2017 22 December 2017 28 December 2017
Please cite this article as: Srikumar Krishnamoorthy, Efficiently mining high utility itemsets with negative unit profits, Knowledge-Based Systems (2017), doi: 10.1016/j.knosys.2017.12.035
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Efficiently mining high utility itemsets with negative unit profits
CR IP T
Srikumar Krishnamoorthy∗ Indian Institute of Management, Ahmedabad, India
Abstract
PT
ED
M
AN US
A High Utility Itemset (HUI) mining is an important problem in the data mining literature that considers utilities of items (such as profits and margins) to discover interesting patterns from transactional databases. Several data structures, pruning strategies and algorithms have been proposed in the literature to efficiently mine high utility itemsets. Most of these works, however, do not consider itemsets with negative unit profits that provide greater flexibility to a decision maker to determine profitable itemsets. This paper aims to advance the state-of-the-art and presents a generalized high utility mining (GHUM) method that considers both postive and negative unit profits. The proposed method uses a simplified utility-list data structure for storing itemset information during the mining process. The paper also introduces a novel utility based antimonotonic property to improve the performance of HUI mining. Furthermore, GHUM adapts key pruning strategies from the basic HUI mining literature and presents new pruning strategies to significantly improve the performance of mining. The proposed method is evaluated on a set of benchmark sparse and dense datasets and compared against a state-of-the-art method. Rigorous experimental evaluation is performed and implications of the key findings are also presented. In general, GHUM was found to deliver more than an order of magnitude improvement at a fraction of the memory over the state-of-the-art FHN method.
AC
CE
Keywords: High utility itemset, Anti-monotonic property, Utility list data structure, Negative unit profits, Pruning strategies, Frequent itemset mining
∗ Corresponding
author. Tel.: +91 79 6632 4834 Email address:
[email protected] (Srikumar Krishnamoorthy)
Preprint submitted to Knowledge-Based Systems
December 28, 2017
ACCEPTED MANUSCRIPT
1. Introduction
AC
CE
PT
ED
M
AN US
CR IP T
High Utility Itemset (HUI) mining problem [1, 2, 3, 4, 5, 6] involves the use of item utilities to discover profitable itemsets from a transactional database. It considers both internal and external utilities of items to discover profitable itemsets from the database. The problem has received significant attention in the recent years due to its potential applicability in numerous business and scientific applications. The high utility itemset mining problem is a generalization of the frequent itemset mining [7] problem. The frequent itemset mining problem uses the notion of support (or item co-occurrence frequencies) to discover interesting patterns. Numerous algorithms have been proposed in the literature to efficiently mine frequent itemsets. Such algorithms predominantly employ a support based anti-monotonic property to efficiently mine interesting patterns. The utility of an itemset, however, do not satisfy anti-monotonic property [1, 8]. Hence, the HUI mining problem is considerably hard and computationally challenging in nature [1]. The high utility mining methods in the literature can be broadly categorized as level-wise [1, 8, 9], tree-based [2, 10, 11, 12], utility list based [3, 4, 5, 13], hyperlink based [14] and projection based [6] methods. Most of the current works in the literature support only items with positive unit profits. However, in most real-life situations, there is a need to consider items with both positive and negative unit profits or margins. For example, supermarket firms like Walmart often runs hundreds or thousands of cross product promotional campaigns per month. The campaign often involves offering products at everyday low pricing (EDLP), discounted price (that might lead to negative margin) or free products (negative profit) or bundled offerings (mix of discounted and non-discounted products). The additional costs (or losses) incurred on individual items that are part of a promotion are insignificant, if the overall promotional campaign delivers profitable outcomes. In essence, a firm is interested in choosing the bundle of products (or itemsets) that maximize its overall profitability. The state-of-the-art HUI mining methods like HUI-Miner [3], FHM [4], and EFIM [6] cannot be directly used for handling such problems that require consideration of items with both positive and negative unit profits to maximize overall profitability. A few recent works in the literature [5, 15, 16] have made attempts to address the above problem. HUINIV-Mine [15] and FHN [5, 16] are the two key methods that considers both positive and negative unit profits while mining HUIs. The HUINIV-Mine [15] uses a level-wise candidate generation and test approach. The more recent and the most efficient FHN [5] method uses a utility list based data structure for mining HUIs with negative unit profits. The FHN [5] method is shown to be 2-3 orders of magnitude faster than the HUINIV-Mine [15] method. We argue that the state-of-the-art FHN method uses a relatively complex utility list data structure and do not exploit interesting anti-monotonic properties of items with negative unit profits. This paper introduces a novel anti-monotonic property and suggests a few new pruning strategies to signifi2
ACCEPTED MANUSCRIPT
cantly improve the efficiency of mining HUIs with negative unit profits. More specifically, the key novelties and contributions of this paper are as follows:
PT
ED
M
AN US
CR IP T
1. Presents a new method, GHUM, for efficiently mining high utility itemsets with negative unit profits. The presented method uses a simplified utility list based data structure for efficiently storing itemset information. 2. Introduces a novel anti-monotonic property of itemsets (A-Prune) for mining HUIs with negative unit profits. This property has not been explored, to the best of our knowledge, in the HUI mining literature. 3. Several pruning properties have been proposed in the literature for efficiently mining HUIs with positive unit profits. However, most of these pruning properties cannot be directly applied to the new problem that considers both positive and negative unit profits. The proposed method adapts the key pruning properties (namely, U-Prune [3], LA-Prune [13]) for the new HUI mining problem and demonstrates their effectiveness. 4. An utility list based HUI mining method require expensive utility list intersections and candidate evaluations during the mining process. We propose a novel pruning strategy (N-Prune) to significantly reduce the total number of evaluations made during the mining process and improve the performance of HUI mining. 5. Explores a few novel optimizations to improve the overall efficiency of HUI mining with negative unit profits. More specifically, the paper considers optimizations based on utility list compaction, support based sorting of items with negative unit profits, and dynamic sorting of items with negative unit profits. The experimental results clearly reveal the usefulness of the proposed optimizations. 6. Substantial experimental evaluation is performed on a variety of benchmark dense and sparse datasets to demonstrate the utility of the proposed ideas. The proposed GHUM method was found to deliver an order of magnitude improvement, at a fraction of the memory, over the state-of-the-art FHN method.
AC
CE
The rest of the paper is organized as follows. Section 2 describes the related work and highlights some of the key gaps in the existing literature. Section 3 formally introduces the problem, and discusses the key definitions and notations used in the paper. The proposed algorithm and its pruning strategies are outlined in Section 4. Section 5 presents the experimental design and evaluation on several benchmark datasets. A comparative evaluation of GHUM against the state-of-the-art FHN method is also made in this section. Finally, Section 6 provides concluding remarks. The limitations and directions for further research are also presented. 2. Related literature In this section, we review the literature on high utility mining with both positive and negative unit profits. Subsequently, we discuss the key differences of the proposed work from existing works in the literature. 3
ACCEPTED MANUSCRIPT
2.1. High utility mining with positive unit profits
AC
CE
PT
ED
M
AN US
CR IP T
One of the earliest works on high utility mining is the two-phase algorithm [1]. The two-phase algorithm used the concept of Transaction Weighted Utility (TWU) that served as a utility upper bound for efficient mining of HUIs. The algorithm primarily operates in two phases. In the first phase, the antimonotonic property of TWU of itemsets is exploited to mine all the high TWU itemsets. Subsequently, in the second phase, the low utility itemsets are pruned by computing the actual utilities of itemsets. The algorithm is proven to suffer from scalability issues [3, 12]. This is primarily due to the level-wise candidate generation and test approach followed by these methods. Other works in the literature that uses a similar level-wise approach include UMining & UMining H [8], FUM & DCG+ [9] and GPA [17]. The computationally expensive nature of these level-wise mining approaches render them less useful for large scale high utility mining. Tree based algorithms that mine high utility itemsets without expensive candidate generation and test methodology include IHUP [10], HUC-Prune [11], UP-Growth [12] and UP-Growth+ [2]. IHUP [10] is experimentally proven to be better than Two-phase [1], FUM [9], and DCG+ [9]. The more recent works in the literature predominantly use utility-list based data structures. Some of the key utility list based methods include HUI-Miner [3], FHM [4], and HUP-Miner [13]. HUI-Miner [3] is one of the earliest works that employed the utility list based data structure for the HUI mining problem. The core idea of the utility list is similar to the tid list in Eclat [18] algorithm, a frequent itemset mining algorithm. The authors demonstrate that their method is superior to IHUP [10, 12] and UP-Growth+ [2]. FHM [4] and HUP-Miner [13] extend the basic ideas proposed in HUI-Miner [3] and suggest several pruning strategies to improve the efficiency of utility mining. d2 HUP [14] employs a hyper-link based structure for efficient mining of HUIs. The algorithm utilizes irrelevant item filtering and look-ahead pruning in order to improve the performance of utility mining. The concept of irrelevant item filtering involves iteratively eliminating the irrelevant items from the utility computation process. EFIM [6] uses database projection and transaction merging techniques for efficient mining. EFIM also uses different heuristics (sub-tree utility, local utility, fast utility counting) to prune the search space, perform fast utility computations and improve the performance of HUI mining. The authors demonstrate that their EFIM algorithm performs significantly better compared to other state-of-the-art methods, especially for highly dense datasets. 2.2. High utility mining with negative unit profits Chu et al. [15] (HUINIV-Mine) extend the two-phase [1] algorithm, a levelwise candidate generation and test methodology, to mine HUIs with negative unit profits. The authors propose a revised TWU download closure property to reduce the search space and improve the efficiency of mining. This approach is highly inefficient due to the level-wise candidate generation process. The method
4
ACCEPTED MANUSCRIPT
CR IP T
performs very poorly at very lower minimum utility threshold values and does not scale well for large datasets. The comparative evaluation of HUINIV-Mine against FHN [5] also show that the method is two to three orders of magnitude slower and requires two orders of magnitude higher memory. Lin et al. [5] (FHN) extend the basic ideas of HUI-Miner [3] and present a PNU-list data structure. The data structure maintains the positive and negative item utilities separately. The authors adapt several pruning strategies used in the literature to efficiently mine HUIs. The specific pruning strategies adapted include: U-Prune [3, 13], EUCS-Prune [4] and LA-Prune [13]. The authors demonstrate that FHN performs significantly better than HUINIV-Mine [15]. The state-of-the-art FHN method is at least 2 orders of magnitude better compared to the HUINIV-Mine method both in terms of runtime and memory consumption.
PT
ED
M
AN US
2.3. High utility mining problem variants Several variants of the basic HUI mining problem has been studied in the recent literature. Some of the key extensions include: On-shelf utility mining [19, 20], top-k HUIs [21, 22], uncertain HUIs [23, 24, 25] and high average utility itemsets [26]. Lan et al. [19] present a on-shelf utility mining algorithm (HOUI) using a level-wise candidate generation approach. The authors design a periodic total transaction utility table and a new pruning strategy based on an on-shelf utility measure. TS-HOUN [20] considers on-shelf utility of items along with negative unit profits. Although TS-HOUN [20] considers negative unit profits during utility mining, it is not directly relevant to the proposed work. The proposed work does not apply on-shelf or other temporal factors during the mining process. Another most recent work that considers negative unit profits is by Gan et al. [24]. The authors propose a new algorithm (HUPNU) to discover potential high utility itemsets from uncertain or imprecise databases. However, the current work like most other HUI mining works in the literature assumes that precise data is available for mining.
AC
CE
2.4. Differences from prior works It is evident from the foregoing review of the literature that there are only two closely related algorithms, namely, HUINIV-Mine [15] and FHN [5] for mining HUIs with negative unit profits. Our work mines HUIs in a single phase by employing a utility list based data structure. The presented method is, therefore, distinct from the related HUINIV-Mine method that follows a two-phase level-wise candidate generation and test methodology. The proposed work is distinct from FHN [5] on the following key aspects: The PNU-list data structure used in FHN method, though comprehensive, is highly inefficient for the problem under study. We propose a simplified utility list data structure and demonstrate its usefulness. In addition, we present a new utility based anti-monotonic property. The presented anti-monotonic property enables better utility upper bound estimation and pruning of search space. 5
ACCEPTED MANUSCRIPT
3. Problem statement, definition and notation
CR IP T
Our experimental results reveal substantial reduction in the total number of candidates evaluated during the mining process. Furthermore, several new optimizations are explored to significantly improve the efficiency of mining HUIs with negative unit profits. Overall, the paper aims to improve the state-of-the-art and presents a new method (GHUM) to efficiently mine high utility itemsets with negative unit profits.
AN US
We formally define the key terms in utility mining using the standard conventions followed in the literature [1, 3, 8, 10]. Let I = {i1 , i2 ...im } be a set of distinct items. A set X ⊆ I is referred as an itemset. A transaction Tj = {xl |l = 1, 2...Nj , xl ∈ I}, where Nj is the number of items in transaction Tj . A transaction database D has set of transactions, D = {T1 , T2 ...Tn }, where n is the total number of transactions in the database. A sample transaction database D is given in Table 1. Definition 1. Each item xi ∈ I is assigned an external utility value (e.g. profit), referred as EU (xi ). For example, in Table 2, EU (b) = 2.
M
Let P I ⊆ I be the set of items whose external utilities are greater than zero. Also, let N I ⊆ I be the set of items whose external utilities are less than zero. Note that P I ∩ N I = ∅ and P I ∪ N I = I. For the example database in Table 1, P I = {b, c, d, e, h} and N I = {a, f, g}
ED
Definition 2. Each item xi ∈ Tj is assigned an internal utility value, referred as IU (xi , Tj ). For example, in Table 1, IU (b, T3 ) = 2.
PT
Definition 3. The utility of an item xi ∈ Tj , denoted as U (xi , Tj ) is computed as the product of external and internal utilities of items in the transaction, Tj . That is, U (xi , Tj ) = EU (xi ) ∗ IU (xi , Tj )
(1)
CE
For example, in Table 1, U (b, T3 ) = EU (b) ∗ IU (b, T3 ) = 2 ∗ 2 = 4.
AC
Definition 4. The utility of an itemset X in transaction Tj (X ⊆ Tj ) is denoted as U (X, Tj ). X U (X, Tj ) = U (xi , Tj ) (2) xi ∈X
For example, in Table 1, U (ac, T1 ) = −1 + 1 = 0.
Definition 5. The positive utility of an itemset X in transaction Tj (X ⊆ Tj ) is denoted as P U (X, Tj ). X P U (X, Tj ) = U (xi , Tj ) (3) xi ∈X and xi ∈P I
6
ACCEPTED MANUSCRIPT
For example, in Table 1, P U (ac, T1 ) = 0 + 1 = 1. Definition 6. The utility of an itemset X in database D is denoted as U (X). X U (X) = U (X, Tj ) (4)
CR IP T
X⊆Tj ∈D
For example, U (ac) = U (ac, T1 ) + U (ac, T2 ) + U (ac, T3 ) + U (ac, T4 ) = (0) + (4) + (0) + (1) = 5. Definition 7. The positive utility of an itemset X in database D is denoted as P U (X). X P U (X, Tj ) (5) P U (X) = X⊆Tj ∈D
AN US
For example, P U (ac) = P U (ac, T1 ) + P U (ac, T2 ) + P U (ac, T3 ) + P U (ac, T4 ) = (1) + (6) + (1) + (3) = 11.
Definition 8. The transaction utility, T U (Tj ) for transaction Tj is defined as X T U (Tj ) = U (xi , Tj ) (6) X⊆Tj and xi ∈X
M
For example, T U (T5 ) = U (b, T5 ) + U (c, T5 ) + U (e, T5 ) + U (g, T5 ) = 2 + 2 + 3 + (−2) = 5
ED
Definition 9. The positive transaction utility, P T U (Tj ) for transaction Tj is defined as X P T U (Tj ) = P U (xi , Tj ) (7) X⊆Tj and xi ∈X
PT
Table 1 Purchase history
TID
a, c, d, h a, c, e, g a, b, c, d, e, f a, b, c, d, e b, c, e, g
AC
CE
T1 T2 T3 T4 T5
Transaction
Purchase Qty (IU) 1, 1, 1, 3 2, 6, 2, 5 1, 2, 1, 5, 1, 5 2, 4, 3, 3, 1 1, 2, 1, 2
Utility (U) -1, 1, 2, 15 -2, 6, 6, -5 -1, 4, 1, 10, 3, -5 -2, 8, 3, 6, 3 2, 2, 3, -2
Table 2 Item profits
Item Profit $ per unit (EU)
a -1
b 2
c 1
d 2 7
e 3
f -1
g -1
h 5
Positive Trans. Utility (PTU) 18 12 18 20 7
ACCEPTED MANUSCRIPT
For example, P T U (T5 ) = P U (b, T5 ) + P U (c, T5 ) + P U (e, T5 ) + P U (g, T5 ) = 2+2+3+0=7
CR IP T
Definition 10. The user specified minimum utility percentage value is denoted as δ. The absolute minimum utility value, denoted as minutil is then computed as X minutil = δ ∗ P T U (Tj ) (8) Tj ∈D
For the illustrative example in Table 1, when δ = 20%, the absolute minimum utility value, minutil = 0.20 ∗ 75 = 15.
AN US
Definition 11. The positive transaction-weighted utility of an itemset X, denoted as P T W U (X), is defined by X P T W U (X) = P T U (Tj ) (9) X⊆Tj ∈D
For the transaction database in Table 1, P T W U (a) = 18 + 12 + 18 + 20 = 68.
a 68
b 45
c 75
d 56
e 57
f 18
g 19
h 18
PT
Item PTWU
ED
Table 3 Transaction weighted utility
M
Definition 12. The support of an itemset X in database D is denoted as Sup(X). It is the ratio of the frequency of occurrence of the itemset X in D divided by the total number of transactions (n). For example, Sup(ac) = 4/5 = 80%.
AC
CE
Property 1. (PTWU Pruning) If P T W U (X) < minutil, then ∀X 0 ⊇ X, P T W U (X 0 ) ≤ P T W U (X) < minutil. As per the apriori property [7], Sup(X 0 ) ≤ Sup(X). This implies that P T W U (X 0 ) ≤ P T W U (X) < minutil. This property is an adaptation of the commonly exploited TWU based pruning strategy [1, 10] for the high utility mining problem with negative unit profits. This property can be used to prune items with positive or negative unit profits. The PTWU values for the sample transactional database in Table 1 is provided in Table 3. Definition 13. (Ordering heuristic) The items in the transaction database are processed using total order such that (1) negative items always succeed all positive items, (2) positive items are sorted in PTWU ascending order, and (3) negative items are sorted in support ascending order. This ordering is distinct 8
ACCEPTED MANUSCRIPT
Table 4 Ordered purchase history
T1 T2 T3 T4 T5
Transaction h, d, c, a e, c, g, a b, d, e, c, f, a b, d, e, c, a b, e, c, g
Purchase Qty (IU) 3, 1, 1, 1 2, 6, 5, 2 2, 5, 1, 1, 5, 1 4, 3, 1, 3, 2 1, 1, 2, 2
Utility (U) 15, 2, 1, -1 6, 6, -5, -2 4, 10, 3, 1, -5, -1 8, 6, 3, 3, -2 2, 3, 2, -2
Positive Trans. Utility (PTU) 18 12 18 20 7
CR IP T
TID
AN US
from the earlier FHN method [5] that does not utilize support based ordering of negative items. For the running example, the ordering of items are: h b d e c f g a. The ordered purchase history for the sample database in Table 1 is given in Table 4 (using minutil = 15). Definition 14. Estimated Utility Co-occurrence Structure (EUCS) is a triangular matrix that holds the PTWU for a pair of items. More formally,
M
EUCS[i, j]=PTWU(X = {i, j}).
This triangular matrix was first used in FHM [4] and adapted in FHN [5]. The EUCS for the running example is given in Figure 1.
AC
CE
PT
ED
Definition 15. Let Tj /X denote the set of all items after X in Tj . For example, in Table 4, T1 /d = ca, T4 /bd = eca
Fig. 1. Estimated Utility Co-occurrence Structure (EUCS)
9
ACCEPTED MANUSCRIPT
Definition 16. The positive remaining utility of an itemset X in transaction Tj (X ⊆ Tj ) is denoted as P RU (X, Tj ) and is computed as, X P RU (X, Tj ) = P U (xi , Tj ), (10) xi ∈(Tj /X)
CR IP T
For example, in Table 4, P RU (d, T3 ) = 3+1+0+0 = 4, P RU (be, T4 ) = 3+0 = 3. Definition 17. The positive remaining utility of an itemset X in database D is denoted as PRU(X). X P RU (X, Tj ) (11) P RU (X) = X⊆Tj ∈D
For example, in Table 4, P RU (d) = 1 + 4 + 6 = 11.
AN US
Problem statement The problem of high utility mining involves determining all itemsets in D whose utility values are greater than or equal to the user defined minimum utility value minutil. That is, HU I = {X : U (X)|X ⊆ I, U (X) ≥ minutil}
(12)
ED
M
For the running example, when minutil = 15, HU I = {h : 15, hd : 17, hdc : 18, hdca : 17, hda : 16, hc : 16, hca : 15, bd : 28, bde : 34, bdec : 38, bdeca : 35, bdea : 31, bdc : 32, bdca : 29, bda : 25, be : 23, bec : 29, beca : 19, bea : 15, bc : 20, d : 18, de : 22, dec : 26, deca : 23, dea : 19, dc : 23, dca : 19, e : 15, ec : 27, eca : 17}. Our main objective in this paper is to mine all high utility itemsets (HU I) from the transaction database (D) at the user specified δ or minutil value. Our method considers items with positive as well as negative unit profits. 4. Our proposed method
PT
In this section, we present a simplified utility list data structure, discuss the proposed pruning strategies and outline the key algorithm steps.
AC
CE
4.1. Utility list data structure The presented utility list structure is inspired from HUI-Miner [3]. The original data structure was designed for handling items with only positive unit profits. We adapt the structure to handle both positive and negative unit profits. The actual utility list structure used in the proposed work is presented in Figure 2. The header of the data structure has item (or itemset X), the sum of total utility of X, and the sum of positive remaining utility of X. Each entry in the utility list contains transaction level information. More specifically, each entry holds the transaction identifier Tj , utility of X in the transaction Tj and the positive remaining utility of X in the transaction Tj . Unlike the state-of-theart FHN algorithm [5], our utility list data structure do not require separation of positive and negative itemset utility values. The 1-item utility list for the running example is provided in Figure 3. The collection of utility lists (all 1-itemsets) are ordered using the ordering heuristic (refer to definition 13). 10
CR IP T
ACCEPTED MANUSCRIPT
ED
M
AN US
Fig. 2. Utility list structure
PT
Fig. 3. 1-itemset utility list for the sample database
4.2. Pruning properties
CE
The GHUM algorithm uses the 1-itemset utility list and explores the search space for mining all high utility itemsets. During the mining process, several pruning strategies are employed to limit the search space. Specific pruning properties used in GHUM are described next.
AC
Property 2. (EUCS-Prune) If a 2-itemset X is less than minutil, then none of the supersets of X is a high utility itemset.
Proof. This pruning property is borrowed from the existing literature and the proof can be referred in FHN [5]. For the running example with minutil = 15, let us consider the itemset X = {bg}. The EU CS[X = {bg}] = 7 (refer to Figure 1) and is less than minutil = 15. Therefore, neither bg nor any of its supersets can be part of high utility itemsets.
11
ACCEPTED MANUSCRIPT
Property 3. (A-Prune) If X 0 ⊃ X and X 0 − X ⊆ N I and X 0 − X 6= ∅, then U (X 0 ) < U (X). It is also assumed that the ordering of items are done as per the ordering heuristic. We refer to this new property as utility based anti-monotonic property of itemsets.
AN US
CR IP T
Proof. Let us prove this property in two steps. First, let us assume that y is an individual P item. ThatP is, X 0 − X = y ∈ N I. U (Xy) = Xy⊆Tj ∈D xi ∈Xy U (xi , Tj ) P P P = Xy⊆Tj ∈D xi ∈X U (xi , Tj ) + Xy⊆Tj ∈D U (y, Tj ) P = U (X) − Xy6⊆Tj ,X⊆Tj U (X, Tj )+ P U (y) − Xy6⊆Tj ,y⊆Tj U (y, Tj ) ≤ U (X) + U (y), < U (X), as U (y) < 0, ∀y ∈ N I Hence the proof.
ED
M
Let us now generalize by relaxing the assumption that y is an individual item. That is, X 0 − P X = Y ⊆ NP I, then U (XY ) = XY ⊆Tj ∈D xi ∈XY U (xi , Tj ) P P P P = XY ⊆Tj ∈D xi ∈X U (xi , Tj ) + XY ⊆Tj ∈D yi ∈Y U (yi , Tj ) P = U (X) − XY 6⊆Tj ,X⊆Tj U (X, Tj )+ P P (y ) − XY 6⊆Tj ,Y ⊆Tj U (Y, Tj ), yi ∈Y U P i ≤ U (X) + yi ∈Y U (yi ), < U (X), as U (yi ) < 0, ∀yi ∈ N I Hence the proof. It is to be noted that this property only holds when the items are ordered as per the ordering heuristic. Primarily, the property requires that all the positive items precede the negative items.
PT
This is one of the important properties proposed in this paper. This property helps significantly speed up the high utility mining process, and is similar to the anti-monotonic property of support commonly used in frequent itemset mining problems.
CE
Property 4. (U-Prune) If the total sum of the utility and remaining utility of an itemset X is less than minutil, then none of the supersets of X is a high utility itemset. That is, if U (X) + P RU (X) < minutil, then X 0 ∈ / HU I ∀X 0 ⊇ X. This utility property is an adaptation of the pruning strategy used in HUI-Miner [3]. The proposed property handles both positive and negative unit profit items.
AC
Proof. If X contains only positive items, the property subsumes to the remaining utility property introduced in HUI-Miner [3]. The proof for this base case can be referred in HUI-Miner [3]. Now, let us generalize and assume that X contains one or more items from N I. When X contains at least one item from N I, the second part of the equation (i.e. P RU (X)) becomes zero. This is always true as P RU by definition accumulates only positive remaining utility of items. Besides, the utility lists are ordered as per the ordering heuristic. That is, all the negative unit profit items 12
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
are processed only after all the positive unit profit items are processed in the search path. Therefore, the U-Prune property reduces to: if U (X)+0 < minutil, then X 0 ∈ / HU I ∀X 0 ⊇ X. We also know that A-Prune property holds when X 0 − X ⊆ N I. Hence, U (X 0 ) < U (X) < minutil. It is to be noted that FHN [5] adapts the remaining utility property [3] for only the positive items. Hence, it heavily over estimates the utility upper bound during the mining of items with negative unit profits. Illustrative example 1. For the running example, let us consider the extensions of itemset e. We compare the efficacy of U-Prune property for both FHN and GHUM methods. The results of our analysis are provided in Figure 4. For this illustration, we assume that the EUCS property is not applied for both the methods. The relaxation of the EUCS property is done primarily to illustrate the core ideas using the sample database. It does not lead to any loss of generalization of the presented results. As per the illustration in Figure 4, the itemset ecg will have the upper bound of 17 for the FHN method. This can be easily verified using the pruning strategy 1 proposed in FHN method. On the other hand, the upper bound for the itemset ecg estimated by GHUM will be just 10. Given a minutil value of 15, the FHN method explores the search path further. But, the GHUM algorithm terminates the search path as it satisfies the pruning property. Illustrative example 2. For the running example, another illustration is provided in Figure 5. In this illustration, the EUCS property is applied for the both FHN and GHUM methods while exploring the extensions of itemset b. From the figure, it is evident that FHN overestimates the utility values for itemset bdef and bdcf exploring the search path further.
Fig. 4. Illustration 1: Application of A-Prune and U-Prune (without EUCS-Prune)
13
AN US
CR IP T
ACCEPTED MANUSCRIPT
M
Fig. 5. Illustration 2: Application of A-Prune and U-Prune (with EUCS-Prune)
PT
ED
The above two illustrations clearly demonstrate the utility of the proposed UPrune property for HUI mining with negative unit profits. The number of nodes pruned were quite low for the small illustrative example. The actual reductions in the number of nodes are quite substantial for the benchmark datasets. Our experimental results clearly provide evidence for the significant candidate size reductions.
CE
Property 5. (LA-Prune) Given two P P itemsets X and Y , if ∀Ti ∈D U (X, Ti )+P RU (X, Ti )− ∀Tj ∈D,X⊆Tj andY 6⊆Tj U (X, Tj )+P RU (X, Tj ) < minutil, then ∀Y 0 ⊇ Y and X 0 ⊇ X, X 0 Y 0 ∈ / HU I, .
AC
Proof. This pruning property is an adaptation of the LA-Prune property used in HUP-Miner [13]. The primary logic of this proof is similar to the one shown above for U-Prune. Let us consider two cases. If X contains only positive items, the property subsumes to LA-Prune property introduced in HUP-Miner [13]. The proof for the base case can be referred in HUP-Miner [13] Now, let us generalize and assume that X contains one or more items from N I. When X contains P at least one item from N I, P RU (X, Ti ) becomes zero. Mathematically, ∀Ti ∈D U (X, Ti ) + P RU (X, Ti ) P − ∀Tj inD,X⊆Tj andY 6⊆Tj U (X, Tj )+P RU (X, Tj ) < minutil, expanding 14
ACCEPTED MANUSCRIPT
AN US
CR IP T
the first term in the left side of expression by rewriting the condition (without lossP of generality) yields: ∀Ti ∈D,X⊆T P i andY ⊆Ti U (X, Ti ) + P RU (X, Ti ) + P∀Ti ∈D,X⊆Ti andY 6⊆Ti U (X, Ti ) + P RU (X, Ti ) − ∀Tj ,X⊆Tj andY 6⊆Tj U (X, Tj ) + P RU (X, Tj ) < minutil P ∀Ti ∈D,X⊆Ti andY ⊆Ti U (X, Ti ) + P RU (X, Ti ) < minutil, by cancellation of 2nd and 3rd terms that are equal P ∀Ti ∈D,X⊆Ti andY ⊆Ti U (X, Ti ) + 0 < minutil, Pas P RU (X, Ti ) = 0 U (X) − ∀Ti ∈D,X⊆Ti andXY 6⊆Ti U (X, Ti ) + 0 < minutil, by expansion Assuming (without loss of generality) XY always co-occurs, the above expression reduces to U (X) < minutil. Given that A-Prune property holds when X 0 − X ⊆ N I, U (X 0 ) < U (X) < minutil. Hence, the proof. It is to be noted that FHN [5] adapts LA-Prune [13] for only the positive items. As a result, FHN method over estimates the utility upper bound values when items with negative profits are encountered in the search path. Our approach effectively leverages the new utility based anti-monotonic property to limit the number of candidates generated. Property 6. (N-Prune) If X 0 ⊃ X and Y = X 0 − X ⊆ N I and X 0 − X 6= ∅, P then if U (X) + ∀Tj ∈D,XY ⊆Tj U (Y, Tj ) < minutil, then U (X 0 ) < minutil and U (X 00 ⊃ X 0 ) < minutil
AC
CE
PT
ED
M
P P Proof. U (X 0 ) = U (XY ) = ∀Tj ∈D,XY ⊆Tj U (X, Tj ) + ∀Tj ∈D,XY ⊆Tj U (Y, Tj ) P ≤ U (X) + ∀Tj ∈D,XY ⊆Tj U (Y, Tj ), since U(X) includes cases where X is present but not Y < minutil, as per the original condition Given that A-Prune property holds when X 0 − X ⊆ N I, U (X 00 ⊃ X 0 ) < U (X 0 ) < U (X) < minutil. Hence, the proof. This property is applied iteratively during the utility computation process to perform early pruning. It helps reduce the number of transaction list intersections performed by GHUM. For the running example, consider the itemset cg that is constructed by intersecting the utility lists c and g (refer to Figure 4). Let us also assume that minutil = 12 for this example. Initially, the total utility value U (c) is 13. The first transaction match (for c and g) is found for the transaction 2. The U (Y ) value for the transaction is -5. Applying the pruning condition yields 13 − 5 = 8 < minutil = 12. Hence, the itemset cg as well as its supersets are not part of high utility itemsets. This property allows early pruning of itemsets without traversing the entire utility list. 4.3. GHUM algorithm The key steps of our algorithm are outlined in this section. The complete pseudo-code for GHUM is given in GHUM (Algorithm 1), Explore-Search-Tree (Algorithm 1B) and Construct UL (Algorithm 1C). 15
ACCEPTED MANUSCRIPT
AN US
CR IP T
4.3.1. TWU computation & utility list generation The algorithm first scans the transaction database and computes P T W U values for all 1-itemsets. A second scan of the database is then made to sort the items, recompute P T Uj , recompute P T W U (xi ) and iteratively construct 1-itemset ULs. The P T W U value for 1-itemsets are recomputed to remove unpromising 1-itemsets (in Step 17 of Algorithm 1). After applying P T W U pruning in step 6, the overall P T W U values for items are likely to fall. Hence, we recompute P T W U and apply it as an additional optimization (OPT O1) to collapse the utility lists by removing unpromising 1-itemsets. In step 18 of Algorithm 1, GHUM sorts U Ls based on 1-itemset support values. This sorting is done only for the negative unit profit items. Therefore, it does not impact the utility values (especially, P RU that is dependent on ordering) maintained in ULs. We apply this step as an additional optimization (OPT O2) to speed up the high utility mining process. It is also possible to dynamically sort negative unit profit items during the mining process, without impacting the utility values maintained in ULs. We analyze the performance of such optimizations and present our findings in the experimental results section.
AC
CE
PT
ED
M
Algorithm 1 GHUM: Main Input: D, the transaction database, minutil, the minimum utility threshold, Output: HU I, the set of all high utility itemsets 1. Scan D and Compute P T W U for all 1-itemsets 2. for each Tj ∈ D do 3. Sort xi ’s in Tj as per the Ordering heuristic 4. P T Uj ← 0 //Recompute PTU 5. for each item xi ∈ Tj do 6. if P T W U (xi ) ≥ minutil then //PTWU Pruning 7. P T Uj = P T Uj + P U (xi , Tj ) 8. end if 9. end for 10. for each item xi ∈ Tj do 11. if P T W U (xi ) ≥ minutil then 12. newP T W U (xi ) = newP T W U (xi ) + P T Uj //Recompute PTWU 13. end if 14. end for 15. Iteratively construct 1-itemset U Ls 16. end for 17. Remove ULs for item, xi if newP T W U (xi ) < minutil //OPT O1 18. Sort ULs again, only for negative items using 1-item support //OPT O2 19. HUI = Explore-Search-Tree({},U Ls,minutil) //Refer Algo 1B
16
ACCEPTED MANUSCRIPT
CR IP T
4.3.2. Tree exploration process The 1-itemset utility lists generated in the earlier step of the algorithm are used to mine all the high utility itemsets. The search tree exploration process is similar to other works in the literature [3, 5]. The actual steps followed by GHUM are outlined in Algorithm 1B. In order to limit the number of candidates generated during the tree exploration process, two key pruning strategies are applied, namely U-Prune and EUCS-Prune. Unlike earlier works in the literature, the pruning strategies adopted in this paper are more generic and applies to both positive and negative unit profit items. The U-Prune strategy subsumes to A-Prune for negative unit profit items as described earlier in Section 3.
PT
ED
M
AN US
Algorithm 1B GHUM: Explore-Search-Tree Input: R, the U L of itemset R, U Ls, the set of U Ls of all R’s 1-extensions, minutil, the minimum utility threshold Output: all HU Is with prefix R 1. for each utility list X in U Ls do 2. if U (X) ≥ minutil then HU I = {HU I ∪ X} 3. if U (X) + P RU (X) ≥ minutil then //U-Prune 4. exU Ls ← {} 5. for each utility list Y after X in U Ls do 6. if EU CS[X, Y ] >= minutil then //EUCS-Prune 7. U L(XY ) = ConstructUL(R, X, Y ) //Refer Algorithm 1C 8. if U L(XY ) 6= NULL then exU Ls = {exU Ls ∪ U L(XY )} 9. end if 10. end for 11. Explore-Search-Tree(X,exU Ls,minutil) 12. end if 13. end for
AC
CE
4.3.3. Utility list construction The tree exploration process in GHUM is run recursively and in each step higher order utility lists are constructed. The utility list construction process involves performing transaction list intersections and computing utility values. The key steps in the utility list construction process are outlined in Algorithm 1C. GHUM limits the expensive transaction list construction process by applying two key pruning strategies, namely, LA-Prune and N-Prune. LA-Prune strategy proposed in this paper is a generalized version of LA-Prune [13] used in the literature. N-Prune is a new pruning strategy presented in this paper. This property is primarily applied to itemsets that contain at least one negative unit profit item. As the utility values monotonically decrease for negative unit profit items, GHUM terminates the expensive transaction list intersections (steps 9,12,14-16 of Algorithm 1C) when the utility values drop below minutil. The foregoing descriptions explained different steps of the proposed GHUM 17
ACCEPTED MANUSCRIPT
PT
ED
M
AN US
CR IP T
Algorithm 1C GHUM: ConstructUL (U L) Input: U L of itemset R, Rx and Ry Output: U L of itemset Rxy 1. UL(Rxy) = NULL 2. set TOTAL UTILITY(Rx) = U(Rx)+PRU(Rx) //LA-Prune 3. set RxUTILITY = U(Rx) //N-Prune 4. for each element Ex ∈ Rx do 5. if ∃ Ey ∈ Ry and Ex.TID==Ey.TID then 6. if R is not empty then 7. find E in Rx such that E.TID==Ex.TID 8. Exy=
9. RxUTILITY = RxUTILITY + U(Ry,Ey.TID) - U(R,E.TID) 10. else 11. Exy= 12. RxUTILITY = RxUTILITY + U(Ry,Ey.TID)) 13. end if 14. if PRU(Rx, Ex.TID) == 0 and RxUTILITY < minutil then 15. return NULL//N-Prune (OPT O3) 16. end if 17. UL(Rxy) = {UL(Rxy) ∪ Exy} //Insert or update tidlist 18. Create or update utility list of UL(Rxy) 19. else //Recompute TOTAL UTILITY and apply LA-Prune 20. TOTAL UTILITY(Rx)=TOTAL UTILITY(Rx) - U(Rx,Ex.TID) - PRU(Rx,Ex.TID) 21. if TOTAL UTILITY(Rx) < minutil then return NULL 22. end if 23. end for 24. return UL(Rxy)
CE
algorithm. The step-by-step descriptions of the algorithm can be referred in Algorithms 1, 1B and 1C. The performance of the proposed method is evaluated next on several benchmark datasets to demonstrate its usefulness. 5. Experimental results
AC
We implemented GHUM algorithm by extending the open-source data mining library, SPMF [27]. All our experiments were conducted on a Intel Core i5-2500 machine, 3.3GHz CPU with 4GB of memory, and running on a Windows OS. In order to ensure robustness of the results, we ran all our experiments five times and reported the average results. 5.1. Experimental Design We evaluated the performance of GHUM on six real-life datasets. All the datasets were downloaded from the SPMF library [27]. The same datasets were 18
ACCEPTED MANUSCRIPT
CR IP T
used in the earlier works [5] in the literature. The external utilities used in the datasets follow a log-normal distribution and is in the range of -1000 and 1000. The internal utilities of items are randomly generated in the range of 1 to 5. The detailed characteristics of the actual datasets used in our experiments are provided in Table 5. Table 5 Dataset characteristics
AN US
Dataset #Trans #Items (I) Avg Len (A) Density (A/I)% Kosarak 990002 41270 8.1 0.0196 Retail 88162 16470 10.3 0.0625 Pumsb 49046 2113 74 3.5021 Accidents 340183 468 33.8 7.2222 Mushroom 8124 119 23 19.3277 Chess 3196 75 37 49.3333
700
800
M
600
900
50 30
Time (in sec)
60 40 500
1000
1.0
ED
Minimum utility (in 1000s)
1.5
5600
5800
350 250 150
Time (in sec)
FHN GHUM
6000
12000
13000
14000
15000
16000
17000
Minimum utility (in 1000s)
600 400
FHN GHUM
0
200
60
FHN GHUM
Time (in sec)
chess
100
mushroom
20
Time (in sec)
CE
Minimum utility (in 1000s)
AC
3.0
50
600 400 200 0
Time (in sec)
PT 5400
2.5
accidents
FHN GHUM
5200
2.0
Minimum utility (in 1000s)
pumsb
5000
FHN GHUM
10
80
retail
FHN GHUM
20
Time (in sec)
kosarak
60
70
80
90
100
110
90
Minimum utility (in 1000s)
100
110
120
130
Minimum utility (in 1000s)
Fig. 6. Runtime performance of GHUM and FHN
19
140
ACCEPTED MANUSCRIPT
5.2. Runtime performance of GHUM
AN US
CR IP T
We compare the effectiveness GHUM against a state-of-the-art FHN [5] algorithm. Figure 6 gives the results of our experiments on six benchmark datasets. It is evident from the charts that the proposed method performs significantly better than FHN. It can be observed that the performance improvement on the chess dataset is more than 2 orders of magnitude. For the retail dataset, the FHN performs marginally better at higher minimum utility values. However, at lower minimum utility values GHUM takes significantly lower execution times than FHN and exhibits a linear increase in execution times. The results of memory consumption for both the algorithms are presented in Figure 7. The memory consumption performance is also found to be better for GHUM on all of the datasets except mushroom dataset. For the mushroom dataset, the improvement was clearly evident as we lowered the minimum utility threshold values.
900
5400
PT
5200
ED
FHN GHUM
5600
5800
1400
2.5
FHN GHUM
12000
13000
14000
15000
16000
mushroom
chess
70
80
90
100
17000
1000
FHN GHUM
600
1000 600
FHN GHUM
3.0
accidents
6000
200
Memory (in MB)
2.0
Minimum utility (in 1000s)
CE AC
60
1.5
Minimum utility (in 1000s)
Minimum utility (in 1000s)
Memory (in MB)
5000
1.0
M
1500 2000 2500
Memory (in MB)
pumsb
1000
1000
2500
800
1500
700
Minimum utility (in 1000s)
Memory (in MB)
600
200
500
FHN GHUM
600
1600
FHN GHUM
Memory (in MB)
retail
1200
Memory (in MB)
kosarak
110
90
Minimum utility (in 1000s)
100
110
120
130
140
Minimum utility (in 1000s)
Fig. 7. Memory consumption of GHUM and FHN
From the foregoing results, it is evident that GHUM performs significantly better compared to the state-of-the-art FHN method for mining high utility 20
ACCEPTED MANUSCRIPT
itemsets with negative unit profits. 5.3. Candidate size reductions
1000
5400
5600
5800
6000
ED
1500 0 500
70
80
90
2
3
100
0 100000 250000 5 10 15 20
FHN GHUM
13000
14000
15000
16000
17000
Minimum utility (in 1000s)
FHN GHUM
60
12000
M
Minimum utility (in 1000s)
mushroom
1
0
# of candidates (in 1000s)
100 200 300 0
5200
0.9
accidents
FHN GHUM
5000
0.85
Minimum utility (in 1000s)
pumsb
# of candidates (in 1000s)
0.8
AN US
900
chess
110
FHN GHUM
5000
800
2000
700
Minimum utility (in 1000s)
FHN GHUM
0
600
# of candidates (in 1000s)
500
# of candidates (in 1000s)
# of candidates (in 1000s)
200 400 600
retail
FHN GHUM
0
# of candidates (in 1000s)
kosarak
CR IP T
In the next set of experiments, we analyzed the candidate size reductions. The results of our experiments are presented in Figures 8 and 9. The number of evaluations is computed as the product of total number of candidates and total number of transaction list intersections. The results clearly indicate the efficacy of GHUM over FHN.
90
Minimum utility (in 1000s)
100
110
120
130
140
Minimum utility (in 1000s)
PT
Fig. 8. Performance analysis of candidate sizes
AC
CE
We observed that huge number of evaluations (billions or even trillions) are made by GHUM at very low minimum utility values. Future work could consider use of additional pruning strategies at the transaction list intersection level to further improve the performance of utility mining. Table 6 provides a summary of performance improvements of GHUM over FHN. The results are presented for the lowest minimum utility values used in our experiments. The column 3 in Table 6 provides the speed up in performance. The speed up in performance was found to be much higher for datasets with higher density values. The fraction of memory used, fraction of candidates and fraction of evaluations are given in columns 4, 5 and 6 of Table 6. The very low values of fraction of candidates and evaluations demonstrates the efficacy of the pruning strategies used and the performance improvements of GHUM over the state-of-the-art FHN method.
21
ACCEPTED MANUSCRIPT
600
700
800
900
6e+07
# of evaluations (in Billions)
1000
0.8
0.85
Minimum utility (in 1000s)
5600
5800
50000 20000
6000
12000
13000
Minimum utility (in 1000s)
90
100
110
Minimum utility (in 1000s)
4e+07
# of evaluations (in Billions)
16000
17000
FHN GHUM
0e+00
2e+07
AN US
1000000
# of evaluations (in Billions)
0
80
15000
chess
FHN GHUM
70
14000
Minimum utility (in 1000s)
mushroom
60
3
FHN GHUM
0
# of evaluations (in Billions)
1000000
# of evaluations (in Billions)
0
5400
2
accidents FHN GHUM
5200
1
Minimum utility (in 1000s)
pumsb
5000
0.9
CR IP T
500
FHN GHUM
0e+00
150000
retail FHN GHUM
0 50000
# of evaluations (in Billions)
kosarak
90
100
110
120
130
140
Minimum utility (in 1000s)
Fig. 9. Comparative analysis of total number of evaluations
PT
Kosarak Retail Pumsb Accidents Mushroom Chess
Minutil (in 1000s) 500 0.80 5000 12000 60 90
Speedup
1.6 4.07 9.71 9.62 6.63 187.99
ED
Dataset
M
Table 6 Improvements of GHUM over FHN at the lowest minimum utility evaluated
Fraction of mem used 0.85 0.4 0.51 0.5 0.26 0.59
Fraction of Cand. 0.999 0.004 0.694 0.359 0.525 0.025
Fraction of Evals. 0.99 0.00005 0.31 0.11 0.26 0.00043
CE
5.4. Performance of optimizations
AC
In the next set of experiments, we assessed the performance of a few optimizations. The optimization 1 (O1) is used to collapse 1-itemset ULs during the initial phase (refer to Step 17 of Algorithm 1). Optimization 2 (O2) is used to sort negative unit profit items based on 1-itemset support values (refer to Step 18 of Algorithm 1). Optimization 3 (O3) is the N-Prune strategy applied at the transaction list level for negative unit profit items. The performance results of each of the optimizations are given in Figures 10 and 11. GHUM: Base indicates the case where all three optimizations are turned off. The results reveal that using all three optimizations yields superior execution time performance. The memory consumption performance also shows promising
22
ACCEPTED MANUSCRIPT
1100
● ● ●
14000
16000
GHUM: GHUM: GHUM: GHUM:
Base O1 O1&2 O1,2&3
● ●
CR IP T
●
●
1500
Base O1 O1&2 O1,2&3
●
12000
●
●
●
●
12000
14000
16000
Minimum utility (in 1000s)
Number of candidates for accidents
Number of evaluations for accidents
# of evaluations (in Billions)
Base O1 O1&2 O1,2&3
GHUM: GHUM: GHUM: GHUM:
Base O1 O1&2 O1,2&3
0
2
AN US
4
6
8
GHUM: GHUM: GHUM: GHUM:
2000 4000 6000 8000
Minimum utility (in 1000s)
0
# of candidates (in 1000s)
GHUM: GHUM: GHUM: GHUM:
1300
●
Memory (in MB)
15 20 25 30 35 40 45
Time (in sec)
●
Memory consumption for accidents
1700
Run time analysis for accidents
12000
14000
16000
12000
Minimum utility (in 1000s)
14000
16000
Minimum utility (in 1000s)
Fig. 10. Performance evaluation of different optimizations for Accident dataset
●
●
80
100
5
1000
Base O1 O1&2 O1,2&3
● ● ●
●
200
110
60
70
80
90
100
110
Number of evaluations for mushroom
60
70
80
90
100
110
60
Minimum utility (in 1000s)
Base O1 O1&2 O1,2&3
4e+05
GHUM: GHUM: GHUM: GHUM:
0e+00
1000
Base O1 O1&2 O1,2&3
600
GHUM: GHUM: GHUM: GHUM:
# of evaluations (in Billions)
Number of candidates for mushroom
PT
Minimum utility (in 1000s)
0 200
AC
90
●
GHUM: GHUM: GHUM: GHUM:
Minimum utility (in 1000s)
CE
# of candidates (in 1000s)
70
●
●
●
60
●
600
ED
Base O1 O1&2 O1,2&3
Memory consumption for mushroom
Memory (in MB)
15
GHUM: GHUM: GHUM: GHUM:
●
10
Time (in sec)
●
M
Run time analysis for mushroom ●
70
80
90
100
Minimum utility (in 1000s)
Fig. 11. Performance evaluation of different optimizations for Mushroom dataset
23
110
ACCEPTED MANUSCRIPT
CR IP T
results when the optimizations are applied. We also experimented with applying O2 (i.e. sorting negative unit profit itemsets) dynamically during the search tree exploration process. However, we did not find significant improvements in performance for dynamic sorting. It is to be noted that all our comparisons against the state-of-the-art FHN utilizes all three optimizations. In this section, we studied the relative performance of each of these optimizations to evaluate their effectiveness. 5.5. Discussion
ED
M
AN US
We presented several new ideas to improve the state-of-the-art in high utility itemset mining with negative unit profits. The key improvements of GHUM over FHN are on the following aspects. First, GHUM uses a simplified data structure for maintaining utility list information during the mining process. Second, key pruning strategies (U-Prune and LA-Prune) proposed in the literature are adapted to effectively solve the HUI mining problem with negative unit profits. Third, the paper introduced a new utility based anti-monotonic property (A-Prune) of itemsets. Fourth, GHUM utilized a new N-Prune strategy to minimize the number of evaluations made during the transaction list intersection process. Finally, the paper explored a few optimizations, hitherto unexplored in the literature. Rigorous experimental evaluation of the algorithm demonstrates the usefulness of the key ideas proposed in this paper. The results are quite promising, achieving significant improvement on several benchmark datasets. Overall, the paper makes useful contributions to the theory of high utility mining. The paper also has practical significance since firms often consider both positive and negative unit profits while determining profitable itemsets. GHUM method offers greater flexibility and performance for a decision maker to determine profitable itemsets from transactional databases.
PT
6. Conclusion and future research directions
AC
CE
This paper presented a new utility mining method (GHUM) for efficiently mining high utility itemsets with negative unit profits. The method used a simplified utility-list data structure to store utility information during the mining process. The method adapts existing pruning strategies (generalized UPrune and LA-Prune) and proposes two new pruning strategies (A-Prune and N-Prune) for efficient utility mining. The proposed method was found to be superior compared to the state-of-the-art FHN method on several benchmark datasets. The proposed method considers negative unit profits like other earlier works in the literature (FHN, HUINIV-Mine). However, this makes the total item utility values for every transaction to be negative. In practice, the same item can take either positive or negative utilities at individual transaction levels. For example, the product discounts can be offered for specific time periods, and is likely to impact only the partial set of transactions in the database. In
24
ACCEPTED MANUSCRIPT
AN US
CR IP T
the current approach, this can be handled by making use of multiple items (i.e. one for positive unit profits and another for negative unit profits). Future work may take an integrated approach and consider positive and negative item utility values at individual transaction levels. One can also investigate the use of on-shelf utility mining problem that considers temporal factors. Another interesting area of work is to extend the current work for uncertain or imprecise database environments [23]. In this paper, we have also reported the total number of evaluations in addition to the candidates generated during the mining process. It is evident from the results that billions or even trillions of evaluations are performed for several benchmark datasets at lower minimum utility values. We believe that there is scope for achieving significant reductions in the total number of evaluations made by designing suitable pruning strategies, especially during the transaction list intersection phase. LA-Prune and N-Prune strategies are attempts to address this problem. Future work can consider more such extensions to improve the performance of high utility itemset mining with negative unit profits. References
M
[1] Y. Liu, W.-k. Liao, A. Choudhary, A two-phase algorithm for fast discovery of high utility itemsets, in: T. Ho, D. Cheung, H. Liu (Eds.), Advances in Knowledge Discovery and Data Mining, volume 3518 of Lecture Notes in Computer Science, 2005, pp. 689–695.
ED
[2] V. S. Tseng, B.-E. Shie, C.-W. Wu, P. S. Yu, Efficient algorithms for mining high utility itemsets from transactional databases, IEEE Transactions on Knowledge and Data Engineering 25 (2012) 1772–1786.
PT
[3] M. Liu, J. Qu, Mining high utility itemsets without candidate generation, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, 2012, pp. 55–64.
CE
[4] P. Fournier-Viger, C.-W. Wu, S. Zida, V. S. Tseng, FHM: Faster highutility itemset mining using estimated utility co-occurrence pruning, in: International Symposium on Methodologies for Intelligent Systems, 2014, pp. 83–92.
AC
[5] J. C.-W. Lin, P. Fournier-Viger, W. Gan, FHN: An efficient algorithm for mining high-utility itemsets with negative unit profits, Knowledge-Based Systems 111 (2016) 283–298. [6] S. Zida, P. Fournier-Viger, J. C.-W. Lin, C.-W. Wu, V. S. Tseng, EFIM: A fast and memory efficient algorithm for high-utility itemset mining, Knowledge and Information Systems 51 (2017) 595–625. [7] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of the 20th International Conference on Very Large Databases, VLDB, 1994, pp. 487–499. 25
ACCEPTED MANUSCRIPT
[8] H. Yao, H. J. Hamilton, Mining itemset utilities from transaction databases, Data & Knowledge Engineering 59 (2006) 603–626.
CR IP T
[9] Y.-C. Li, J.-S. Yeh, C.-C. Chang, Isolated items discarding strategy for discovering high utility itemsets, Data & Knowledge Engineering 64 (2008) 198–217. [10] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, Y.-K. Lee, Efficient tree structures for high utility pattern mining in incremental databases, IEEE Transactions on Knowledge and Data Engineering 21 (2009) 1708–1721.
[11] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, Y.-K. Lee, HUC-Prune: An efficient candidate pruning technique to mine high utility patterns, Applied Intelligence 34 (2011) 181–198.
AN US
[12] V. S. Tseng, C.-W. Wu, B.-E. Shie, P. S. Yu, UP-Growth: An efficient algorithm for high utility itemset mining, in: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, 2010, pp. 253–262. [13] S. Krishnamoorthy, Pruning strategies for mining high utility itemsets, Expert Systems with Applications 42 (2015) 2371–2381.
M
[14] J. Liu, K. Wang, B. Fung, Direct discovery of high utility itemsets without candidate generation, in: Proceedings of the IEEE 12th International Conference on Data Mining, 2012, pp. 984–989.
ED
[15] C.-J. Chu, V. S. Tseng, T. Liang, An efficient algorithm for mining high utility itemsets with negative item values in large databases, Applied Mathematics and Computation 215 (2009) 767–778.
PT
[16] P. Fournier-Viger, FHN: Efficient mining of high-utility itemsets with negative unit profits, in: International Conference on Advanced Data Mining and Applications, Springer, 2014, pp. 16–29.
CE
[17] G.-C. Lan, T.-P. Hong, V. S. Tseng, et al., An efficient gradual pruning technique for utility mining, International Journal of Innovative Computing Information and Control 8 (2012) 5165–5178.
AC
[18] M. J. Zaki, Scalable algorithms for association mining, IEEE Transactions on Knowledge and Data Engineering 12 (2000) 372–390. [19] G.-C. Lan, T.-P. Hong, V. S. Tseng, Discovery of high utility itemsets from on-shelf time periods of products, Expert Systems with Applications 38 (2011) 5851–5857. [20] G.-C. Lan, T.-P. Hong, J.-P. Huang, V. S. Tseng, On-shelf utility mining with negative item values, Expert Systems with Applications 41 (2014) 3450–3459.
26
ACCEPTED MANUSCRIPT
[21] V. S. Tseng, C.-W. Wu, P. Fournier-Viger, S. Y. Philip, Efficient algorithms for mining top-k high utility itemsets, IEEE Transactions on Knowledge and Data Engineering 28 (2016) 54–67.
CR IP T
[22] Q.-H. Duong, B. Liao, P. Fournier-Viger, T.-L. Dam, An efficient algorithm for mining the top-k high utility itemsets using novel threshold raising and pruning strategies, Knowledge-Based Systems 104 (2016) 106–122. [23] J. C.-W. Lin, W. Gan, P. Fournier-Viger, T.-P. Hong, V. S. Tseng, Efficient algorithms for mining high-utility itemsets in uncertain databases, Knowledge-Based Systems 96 (2016) 171–187.
AN US
[24] W. Gan, J. C.-W. Lin, P. Fournier-Viger, H.-C. Chao, V. S. Tseng, Mining high-utility itemsets with both positive and negative unit profits from uncertain databases, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2017, pp. 434–446. [25] J. C.-W. Lin, W. Gan, P. Fournier-Viger, T.-P. Hong, V. S. Tseng, Efficiently mining uncertain high-utility itemsets, Soft Computing 21 (2017) 2801–2820.
M
[26] U. Yun, D. Kim, Mining of high average-utility itemsets using novel list structure and pruning strategy, Future Generation Computer Systems 68 (2017) 346–360.
AC
CE
PT
ED
[27] P. Fournier-Viger, A. Gomariz, A. Soltani, H. Lam, T. Gueniche, SPMF: Open-source data mining platform, http://www.philippe-fournierviger.com/spmf, 2014.
27