An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies

An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies

ARTICLE IN PRESS JID: KNOSYS [m5G;April 21, 2016;22:14] Knowledge-Based Systems 0 0 0 (2016) 1–17 Contents lists available at ScienceDirect Knowl...

2MB Sizes 2 Downloads 63 Views

ARTICLE IN PRESS

JID: KNOSYS

[m5G;April 21, 2016;22:14]

Knowledge-Based Systems 0 0 0 (2016) 1–17

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies Quang-Huy Duong a,∗, Bo Liao a, Philippe Fournier-Viger b, Thu-Lan Dam a,c a

College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China School of Natural Sciences and Humanities, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, 518055, China c Faculty of Information Technology, Hanoi University of Industry, Hanoi, Vietnam b

a r t i c l e

i n f o

Article history: Received 3 December 2015 Revised 15 April 2016 Accepted 16 April 2016 Available online xxx Keywords: High utility itemset mining Top-k mining Threshold raising strategies Co-occurrence pruning Transitive extension pruning Coverage

a b s t r a c t Top-k high utility itemset mining is the process of discovering the k itemsets having the highest utilities in a transactional database. In recent years, several algorithms have been proposed for this task. However, it remains very expensive both in terms of runtime and memory consumption. The reason is that current algorithms often generate a huge amount of candidate itemsets and are unable to prune the search space effectively. In this paper, we address this issue by proposing a novel algorithm named kHMC to discover the top-k high utility itemsets more efficiently. Unlike several algorithms for top-k high utility itemset mining, kHMC discovers high utility itemsets using a single phase. Furthermore, it employs three strategies named RIU, CUD, and COV to raise its internal minimum utility threshold effectively, and thus reduce the search space. The COV strategy introduces a novel concept of coverage. The concept of coverage can be employed to prune the search space in high utility itemset mining, or to raise the threshold in top-k high utility itemset mining, as proposed in this paper. Furthermore, kHMC relies on a novel co-occurrence pruning technique named EUCPT to avoid performing costly join operations for calculating the utilities of itemsets. Moreover, a novel pruning strategy named TEP is proposed for reducing the search space. To evaluate the performance of the proposed algorithm, extensive experiments have been conducted on six datasets having various characteristics. Results show that the proposed algorithm outperforms the state-of-the-art TKO and REPT algorithms for top-k high utility itemset mining both in terms of memory consumption and runtime. © 2016 Elsevier B.V. All rights reserved.

1. Introduction Frequent itemset mining (FIM) [1–4] is a fundamental research topic in data mining. It consists of discovering the frequent itemsets appearing in a transactional database. A frequent itemset is a group of items having a support (occurrence frequency) that is no less than a user-specified minimum support threshold. Numerous algorithms have been developed to discover frequent itemsets efficiently [5–12]. Most of them are based on two well-known representative algorithms: Apriori [1] and FP-Growth [4]. The Apriori algorithm employs a level-wise candidate generation-and-test approach. A drawback of Apriori-based algorithms is that they scan the database multiple times and can generate a huge amount of candidates. This greatly reduces the performance of Aprioribased algorithms. The FP-Growth algorithm relies on an efficient



Corresponding author. Tel.: +84936240829. E-mail addresses: [email protected] (Q.-H. Duong), [email protected] (B. Liao), [email protected] (P. Fournier-Viger), lanfi[email protected] (T.-L. Dam).

tree structure named FP-Tree, which is a compact representation of the database. To discover frequent itemsets, FP-Growth first builds an FP-Tree, and then uses a pattern-growth approach to discover the frequent itemsets directly from the FP-tree, without generating candidates. FP-Growth-based algorithms generally outperform Apriori-based algorithms. In FIM, the downward closure property [1,4] is a well-known property, used to reduce the search space. This property states that if an itemset is infrequent, all its supersets are also infrequent. It is used by almost all FIM algorithms to prune the search space. FIM has a wide range of applications, and has been applied to various types of databases such as transactional databases [13], streaming databases [14], uncertain databases [15], and time-series databases [16]. Some representative applications of FIM are web analysis [17] and bioinformatics [18]. An important limitation of FIM is that in real-world applications, specifying the minimum support threshold required by FIM algorithms is not an easy task for users, since users often do not know what threshold values are the best for their requirements [19]. But choosing an appropriate threshold value is crucial

http://dx.doi.org/10.1016/j.knosys.2016.04.016 0950-7051/© 2016 Elsevier B.V. All rights reserved.

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

JID: KNOSYS 2

ARTICLE IN PRESS

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17

since it affects the number of frequent itemsets found by FIM algorithms. If the threshold is set too high, few frequent itemsets will be found, and thus many interesting itemsets will be missed. But if the threshold is set too low, a huge number of frequent itemsets will be obtained, which is both time and space consuming, and users may find it difficult to analyze a large amount of frequent itemsets. To find an appropriate threshold value, a user may thus need to run a FIM algorithm several times. To solve this problem, a new research direction was proposed named top-k FIM [6,19–23]. In top-k FIM, instead of specifying a minimum threshold value, the user must specify a parameter k indicating the number of itemsets to be found. A top-k FIM algorithm then returns the k itemsets having the highest supports. A second important limitation of FIM is that it assumes that all items have the same importance (e.g., unit profit or weight) and that items may not appear more than once in each transaction. But these assumptions often do not hold in real applications. For example, items in a transactional database may have different unit profits, and items may have non binary purchase quantities in transactions. Besides, in real-life, retailers are often more interested in finding the itemsets that yield a high profit than those bought frequently. Thus, top-k FIM does not satisfy the requirement of users who want to discover itemsets having high utilities (e.g., generating a high profit). To address these important issues, the problem of top-k FIM was recently redefined as the problem of top-k high utility itemset mining. Mining high utility itemsets (HUIs) [24–30] is an important research problem in data mining, which has a wide range of applications. It consists of discovering itemsets having utilities (e.g., profit) no less than a user-specified minimum utility threshold, that is high utility itemsets. The problem of top-k high utility itemset mining consists of finding the k itemsets having the highest utilities in a transactional database. Top-k high utility itemset mining is an important data mining task that is useful in many domains. For example, it can be used to find the k sets of products that are the most profitable when sold together, in retail stores. Top-k high utility itemset mining is harder than top-k FIM, since the downward closure property used in FIM to prune the search space, does not hold in high utility itemset mining. To circumvent this issue, overestimation methods [24,28] such as the transaction weighted utility (TWU) have been proposed. Although several algorithms have been designed for top-k high utility itemset mining, it remains a very expensive task in terms of runtime and memory consumption. It is thus an important research problem to design more efficient algorithms. Despite that top-k high utility itemset mining is inspired by the traditional problem of top-k FIM, the methods used in top-k FIM cannot be directly applied to solve the problem of top-k high utility itemset mining. To design an efficient top-k high utility itemset mining algorithm, additional issues need to be considered such as designing effective search space pruning techniques by considering information about the utilities of itemsets, and also how to raise the internal minimum utility threshold effectively to reduce the number of candidates. To address the need for a more efficient top-k high utility itemset mining algorithm, this paper proposes a novel algorithm, called top-k High utility itemset Mining using Co-occurrence pruning (kHMC). It introduces several novel ideas to discover top-k high utility itemsets efficiently. The contributions of this paper are summarized as follows: 1. An efficient algorithm is proposed for mining the top-k high utility itemsets in transactional databases. The proposed algorithm relies on two efficient strategies for pruning the search space. The first strategy is an improved co-occurrence pruning strategy that eliminates a large number of join operations, and

thus prunes a large part of the search space. This strategy is implemented using a novel co-occurrence pruning structure with threshold (EUCST). The second strategy is named transitive extension pruning (TEP). It reduces the search space using a novel upper-bound on the utilities of itemsets. 2. The proposed algorithm relies on the utility-list structure [31]. To further improve the performance of discovering high utility itemsets using utility-lists, a low complexity utility-list construction procedure is proposed. Moreover, the proposed algorithm utilizes a strategy for abandoning utility-list construction early. 3. As previously mentioned, a key challenge in top-k high utility itemset mining is how to raise the internal minimum utility threshold effectively while searching for the top-k high utility itemsets, to reduce the search space. To address this challenge, the proposed algorithm utilizes several strategies, named RIU, COV, and CUD, for initializing and dynamically adjusting its internal minimum utility threshold. The COV strategy is based on a novel concept of coverage. The concept of coverage can be employed to prune the search space in high utility itemset mining, or to raise the threshold as introduced in this paper. This article proves that by using these strategies, no top-k HUIs will be missed, and demonstrates that these strategies are effective at raising the internal minimum utility threshold close to the optimal value. 4. Extensive experimental evaluations are conducted on both real and synthetic datasets to evaluate the proposed techniques. Results show that the proposed algorithm is faster and consumes less memory than the state-of-the-art TKU, REPT, and TKO algorithms for top-k high utility itemset mining. The paper is organized as follows. Section 2 briefly reviews related work on (top-k) high utility itemset mining. Section 3 presents preliminaries and formally defines the problem of top-k high utility itemset mining. Section 4 proposes an improved strategy for pruning the search space based on item co-occurrence information. Section 5 presents two techniques for improving the efficiency of utility-list construction. Section 6 introduces a strategy for reducing the search space using a novel concept of transitive extension upper-bound utility. Section 7 proposes the concept of coverage. Section 8 presents the three threshold raising strategies used in the designed algorithm. Then, Section 9 describes the proposed kHMC algorithm, to mine the top-k high utility itemsets, which incorporates all the techniques presented in previous sections. Section 10 presents an extensive experimental evaluation. Finally, Section 11 draws the conclusion and discusses future work. 2. Related work This section briefly reviews studies related to high utility itemset mining and top-k high utility itemset mining. 2.1. High utility itemset mining Several algorithms have been proposed for high utility itemset mining. Two-Phase [26], an Apriori-based algorithm, employs an upper-bound called the Transaction Weighted Utilization (TWU) to restore the downward closure property for mining high utility itemsets. According to the TWU model, an itemset having a TWU lower than the minimum utility threshold is not a high utility itemset, as well as all its supersets. The TWU model allows to reduce the search space. However, a disadvantage of using the TWU model is that the TWU is a loose upper-bound on the utilities of itemsets and thus a huge amount of candidates still need to be considered to discover the high utility itemsets. The IIDS [30]

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

JID: KNOSYS

ARTICLE IN PRESS

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17

algorithm improves upon Two-Phase by introducing a pruning strategy named Isolated Itemset Discarding Strategy (IIDS). But IIDS still finds HUIs by scanning the database numerous times using a level-wise approach. As a solution to this problem, the IHUP algorithm was proposed based on the FP-Growth algorithm and the TWU model [25]. The Two-Phase, IIDS, and IHUP algorithms, are said to be two-phase algorithms since they discover high utility itemsets in two phases. They first find a set of candidate HUIs in a given database (Phase I). Then, they scan the database to calculate the exact utility of each candidate and keep only those having utilities no less than the minimum utility threshold (Phase II). Despite having introduced new techniques to discover high utility itemsets, these algorithms still suffer from the problem of generating too many candidates due to the use of the TWU model. A solution to the above issue was proposed in the UP-Growth algorithm [24]. This latter reduces the overestimation performed by the TWU model by using four strategies named DGU, DGN, DLU, and DLN. UP-Growth relies on a tree structure called UP-Tree, for mining high utility itemsets. A UP-Tree is constructed by performing two database scans. Then, candidates are generated directly from the tree. But the exact utility of each candidate is calculated by performing an additional database scan. UP-Growth+ [28] is an improved version of UP-Growth that introduces two strategies named DNU and DNN to further decrease the overestimations on utilities. BAHUI [32], PB [33], and MU-Growth [34] are recent two-phase algorithms which provide a slight improvement over Two-Phase or UP-Growth in terms of execution time. However, running these algorithms is still very time consuming. Recently, algorithms have been proposed to mine HUIs in a single phase. HUI-Miner (High Utility Itemset Miner) was proposed by [31]. It introduces a new data structure called utility-list to maintain the information required for calculating the utilities of itemsets and prune the search space. Utility-lists are a vertical structure that is similar to the tid lists used by the Eclat algorithm [35] for mining frequent itemsets. Once HUI-Miner has constructed utility-lists for single items, it can create the utility-list of any itemset without scanning the database, and can derive all HUIs and their exact utilities from these utility-lists. The search space of itemsets is explored by HUI-Miner using a depth-first search rather than a breadth-first search. Furthermore, it avoids repeatedly scanning the database, thanks to the use of utility-lists. Experimental results have shown that HUI-Miner outperforms IHUP [25], UP-Growth [24], and UP-Growth+ [28]. But HUI-Miner still has to perform a costly join operation to construct the utility-list of each itemset considered by its search procedure. To reduce the number of join operations, the FHM [36] algorithm was designed. It integrates a strategy for pruning the search space using information about itemset co-occurrences. It was shown to reduce the number of join operations by up to 95%. 2.2. Top-k high utility itemset mining An important limitation of high utility itemset mining is that in real applications, specifying the minimum utility threshold is not an easy task for users, as they often do not know what threshold value is the best for their requirements [37–39]. But choosing an appropriate threshold value is crucial since it directly influences the number of high utility itemsets found by the algorithms. If the threshold is set too high, few itemsets will be found, and thus many interesting itemsets will be missed. But if the threshold is set too low, a huge number of itemsets will be obtained, which is time and space consuming, and users may find it difficult to analyze a large amount of high utility itemsets. To address this issue, the task of top-k high utility itemset mining was proposed, by combining the concept of top-k pattern mining from FIM with high utility itemset mining. In top-k high utility itemset

3

mining, the user must specify a parameter k, representing the number of patterns to be found, instead of specifying a minimum utility threshold. The task of top-k high utility itemset mining is useful in many domains. For example, it can be used to find the k sets of products that are the most profitable when sold together, in retail stores. Several top-k high utility-itemset mining algorithms have been proposed [37–39]. They typically follow a same general process. A top-k high utility itemset mining algorithm initially sets an internal minimum utility threshold to zero. Then, during the mining process, the algorithm automatically raises the threshold using various strategies, to reduce the search space. But it is important that the algorithm uses appropriate strategies to raise the threshold, to ensure that no top-k high utility itemsets are missed. A top-k high utility itemset mining algorithm searches for itemsets by using a search strategy. When a potential itemset is found, it is inserted into a list of itemsets Rk , ordered by their utilities. The minimum utility threshold is then raised to the kth largest utility value in Rk . Then, itemset(s) in Rk not respecting the minimum utility threshold anymore are removed from Rk . Thereafter, the new threshold value is used to continue the search for more itemsets (by searching for itemsets having utilities no less than the threshold). The algorithm then continues searching for more itemsets until no itemsets are produced by the search strategy. When the algorithm terminates, it has found the top-k high utility itemsets. The final value of the internal minimum utility threshold is called the boundary threshold value. The efficiency of a top-k high utility itemset mining algorithm is not only determined by its data structure(s) and search strategy. But it also largely depends on how fast the algorithm can raise its internal minimum utility threshold to prune the search space. The major challenges in top-k HUI mining are thus to design effective strategies for initializing and raising the internal minimum utility threshold, and effective data structures and strategies for reducing the search space. Several algorithms [37,40,41] have been proposed for top-k high utility itemset mining. The first one is TKU [38], which extends the UP-Growth algorithm [24,28] with several efficient threshold raising strategies for top-k high utility itemset mining. Recently, another algorithm, called Raising threshold with Exact and Precalculated utilities (REPT) [37] was proposed for top-k high utility itemset mining. REPT relies on the UP-Tree [24,28] structure, just like TKU. However, REPT uses different threshold raising strategies. In Phase I, two strategies called Pre-evaluation with Utility Descending order (PUD) and Real Item Utilities (RIU) are used. In Phase 2, two strategies named Raising by Support Descending order (RSD) and Sorting candidates and raising the threshold by Exact and Precalculated utilities of candidates (SEP) are applied. A drawback of REPT is that besides parameter k, users also have to set a value for a novel parameter N, used by the RSD strategy. Choosing a value for N that provides good performance is difficult for users [39]. The strategies proposed in TKU and REPT can raise their internal minimum utility thresholds effectively. But these algorithms still generate a huge amount of candidates because they are two-phase algorithms. They thus repeatedly scan the database to obtain the exact utilities of candidates and identify the actual top-k high utility itemsets. This process is very expensive, especially when working with a huge number of candidates or datasets with long transactions and many items. Recently, a one-phase algorithm named TKO was proposed [39]. It was shown to generally outperform TKU and REPT. This paper proposes a novel algorithm named kHMC to mine the top-k high utility itemsets efficiently. kHMC is a top-k high utility itemset mining algorithm that uses the utility-list structure [31] to extract the top-k high utility itemsets. kHMC is a one-phase algorithm, unlike TKU and REPT. The kHMC algorithm

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

ARTICLE IN PRESS

JID: KNOSYS 4

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17

TI D T1 T2 T3 T4 T5

Transaction (a,1), (c,1), (d,1) (a,2), (c,6), (e,2), (g,5) (a,1), (b,2), (c,1), (d,6),(e,1),(f,5) (b,4), (c,3), (d,3), (e,1) (b,2), (c,2), (e,1), (g,2)

Item External utility

a 5

b 2

c 1

d 2

e 3

TU 8 27 30 20 11

f 1

g 1

Fig. 1. An example transaction database (top) and the corresponding external utilities of items (bottom).

prunes the search space using two novel strategies called Estimated Utility Co-occurrence Pruning Strategy with Threshold (EUCPT) and Transitive Extension Pruning strategy (TEP). The top-k high utility itemsets are discovered by raising an internal minimum utility threshold, initially set to zero. The key differences between the proposed algorithm and previous algorithms are the following. First, the proposed EUCPT pruning strategy utilizes information about item co-occurrences to prune the search space. Second, the TEP strategy prunes the search space using a novel upper-bound. Third, kHMC also introduces a novel utility-list construction procedure, having a complexity lower than the one used in HUI-Miner and FHM. Fourth, an early abandoning strategy (EA strategy) is used during utility-list construction to stop building a utility-list if it cannot result in a top-k high utility itemset. These novelties considerably reduce runtime and memory consumption for discovering the top-k high utility itemsets. Besides, the kHMC algorithm also relies on three threshold raising strategies. The first one is named raising the threshold by Real Item Utilities (RIU), and is applied during the first database scan. The second and third ones are named raising the threshold based on Co-occurrence with Utility Descending order (CUD) and raising the threshold based on COVerage with utility descending order (COV), and are applied during the second database scan. These strategies are very efficient at raising the internal minimum utility threshold close to the boundary threshold value. Furthermore, the paper introduces a novel concept of coverage. This latter can be employed to prune the search space in high utility itemset mining, or to raise an internal minimum utility threshold in top-k high utility itemset mining as proposed in this paper. 3. Preliminaries and problem definition This section presents preliminaries related to high utility itemset mining, and defines the problem of top-k high utility itemsets mining. Let I = {i1 ; i2 ; ... ; im } be a set of items, such that each item ij ∈ I is associated with a positive number p(ij ), called its external utility, which represents the relative importance (e.g., unit profit) of item ij to the user. An itemset X is a set of l distinct items {i1 ; i2 ; . . . ; il } such that X⊆I. An itemset containing l items is said to be of length l, and to be a l-itemset. In the rest of this paper, for the sake of brevity, each itemset will be denoted by the concatenation of its items. For example, the itemset {x; y; z} will be denoted as xyz. Similarly, the union of two itemsets X and Y will be denoted as XY or X ∪ Y. A transaction Td is a subset of I and has a unique identifier d, called its transaction identifier, or Tid. A transaction database D = {T1 ; T2 ; ... ; Tn } is a set of transactions. Each item ij in a transaction Td (1 ≤ d ≤ n) is associated with a quantity q(i, Td ) called the local utility (e.g., purchase quantity) of item ij in Td . For example, Fig. 1 shows a transaction database D containing five transactions (T1 , T2 , T3 , T4 , and T5 ), which will be used as running example. In this database, the set of items I in D is {a; b; c; d; e;

f; g}. The external utilities of these items are respectively 5, 2, 1, 2, 3, 1, and 1. The itemset ac appears in transactions T1 , T2 , and T3 . The items a, c, e, and g, respectively have internal utilities (e.g., purchase quantities) of 2, 6, 2, and 5, in transaction T2 . Definition 1. The utility of an item i in a transaction Td is defined as u(i, Td ) = q(i, Td ) × p(i ), where q(i, Td ) is the internal utility (purchase quantity) of item i in transaction Td , and p(i) is the external utility (unit profit) of item i. For example, in Fig. 1, u(a, T1 ) = 1 × 5 = 5, u(c, T1 ) = 1 × 1 = 1. Definition 2. The utility of an itemset X in a transaction Td is denoted as u(X, Td ) and defined as u(X, Td ) =  i ∈ X u(i, Td ). For example, in Fig. 1, u(ac, T1 ) = 1 × 5 + 1 × 1 = 6, and u(ac, T2 ) = 2 × 5+ 6 × 1 = 16. Definition 3. The utility of an itemset X in a database D is de fined as u(X ) = X⊆T ∧T ∈D u(X, Td ). For example, in Fig. 1, u(ac ) = d d u(ac, T1 ) + u(ac, T2 ) + u(ac, T3 ) = 6 + 16 + 6 = 28. Definition 4. The utility of a transaction Td is denoted as TU(Td ), and is calculated as u(Td , Td ). For example, in Fig. 1, T U (T1 ) = 8, T U (T2 ) = 27. Definition 5. Let  be any total order on items from I, and X be an itemset. An extension of X is an itemset Xy that is obtained by appending an item y to X, such that yi, ∀i ∈ X. For example, in Fig. 1, assume that items in transactions are ordered by the lexicographical order. The itemset acd is an extension of the itemset ac. Given a user-specified minimum utility threshold min_util, if the utility of an itemset X is no less than min_util, then X is said to be a high utility itemset. Otherwise, X is said to be a low utility itemset. The problem of high utility itemset mining is to discover all high utility itemsets. The problem of top-k high utility itemset mining is to discover the k itemsets having the highest utilities, where k is a parameter set by the user. In FIM, the downward closure property is used to prune the search space. But in high utility itemset mining, the utility measure does not follow this property (the utility is neither monotonic nor anti-monotonic). To restore the downward closure property in utility mining, the transactionweighted downward closure property (TWDC) was proposed [26]. Definition 6. The transaction-weighted utilization (TWU) of an itemset X in a database D is denoted as TWU(X) and defined as  TWU(X)= T ∈D∧X⊆T T U (Td ). d

d

Property 1 (Overestimation [26]). The TWU of an itemset X is no less than its utility, that is TWU(X) ≥ u(X). Property 2 (Transaction-weighted downward closure property [26]). Let there be two itemsets X and Y. If Y⊆X, then TWU(Y) ≥ TWU(X). Thus, the TWU measure is downward close (anti-monotonic). Most HUI mining algorithms use the TWU values of itemsets as an

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

ARTICLE IN PRESS

JID: KNOSYS

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17

Item Name TWU

a 65

TID TU

1 8

b 61 2 27

c 96 3 30

d 58 4 20

e 88

f 30

5

g 38

5 11

Fig. 2. The transaction utilities and TWU values for the database of Fig. 1.

overestimation on their utilities and the utilities of their supersets, to prune the search space. Example 1. Consider the database of Fig. 1. The utility of itemset c in transaction T1 is u(c, T1 ) = 1 × 1 = 1. The utility of itemset cd in T1 is u(cd, T1 ) = u(c, T1 ) + u(d, T1 ) = 1 × 1 + 1 × 2 = 1 + 2 = 3. Because the itemset cd appears in transactions T1 , T3 , and T4 , its utility is calculated as u(cd ) = u(c ) + u(d ) = u(c, T1 ) +u(c, T3 )+ u(c, T4 )+ u(d, T1 ) + u(d, T3 ) + u(d, T4 ) = 1 + 1 + 3 + 2 + 12 + 6 = 25. The utility of transaction T1 is T U (T1 ) = u(a, T1 ) + u(c, T1 ) + u(d, T1 ) = 5 + 1 + 2 = 8. The transaction-weighted utilization of itemset f is T W U ( f ) = T U (T3 ) = 5 + 4 + 1 + 12 + 3 + 5 = 30. Fig. 2 shows the TU values of all transactions in D, and the TWU values of all items. Suppose that the parameter k is set to 3. The set of top-k high utility itemsets for the running example is {bcde:40, bde:36, bcd:34}, where the number beside each itemset indicates its utility. For k = 3, the boundary threshold value is 34. 4. Effective search space pruning using item co-occurrence information A key challenge in HUI mining is to design effective search space pruning techniques to discover HUIs efficiently. This section proposes an improved search space pruning strategy for top-k high utility itemset mining, which relies on information about item cooccurrences to reduce the search space. This proposed strategy is named Estimated Utility Co-occurrence Pruning Strategy with Threshold (EUCPT). It is an improvement of the EUCP strategy introduced in the FHM algorithm [36]. The following paragraphs first describe the EUCP strategy, and then present the proposed EUCPT strategy. 4.1. The estimated utility co-occurrence pruning strategy (EUCP) The EUCP [36] pruning strategy relies on a structure called the Estimated Utility Co-occurrence Structure (EUCS), which is defined as follows: Definition 7. The Estimated Utility Co-occurrence Structure of a database D is denoted as EUCSD and defined as EUCSD = {(a; b; T W U (ab)) ∈ I∗ × I∗ × R+ }, where I∗ is the set of all items having a TWU no less than min_util in D. The EUCS has been defined for the problem of HUI mining, where the user has to specify a min_util threshold. But in top-k HUI mining, no such threshold is defined by the user. Instead, topk HUI mining algorithms rely on an internal minimal utility threshold that is initially set to zero, and increased during the search for itemsets using threshold raising strategies, as it will be explained in the next section. As other efficient algorithms for HUI mining, the proposed algorithm only scans the database twice to compute the TU and TWU values of all items and 2-itemsets, to construct the EUCS structure. The first database scan calculates the TWU of each single item, as well as the utility of each single item. The TWU values of items are used to establish a total order on items appearing in the transactional database, which is the order of ascending TWU values. The TWU values of items will be used by

the threshold raising strategies presented in Section 8. The second database scan constructs the EUCS. During this scan, items in transaction are reordered using the ascending order of TWU values. For instance, consider the database D of Fig. 1. Items are reordered by ascending order of TWU values, and the EUCS is implemented as a triangular matrix, as shown in Fig. 3. The EUCP strategy employs the following property to reduce the search space: Property 3 (Pruning). Let there be an itemset X. If T W U (X ) < min_util, then for any extension Y of X, u(XY ) < min_util. Proof. By Property 2, TWU(XY) ≤ TWU(X). Because T W U (X ) < min_util, it follows that T W U (XY ) < min_util. Furthermore, u(XY) ≤ TWU(XY) by Property 1. Therefore u(XY ) < min_util.  Property 3 is employed by the EUCP strategy to directly prune an itemset Pxy and its transitive extensions. If there is a tuple (x; y; TWU(xy)) in the EUCS such that TWU(xy) < min_util, then any supersets of xy such as Pxy, and its transitive extensions, have utilities lower than min_util. Thus, Pxy and its transitive extensions do not need to be explored by the search process. The EUCP strategy is efficient because it can reduce the number of join operations performed by utility-list based algorithms, and thus the number of candidates that they generate. It was shown that the number of candidates generated can be pruned by up to 95% using this strategy [36]. Hence, a considerable performance gain can be achieved using this strategy. 4.2. The estimated utility co-occurrence pruning strategy with threshold (EUCPT) A simple optimization of the EUCS was mentioned in the authors’ implementation. By implementing the EUCS using hashmaps instead of a triangular matrix, the EUCS only stores tuples such that TWU = 0. As a result, fewer items are kept in memory, and this implementation is thus more memory efficient than the original implementation. This paper introduces an improvement of the EUCS, named EUCST, which keeps fewer items in memory than the EUCS. The design of the EUCST is based on the observation that only tuples having TWU values greater than min_util are used by the EUCP strategy to prune the search space. Thus, the EUCST only stores tuples for itemsets that have a TWU greater than min_util. As a result, less tuples are stored in the proposed structure, and it is more memory efficient. Moreover, because the EUCST is smaller, the time spent for searching the structure, when using the structure to prune join operations, is also reduced. The proposed pruning strategy, called EUCPT, utilizes the designed EUCST structure to avoid join operations. For example, consider that min_util is set to 31. Using the proposed EUCST, all tuples having a TWU less than 31 will be eliminated, as shown in Fig. 4(a). In this figure, all tuples colored in brown have a TWU that is less than 31, and are thus removed. The resulting EUCST keeps less tuples than the corresponding EUCS, as shown in Fig. 4(b). The construction of the EUCST is performed during the second database scan. For each transaction Td , each pair of items in Td is

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

ARTICLE IN PRESS

JID: KNOSYS 6

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17

f 30

Item TWU TID T1 T2 T3 T4 T5

g 38

Ordered Transaction (d,1), (a,1), (c,1) (g,5), (a,2), (e,2), (c,6) (f,5), (d,6), (b,2), (a,1), (e,1), (c,1) (d,3), (b,4), (e,1), (c,3) (g,2), (b,2), (e,1), (c,2)

d 58

b a 61 65 Item g d b a e c

e 88 f 0 30 30 30 30 30

c 96 g

d

b

a

e

0 11 27 38 38

50 38 50 58

30 61 61

57 65

88

Fig. 3. Items are reordered by ascending order of TWU values (top). Transactions are sorted according to this order (bottom-left). The EUCS matrix (bottom-right).

Item g d b a e c

f 0 30 30 30 30 30

g

d

b

0 11 27 38 38

50 38 50 58

30 61 61

a

e

57 65

88

Item g g d b a e 38 c 38

d

b

a

e

50 38 50 58

61 61

57 65

88

Fig. 4. The (a) EUCS and (b) EUCST structures.

Item f g d b a e c

g

d

b

a

e

8 8

8

Item f g d b a e c

g

d

27 27 27

8 8

b

a

e

27 35

27

Fig. 5. The construction process of the EUCST

considered. If there is no tuples for that pair of items in the EUCST, a tuple is inserted in the EUCST, where the TWU value of the pair of items is set to the transaction utility of Td , that is TU(Td ). Otherwise, the TWU value of the tuple in the EUCST is increased by TU(Td ). For example, consider the database of Fig. 1. At first, the transaction T1 is processed, and it is found that TU(T1 ) = 8. Items in that transaction are d, a, and c. The algorithm combines each item with each other to create the tuples {(d, a, 8), (d, c, 8), (a, c, 8)}, which are inserted in the EUCST (see Fig. 5, left). The transaction T2 is then processed, and it is found that TU(T2 ) = 27. T2 is composed of four items g, a, e, and c. The tuples that are constructed using these four items are: {(g, a, 27), (g, e, 27), (g, c, 27), (a, e, 27), (a, c, 27), (e, c, 27)}. Because the pair (a, c) already exists in the EUCST with a TWU value of 8, this value is updated to 8 + 27 = 35. The others pairs do not exist in the EUCST. They are thus inserted into the EUCST without updating any tuples. The resulting EUCST after processing T1 and T2 is shown in Fig. 5, right. The same process is repeated for the next transactions. The result is the EUCS presented in Fig. 4(a). Then, all tuples having a TWU less than min_util (colored in brown in Fig. 4(a)) are removed from the EUCS, and the EUCST shown in Fig. 4(b) is obtained. The EUCST is constructed quickly and occupies a small amount of memory. It is thus an efficient data structure for mining high utility itemsets. 5. Efficient utility-list construction In utility-list based algorithms such as HUI-Miner, FHM, and the proposed algorithm, a key operation is the construction of utility-

lists. The next subsection thus proposes an improved utility-list construction procedure that has a lower complexity than the one used in previous algorithms. Moreover, the following subsection presents an early abandoning strategy to avoid completely constructing utility-lists when specific conditions are met. These optimizations greatly reduce the runtime and memory consumption of the proposed algorithm.

5.1. An efficient utility-lists construction procedure The proposed kHMC algorithm utilizes the utility-list structure proposed in the HUI-Miner algorithm [31] to mine high utility itemsets directly, using a single phase. The utility-list structure is used to maintain information about the utilities of itemsets. The utility of an itemset can be quickly calculated by joining utilitylists of smaller itemsets. Utility-lists are briefly introduced thereafter by the following definitions and properties: Definition 8 (Utility-list). Let ≺ be any total order on items from I. The utility-list of an itemset X in a database is denoted as UL(X), and contains a set of tuples having the form (tid; iutil; rutil), such   that UL(X ) = X∈T (tid; u(X, Ttid ), i∈X∧x≺i u(i, Ttid )). tid

Definition 9 (Remaining utility [31]). The set of items appearing after an itemset X in a transaction T is defined as {i|i ∈ T ∧ ∃ j ∈ X, i ≺ j}. The remaining utility of an itemset X in a transaction T is denoted as RU(X, T) and defined as the sum of the utilities of all items appearing after X in T.

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

ARTICLE IN PRESS

JID: KNOSYS

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17

7

Algorithm 1 Construct utility-list

Algorithm 2 iConstruct Algorithm

Input: UL(P ), the utility-list of itemset P ; UL(P x), the utility-list of itemset P x; UL(P y), the utility-list of itemset P y Output: UL(P xy), the utility-list P xy.

Input: UL(P ), the utility-list of itemset P; UL(P x), the utility-list of itemset Px; UL(P y), the utility-list of itemset Py Output: UL(P xy), the utility-list Pxy.

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

of

itemset

UL(P xy) = NULL; for each (tuple ex ∈ UL(P x)) do if (∃ey ∈ UL(P y) and ex.tid = ey.tid) then if (UL(P ) is not empty) then Search element e ∈ UL(P ) such that e.tid = ex.tid; exy ←− (ex.tid; ex.iutil + ey.iutil - e.iutil; ey.rutil); else exy ←− (ex.tid; ex.iutil + ey.iutil; ey.rutil); end if U L(P xy) ←− U L(P xy) ∪ exy; end if end for return UL(P xy);

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

Property 4 (Sum of iutil values). Let there be an itemset X. X is a  high utility itemset if ul∈UL (X ) ul.iut il ≥ min_ut il. Otherwise, it is a low utility itemset. Property 5 (Sum of iutil and rutil values [31]). Let X be an itemset. If the sum of iutil and rutil values in the utility-list of X is less than min_util, all extensions of X and their transitive extensions are low utility itemsets. As proposed in the HUI-Miner algorithm, the utility-list of an itemset can be constructed by intersecting the utility-lists of some of its subsets. For example, let P, Px, and Py be itemsets, such that Px and Py are extensions of P with items x and y, respectively. The utility-list of Pxy is obtained by applying Algorithm 1. This procedure is the same in FHM [36] and HUI-Miner [31]. The implementation of this procedure in the FHM algorithm can be obtained as part of the SPMF data mining library [42]. The main idea of this procedure is the following. Two utility-lists UL(x) and UL(y) are intersected as follows. For each element in UL(x), a binary search is performed to check whether the element exists in UL(y) or not. Therefore, the complexity of the procedure is O(mlogn), where m and n are |UL(x)| and |UL(y)|, respectively. Because all Tids in a utility-list are ordered, the algorithm for identifying transactions that are common to both utility-lists by comparing the Tids in the two utility-lists can be improved. This paper introduces an intersection procedure for constructing utility-lists having a complexity of O(m + n ) < O(mlogn ). The pseudo-code is presented in Algorithm 2. Note that if UL(P) is not empty then the Tid list in UL(Pxy) is always a subset of the Tid list in UL(P). In this case, the complexity of the intersection procedure is only O(|UL(P)|).

14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27:

of

itemset

UL(P xy) = NULL; Let i, j = 0; Let total = UL(P x).sumIutils + UL(P y).sumIutils + UL(P x).sumRutils + UL(P y).sumRutils; //for the EA strategy while (i < UL(P x).size and j < UL(P y).size) do if ( UL(P x)[i].tid = UL(P y)[j].tid) then U L(P xy) ←− (U L(P x)[i].tid; U L(P x)[i].iutil + UL(P y)[j].iutil; UL(P y)[j].rutil); i = i + 1; j = j + 1; else if (UL(P x)[i].tid < UL(P y)[j].tid) then total = total - (UL(P x)[i].iutil + UL(P x)[i].rutil); i = i + 1; else total = total - (UL(P y)[j].iutil + UL(P y)[j].rutil); j = j + 1; end if if (total < min_util) then return null; //EA strategy end if end while if (UL(P ) = ∅ and UL(P xy) = ∅) then Let i = j = 0; while (i < UL(P xy).size and j < UL(P ).size) do if (UL(P xy )[i].tid = UL(P )[j].tid) then UL(P xy )[i++].iutil -= UL(P )[j].iutil;//Subtract utilities of tuples existing in prefix UL(P ) end if j = j + 1; end while end if return UL(P xy);

Property 6 (LA property). Given two itemsets X and Y, if the value   ∀T ∈D U (X, Ti ) + RU (X, Ti )− ∀T ∈D,X⊆T ∧Y T U (X, T j ) + RU (X, T j ) < i

j

min_util, then ∀X ⊇X∧Y ⊇Y, X Y ∈ HUIs.

j

j

Proof. The proof of this property can be found in [43]. For the convenience of the reader, this proof is provided thereafter. We have

min_util >





U (X, Ti ) + RU (X, Ti ) −

∀Ti ∈D



+RU (X, T j ) = 

+

U (X, T j )

∀T j ∈D,X⊆T j ∧Y T j

U (X, Ti ) + RU (X, Ti )

∀Ti ∈D,X⊆Ti ∧Y ⊆Ti

U (X, Ti ) + RU (X, Ti )

∀Ti ∈D,X⊆Ti ∧Y Ti

Furthermore, an additional strategy called EA (Early Abandoning) is integrated in the proposed kHMC algorithm, to obtain better performance for utility-list construction. This strategy consists of abandoning the construction of a utility-list if a specific condition is met. The kHMC algorithm, just like HUI-Miner, utilizes Property 4 to determine if an itemset X is a high utility itemset (an itemset X  is a high utility itemset if ul∈UL (X ) ul.iut il ≥ min_ut il). Moreover, kHMC utilizes the following property [43]:





5.2. A strategy for abandoning utility-list construction early

U (X, T j ) + RU (X, T j )

∀T j ∈D,X⊆T j ∧Y T j



=

U (X, Ti ) + RU (X, Ti )

∀Ti ∈D,X⊆Ti ∧Y ⊆Ti

On the other hand, we also have

U (X Y  ) =

 X  Y  ⊆Ti ∈D

⇒ U (X Y  ) ≤

U (X Y  , Ti ) ≤ 



U (X Y, Ti ) + RU (X Y, Ti )

X  Y  ⊆Ti ∈D

U (X Y, Ti ) + RU (X Y, Ti )

∀Ti ∈D,X⊆Ti ∧Y ⊆Ti

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

ARTICLE IN PRESS

JID: KNOSYS 8

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17



⇒ U ( X Y  ) ≤

U (X, Ti ) + RU (X, Ti ) < min_util

∀Ti ∈D,X⊆Ti ∧Y ⊆Ti

Therefore, X Y ∈ HUIs.



Thus, for any itemset X, extensions of X will be explored if (1)  the pruning condition of Property 5 is satisfied ( ul∈UL(X ) (ul.iutil + ul .rutil ) ≥ min_util), and (2) Property 6 is not satisfied. The proposed algorithm adopts the LA property in its EA strategy. The EA strategy is applied as follows. In the utility-list construction process, if the condition of Property 6 is satisfied, it is concluded that extensions of X should not be explored, as they will be low utility itemsets. Thus, the utility-list construction process is immediately stopped. This saves the time and memory that would be used for creating and storing the utility-list. This process is detailed in Algorithm 2. At first, the variable total is set to the sum of iutil and rutil values in the utility-lists of Px and Py (Line 2). Then, the value of total is decreased for each tuple in UL(Px) that has no corresponding tuple in UL(Py), or vice versa, by the sum of iutil and rutil values in the current tuple (Line 8, 11). Lines (14-16) check if total falls below min_util. If yes, the construction of the utility-list of Pxy is stopped and extensions of Pxy will not be explored. This stopping criterion is employed by the proposed algorithm during the construction of all utility-lists for itemsets of size greater than one. Thus, this strategy reduces the runtime and memory consumption of the proposed algorithm.

 ul ∈ UL(y) ul.rutil and the third term u(y) are precalculated in advance for each item y. The proposed transitive extension upperbound utility is used for pruning the search space based on the following properties and lemma. Property 7. (Overestimation using the transitive extension upperbound utility) Let be an itemset X and an item yi, ∀i ∈ X. The relationship teu(X, y) ≥ u(Xy) holds. Property 8. Let there be an itemset X and an item y, such that yi, ∀i ∈ X. The value teu(X, y) is no less than the sum of iutil and ru til values in the utility-list of Xy, i.e., teu(X, y ) ≥ ul∈UL(Xy ) (ul.iutil + ul .rutil ). Proof. We have:

Definition 10. (Transitive extension upper-bound utility) Let be an itemset X and an item yi, ∀i ∈ X. The transitive extension upper-bound utility of itemset X with respect to item y is de noted as teu(X, y) and is defined as teu(X, y ) = ul∈UL(X ) ul.iutil +  ul∈UL (y ) ul.rutil + u (y ). This upper-bound can be calculated very efficiently. In the proposed kHMC algorithm, utility-lists of single items are created during the initial database scans. Thus, the second term

ul .iutil +

ul∈UL(X )

ul .rutil + u(y ) 

u(X, Td ) +

X⊆Td ∧Td ∈D

u(X, Td ) +

X⊆Td ∧y∈Td ∧Td ∈D

+





RU (y, Td ) +





u(y, Td )

u(Xy, Td ) +

Xy⊆Td ∧Td ∈D









RU (y, Td )

y∈Td ∧Td ∈D

u(Xy, Td ) +

Xy⊆Td ∧Td ∈D

=

RU (y, Td )

y∈Td ∧Td ∈D

y∈Td ∧XTd ∧Td ∈D





u(X, Td ) +

X⊆Td ∧y∈ / Td ∧Td ∈D



u(y, Td )

[u(X, Td ) + u(y, Td )]



+

u(X, Td )

y∈Td ∧Td ∈D

X⊆Td ∧y∈Td ∧Td ∈D

+



X⊆Td ∧y∈ / Td ∧Td ∈D

y∈Td ∧Td ∈D

=

RU (y, Td ) + u(y )

y∈Td ∧Td ∈D



=

 ul∈UL(y )



=

6. Transitive extension pruning strategy (TEP) Liu et al. [31] proposed a pruning strategy based on Property 5 to reduce the search space using the sum of iutil and rutil values in the utility-list of an itemset Y. If this sum is less than min_util, then the itemset Y and its transitive extensions are low utility itemsets. This strategy is efficient at pruning the search space and is thus integrated in the proposed kHMC algorithm. However, a drawback of this pruning strategy is that it can only be applied to an itemset Y after the utility-list of Y has been fully constructed. But constructing a utility-list is an expensive operation. It is thus desirable to design pruning strategies that can prune an itemset Y and its transitive extensions before the utility-list of Y is constructed. The designed EUCPT strategy, presented in Section 4, is one such strategy. It relies on information about item co-occurrences to reduce the search space. In this section, we propose a novel strategy for pruning an itemset Y and its transitive extensions before its utility-list is constructed, by adapting Property 5. This strategy is called Transitive Extension Pruning (TEP). Recall that kHMC mines high utility itemsets using a depth-first search. For any itemset X, kHMC generates larger itemsets by recursively appending items one at a time to X according to the  order. Consider an itemset Y that is an extension of an itemset X, formed by appending an item y to X, that is Y = Xy. The TEP strategy is used to prune the itemset Y and its transitive extensions, when the search procedure is considering itemset X (thus before the utility-list of Y is constructed). This novel strategy introduced in kHMC relies on a new upper-bound on the utility of itemsets and their extensions, named the transitive extension upper-bound utility. This upper-bound is defined as follows.



teu(X, y ) =



RU (y, Td )

Xy⊆Td ∧Td ∈D

(ul .iutil + ul .rutil )

ul∈UL(Xy )

 Lemma 1. (Pruning the search space using the transitive extension upper-bound utility) Let there be an itemset X and an item yi, ∀i ∈ X. If teu(X, y ) < min_util, then the single item extension Xy and its transitive extensions are low utility.  Proof. Since teu(X, y ) ≥ ul∈UL(Xy ) (ul .iutil + ul .rutil ) (by Property  8), and teu(X, y ) < min_util, it follows that ul∈UL (Xy ) (ul.iutil + ul .rutil ) < min_util. Therefore, by Property 5, the itemset Xy and its transitive extensions are low utility.  The above lemma is used by the TEP strategy to prune the search space. For a given itemset X, if teu(X, y) is less than min_util, then Xy and all its transitive extensions are low utility itemsets, and can thus be pruned. The TEP strategy can greatly reduce the number of extensions to be explored for mining high utility itemsets using utility-lists. As it will be shown in the experimental evaluation of this paper, if the proposed kHMC algorithm applies the TEP strategy after applying the EUCPT strategy, kHMC can prune up to 21% more candidates from the search space. 7. The concept of coverage This section introduces a novel concept called coverage, for mining high utility itemsets. In this paper, it is used in the proposed kHMC algorithm in the context of top-k high utility itemset mining but it could also be used in other high utility itemset mining algorithms. It is defined as follows.

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

ARTICLE IN PRESS

JID: KNOSYS

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17

Definition 11 (Tidset of an itemset [44]). The Tidset of an itemset X is denoted as g(X) and defined as the set of transaction identifiers (Tids) of transactions containing X. Thus, g(X ) = {tid |X ⊆Ttid ∧Ttid ∈ D}. Definition 12 (Coverage). Let i and j be two single items (i, j ∈ I). The item j is said to cover i if g(i)⊆g(j). The coverage of the item i is denoted as Ç(i) and defined as Ç(i ) = { j ∈ I, g(i ) ⊆ g( j )}. Property 9. Let i, j be two single items such that j ∈ Ç(i). The utility of the itemset ij is greater than the utility of itemset i, that is u(i j ) ≥ u(i ) + |g(i )| × p( j ). Proof. If j ∈ Ç(i), then g(i)⊆g(j). Hence, g(i j ) = g(i ) ∩ g( j ) = g(i ).



u (i j ) =



u(i j, Td ) =

=



u(i, Td ) + u( j, Td ) =

Td ∈g(i j )

= u (i ) +



Definition 13 (Coverage itemset). Let there be an itemset X. X is said to be a coverage itemset if and only if X = pp1 p2 ...pr , pl ∈ Ç(p)(1 ≤ l ≤ r). Lemma 3 (Utility of a coverage itemset). Let there be a coverage itemset X = pp1 p2 ...pr . The utility of X can be calculated using the following formula:  u(X ) = 1≤i≤r u( ppi ) − (r − 1 ) × u( p) Proof. Because pi ∈ Ç(p), we have that g(p)⊆g(pi ). Therefore, g( ppi ) = g( p), and g( pp1 p2 ...pr ) = g( p).



= 

u(i, Td ) +

Td ∈g(i )



q( j, Td ) × p( j ) ≥ u(i ) +

Td ∈g(i )



u(X ) = u( pp1 p2 ...pr ) =

u(i j, Td )

Td ∈g(i j )

i j⊆Td ∧Td ∈D

9



u( pp1 p2 ...pr , Td )

Td ∈g( p)

[u( p, Td ) + u( p1 , Td ) + ... + u( pr , Td )]

Td ∈g( p)

=

u( j, Td )



[[(u( p, Td ) + u( p1 , Td )) + ... + (u( p, Td ) + u( pr , Td ))]

Td ∈g( p)

Td ∈g(i )

−(r − 1 ) × u( p, Td )]  = [[(u( p, Td ) + u( p1 , Td )) + ... + (u( p, Td ) + u( pr , Td ))]

p( j )

Td ∈g(i )

Td ∈g( p)

≥ u ( i ) + | g ( i ) | × p( j )

− (r − 1 ) ×

 Example 2. In the example database shown in Fig. 1, item d appears in transactions T1 , T3 , and T4 . Thus, g(d) = {1, 3, 4}. Since the item c appears in all transactions, g(c) = {1, 2, 3, 4, 5}. Furthermore, since g(d)⊆g(c), we have that item c ∈ Ç(d).

=

 



u( p, Td )]

Td ∈g( p)

[u( p, Td ) + u( pi , Td )] − (r − 1 ) × u( p)

1≤i≤r Td ∈g( p)

=



u( ppi ) − (r − 1 ) × u( p)

1≤i≤r

Lemma 2. (Relationship between TWU and coverage) Let i, j be two single items in a database D (i, j ∈ I). We have that j ∈ Ç(i) if and only if T W U (i ) = T W U (i j ). Proof. Let g(xy ) denote the set of tids of transactions containing x, but not y. The TWU of the itemset ij is defined by the following formula.

T W U (i j ) =



=





T U (Td ) +

Td ∈g(i j )

=

= T W U (i ) −

T U (Td ) −

Td ∈g(i j )



T U (Td ) −

Td ∈g(i )

T U (Td )

Td ∈g(i j )

Td ∈D∧i j⊆Td





T U (Td ) =



T U (Td )

Td ∈g(i j )

T U (Td )

Td ∈g(i j )



T U (Td )

Td ∈g(i j )

Therefore,

T W U (i j ) = T W U (i )  ⇔ T U (Td ) = 0 ⇔ g(i j ) = ∅ Td ∈g(i j )

⇔ g( i ) ⊆ g( j ) ⇔ j ∈ Ç ( i )  The above lemma provides a simple way of calculating the coverage of any single item i ∈ I using the TWU of single items, and the EUCST structure, which have been presented in the previous section. Example 3. Consider the database of Fig. 1. Fig. 3 gives the TWU values of all single items in that database, and illustrates the corresponding EUCST. Consider the item g. We have that T W U (g) = 38. In the EUCST, there are two tuples containing g, which are (g, c, 38) and (g, e, 38). The TWU values stored in these tuples are equal to T W U (g) = 38. Hence, the coverage of item g is Ç(g) = {c, e}.

 Example 4. Consider the database of Fig. 1. The coverage of item g is Ç(g) = {e, c}, the utility of coverage itemset gec can be calculated as u(gec ) = u(ge ) + u(gc ) − u(g) = 16 + 15 − 7 = 24. Using the above lemma, the utility of any coverage itemset X can be obtained directly by adding the utilities of 2-itemsets and subtracting the utility of a 1-itemset. Thus, the utility of X can be obtained without performing utility-list intersections or additional database scans. This calculation only requires to precalculate the coverage of each item, and the utilities of 1-itemsets and 2itemsets. This lemma can be applied to effectively prune the search space for the problem of top-k high utility itemset mining, and could be applied in other problems. In this work, this lemma is introduced to be used in a novel and efficient threshold raising strategy. Lemma 4 (Utility of a superset using the coverage). Let there be an item p, and Ç(p) be the coverage of p. Furthermore, let there be an itemset Y ⊂ Ç(p), and an item pi ∈ Ç(p) such that pi ∈ Y. We have that u( p ∪ Y ∪ pi ) = u( p ∪ Y ) + u( ppi ) − u( p). Proof. Assume that the itemset Y is of the form Y = p1 p2 ...pr . Hence, p ∪ Y ∪ pi = pp1 p2 ...pr pi . By Lemma 3, we have that:

u( p ∪ Y ∪ pi ) = u( pp1 p2 ...pr pi )  = u( pp j ) + u( ppi ) − r × u( p) 1≤ j≤r

=(



u( pp j ) − (r − 1 ) × u( p)) + u( ppi ) − u( p)

1≤ j≤r

= u( p ∪ Y ) + u( ppi ) − u( p)  Property 10. Let there be a single item p, and Ç(p) be its coverage. Furthermore, consider an itemset Y⊆ Ç(p), and an item pi ∈ Ç(p) such that pi ∈ Y. We have that u(p ∪ Y) < u(p ∪ Y ∪ pi ).

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

ARTICLE IN PRESS

JID: KNOSYS 10

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17

Proof. By Lemma 4, we have u( p ∪ Y ∪ pi ) − u( p ∪ Y ) = u( ppi ) − u( p). According to Property 9, u( ppi ) − u( p) ≥ |g( p)| × p( pi ) > 0.  Property 11. Let there be a single item p, and Ç(p) be its coverage. Furthermore, consider an itemset Y⊆ Ç(p), and an itemset X such that X ∩ Y = ∅ and p ∈ / X. If p ∪ X is a high utility itemset, then p ∪ X ∪ Y is also a high utility itemset. Proof. Because Y is a subset of Ç(p), it follows that g( p ∪ X ∪ Y ) = g( p ∪ X ). The utility of the itemset p ∪ X ∪ Y is calculated as:

u( p ∪ X ∪ Y ) =



u( p ∪ X ∪ Y, Td )

Td ∈D

=



[u( p ∪ X, Td ) + u(Y, Td )]

Td ∈g( p∪X )

=



u( p ∪ X, Td ) +

Td ∈g( p∪X )

= u( p ∪ X ) +





u(Y, Td )

Td ∈g( p∪X )

u(Y, Td ) > u( p ∪ X )

Td ∈g( p∪X )

If p ∪ X is a high utility itemset, then u( p ∪ X ) ≥ min_util. Thus, u( p ∪ X ∪ Y ) > u( p ∪ X ) ≥ min_util. Therefore, p ∪ X ∪ Y is a high utility itemset.  Note that a coverage itemset is a closed itemset as defined in FIM. Closed itemsets are a subset of the set of all itemsets. Mining closed itemsets generally reduces the execution time in FIM. Several algorithms were proposed to mine closed itemsets [12]. In this paper, the concept of coverage is proposed to mine the top-k high utility itemsets. Novel definitions and properties are introduced in this paper related to the concept of coverage, which are specific to high utility itemset mining. In particular, the next section introduces a novel threshold raising strategy relying on the concept of coverage, to mine the top-k high utility itemsets. 8. Three effective threshold raising strategies The proposed kHMC algorithm is designed to mine the top-k high utility itemsets. It initially sets an internal min_util threshold to zero, and employs three effective threshold raising strategies to raise the threshold to higher values. The next subsections present these three strategies. 8.1. The real item utilities threshold raising strategy The first threshold raising strategy is named RIU [37]. It is applied immediately after the first database scan. The utilities of all items are calculated during the first database scan. The utility of an item i in a database D is denoted as riu(i), and is calculated by  the formula T ∈D u(i, Td ). The RIU strategy utilizes the utilities of d items to raise the value of the min_util threshold, as follows. Let I={i1 ; i2 ; . . . ; im } be the set of items in database D, and R = {r iu1 , r iu2 , ..., rium } be the list of utilities of items in I, ordered by descending TWU values. Let riuk denote the k-th largest value in R. The RIU strategy consists of raising the min_util threshold to the value riuk (1 ≤ k ≤ m). This new value will then be used as min_util threshold by the mining process, until the threshold is increased to a larger value by another threshold raising strategy. Example 5. Consider the database of Fig. 1, and that k = 3. After the first database scan, the utilities of items in D have been calculated (see Fig. 6). The third largest value in the list R is 16. Therefore, the value of min_util is increased to 16. This new min_util value will then be used by the mining process. A second database scan is then done using the new min_util value, to construct the EUCST.

Item Utility

a 20

b 16

c 13

d 20

e 15

f 5

g 7

Fig. 6. Utilities of items in database D.

8.2. The co-occurrence threshold raising strategy The second threshold raising strategy is named CUD (Cooccurrence Utility Descending order). It is a novel strategy that is applied after the RIU strategy and the second database scan. CUD increases min_util using the utilities of 2-itemsets stored in the EUCST. As mentioned previously, the EUCST structure employed by the proposed algorithm is very efficient for pruning the search space. The EUCST is used to avoid performing join operations for constructing utility lists. It is designed to store and quickly retrieve the TWU values of pairs of items. Each tuple in the EUCST contains a pair of items having a TWU no less than min_util, that may thus be a high utility itemset. Hence, the utility of the 2-itemsets stored in the EUCST can be considered for raising the threshold. The utility of each entry in the EUCST can be calculated easily during the construction of the EUCST, while performing the second database scan. The benefits of the CUD strategy are that: (i) it uses the same structure as the main pruning strategy, (ii) it is easy to calculate the utility of pairs of items during the second database scan, (iii) it is efficient to store this information in terms of memory and runtime using hash maps, and (iv) the utilities of 2-itemsets help raising the value of the min_util threshold closer to the boundary to obtain the top-k high utility itemsets. The structure for storing the utilities of pairs of items is called the CUD utility Matrix (CUDM). It is built at the same time as the EUCST structure. When the algorithm reads a transaction Td , the utility of each 2-itemset xy appearing in the transaction is increased in the CUDM by the value u(x, Td ) + u(y, Td ). Lemma 5. Let HD (k) denote the top-k high utility itemsets in a database D. Let CUDk be the k-th largest utility in the CUDM. We have that u(X) ≥ CUDk , ∀X ∈ HD (k). Proof. Let CUDMk = {X ∈ CUDM, u(X) ≥ CUDk }. Assume that ∃Y ∈ HD (k), u(Y) < CUDk . We have that ∀X ∈ CUDMk , u(X) ≥ CUDk > u(Y). Therefore, CUDMk ⊆HD (k). Furthermore, Y ∈ HD (k). This is in contradiction with the statement that HD (k) is the set of top-k high utility itemsets. Thus, ࢜Y ∈ HD (k), u(Y) < CUDk , which means that u(X) ≥ CUDk , ∀X ∈ HD (k).  This lemma is used in the proposed CUD threshold raising strategy. After the utilities of all pairs of items have been calculated by constructing the CUDM, the min_util threshold is raised to the k-th highest value in the CUDM. Example 6. Consider the example database depicted in Fig. 1, and that k = 3. The CUD strategy first reads the transaction T1 , which contains three items: d, a, and c. Thus, three 2-itemsets are stored in the CUDM, i.e., da, dc, and ac, having the following utilities: u(da, T1 ) = 7, u(dc, T1 ) = 3, and u(ac, T1 ) = 6. Next, the transaction T2 is processed. The 2-itemsets in T2 are: ga, ge, gc, ae, ac, and ec, having the following utilities: 15, 11, 11, 16, 16, and 12. Because the itemset ac is already in the CUDM with a value of 6, its utility in the CUDM is updated to 6 + 16 = 22. All the other transactions are processed in the same way. The result is shown in Table 1. The third largest value in the CUDM is 27. Therefore, the CUD strategy raises min_util to 27. 8.3. The coverage threshold raising strategy An important observation made in this paper is that the coverage relation between single items can be used to raise the min_util

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

ARTICLE IN PRESS

JID: KNOSYS

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17 Table 1 The utilities in the CUDM. Item

f

g

d

g d b a e c

17 9 10 8 6

6 15 16 15

30 24 24 25

Algorithm 4 kHMC Algorithm b

a

e

Input: a transaction database D and the value k Output: the top-k high utility sets 1:

9 25 22

2: 24 28

27

3: 4:

threshold using the properties proposed in the previous section. This subsection presents such a threshold raising strategy, based on the concept of coverage. This strategy is named COV. It relies on a structure named COVerage List (COVL), which is a list of utility values. The construction of the COVL is done as follows. Initially, all values stored in the CUDM are inserted in the COVL. Then, for each single item i ∈ I, the COV strategy inserts the combinations of i with all subsets of its coverage Ç(i) in the COVL. The construction of the COVL is completed when all items have been processed. The detailed algorithm for constructing the coverage list is given in Algorithm 3. Let COVk denote the k-th highest value in the COVL. Algorithm 3 CoverageContruct Procedure Output: The COVL. 1: 2: 3: 4: 5: 6: 7: 8:

11

for each (item x ∈ I) do I (x ).COV ←− ∅; for (each item y ∈ I and y  x) do if (EUCST (x, y ) = T W U (x )) then I (x ).COV ←− I (x ).COV ∪ y; end if end for end for

5: 6: 7: 8: 9: 10: 11: 12:

Proof. This lemma can be easily proven in a similar way to the proof of Lemma 5 for the CUD strategy, and presented in Section 8.2. 

Global variables: min_util ←− 0; Rtopk ←− ∅ ; Scan D to calculate the TWU and real item utility (RIU) of each item; Let I be the list of single items sorted by ascending order of TWU; min_util ←− the k-th highest utility value in RIU; //RIU strategy Scan D to build the utility-list of each item i ∈ I, the EUCST structure, and the CUDM structure; cud ←− the k-th highest utility value in the CUDM; if (min_util < cud) then min_util ←− cud; //CUD strategy CoverageContruct(); Raise min_util to COVk ; //COV strategy Remove all entries lower than min_util in the EUCST ; Search (∅, I, EUCST); Output the k itemsets having the highest utilities in Rtopk

Procedure Search() Input : P: an itemset, ExtensionsOfP: a set of extensions of P, and the EUCST structure Output : a set of top-k high utility itemset candidates 1: 2: 3: 4: 5: 6: 7: 8: 9:

Lemma 6. Let HD (k) denote the utilities of the top-k high utility itemsets in a database D. Let COVk be the k-th highest utility value in COVL. We have that u(X) ≥ COVk , ∀X ∈ HD (k).

item-

10: 11: 12: 13: 14: 15: 16:

The proposed COV strategy consists of raising the min_util threshold to COVk .

18:

9. The proposed kHMC algorithm

20:

This section presents the proposed kHMC algorithm, for mining the top-k high utility itemsets. kHMC employs the efficient EUCST structure to prune the search space using the EUCPT strategy. This strategy can avoid numerous join operations for constructing utility-lists. It was shown that this strategy can eliminate up to 95% of the candidates [36]. Moreover, the designed TEP strategy is also applied for reducing the search space. In addition, threshold raising strategies RIU, CUD, and COV are used to raise the min_util threshold. The threshold is also dynamically raised by the kHMC algorithm as the utilities of itemsets in the search space are calculated. In addition, the proposed procedure that constructs utility-lists with a complexity of O(m + n ) and an early abandoning strategy are introduced in the proposed algorithm (described in Section 5). The main procedure of kHMC is shown in Algorithm 4. kHMC first scans the database to obtain the TWU and utilities (RIU) of single items (Line 2). The algorithm then sorts the list of single items in I by ascending order of TWU (Line 3), and raises min_util

22:

17: 19: 21: 23: 24: 25:

for (each itemset P x ∈ ExtensionsO f P ) do if (SU M (U L(P x ).iutils ) ≥ min_util ) then Rtopk ←− Rtopk ∪ P x; if (Rtopk .size ≥ k) then min_util ← the k-th highest utility in Rtopk ; keep only the k itemsets having the highest utilities in Rtopk ; end if end if if (SU M (U L(P x ).iutils ) + SU M (U L(P x ).rutils ) ≥ min_util ) then ExtensionsO f P x ←− ∅; for (each itemset P y ∈ ExtensionsO f P and y  x) do if (EUCST (x, y ) ≥ min_util) then P xy ←− P x ∪ P y; if (teu(P x, y ) < min_util) then continue; end if UL(P xy ) ←− iConstruct(P, P x, P y ); if (UL(P xy ) = NULL) then E xtensionsO f P x ←− E xtensionsO f P x ∪ P xy; end if end if end for Search (P x, ExtensionsO f P x, EUCST ); end if end for

using the RIU strategy (Line 4). kHMC then scans the database again to build the EUCST and the CUDM (Line 5). The threshold raising strategy CUD is then applied (Lines 6-7). The coverage relation is then constructed by applying the CoverageContruct procedure (Line 8). The threshold raising strategy COV is then applied (Line 9). The EUCST is finally created using the current min_util value (Line 10). Then, the Search procedure is applied to search for high utility candidates, to be inserted into Rtopk (Line 11), and the real top-k high utility itemsets from Rtopk are output (Line 12). The main steps of the Search procedure are the following. For each itemset Px in ExtensionsOfP having a utility no less than min_util, Px is added to the set of candidates Rtopk . Then, min_util is raised

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

ARTICLE IN PRESS

JID: KNOSYS 12

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17 Table 2 Details of the datasets. Dataset

#Trans

#Items

#Avg

Type

Accidents Mushroom Retail Chess Connect Kosarak

340,183 8124 88,162 3196 67,557 990,002

468 119 16,470 75 129 41,270

33.8 23.0 10.3 37 43.0 8.1

Dense Dense Sparse Dense Dense Large

to the k-th highest utility in Rtopk , and each itemset having a utility lower than that value is removed from Rtopk (Lines 2–8). If this condition is satisfied (Line 9), extensions of Px will be considered by the search procedure. Px will be combined with each other itemset Py in ExtensionsOfP such that yx to form larger itemsets, if there is a tuple in the EUCST for x and y such that this tuple’s utility value is no less than min_util (Lines 12-13) and the extension upper-bound utility of itemset X by extension Xy is no less than min_util (Lines 14-15). The utility-list of the resulting itemset Pxy is constructed by applying the iConstruct procedure (Line 17). If the utility-list is not empty (Line 18), then Pxy and its extensions will be considered by the search procedure (Line 19). Then, the Search procedure is recursively called with Px. The procedure stops if no extensions can be formed. By the proposed lemmas, it can be easily seen that the algorithm is correct and complete for mining the top-k high utility itemsets. 10. Experimental evaluation This section presents an extensive experimental evaluation to evaluate the performance of the proposed kHMC algorithm, including its pruning strategies and threshold raising strategies. The performance of kHMC is compared to the state-of-the-art REPT [37] and TKO [39] algorithms. REPT and TKO were recently proposed for mining top-k high utility itemsets, and they were shown to be more efficient than other state-of-the-art algorithms such as TKU [38], UP-Growth [24], and UP-Growth+ [28]. Experiments were performed on a computer equipped with a 64 bit Core i3, 2.4 GHz Intel Processor and 4 GB of RAM, running the Windows 7 operating system. All of the algorithms were implemented in Java, and both real and synthetic datasets were used in the experiments. The characteristics of the real datasets used in the experiments are shown in Table 2, where #Trans, #Items, and #Avg indicate the number of transactions, the number of distinct items, and the average transaction length, respectively. Accidents, Mushroom, Retail, Chess, Connect, and Kosarak were obtained from the FIMI Repository [45]. These datasets are real datasets with synthetic utility values. The internal utility values have been generated using a uniform distribution in the [1, 10] interval, and the external utility values have been generated using a Gaussian (normal) distribution [42], as in previous work [24,28]. 10.1. Evaluation of the EUCST structure This section presents experimental results related to the proposed EUCST structure. For each dataset, the algorithm was applied with various values of k. The number of item pairs in the EUCST and the memory required to store these pairs were measured after applying the various threshold raising strategies. Table 3–8 show the pair count and the maximum memory usage for each dataset, after applying the RIU strategy, and after applying the CUD and COV strategies. When the parameter k is increased, the internal minimum utility threshold is raised to lower values, and the item pair count and memory usage are slightly larger. It can be seen that the memory usage and pair count for all

Table 3 Pair count and memory usage of the EUCST after applying the RIU, CUD, and COV threshold raising strategies on Accidents. RIU

CUD and COV

k

Memory size (MB)

Pair count

Memory size (MB)

Pair count

1 5 10 50 100 500 10 0 0

0.9 1.08 1.46 2.96 4.32 8.01 8.01

5268 6340 8534 17,341 25,276 46,815 46,815

0.29 0.34 0.41 0.65 0.85 1.52 1.81

1330 1552 1791 2411 2864 4963 6850

Table 4 Pair count and memory usage of the EUCST after applying the RIU, CUD, and COV threshold raising strategies on Mushroom. RIU

CUD and COV

k

Memory size (MB)

Pair count

Memory size (MB)

Pair count

1 5 10 50 100 500 10 0 0

0.31 0.4 0.42 0.56 0.61 0.61 0.61

1789 2312 2432 3242 3527 3527 3527

0.14 0.15 0.16 0.22 0.24 0.36 0.43

702 737 763 1109 1204 1930 2357

Table 5 Pair count and memory usage of the EUCST after applying the RIU, CUD, and COV threshold raising strategies on Retail. RIU

CUD and COV

k

Memory size (MB)

Pair count

Memory size (MB)

Pair count

1 5 10 50 100 500 10 0 0

0.03 1.76 46.13 226.7 344.61 503.27 551.74

171 10,113 266,503 1,309,201 1,989,460 2,904,279 3,183,355

0.0054 0.18 4.92 24.67 37.74 57.14 64.94

3 17 114 2306 4403 17,978 34,022

Table 6 Pair count and memory usage of the EUCST after applying the RIU, CUD, and COV threshold raising strategies on Chess. RIU

CUD and COV

k

Memory size (MB)

Pair count

Memory size (MB)

Pair count

1 5 10 50 100 500 10 0 0

0.31 0.39 0.39 0.42 0.44 0.44 0.44

1791 2276 2276 2445 2582 2582 2582

0.24 0.26 0.27 0.28 0.3 0.36 0.38

1368 1424 1475 1550 1652 2025 2182

Table 7 Pair count and memory usage of the EUCST after applying the RIU, CUD, and COV threshold raising strategies on Connect. RIU

CUD and COV

k

Memory size (MB)

Pair count

Memory size (MB)

Pair count

1 5 10 50 100 500 10 0 0

0.57 0.69 0.75 1.01 1.14 1.17 1.17

3309 4052 4401 5917 6659 6827 6827

0.34 0.38 0.41 0.46 0.5 0.61 0.71

1832 2050 2200 2403 2569 3247 3886

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

ARTICLE IN PRESS

JID: KNOSYS

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17 Table 8 Pair count and memory usage of the EUCST after applying the RIU, CUD, and COV threshold raising strategies on Kosarak. RIU

Table 10 Threshold values obtained on Mushroom.

CUD and COV

k

Memory size (MB)

Pair count

Memory size (MB)

Pair count

1 5 10 50 100 500

0.0343 103.65 792.85 2,061.98 2,715.41 3,927.17

210 666,208 5,098,733 13,261,732 17,464,485 27,187,241

0.0 0 047 0.0859 0.712 57.29 119.50 384.15

2 488 4301 366,867 766,746 2,468,287

Table 9 Threshold values obtained on Accidents.

13

REPT(N = 100)

Algorithm

kHMC

k

RIU

CUD

COV

PUD

RIU

RSD

1 5 10 50 100 500 10 0 0

88,312 53,631 40,998 7949 410

129,885 113,298 93,606 68,770 51,340 19,139 8993

160,823 151,424 151,298 141,789 132,567 122,777 113,495

88,979 66,980 50,286 23,638 11,510 193

88,312 53,631 40,998 7949 410

129,885 113,298 93,606 68,770 51,340 19,070 6401

Table 11 Threshold values obtained on Retail. REPT(N = 200)

Algorithm

kHMC

Algorithm

kHMC

k

RIU

CUD

COV

PUD

RIU

RSD

k

RIU

CUD

COV

PUD

RIU

RSD

1 5 10 50 100 500 10 0 0

5589,192 3325,842 1905,782 428,860 142,277

9320,084 7853,805 6969,478 4836,703 3730,529 1326,872 695,739

9320,084 7853,805 6969,478 4836,703 3730,529 1326,872 695,739

7653,891 5681,678 4823,295 1388,712 649,572 12,964 642

5589,192 3325,842 1905,782 428,860 142,277

9320,084 7853,805 6969,478 4836,703 3730,529 1326,872 695,739

1 5 10 50 100 500 10 0 0

278,199 82,271 20,846 8080 5724 2052 1274

320,051 99,222 50,168 13,017 9136 3986 2547

320,051 99,222 50,168 13,017 9136 3986 2547

111,343 34,495 26,733 5109 3374 1158 728

278,199 82,271 20,846 8080 5724 2052 1274

320,051 5,981 15,259 8551 6803 3140 1811

datasets is quite low. From these results, it can be inferred that the EUCST after raising the threshold using the CUD and COV strategies is much more efficient than the EUCS, as it requires less memory and stores less item pairs. Thus, using the designed threshold strategies RIU, CUD, and COV, the EUCST contains fewer item pairs and consumes less memory. This considerably improves the overall performance of the proposed algorithm.

Table 12 Threshold values obtained on Chess.

10.2. Evaluation of the threshold raising strategies The designed kHMC algorithm includes three effective threshold raising strategies named RIU, CUD, and COV. RIU is also used in the REPT algorithm [37], while CUD and COV are two novel strategies introduced in this paper. To evaluate the effectiveness of the threshold raising strategies, an experiment was performed where the RIU, CUD, and COV strategies were compared with the PUD and RSD strategies used by the REPT algorithm. The strategies were applied on each dataset, while varying the parameter k, and the threshold values obtained by each threshold raising strategy were measured. In this experiment, obtaining a higher threshold value is better. Note that to be able to see the influence of each threshold raising strategy, each strategy is run independently in this experiment, unlike in [37] where the threshold values obtained by a strategy depends on prior strategies. Thus, in this experiment, the k-th value obtained by each strategy is displayed. Besides, note that for this experiment, the value of N required by the RSD strategy is respectively set to 200, 100, 1000, 30, 50, and 2000, for the Accidents, Mushroom, Retail, Chess, Connect, and Kosarak datasets as suggested in [37]. Table 9–14 show the threshold values obtained by applying each threshold raising strategy. It can be observed that as the parameter k is increased, the threshold can be raised to larger values. Furthermore, it can be seen that the CUD and COV strategies raise the threshold to very high values. The more the threshold is raised, the more the search space is reduced, and as a result less itemsets need to be considered to find the top-k high utility itemsets. This directly influences the runtime and memory consumption of the algorithms. Besides, it can be observed that the RSD and PUD strategies are special cases of the CUD strategy. The CUD strategy covers all cases of the RSD and PUD strategies. This is the reason why the CUD strategy can raise the threshold higher than

REPT(N = 10 0 0)

REPT(N=30)

Algorithm

kHMC

k

RIU

CUD

COV

PUD

RIU

RSD

1 5 10 50 100 500 10 0 0

52,038 35,290 32,676 6623

91,500 81,247 77,676 61,813 50,759 24,384 12,814

91,500 81,247 77,676 91,197 86,024 75,264 69,714

71,367 58,467 49,314 20,512 6653

52,038 35,290 32,676 6623

91,500 81,247 75,017 50,902 31,380

Table 13 Threshold values obtained on Connect. REPT(N=50)

Algorithm

kHMC

k

RIU

CUD

COV

PUD

RIU

RSD

1 5 10 50 100 500 10 0 0

742,554 737,272 674,164 109,341 7156

1483,020 1474,578 1472,118 1314,648 1075,593 599,088 278,367

1483,020 1474,578 1472,118 1314,648 1075,593 800,558 421,406

717,790 716,340 672,756 295,545 176,082 6942

742,554 737,272 674,164 109,341 7156

1483,020 1474,578 1472,118 1314,648 1070,541 5941

the RSD and PUD strategies. Applying the CUD strategy is simple and inexpensive in terms of runtime and memory, as described in Section 8.2. The three datasets Mushroom, Connect, and Chess are dense datasets. Thus, items in these datasets have large coverage sets. For this reason, the COV strategy can raise the threshold higher than the CUD strategy. Figs. 7–10 compare results of the CUD, COV, PUD, and RSD threshold raising strategies on the Chess, Connect, Retail, and Mushroom datasets. It can be observed that the CUD and COV strategies raise the threshold higher than the PUD and RSD strategies, and by more than 100 times, in some cases. 10.3. Evaluation of the TEP strategy This section presents experimental results related to the evaluation of the proposed TEP pruning strategy. To evaluate the effec-

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

ARTICLE IN PRESS

JID: KNOSYS 14

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17

Table 14 Threshold values obtained on Kosarak. REPT(N=20 0 0)

Algorithm

kHMC

k

RIU

CUD

COV

PUD

RIU

RSD

1 5 10 50 100 500

6617,618 972,156 396,823 94,648 53,707 16,149

8912,841 2023,210 1290,328 352,209 231,239 87,496

8912,841 2023,210 1290,328 352,209 231,239 87,496

8519,820 1535,670 635,846 185,291 117,103 35,981

6617,618 972,156 396,823 94,648 53,707 16,149

8912,841 2023,210 1290,328 185,291 231,239 87,496 Fig. 10. Threshold values obtained on Mushroom. Table 15 Number of candidates generated by kHMC with/without the TEP strategy. Accidents

Fig. 7. Threshold values obtained on Chess.

Mushroom

Retail

k

kHMCNoTep

kHMCTep

kHMCNoTep

kHMCTep

kHMCNoTep

kHMCTep

10 50 100 200 500 700 10 0 0

10,228 13,451 18,440 22,957 29,919 32,861 38,013

8470 11,388 16,024 20,265 26,933 29,789 34,796

3643 4982 5743 6188 90,74 10,552 11,620

2876 4099 4870 5281 7960 9378 10,524

13 2667 159,264 1136,541 4588,796 6588,211 8963,480

13 2461 158,899 1135,946 4587,425 6586,878 8961,775

Fig. 8. Threshold values obtained on Connect. Fig. 11. Runtimes on Accidents.

Fig. 9. Threshold values obtained on Retail.

tiveness of the TEP strategy, an experiment was performed on the Accidents, Retail, and Mushroom datasets, while varying the parameter k, and measuring the number of candidates generated by applying the TEP strategy. Two versions of kHMC named kHMCTep and kHMCNoTep were prepared, where the TEP strategy is respectively activated and deactivated. Table 15 compares the number of candidates generated by kHMCTep and kHMCNoTep on the Accidents, Retail, and Mushroom datasets. It can be observed that kHMCTep considers up to 21% less candidate than kHMCNoTep . 10.4. Evaluation of the efficiency of kHMC This section compares the efficiency of the proposed kHMC algorithm with REPT [37] and TKO [39], for the Accidents, Retail, Mushroom, and Kosarak datasets. These datasets have been selected because they include dense, sparse, and large datasets, and

Fig. 12. Memory used for candidates on Accidents.

thus represent the main types of datasets used in real-life applications. To discover the top-k high utility itemsets, REPT scans the database three times. In Phase I, REPT performs two database scans to generate a set of candidates. In Phase II, REPT scans the database again to calculate the exact utility of each candidate. This operation is very expensive. In this experiment, REPT is executed with various values for the parameter N of the RSD strategy. The notation REP T (N = x ) represents REPT with N = x. The proposed kHMC algorithm scans the database twice. The utility of an itemset is calculated directly by adding the utilities in its utility-list. In TKO, authors integrate the strategies RUC, RUZ, and EPB [39] for pruning the search space.

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

ARTICLE IN PRESS

JID: KNOSYS

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17

Fig. 13. Number of potential candidates on Accidents.

Fig. 18. Memory used for candidates on Mushroom.

Fig. 14. Runtimes on Retail.

Fig. 19. Number of potential candidates on Mushroom.

Fig. 15. Memory used for candidates on Retail.

Fig. 16. Number of potential candidates on Retail.

Fig. 17. Runtimes on Mushroom.

15

Fig. 20. Runtimes on Kosarak.

Fig. 21. Memory used for candidates on Kosarak.

Figs. 11–22 show the runtime, the memory used to store potential candidates, and the number of candidates visited in the search space, for each algorithm and dataset. Detailed results for the Accidents dataset are shown in Figs. 11–13. It can be observed that when k is small (1 ≤ k ≤ 10), the runtime and memory consumption are similar for all algorithms, and that for small values of k (k= 1), the REPT algorithm is slightly slower than kHMC and TKO. The reason is that kHMC has to apply the CUD and COV threshold raising strategies, and this takes slightly more time than the PUD and RSD strategies, but the difference is very small. When k is set to larger values (50 ≤ k ≤ 1,0 0 0), the runtime and memory consumption of REPT are much larger than kHMC. In particular, when

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

JID: KNOSYS 16

ARTICLE IN PRESS

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17

Fig. 23. Synthetic datasets characteristics.

Fig. 22. Number of potential candidates on on Kosarak.

k is set to values greater than 100, REPT fails to provide a result (it is considered that an algorithm fails in this experiment if it takes more than 10,0 0 0 seconds to terminate). It can also be observed that kHMC outperforms TKO. Figs. 14 –16 display results for the Retail dataset. It can be seen that kHMC consumes less memory than REPT. In terms of runtime, kHMC and TKO have similar performance. They perform slightly better than REPT because they generate a small number of candidates. For N = 10,0 0 0, REPT spends more time for enumerating 2-itemsets consisting of promising items from each transaction by using the RSD strategy than with N = 1,0 0 0, so REP T (N = 10, 0 0 0 ) is the worse. Detailed results for the Mushroom dataset are shown in Figs. 17–19. Results for Mushroom are similar to the results for the Accidents dataset. When k is set to small values, the performance of REPT is similar to kHMC, but when k is set to larger values, the difference becomes clear and large. This is explained as follows. For dense datasets such as Mushroom and Accidents, when k is set to small values, the threshold is raised to high values, and as a result, the number of candidates is small. Therefore, algorithms are fast and require a small amount of memory. But when k is set to large values, the threshold cannot be raised as high. In this case, the REPT algorithm generates a huge amount of candidates (millions), while kHMC employs co-occurrence pruning and TEP pruning, and as a result the number of candidates generated by kHMC remains small. Moreover, because REPT is a two phase algorithm, it has to scan the database to get the utilities of all (millions of) candidates. This process is very expensive in terms of runtime. It becomes especially inefficient for large values of k and N. On the other hand, in kHMC, the utilities of itemsets are directly and easily calculated using utility-lists. The performance of TKO is good and similar to kHMC for this dataset. Figs. 20–22 display results for the large Kosarak dataset. When k is set to small values (1 ≤ k ≤ 50), the threshold can be raised high and the number of candidates is small for both kHMC and REPT. On overall, the runtime of kHMC is less than the runtime of REPT for this dataset. When k is set to large values (k > 50), the number of candidates generated by kHMC grows fast (more than a million) because the number of items in Kosarak is huge (approximately 40,0 0 0 items), and hence the threshold cannot be raised to high values. But the runtime of kHMC is still less than the runtime of REPT. This result is explained as follows. Although the number of candidates generated by REPT is less than the number of candidates generated by kHMC, REPT needs to scan the database to calculate the exact utilities of all candidates in Phase II. This is very expensive since Korasak has a huge amount of transactions and a large number of items. Another reason is that kHMC is a one phase algorithm. Thus, it obtains the utilities of itemsets by performing utility-list intersections rather than scanning the database. Therefore, REPT takes more time than kHMC to mine the top-k high utility itemsets. In contrast, TKO cannot be run on Kosarak since it runs out of memory because of its EPB strategy, which requires

Fig. 24. Scalability evaluation of the algorithms.

to maintain multiple possible branches of the search space into memory. The EPB strategy explores the most promising branches of the search space first, to find itemsets having a high utility early. Hence, results show that the number of candidates generated by TKO is slightly less than kHMC but TKO requires additional time to sort and maintain the set of promising branches. Furthermore, the proposed method for intersecting utility-lists has a lower complexity than the one used in TKO. For these reasons, the runtime of kHMC is less than TKO. Besides, experimental results show that the performance of REPT can greatly vary for different values of N. The selection of N has a major influence on the RSD strategy in REPT, especially on large datasets. However, it may be difficult for users to select appropriate values for the parameter N of REPT. This drawback of REPT was also observed by Tseng et al. [39]. In conclusion, the previously described experiments show that kHMC outperforms REPT and TKO. 10.5. Evaluation of the Scalability Finally, the scalability of the algorithms is compared under different parameter settings. In these experiments, k is fixed to 5,0 0 0. Synthetic datasets have been generated using the IBM data generator [39] and the database size is varied from 100K to 500K transactions. Fig. 23 shows the characteristics of the synthetic datasets, where the number of transactions x is varied from 100K to 500K transactions. For this experiment, the parameter N of REPT is set to 10 0 0. Fig. 24 shows the runtime of the algorithms with respect to the database size. Note that both REPT and TKO failed to run on synthetic datasets (it is considered that an algorithm fails if it terminates in more than 10,0 0 0 seconds or run out of memory). Results show that the proposed algorithm has good scalability under different parameter settings. 11. Conclusion This paper proposed an efficient algorithm named kHMC for top-k high utility itemset mining. The kHMC algorithm introduces several novel ideas. First, it includes an improved pruning strategy named EUCPT to reduce the number of join operations and prune the search space. Second, a novel and efficient pruning strategy named TEP was proposed for reducing the search space by pruning transitive extensions. Third, a novel intersection procedure for constructing a utility-list in O(m + n ) time complexity was introduced.

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016

JID: KNOSYS

ARTICLE IN PRESS

[m5G;April 21, 2016;22:14]

Q.-H. Duong et al. / Knowledge-Based Systems 000 (2016) 1–17

Fourth, an early abandoning strategy applied during the construction of utility-lists was also presented. These techniques not only help to find the top-k high utility itemsets effectively, but they also reduce the runtime and memory consumption of the algorithm considerably. But major challenges in top-k high utility itemset mining are also to design effective strategies for initializing and raising the internal minimum utility threshold during the mining process. Thus, in the proposed kHMC algorithm, three additional strategies are employed to raise the threshold close to the optimal boundary threshold value. In particular, a new threshold raising strategy based on a novel concept of coverage was proposed. The concept of coverage was shown to be useful for top-k high utility itemset mining and may be used in other problems related to high utility pattern mining. An extensive experimental evaluation was conducted to confirm the effectiveness and the high efficiency of the proposed algorithm for finding the top-k high utility itemsets. Although the proposed algorithm was shown to outperform previous algorithms in terms of both runtime and memory, there is still room for improvement, and there are several opportunities for future work. In particular, the concept of coverage is employed in the new COV threshold raising strategy. The coverage should be studied and applied to prune the search space in other high utility pattern mining problems such as closed and maximal high utility itemset mining. This is an interesting possibility for future work. Acknowledgements This work was partially funded by the National Natural Science Foundation of China (Grant Nos. 11171369, 61272395, 61370171, 61300128, 61572178). References [1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, VLDB (1994) 487–499. [2] T. Calders, B. Goethals, Mining all non-derivable frequent itemsets, in: Principles of Data Mining and Knowledge Discovery. 6th European Conference, PKDD 2002. Proceedings, 2002. [3] J. Han, H. Cheng, D. Xin, X. Yan, Frequent pattern mining: Current status and future directions, Data Mining Knowl. Discov. 15 (1) (2007) 55–86. [4] J.W. Han, J. Pei, Y.W. Yin, Mining frequent patterns without candidate generation, SIGMOD Rec. 29 (2) (20 0 0) 1–12. [5] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, SIGMOD Rec. 22 (2) (1993).207–16 [6] Y.L. Cheung, A.W.C. Fu, Mining frequent itemsets without support threshold: With and without item constraints, IEEE Trans. Knowl. Data Eng. 16 (9) (2004) 1052–1069. [7] J. Han, J. Pei, Y. Yin, R. Mao, Mining frequent patterns without candidate generation: A frequent-pattern tree approach, Data Mining Knowl. Discov. 8 (1) (2004) 53–87. [8] M.J. Zaki, K. Gouda, Fast vertical mining using diffsets, in: Proceedings of the 2003 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2003. [9] G. Liu, H. Lu, W. Lou, Y. Xu, J. Yu, Efficient mining of frequent patterns using ascending frequency ordered prefix-tree, Data Mining Knowl. Discov. 9 (2) (2004) 249–274. [10] T.-L. Dam, K. Li, P. Fournier-Viger, Q.-H. Duong, An efficient algorithm for mining top-rank-k frequent patterns, Appl. Intell. (2016) 1–16. [11] G. Grahne, J.F. Zhu, Fast algorithms for frequent item-set mining using fp-trees, IEEE Trans. Knowl. Data Eng. 17 (10) (2005) 1347–1362. [12] J.W. Han, J.Y. Wang, Y. Lu, P. Tzvetkov, Mining top-k frequent closed patterns without minimum support, in: 2002 IEEE International Conference on Data Mining, Proceedings, 2002. [13] P. Jian, H. Jiawei, L. Hongjun, S. Nishio, T. Shiwei, Y. Dongqing, H-mine: Fast and space-preserving frequent pattern mining in large databases, IIE Transactions 39 (6) (2007) 593–605. [14] U. Yun, G. Lee, K.H. Ryu, Mining maximal frequent patterns by considering weight conditions over data streams, Knowl. Based Syst. 55 (2014) 49–65.

17

[15] J.C.-W. Lin, W. Gan, P. Fournier-Viger, T.-P. Hong, V.S. Tseng, Efficient algorithms for mining high-utility itemsets in uncertain databases, Knowl. Based Syst. 96 (2016) 171–187. [16] M.A. Nishi, C.F. Ahmed, M. Samiullah, B.-S. Jeong, Effective periodic pattern mining in time series databases, Expert Syst. Appl. 40 (8) (2013) 3015–3027. [17] W. Yao-Te, A.J.T. Lee, Mining web navigation patterns with a path traversal graph, Expert Syst. Appl. 38 (6) (2011).7112–22 [18] A. Deepak, D. Fernandez-Baca, S. Tirthapura, M.J. Sanderson, M.M. McMahon, Evominer: frequent subtree mining in phylogenetic databases, Knowl. Inf. Syst. 41 (3) (2014) 559–590. [19] A. Salam, M.S.H. Khayal, Mining top-k frequent patterns without minimum support threshold, Knowledge and Information Systems 30 (1) (2012) 57–86. [20] J.Y. Wang, J.W. Han, Y. Lu, P. Tzvetkov, TFP: An efficient algorithm for mining top-k frequent closed itemsets, IEEE Trans. Knowl. Data Eng. 17 (5) (2005) 652–664. [21] S.-C. Ngan, T. Lam, R.C.-W. Wong, A.W.-C. Fu, Mining n-most interesting itemsets without support threshold by the cofi-tree, Int. J. Bus. Intell. Data Mining 1 (1) (2005) 88–106. [22] A.W.-c. Fu, R.W.-w. Kwong, J. Tang, Mining n-most interesting itemsets, in: Foundations of Intelligent Systems, Springer, 2010, pp. 59–67. [23] J.-j. Shi, Y.-q. Miao, Top-k frequent closed item set mining in microarray data, Comput. Eng. 37 (2) (2011).60–2 [24] V.S. Tseng, C.-W. Wu, B.-E. Shie, P.S. Yu, Up-growth: An efficient algorithm for high utility itemset mining, 2010, [25] C.F. Ahmed, S.K. Tanbeer, J. Byeong-Soo, L. Young-Koo, Efficient tree structures for high utility pattern mining in incremental databases, IEEE Trans. Knowl. Data Eng., 21 (12) (2009) 1708–1721. [26] Y. Liu, W.K. Liao, A. Choudhary, A two-phase algorithm for fast discovery of high utility itemsets, in: Lecture Notes in Artificial Intelligence, vol. 3518, 2005, pp. 689–695. [27] H. Yao, H.J. Hamilton, C.J. Butz, A foundational approach to mining itemset utilities from databases, in: Proceedings of the Fourth Siam International Conference on Data Mining, 2004. [28] V.S. Tseng, B.-E. Shie, C.-W. Wu, P.S. Yu, Efficient algorithms for mining high utility itemsets from transactional databases, IEEE Trans. Knowl. Data Eng. 25 (8) (2013) 1772–1786. [29] C.-W. Lin, T.-P. Hong, W.-H. Lu, An effective tree structure for mining high utility itemsets, Expert Syst. Appl. 38 (6) (2011) 7419–7424. [30] Y.-C. Li, J.-S. Yeh, C.-C. Chang, Isolated items discarding strategy for discovering high utility itemsets, Data Knowl. Eng. 64 (1) (2008) 198–217. [31] M. Liu, J. Qu, Mining high utility itemsets without candidate generation, in: 21st ACM International Conference on Information and Knowledge Management, CIKM’12, Maui, HI, USA, October 29 - November 02, 2012, 2012, pp. 55–64. [32] W. Song, Y. Liu, J. Li, Bahui: Fast and memory efficient mining of high utility itemsets based on bitmap, Int. J. Data Warehous. Mining 10 (1) (2014) 1–15. [33] G.-C. Lan, T.-P. Hong, V.S. Tseng, An efficient projection-based indexing approach for mining high utility itemsets, Knowl. Inf. Syst. 38 (1) (2014) 85–107. [34] U. Yun, H. Ryang, K.H. Ryu, High utility itemset mining with techniques for reducing overestimated utilities and pruning candidates, Expert Syst. Appl. 41 (8) (2014) 3861–3878. [35] M.J. Zaki, Scalable algorithms for association mining, IEEE Trans. Knowl. Data Eng. 12 (3) (20 0 0) 372–390. [36] P. Fournier-Viger, C. Wu, S. Zida, V.S. Tseng, FHM: Faster high-utility itemset mining using estimated utility co-occurrence pruning, in: Foundations of Intelligent Systems - 21st International Symposium, ISMIS 2014, Roskilde, Denmark, June 25–27, 2014. Proceedings, 2014, pp. 83–92. [37] H. Ryang, U. Yun, Top-k high utility pattern mining with effective threshold raising strategies, Knowl. Based Syst. 76 (2015) 109–126. [38] C. Wu, B. Shie, V.S. Tseng, P.S. Yu, Mining top-k high utility itemsets, in: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, Beijing, China, August 12-16, 2012, 2012, pp. 78–86. [39] V.S. Tseng, C.-W. Wu, P. Fournier-Viger, P.S. Yu, Efficient algorithms for mining top-k high utility itemsets, IEEE Trans. Knowl. and Data Eng., 28 (1) (2015) 54–67. [40] C. Raymond, Y. Qiang, S. Yi-Dong, Mining high utility itemsets, in: Third IEEE International Conference on Data Mining, 2003. [41] H. Xiong, M. Brodie, S. Ma, TOP-COP: Mining TOP-K strongly correlated pairs in large databases, IEEE International Conference on Data Mining, pp. 1162–1166. [42] P. Fournier-Viger, A.G. Penalver, T. Gueniche, A. Soltani, C.-W. Wu, V.S. Tseng, SPMF: A java open-source pattern mining library, J. Mach. Learn. Res. (JMLR) (2014) 15:3389–3393. [43] S. Krishnamoorthy, Pruning strategies for mining high utility itemsets, Expert Syst. Appl. 42 (5) (2015) 2371–2381. [44] V.S. Tseng, C. Wu, P. Fournier-Viger, P.S. Yu, Efficient algorithms for mining the concise and lossless representation of high utility itemsets, IEEE Trans. Knowl. Data Eng. 27 (3) (2015) 726–739. [45] Frequent itemset mining implementations repository, 2012, URL http://fimi.cs. helsinki.fi

Please cite this article as: Q.-H. Duong et al., An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies, Knowledge-Based Systems (2016), http://dx.doi.org/10.1016/j.knosys.2016.04.016