Mining of frequent patterns with multiple minimum supports

Engineering Applications of Artificial Intelligence 60 (2017) 83–96 Contents lists available at ScienceDirect Engineering Applications of Artiﬁcial ...

Download PDF

801KB Sizes 0 Downloads 60 Views

Report

PDF Reader
Full Text

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

Contents lists available at ScienceDirect

Engineering Applications of Artiﬁcial Intelligence journal homepage: www.elsevier.com/locate/engappai

Mining of frequent patterns with multiple minimum supports a

a,⁎

b

MARK a,c

Wensheng Gan , Jerry Chun-Wei Lin , Philippe Fournier-Viger , Han-Chieh Chao , Justin Zhand a

School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), China School of Natural Sciences and Humanities, Harbin Institute of Technology (Shenzhen), China Department of Computer Science and Information Engineering National Dong Hwa University, Hualien County, Taiwan d Department of Computer Science, University of Nevada, Las Vegas, USA b c

A R T I C L E I N F O

A BS T RAC T

Keywords: Frequent patterns Multiple minimum supports Sorted downward closure property Set-enumeration-tree DiﬀSet

Frequent pattern mining (FPM) is an important topic in data mining for discovering the implicit but useful information. Many algorithms have been proposed for this task but most of them suﬀer from an important limitation, which relies on a single uniform minimum support threshold as the sole criterion to identify frequent patterns (FPs). Using a single threshold value to assess the usefulness of all items in a database is inadequate and unfair in real-life applications since each item is diﬀerent and not all items should be treated as the same. Several algorithms have been developed for mining FPs with multiple minimum supports but most of them suﬀer from the time-consuming problem and require a large amount of memory. In this paper, we address this issue by introducing the novel approach named Frequent Pattern mining with Multiple minimum supports from the Enumeration-tree (FP-ME). In the developed Set-Enumeration-tree with Multiple minimum supports (ME-tree) structure, a new sorted downward closure (SDC) property of FPs and the least minimum support (LMS) concept with multiple minimum supports are used to eﬀectively prune the search space. The proposed FP-ME algorithm can directly discover FPs from the ME-tree without candidate generation. Moreover, an improved algorithm, named FP-MEDiﬀSet, is also developed based on the DiﬀSet concept, to further increase mining performance. Substantial experiments on both real-life and synthetic datasets show that the proposed algorithms can not only avoid the “rare item problem”, but also eﬃciently and eﬀectively discover the complete set of FPs in transactional databases while considering multiple minimum supports and outperform the state-of-the-art CFP-growth++ algorithm in terms of execution time, memory usage and scalability.

1. Introduction With the rapid development of sensor technology, knowledge discovery in database (KDD) has become a powerful tool for ﬁnding meaningful and valuable information from the amounts of mass data. In the ﬁeld of data mining, frequent pattern mining (FPM) and association rule mining (ARM) (Chen et al., 1996; Han et al., 2004; Lin et al., 2009) are the fundamental task in data mining and have numerous real-world applications. Most studies in FPM have been extensively studied, such as incremental mining of FPs (Hong et al., 2008; Lin et al., 2009), constrain-based FPM (Hong et al., 2009; Pei and Han, 2002; Zaki and Hsiao, 2002), weighted-based frequent pattern mining (Gan et al., 2016; Lin et al., 2015a, 2015d; Vo et al., 2013), and interesting FPM (Geng and Hamilton, 2006; Lin et al., 2015c, 2016), among others (Grahne and Zhu, 2005; Han et al., 2004;

Pei et al., 2001; Schlegel et al., 2011). In general, most of them focus on developing eﬃcient algorithms to mine FPs in transactional databases (Chen et al., 1996; Han et al., 2004). The above approaches suﬀers, however, from an important limitation, which has to utilize a single minimum support threshold as the measure to identify the set of FPs. Using a single support threshold value to assess the occur frequency of all items in a database is inadequate since each item is diﬀerent and they should not be treated the same. The reasons are described as follows. In retail business, customers may buy some items with a high frequency but buy other items very rarely. In general, the necessary, consumable and low-price products are frequently bought, while the luxury goods, electric appliances and high-price products are rarely bought. For the above situations, if the minsup is set too high, all the discovered patterns are concerned with those low-price products, which only contribute a small

⁎

Corresponding author. E-mail addresses: [email protected] (W. Gan), [email protected] (J.C.-W. Lin), [email protected] (P. Fournier-Viger), [email protected] (H.-C. Chao), [email protected] (J. Zhan). http://dx.doi.org/10.1016/j.engappai.2017.01.009 Received 16 June 2016; Received in revised form 13 December 2016; Accepted 12 January 2017 0952-1976/ © 2017 Elsevier Ltd. All rights reserved.

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

W. Gan et al.

4. Extensive experiments were conducted on both real-life and synthetic datasets to evaluate the performance of the proposed algorithms. Results showed that the proposed algorithms can eﬃciently identify all FPs from a transactional database while considering multiple minimum supports and can avoid the “rare item problem”. Both the proposed two algorithms signiﬁcantly outperform the stateof-the-art CFP-growth++ algorithm in terms of execution time, memory usage and scalability. Besides, the improved algorithm considerably outperforms the baseline algorithm.

portion of the proﬁt to the business. Otherwise, if the minsup is set too low, it generates too many meaningless FPs and the decision makers may be confused and misled to make the wrong decisions. Thus, a traditional FPM algorithm may discover many itemsets that are frequent but generate a low proﬁt and fail to discover itemsets that are rarer but generate high proﬁt. For example, clothes i.e., {shirt, tie, trousers, suits} occurs much more frequent than {diamond} in a supermarket but having positive contribution to increase the proﬁt amount. If the value of minsup is set too high, though the rule {shirt, tie → trousers} can be found, we would never ﬁnd the rule {shirt, tie → diamond}. To ﬁnd the second rule, the minsup is necessary to set very low. However, this will cause lots of meaningless rules to be found at the same time. To address the “rare item problem” in FPM (Liu et al., 1999; Lee et al., 2005), the problem of mining frequent patterns with multiple minimum supports (FP-MMS) has been studied. Liu et al. (1999) ﬁrst introduced the problem of FPM with multiple minimum supports, and also proposed the MSApriori algorithm by extending the level-wise Apriori algorithm. The goal of FP-MMS is to discover the valuable set of patterns that are “frequent” for the users, i.e., frequent patterns (FPs), it allows the users to freely set multiple minimum supports instead of an uniform minimum support to reﬂect diﬀerent natures and frequencies of all items. Up to now, several approaches have been designed for the mining task of FP-MMS, such as MSApriori (Liu et al., 1999), MMS_Cumulate and MMS_Stratify (Tseng and Lin, 2007), CFP-growth (Hu and Chen, 2006), CFP-growth++ (Kiran and Reddy, 2011), and so on. As the enhanced algorithm of CFP-growth, the stateof-the-art CFP-growth++ was proposed by extending the well-known FP-growth approach to mine FPs from a condensed CFP-tree structure (Kiran and Reddy, 2011). However, the mining eﬃciency of them is still a major problem. For example, the FP-MMS still suﬀers from the time-consuming and memory usage problems. It is thus quite challenge and critically important to design an eﬃcient algorithm to solve this problem. In this paper, we propose a novel mining model named mining FPs from the Set-enumeration-tree with multiple minimum supports (FPME) is designed to address this important research gap. In the designed FP-ME model, each item has its own unique minimum support threshold instead of a single uniform minimum support threshold for all items. This increases the applicability of FPM in real-life situations, which allows the user to specify multiple minimum supports and reﬂect diﬀerent nature and frequency of items. The key contributions of this paper are summarized as follows:

The rest of this paper is organized as follows. Related work is brieﬂy reviewed in Section 2. Preliminaries and the problem statement of frequent pattern mining with multiple minimum supports (FP-MMS) are presented in Section 3. The proposed baseline FP-ME algorithm and the improved FP-MEDiﬀSet algorithm are presented in Section 4. An experimental evaluation comparing the performance of the proposed approaches is provided in Section 5. Finally, conclusions are drawn in Section 6. 2. Related work To confront the “rare item problem” which has been presented above, the problem of FPM involving rare items using multiple minimum support thresholds has been studied. Up to now, several algorithms such as MSApriori (Liu et al., 1999), MMS_Cumulate and MMS_Stratify (Tseng and Lin, 2007), CFP-growth (Hu and Chen, 2006), CFP-growth++ (Kiran and Reddy, 2011), REMMAR (Liu et al., 2011) and FQSP-MMS (Huang, 2013), etc. have been proposed. Characteristics of the related algorithms are shown in Table 1. Those algorithms allow the user to specify multiple minimum supports (MMS) instead of a single minimum support to reﬂect the nature of the items and their varied frequency in the database. MSApriori (Liu et al., 1999) is the ﬁrst framework to address the FP-MMS problem. In MSApriori, each item is associated with a speciﬁc minimum item support (MIS) value, and each pattern satisfy a minsup depending upon the MIS value of the items within it. The MSApriori extends the well-known Apriori algorithm to mine FPs or ARs by considering multiple minimum supports, rare ARs can be discovered without generating a large number of meaningless rules (Liu et al., 1999). MSApriori uses sorted closure property to reduce the search space, but it may easily suﬀer from the combinatorial explosion. Then, an improved tree-based algorithm named Conditional Frequent Pattern-growth (CFP-growth) was proposed (Hu and Chen, 2006). CFP-growth mines FPs with multiple minimum supports using the pattern growth method based on a new MIS-Tree structure. CFPgrowth recursively creates a series of conditional trees to generate all desired FPs. Then, Tseng et al. proposed two algorithms, MMS_Cumulate (Tseng and Lin, 2007) and MMS_Stratify (Tseng and Lin, 2007), to mine ARs in the presence of taxonomies, which allows any form of user-speciﬁed multiple minimum supports. In addition, mining ARs with multiple minimum supports using maximum constraints was introduced by using an Apriori-like approach, and it showed that the number of the derived FPs and ARs using maximum constraints is less than those using the minimum constraints (Lee et al., 2005). An enhanced CFP-growth++(Kiran and Reddy, 2011) was then proposed, which employs LMS (least minimum support) instead of MIN and three improved strategies to reduce the search space and improve performance. The CFP-growth and CFPgrowth++ algorithms are, however, needed to perform an exhaustive search on the constructed conditional trees to discover the complete set of FPs, which causing performance problem. A key drawback of the two pattern-growth approaches is how to reduce the traversal and construction cost of a series of conditional sub-trees, and reduce the total number of conditional sub-trees which are needed to be constructed for deriving FPs. They are, however, always hard to both reduce the traversal and the construction cost at the same time. Thus, it is quite

1. In contrast to the Apriori-like and FP-growth-based approaches, we propose a novel Frequent Patterns with Multiple minimum supports from the Set-Enumeration-tree (abbreviated as FP-ME) algorithm to directly extract FPs. It allows mining FPs by considering diﬀerent minimum supports for each item instead of using a single minimum support threshold. 2. Based on the proposed compact tree structure named SetEnumeration-tree with Multiple minimum supports (ME-tree), a new sorted downward closure (SDC) property of FPs w.r.t. the conditional anti-monotonicity of FPs, and the least minimum support (LMS) concept w.r.t. the global anti-monotonicity of FPs with multiple minimum supports, can guarantee the correctness and completeness of derived results. The FP-ME algorithm can directly discover FPs by spanning the ME-tree without candidate generationand-test approach and multiple time-consuming database scans, which can greatly reduce the running time and memory consumption. 3. The DiﬀSet concept is further extended to early prune the huge amount of unpromising patterns, thus speeding up the process for mining FPs. The improved FP-MEDiﬀSet algorithm can discover the complete set of FPs with only two database scans, which greatly decreases the execution time and memory consumption. 84

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

The first framework to address the FP-MMS problem, based on the well-known Apriori algorithm The pattern-growth method based on a MIS-Tree structure It employs LMS (least minimum support) instead of MIN. Three improved strategies were proposed to reduce the search space and improve performance It mines ARs in the presence of taxonomies with MMS Itemset-based data Itemset-based data Itemset-based data

Itemset-based data

Itemset-based data Bioinformatics microarray datasets Quantitative itemset-based data Quantitative itemset-based data Quantitative sequential data Itemset-based data Quantitative itemset-based data

✕ ✓ ✓

✕

✕ ✕ ✕ ✕ ✕ ✓ ✕

✓ ✕ ✕

✓

✓ ✓ ✓ ✓ ✓ ✕ ✓

MSApriori (Liu et al., 1999) CFP-growth (Hu and Chen, 2006) CFP-growth++ (Kiran and Reddy, 2011)

MMS_Cumulate and MMS_Stratify (Tseng and Lin, 2007) Lee et al. (2005) REMMAR (Liu et al., 2011) Lee et al. (2004) Lee et al. (2006) FQSP-MMS (Huang, 2013) GCoMine (Rage and Kitsuregawa, 2014) HUIM-MMS (Lin et al., 2015b)

challenge and critically important to design more eﬃcient algorithm to solve the FP-MMS problem. Recently, mining various types of patterns using multiple minimum supports has been extensively studied, as shown in Table 1. It includes the relational-based multiple minimum supports association rules (REMMAR) (Liu et al., 2011), a fuzzy mining algorithm for discovering useful fuzzy association rules with MMS by using maximum constraints (Lee et al., 2004), the model to fuzzy multiple-level ARs under multiple minimum supports (Lee et al., 2006), the fuzzy quantitative sequential pattern with multiple minimum supports (FQSP-MMS) algorithm (Huang, 2013). Diﬀerent from the concept of MMS, an algorithm named GCoMine for discovering the correlated patterns using multiple minimum all-conﬁdence thresholds (MMC) has studied (Rage and Kitsuregawa, 2014). Recently, a new mining model named HUIMMMS is developed, which discovers the high-utility itemsets with the consideration of multiple minimum utility thresholds instead of multiple minimum supports (Lin et al., 2015b). Other related algorithms about FPM with MMS are still developed on progress.

It derives FPs and ARs using maximum constraints instead of minimum constraints It discovers relational association rules with multiple minimum supports in microarray datasets It mines fuzzy association rules with multiple minimum supports by using maximum constraints It mines fuzzy multiple-level ARs under multiple minimum supports It discovers fuzzy quantitative sequential patterns (FQSPs) from quantitative sequential databases It finds out the correlated patterns using multiple minimum all-confidence thresholds (MMC) It discovers high-utility itemsets with the consideration of multiple minimum utility thresholds of items instead of MMS

Note Data Tree-based Apriori-like Algorithm

Table 1 Characteristics of frequent pattern mining algorithms with multiple minimum supports.

W. Gan et al.

3. Preliminaries and Problem Statement This section introduces preliminaries and deﬁnes the problem of frequent pattern mining with multiple minimum supports (FP-MMS). 3.1. Preliminaries Let I ={i1, i2, …, im} be a ﬁnite set of m distinct items. A transactional database is a set of transactions D ={T1, T2, …, Tn}, where each transaction Tq∈D (1≤q≤ n) is a subset of I and has a unique identiﬁer, called its TID. A multiple minimum supports table MMStable ={ms(i1), ms(i2), …, ms(im)} indicates the user-speciﬁed minimum support value ms(ij) of each item ij. A set of k distinct items X ={i1, i2, …, ik} such that X⊆I is said to be a k-itemset, where k is also called the length of the itemset. An itemset X is said to be contained in a transaction Tq if X⊆Tq. An example database is shown in Table 2. It consists of 10 transactions and 5 items, denoted from (a) to (e), respectively. For example, transaction T1 contains items a, c and d. Table 2 shows the MMS-table, which deﬁnes the minimum support value of each item. Henceforth, this database will be used as the running example. Deﬁnition 1. The number of transactions containing an itemset is known as the occurrence frequency of that itemset, and also called the support count of the itemset. The support of an itemset X, denoted as sup(X), is the number of transactions containing X w.r.t. X⊆Tq. Deﬁnition 2. An itemset X is designated as a frequent pattern if the sup(X) is larger than a user speciﬁed support threshold called minsup multiplied by the total number of transactions in the database, such that sup(X)≥minsup ×|D|. This shows that the presence of itemset X in the database is statistically signiﬁcant. Deﬁnition 3. The minimum support threshold of an item ij in a Table 2 A transactional database.

85

TID

Transaction

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10

a, c, d a, d, e b, c a, c, e a, b, c, d, e b, d a, b, c, e b, c, d c, d, e a, c, d

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

W. Gan et al.

Table 3 The MMS-table.

Table 4 The derived FPs for the running example.

Item

ms

Itemset

MIS ×|D|

sup

a b c d e

40% 50% 30% 60% 100%

(a) (b) (c) (d) (ac) (ad) (ae) (bc) (cd) (ce) (acd) (ace)

4 5 3 6 3 4 4 3 3 3 3 3

6 5 8 7 5 4 4 4 5 4 3 3

database D which is related to minsup is re-denoted as ms(ij) in this paper. A structure called MMS-table indicates the minimum support threshold of each item in D and is deﬁned as:

MMS − table = {ms(i1), ms(i2 ), …, ms(im )}. Assume that the minimum supports of all items for the running example are deﬁned as in the MMS-table shown in Table 3. For example, ms(a) = 40%, and MMS-table = {ms(a), ms(b), ms(c), ms(d), ms(e)} = {40%, 50%, 30%, 60%, 100%}. To confront the “rare item problem”, an eﬀort has been made (Liu et al., 1999) to mine FPs with multiple minsups. In the MSApriori model, each item in the transactional database is speciﬁed with a distinct support constraint known as minimum item support (MIS). In this paper, MIS of an itemset is represented with the minimal ms value among all items in this itemset.

is sup(e) = 5, which is less than MIS(e) × |D| = 100% ×10 = 10. But the itemset (ae) is a FP since its support is sup(ae) = 4, which is greater than the minimum support count MIS(ae)×|D| = min{ms(a), ms(e)} ×| D| = min{40%, 100%}×|D| = 40%×10 = 4. For the running example, the complete set of FPs with multiple minimum supports is shown in Table 4. 4. Proposed FP-ME algorithm for FP-MMS Based on the addressed FP-MMS problem, an eﬃcient algorithm for mining Frequent Patterns with Multiple minimum supports based on the Enumeration-tree, named FP-ME, is proposed in this section. A new tree structure, named the Set-Enumeration-tree with Multiple itemset supports (the ME-tree), is designed for FP-MMS. It is quite diﬀerent from the CFP-tree which is an extended version of the FP-tree structure. Thereafter, an improved algorithm based on the DiﬀSet concept is also introduced to further increase mining eﬃciency.

Deﬁnition 4. The minimum item support of a k-itemset X ={i1, i2, …, ik} in D is denoted as MIS(X), and deﬁned as the smallest ms value for all items in X, that is:

MIS (X ) = min{ms(ij )|ij ∈ X}. For example, MIS(a) = min{ms(a)} = 40%, MIS(ae) = min{ms(a), ms(e)} = min{40%, 100%} = 40%, and MIS(ace) = min{ms(a), ms(c), ms(e)} = min{40%, 30%, 100%} = 30%.

4.1. Proposed ME-tree

Deﬁnition 5. An itemset X in a database D is called a FP if and only if its support count is no less than the minimum itemset support value of X:

Based on the previous studies, the search space of mining FPs with multiple minimum supports can be represented as a lattice structure (Pasquier et al., 1998) or a Set-enumeration tree (Rymon, 1992). In an Apriori-like algorithm, the traversal space of the itemsets in the lattice can also be characterized by a Set-enumeration tree (Rymon, 1992). The well-known downward closure property in traditional FPM algorithm can be used to prune the search space. However, it does not hold for the multiple minimum supports measure used in FP-MMS. In other words, an itemset may have a MIS that is higher, lower or equal to the MIS of its subsets. For example, the itemset (e) is not a FP, but its supersets (ae), (ce) and (ace) are the desired FPs, which can be observed in Table 4. To solve this problem, Liu et al. proposed a concept called sorted downward closure (SDC) property, which assumes that all items within an itemset are sorted in ascending order of their minimum supports (Liu et al., 1999). Since this important property has not been clearly deﬁned in the previous studies, we provide a formalization. Hereafter, to distinguish from the traditional itemset, a sorted k-itemset is denoted as X ={i1, i2,···, ik}.

FP ← {X |sup(X ) ≥ MIS (X ) × D }. The extended concept of MIS enables the user to simultaneously specify high minsup and low minsup for desired patterns containing both frequent and rare items. Thus, it can eﬃciently address the “rare item problem”. The signiﬁcance of this MIS concept is illustrated in the following example. Assume that three items bread, shoes and clothes in the database and also assume the database has 1000 transactions, and the user-speciﬁed MIS values are respectively set as follows: MIS(bread) = 3%; MIS(shoes) = 1%; MIS(clothes) = 0.15%. If the support of itemset {clothes, bread} is 160, then itemset {clothes, bread} is frequent because the MIS value of itemset {clothes, bread} is thus calculated as min{MIS(clothes), MIS(bread)} = min{0.15%, 3%} =0.15%, 0.15% ×1000 (=150). 3.2. Problem statement Based on the above deﬁnitions, we deﬁne the problem of FPM with multiple minimum supports as follows. Given a transactional database D (|D| = n) and a MMS-table, which deﬁnes the minimum support thresholds {ms(i1), ms(i2), …, ms(im)} of each item in D. The problem of mining frequent patterns from D with multiple minimum supports (FP-MMS) is to ﬁnd the set of itemsets, in which the support of each itemset X is no less than MIS(X) ×|D|. Hence, the goal of FP-MMS is to eﬃciently ﬁnd out the complete set of FPs in a database, while considering multiple minimum supports instead of a single uniform minimum support. For the running example, assume that the minimum support threshold of items is respectively set in Table 3. The itemset (e) is not a FP since its support

Property 1. (Sorted downward closure property). If a sorted kitemset {i1, i2,···, ik}, for k≥2 and MIS(i1) ≤ MIS(i2) ≤···≤ MIS(ik), is frequent, then all of its sorted subsets with k-1 items are also frequent, except for the subset {i2, i3,···, ik}. Proof. The k-itemset {i1, i2,···, ik} has k subsets with k-1 items, which can be divided into two groups with or without i1 included, i.e., group 1: {i1, i2,···, ik−1},{i1, i2,···, ik-2, ik}, …, {i1, i3,···, ik}, and group 2: {i2, i3,···, ik}. Note that all of the itemsets in group 1 have the same lowest minimum item support as that of {i1, i2,···, ik}, i.e., ms(i1), while {i2, i3,···, ik} does not, which is ms(i2). Since ms(i2) ≥ ms(i1), this property holds. 86

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

W. Gan et al.

set of LMS_FP1 is presented as the nodes in the level 1, the reason will be given in next subsection. (2) Each node N in the ME-tree consists of three ﬁelds: N.Name, N.TidSet w.r.t. support count, and N.MIS, where N.Name registers which itemset of this node, N.TidSet is a set of the TIDs of itemset N (the size of N.TidSet is equal to the support count of N, it registers the number of transactions containing N in the database, the MIS value is the pre-deﬁned interesting measure value of each node. (3) All the 1-level items/nodes in the set of LMS_FP1 are sorted in ascending order of their MIS value of items.

The level-wisely generate-and-test mechanism of candidates in the MSApriori algorithm (Liu et al., 1999), in fact, is similar to explore FPs in the lattice structure by using the sorted downward closure (SDC) property. The produce mechanism of lattice structure (Pasquier et al., 1998) and Set-enumeration tree (Rymon, 1992) are quite diﬀerent, the lattice structure is an Apriori-like approach; each itemset is determined by all of its subsets and has to traverse each of these sub-lattices to determine whether their super-lattices stratify the mining condition. In candidate generation, the lattice structure is used as an Apriori-like approach with the SDC property. While the Set-enumeration tree is a preﬁx-tree-based method, each child node is generated and determined by its preﬁx node. To utilize the sorted downward closure property of the desired FPs, we develop a new tree structure named ME-tree as follows.

The ME-tree structure may contain more nodes than the CFP-tree since the ascending support (w.r.t. frequency) ordering method reduces the chances for node sharing among diﬀerent transactions. However, both the CFP-growth (Hu and Chen, 2006) and CFP-growth++ (Kiran and Reddy, 2011) algorithms need to maintain parent links and nodelinks at each node, also need to build a series of conditional trees, which incur additional memory cost. To further save space, we use vertical TID-lists (Zaki and Gouda, 2003) to store the necessary information for every node in the ME-tree, which can save the space and construction costs. A number of vertical mining algorithms have been recently proposed for ARM, which have shown to be very eﬀective and usually outperformed horizontal approaches (Zaki and Gouda, 2003). The main advantage of the vertical format is used to fast calculate the occurrence frequency via intersection operations on TIDs and automatically prune the irrelevant data. Therefore, a ME-tree is an ordered tree structure which can represent a transaction database in a highly compressed form. It is constructed by reading transactions one at a time with a predeﬁned item order ≺ and mapping each transaction into a path in the preﬁxsharing ME-tree. Since diﬀerent transactions can have several common items; their paths may be overlapped. The more the paths overlap with one another, the more compression we can achieve using a preﬁx-tree structure. The ME-tree improves the possibility of preﬁx sharing among all the patterns in database with one database scan. That is, more frequently occurring items are more likely to be shared. This item ordering enhances the compactness of the ME-tree structure. The frequency-descending tree (Chen et al., 1996; Han et al., 2004) may, however, not always provide maximum compactness. Sometimes, the insertion order of transactions may aﬀect the tree size. Thus, the MEtree provides better compactness, which can achieve the “mining during constructing” property with only one database scan. Like the FP-tree, the size of a ME-tree is bounded by the size of the database itself because each transaction contributes at most one path of its size to the ME-tree. Since many preﬁx patterns are common among the transactions, the size of the ME-tree is normally much smaller than its original database. Based on the constructed ME-tree, the following lemmas can be obtained.

Deﬁnition 6. (Total order ≺ on items). Without loss of generality, assume that items in the transactions are stored according to the lexicographic order. Furthermore, assume that the total order ≺ on items in the designed ME-tree is the ascending order of the MIS value of items. Deﬁnition 7. (Set-enumeration tree with multiple minimum supports, ME-tree). A ME-tree is a sorted Set-enumeration tree using the deﬁned total order ≺ on items. For example, the ascending order of MIS value of items in the running example is MIS(c) < MIS(a) < MIS(b) < MIS(d) < MIS(e). Thus, the total order ≺ on items in the ME-tree is c ≺ a ≺ b ≺ d ≺ e. For the running example, the designed ME-tree for the FP-MMS problem is illustrated in Fig. 1. Deﬁnition 8. (Extension nodes in the ME-tree). In the designed ME-tree and according to the total order ≺, all descendant nodes of any tree node are called its extension nodes, and the ﬁrst generation children of a tree node is called the 1-extendsion. For example in Fig. 1, the extension nodes of node (cb) are (cbd), (cbe) and (cbde). Note that all the supersets of node (cb) are (cbd), (cbe), (cab), (cbde), (cabd), (cabe) and (cabde). Hence, the extension node of a node are a subset of the superset of that node. The ME-tree systematically enumerates itemsets of an extension node using a preimposed total order ≺ on the underlying set of items. Based on the above deﬁnitions, the ME-tree is concretely deﬁned as follows: (1) It consists of one root labeled as “null” or “{ }”, a set of item's subtrees as the children of the root. It is important to note that the

{} c

a

b

ca

cb

cd

ce

ab

ad

cab

cad

cae

cbd

cbe

cde

cabd

cabe

cade

e

d

ae

be

bd

abd

cbde

abe

ade

Lemma 1. The support of a node in the ME-tree is no less than the support of any of its child nodes (1-extension nodes), as well as all the extension nodes.

de

Proof. Let Xk be a node in the ME-tree containing k items, and let Xk−1 be any parent node of Xk containing k–1 items. Based on the well-known downward closure property, it is straightforward that sup(Xk) ≤ sup(Xk−1); this lemma can thus be proven. Therefore, the support of any node in the ME-tree is always no less than that of any of its extension nodes based on the total order ≺.

bde

abde

Lemma 2. The MIS of a node in the ME-tree is equal to the MIS of any of its child nodes (1-extension nodes), as well as all the extension nodes.

cabde

Proof. According to above discussion, this lemma holds. For example in Fig. 1, the ME-tree is built using the ascending order of their MIS value of items. The support of (cb) is no less than

MIS(c) < MIS(a) < MIS(b) < MIS(d) < MIS(e) Fig. 1. A constructed ME-tree.

87

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

W. Gan et al.

that of any of its extension nodes (cbd), (cbe) and (cbde), which can be respectively calculated as follow: sup(cb) =4, MIS(cb) =30%; sup(cbd) =2, MIS(cbd) =30%; sup(cbe) =2, MIS(cbe) =30%; and sup(cbde) =1, MIS(cbde) =30%. From the above results, it can be found that the MIS values of (cbd), (cbe) and (cbde) are always equal to the MIS of (cb).

For example, the LMS of Table 3 is calculated as LMS = min{ms(a), ms(b), ms(c), ms(d), ms(e)} = min{40%, 50%, 30%, 60%, 100%} =30%.

Theorem 1. (Sorted downward closure property of FPs in the ME-tree). In the designed ME-tree, if a tree node is a FP, its parent node is also a FP. Let Xk be a k-itemset (node) in the ME-tree and its parent node (w.r.t. a (k-1)-itemset) is denoted as Xk−1. The relationships sup(Xk) ≤ sup(Xk−1) and MIS(Xk) = MIS(Xk−1) hold.

Proof. Since the least minimum support (LMS) refers to the lowest minsup of all FPs in a processed database, X could not be a frequent pattern when sup(X) < LMS.

Proposition 1. If X ={i1, i2,···, ik} ⊆ I, where 1≤ k≤ n, is a pattern such that sup(X) < LMS, X could not be a frequent pattern.

Proposition 2. If X and Y are two patterns such that X⊂Y and sup(X) < LMS, then sup(Y) < LMS.

Proof. According to Lemmas 1 and 2, this theorem holds. From Theorem 1, it can be concluded that if a node in the ME-tree is a FP, its parent node should be a FP. Thus, a novel sorted downward closure (SDC) property of frequent patterns in the designed ME-tree is obtained. The ME-tree can be traversed using the following principle. Each node X in the ME-tree is annotated with a structure called a MMS-table. If the support count of X is no less than its minimum support count (= MIS(X) ×|D|), extensions of the node X are needed to be explored since they may be the desired FPs. Otherwise, the depthﬁrst search procedure will be terminated since no extensions of node X can be a FP. This procedure can be used to eﬃciently reduce the search space of the ME-tree. Moreover, based on the developed ME-tree, the actual frequent patterns can be directly discovered during the construction of ME-tree without performing multiple database scans. The SDC property of FPs can thus be used to speed up the mining process by pruning unpromising extension nodes.

Proof. Since X⊂Y, it means that Y is a superset of X, according to the Apriori property, we have that sup(Y)≤sup(X). With sup(Y)≤sup(X) and sup(X) < LMS, the relationship sup(Y)≤sup(X) < LMS always holds. It indicates that the concept of LMS guarantees the global anti-monotonicity of frequent patterns with multiple minimum supports. Note that the set of 1-items which having sup(X) ≥ LMS is denoted as LMS_FP1 (LMS_FP1 is a subset of the ﬁnal all FPs such that LMS_FP1 ⊆ FPs), the following theorem can be obtained. Theorem 2. Assume that 1-itemsets which having a MIS lower than LMS are discarded and that the sorted downward closure (SDC) property is applied. We have that if an itemset is not a LMS_FP1, then it is not a FP, as well as all its supersets. Proof. Let Xk-1 ={i1, i2,···, ik-1}⊆ I be a (k-1)-itemset and its superset k-itemset is denoted as Xk. Since Xk-1⊆Xk,

4.2. Proposed pruning strategies

(1) For a LMS_FP1, it holds sup(LMS_FP1) ≥ LMS. In other words, if Xk−1 is not a LMS_FP1, sup(Xk−1) < LMS and MIS(Xk−1) < LMS. (2) Since items are sorted by ascending order of MIS values. According to Theorem 1, it holds sup(Xk−1) ≥ sup(Xk) and MIS(Xk−1) = MIS(Xk) = min{MIS(i1), MIS(i2), …, MIS(im)} = MIS(i1).

It is important to note that the sorted downward closure (SDC) property of FPs in the ME-tree can only guarantee the partial antimonotonicity for FPs, but not the general anti-monotonicity. In other words, the SDC property holds for any extension nodes of a given node, but it may fail for any supersets of that node. Thus, if the SDC property of FPs is used to determine whether all supersets of an itemset should be explored, some FPs may not be found. For example, let be an itemset X having a support value less than the user-speciﬁed minimum support count MIS(X) ×|D|, that X is not a FP. Consider that an itemset Y that is a superset of X is not considered as a FP. This would be wrong for the following reason. Let be an item ij such that ij has a lower MIS value than all items in X, w.r.t. MIS(ij) < MIS(X). The itemset Y = ij∪X would have a lower MIS value than X such that MIS(Y) = MIS(ij) < MIS(X), and Y would be a FP if sup(Y) ≥ MIS(Y) ×|D|. Thus, if the SDC property is used to prune supersets of itemset X, its superset Y may not be considered to be a FP since X is not a FP (sup(X) < MIS(X) ×|D|). As the result, Y which is produced by appending ij to X would not be included into the ﬁnal set of FPs. For the running example, the item (e) is not a FP since sup(e) =5 ( < 10), while its supersets (ae), (ce) and (ace) are FPs, as shown in Table 4. It is thus incorrect to directly determine the FPs based only on the proposed SDC property. In the MSApriori (Liu et al., 1999) and CFP-Growth++(Kiran and Reddy, 2011) algorithms, it showed that the MIN /LMS concept can guarantee the global anti-monotonicity of frequent patterns with multiple minimum supports and ensure the completeness of the set of derived FPs. To address this problem, we further adopt the MIN /LMS concept in the proposed FP-ME algorithm. This guarantees the completeness of the derived FPs form the ME-tee with the proposed SDC property.

Thus, if Xk−1 is not a LMS_FP1, for any of its supersets Xk has sup(Xk) ≤ sup(Xk−1) < LMS and MIS(Xk) < LMS. Based on the concept of LMS in Deﬁnition 9, Xk could not be a desired frequent pattern (LMS_FP1 ⊆ FPs). It can be concluded that Xk−1 is not a FP, as well as any of its supersets. The above deﬁnitions and theorems ensure that all FPs are included in the extensions of the set of LMS_FP1 and thus that we can safely discard the other itemsets. Thus, the designed sorted downward closure (SDC) property can guarantee the completeness and correctness of the FP-ME algorithm for the addressed FP-MMS problem. These properties facilitate to use LMS as a constraint to reduce the search space. In particular, LMS which having the global anti-monotone property can be used to prune the items (or patterns) that cannot generate any FP at higher order. Based on the above analysis, it can be shown that the LMS plays a signiﬁcant role in the FP-ME algorithm; it not only ensures the completeness of the set of derived FPs, but also can be used to eﬃciently prune the search space. Pruning Strategy 1. In the designed ME-tree which adopted the MIS-ascending order of items, if a tree node X has a support value less than the LMS, then any nodes which contains X w.r.t. all supersets of X can be directly pruned and not to be explored in the ME-tree. For the running example, the set of LMS_FP1 in the transactional database of Table 2 are {sup(a) =6; sup(b) =5; sup(c) =8; sup(d) =7; sup(e) =5}. Based on Theorem 1, it can be concluded that any FP which is mined from this database will have one or more items having various MIS values. Thus, the lowest minsup in the database that satisﬁes by a FP is the lowest MIS value among all these frequent items. If we assume that the LMS is set as 6, the support of (e) is sup(e) =5, which is less than 6; all supersets containing (e) cannot have support greater than 6. Thus, it guarantees that itemset (e) cannot generate any FP at

Deﬁnition 9. (Least minimum support, LMS). The least minimum support (LMS) refers to the lowest minsup of all FPs. Therefore, the LMS in a database is always equal to the lowest MIS value among all frequent items. Based on the deﬁnition, the LMS is equal to the lowest value in the MMS-table and is deﬁned as min{ms(i1), ms(i2), …, ms(im)}, where m is the total number of items in a database. 88

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

W. Gan et al.

scanning the ME-tree, and will be ﬁnished after traversed all promising paths. We use the following example to illustrate the “mining when structuring” property in details. Notice that the ME-tree construction process is performed during the mining phase. Details of the proposed FP-ME algorithm and the FP-Spanning procedure are respectively described in Algorithm 1 and Algorithm 2. Besides, the top-down traversal strategy is adopted in the ME-tree to mine the FPs.

higher-level in ME-tree. Lemma 3. Given a transactional database D and multiple minimum support threshold MIS(ik) of each item ik, the constructed ME-tree contains the complete information about FPs in D. Proof. In the construction process of ME-tree, each transaction in D can be mapped into one path in the ME-tree whenever necessary. According to Theorems 1 and 2, all promising itemsets, the LMS_FP1, their information in each transaction is completely stored in the MEtree based on the total order ≺. Notice that we retained those infrequent items with supports no less than LMS w.r.t. the LMS_FP1 in the ME-tree since the supersets of those items may be frequent.

Algorithm 1. FP-ME algorithm.

Input: D (n = |D|), a transactional database; MMS-table, a table contains the minimum item support values for all items. Output: The set of frequent patterns (FPs). 1. scan MMS-table to calculate the MIS value of each single item and put into the set of MISArrary, as well as calculate the LMS among them; // calculate the MISArrary and the LMS 2. scan D to ﬁnd I* ← {ij∈I | sup(ij) ≥ LMS}, w.r.t. the LMS_FP1; // apply the pruning strategy 1 3. sort the set I* in MIS ascending order ≺; // Deﬁnitions 6–8 4. scan D once again to construct the TID-Set of each item ij∈I* such as ij. TidSet; 5. call FP-Spanning (Ø, I*, MISArrary); 6. return FPs

Pruning Strategy 2. Let be the designed ME-tree using the MISascending order of items. If a node X has a support value less than its MIS w.r.t. the MIS of its preﬁx level-2 node, any extension of X w.r.t. all subtree of X can be directly pruned. For example, the illustrated ME-tree of the running example is shown in Fig. 3. To illustrate the conditional anti-monotone property of the FP based on the ME-tree, we consider the itemset (cab). Since sup(cab) (=2 < MIS(cab) ×|D|) is uninteresting, by applying the pruning strategy 2, all the extension nodes of itemset (cbd) are not considered as the FP since their support values are always no greater than that of (cab). Hence, all its extensions (cabd), (cabe) and (cabde) (the shaded nodes in Fig. 3) can be considered as the non-interesting patterns and can be safely pruned. However, it is also important to notice that the diﬀerence between the conditional anti-monotonicity and the pure anti-monotonicity. That is, if the itemset (ad) is an noninteresting pattern, we can only have the conﬁdence that the extension nodes with itemset (ad) as their preﬁx are non-interesting, but cannot infer the interestingness of the itemsets (cad) and (abd). While the global anti-monotonicity can used to safely ﬁlter all the supersets of an unpromising itemset, as descried in Fig. 3. This implies that the traditional candidate generation method, i.e., the MSApriori algorithm (Liu et al., 1999), is not suitable for mining FPs with strict conditional anti-monotonicity, but not the SDC property in the ME-tree.

As shown in Algorithm 1, the FP-ME algorithm ﬁrst scans the database once to calculate the MIS value of each single item, as well as the LMS, then discovers the set of I* w.r.t. LMS_FP1 (Lines 1–2). It is important to notice that the derived candidate patterns here are the set of LMS_FP1 but not the set of FP1. The reason had been mentioned before. The discovered LMS_FP1 are then sorted in MIS-ascending order to construct their TID-Sets by scanning database again, thus forming the ﬁnal set of TID-Sets as D.TidSets ← ij.TidSet (Lines 3–4). Afterwards, each 1-itemset in the set of I* is processed in designed order ≺ to ﬁnd FPs from the ME-tree, using the constructed TID-Sets without multiple rescanning the database (Line 5, FP-Spanning procedure). Details of the FP-Spanning procedure are described below.

4.3. Proposed FP-ME algorithm

Algorithm 2. FP-Spanning procedure.

Similar to the FP-tree (Geng and Hamilton, 2006), the CFP-growth ++ algorithm uses a compact data structure CFP-tree to generate the conditional trees, which is a combination of preﬁx-tree structure and node-links. To improve the compact of preﬁx sharing, the items in the CFP-tree are sorted in frequency-descending ordered. The traversal of the CFP-tree is from bottom to top along node-links. When processing a conditional CFP-tree, the construction and mining time of a node N to be visited is associated to the number of its descendants. However, it is always hard to reduce the traversal and construction costs at the same time since the save of the construction cost usually invokes more traversal cost, and vice versa. Another drawback is that at each node, it needs to maintain the pointer pointing to its parent, as well as the node-link. CFP-growth++ is not eﬃcient on sparse databases due to its high tree construction cost. Thus, it is quite challenge and critically important to design more eﬃcient algorithms to solve this problem. The ME-tree has the “mining when structuring” property. In details, it does not need to build the complete ME-tree, only the potential frequent patterns and their extensions would be constructed and determined during the mining phase. Therefore, those unpromising patterns and their extensions can be directly pruned, greatly reduce the construction time, mining cost and memory usage. In summary, the FP-ME algorithm usually follows a depth-ﬁrst search method to mine the expected patterns. For example, in Fig. 3, one of the searching paths is 〈(c)〉→〈(ca)〉→〈(cab)〉→〈(cabd)〉→〈(cabde)〉. Once this path is traversed, it will recursively search the other branches until no more promising patterns left. This search phase is dynamically executed in alternate fashion, starting with the ﬁrst phase by structuring and

Input: X, an itemset; extensionsOfX, a set of all 1-extensions of X; MISArray, an array containing the minimum item supports of all items. Output: The set of frequent patterns (FPs). 1. for each itemset Xa ∈ extensionsOfX do 2. calculate the sup(Xa) and MIS(Xa) from the built structure of Xa; 3. if sup(Xa) ≥ MIS(Xa) × |D| then 4. FPs ← FPs∪Xa. // update the ﬁnal derived FPs 5. extensionsOfXa ← Ø. // initialize the set of all extensions of Xa 6. for each itemset Xb ∈ extensionsOfXa such that b after a do 7. Xab ← Xa∪Xb; // obtain the extension of Xa 8. calculate the Xab.TidSet by merging Xa.TidSet and Xb.TidSet; 9. extensionsOfXa ← extensionsOfXa∪Xab. // update the derived set 10. end for 11. call FP-Spanning (Xa, extensionsOfXa, MISArrary); 12. end if 13. end for 14. return FPs As mentioned before, the spanning miner mechanism in the FP-ME 89

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

W. Gan et al.

Lemma 4. The complete search space of the addressed FP-MMS problem can be represented by a ME-tree where items are sorted according to the ascending order of the MIS value on items.

{} c

a

b

ca

cb

cd

ce

ab

ad

ae

cab

cad

cae

cbd

cbe

cde

abd

cabe

cabd

e

d

cade

de

be

bd

abe

Proof. According to the deﬁnition of a Set-enumeration tree (Rymon, 1992), the complete search space of I*(where m is the number of items in I*) contains (2 m – 1) patterns, by systematically enumerating all subsets of I. For example, the ME-tree shown in Fig. 1 depicts all subsets of I ={a, b, c, d, e}, that is all the possible patterns. Thus, all the supersets of the root node can be enumerated according to the ascending order of MIS value of items. This representation is complete; the developed ME-tree can thus be used to represent the whole search space of the addressed FP-MMS problem. Completeness and correctness. The derived results from the ME-tree by the FP-ME algorithm is complete and correct, that is it can discover the complete FPs under multiple minimum supports constraint. The evidences are as follows: (1) From Lemma 4, the ME-tree guarantees that the search space is complete, i.e., every promising itemset can be traversed. As shown in Fig. 2, the worst case of the search space is 2n–1 =25–1=31. (2) The time complexity of the construction and mining phases is proportional to the size of the ME-tree in worst case, and is much smaller than the size of the complete ME-tree in average case. (3) The promising pattern generation and pruning strategies. According to Lemma 3, the necessary information of the processed database has been completely mapped into each node in the ME-tree, the necessary information of each promising pattern is thus complete and it never misses any desired pattern. (4) The pruning operations are safe since the anti-monotone property of the support measure, the conditional anti-monotonicity of the SDC property, and the global anti-monotone property of the LMS constraint. Therefore, the derived results by the FP-ME algorithm is complete and correct. For each itemset found by FP-ME can guarantee that its support value is greater than the pre-deﬁned MIS threshold, and every promising itemset in the ME-tree can be successfully explored.

ade

bde

abde

cbde

visited nodes cabde visited and pruned nodes skipped nodes

MIS(c) < MIS(a) < MIS(b) < MIS(d) < MIS(e)

Fig. 2. All the visited nodes in the ME-tree.

{} c

a

b

ca

cb

cd

ce

ab

ad

cab

cad

cae

cbd

cbe

cde

cabd

pruned node skipped node

cabe

cade

e

d

ae

abd

cbde

be

bd

abe

ade

de

bde

Example. Consider the transaction database shown in Table 2, and the MIS value of each item is shown in Table 3. According to above discussion, the order of items in the ME-tree is arranged according to their MIS values in ascending order as c ≺ a ≺ b ≺ d ≺ e. To create the ME-tree, as the above discussion, the set of LMS_FP1 needs to be discovered ﬁrstly, such as LMS_FP1={sup(a) =6; sup(b) =5; sup(c) =8; sup(d) =7; sup(e) =5}. After sorting LMS_FP1 in the designed order, we ﬁrst create the root of the tree, labeled as “null”, then the depth-ﬁrst search of the item (c) is performed. As shown in Fig. 3, the ﬁrst searching paths is 〈(c)〉 → 〈(ca)〉 → 〈(cab)〉 → 〈(cabd)〉 → 〈(cabde)〉. During the traversal, the support of the tree node for the TID-sets is accumulated. Besides, the node with each of its sibling nodes which has the same preﬁx items are combined together to construct continuously a series of 1-extensions. For node (c), the set of 1-extensions is 〈(ca), (cb), (cd), (ce)〉. The join operation is performed to combine (ca) and (cb) for generating (cab). The remaining itemsets are performed in the same way. Notice that during the join process, only a new node is created each time, and none duplicated nodes will be created and unpromising nodes will be simultaneously disposed.

abde

cabde MIS(c) < MIS(a) < MIS(b) < MIS(d) < MIS(e)

Fig. 3. Pruning operation with the SDC property in the ME-tree.

is diﬀerent form the generate-and-test and the pattern-growth approaches. The main idea of the FP-Spanning procedure (c.f. Algorithm 2) is that for each 1-itemset Xa, extensions of Xa are recursively explored using a depth-ﬁrst search. Moreover, each itemset encountered during that search is evaluated to determine if it is a deﬁned FP (Lines 2–4). Notice that the depth search is only performed for an itemset Xa if the support of Xa (which is directly obtained from the relevant TID-Sets by calculating its TIDs) is larger than or equal to the related minimum support count (w.r.t. MIS(Xa)) (Lines 3–4). Simultaneously, the TID-Set construction procedure is used to construct a series of 1-extensions of Xa (Lines 5–10). The above process is recursively executed until all the 1-itemsets in LMS_FP1 have been processed (Line 11). After that, the ﬁnal set of FPs can be discovered and outputted while considering the multiple minimum supports constraint. Using this proposed FP-ME algorithm, FPs are directly discovered by exploiting the constructed TID-Sets, it can not only avoid repeatedly performing time-consuming database scans, but also unnecessary to maintain the entire database in the memory.

4.4. Improved algorithm with the DiﬀSet strategy Although the LMS concept and SDC property proposed in the FPME algorithm can be used to prune the search space, the FP-ME algorithm may still suﬀer from the combinatorial explosion of the number of join operations when there are a large amount of LMS_FP1. When intermediate results of vertical TID-lists become too large, it aﬀects the scalability. To increase the mining performance and maintain the completeness and correctness of the algorithm, an additional strategy named DiﬀSet is extended as follows. In (Zaki and Gouda, 2003), a novel vertical data representation 90

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

W. Gan et al.

Fig. 4. Runtime w.r.t. a ﬁxed β under various LMS.

Experimental comparisons on both dense and sparse databases showed that DiﬀSets deliver order of magnitude performance improvements than the state-of-the-art methods, which will be discussed and represented in next section.

called DiﬀSet was present, that only keeps track of diﬀerences in the TIDs of a candidate pattern from its generating FPs. Based on the previous studies, it had been shown that DiﬀSet drastically cut down the size of memory required to store intermediate results. In next, we show how the DiﬀSet incorporates with previous vertical mining method and used in the FP-ME for signiﬁcantly increasing the mining performance. Consider a pattern with preﬁx P, let t(X) denotes the TidSet of element X, d(X) denotes the DiﬀSet of element X, with respect to preﬁx TidSet, and let PX and PY be the combined patterns with preﬁx P. The following relationships can be obtained as: d(PX) =t(P)−t(X); d(PXY)=t(PX)−t(PY)=t(PX)−t(PY)+t(P)−t(P)=(t(P)− t(PY)) −(t(P)−t(PX))=d(PY)−d(PX); sup(PX)=sup(P)−|d(PX)| (Zaki and Gouda, 2003). Thus, the diﬀerence of the preﬁx TidSet and DiﬀSet can be used to quickly calculate the support of an itemset. Details of the concept DiﬀSet and its construction process can be referred to (Zaki and Gouda, 2003). Notice that the enhanced algorithm is named FP-MEDiﬀSet using DiﬀSet to fast mine FPs. The major diﬀerences between the FP-ME and FP-MEDiﬀSet algorithms include three parts. (1) The later needs to construct DiﬀSet in the second database scan (Algorithm 1, Line 4), and (2) It uses the DiﬀSet strategy to quickly obtain the support of the processed node (based on sup(PX) = sup(P)−|d(PX)| (Zaki and Gouda, 2003)), and then determines whether this pattern and its extension nodes should be explored (Algorithm 2, Line 2). (3) In addition, FP-MEDiﬀSet calculates the Xab.DiﬀSet by merging Xa.DiﬀSet and Xb.DiﬀSet. Thanks to use of the DiﬀSet structure, it is not necessary to scan the database to calculate the support of itemsets (Algorithm 2, Line 8). Thus, the FP-MEDiﬀSet algorithm has better performance compared to the baseline FP-ME algorithm.

5. Experimental results In this section, substantial experiments were conducted to verify the eﬀectiveness and eﬃciency of the proposed baseline FP-ME algorithm (FP-MEbaseline for short) and the improved FP-MEDiﬀSet algorithm. Notice that several studies have been done on the topic of mining FPs with multiple minimum supports. In addition, it had been shown that the CFP-growth++ algorithm (Kiran and Reddy, 2011) signiﬁcantly outperforms the MSApriori (Liu et al., 1999), CFP-growth (Hu and Chen, 2006), the state-of-the-art CFP-growth++ algorithm was thus executed to derive FPs, which can provide a benchmark to verify the eﬃciency of the proposed two ME-tree based algorithms. 5.1. Test environment and datasets All algorithms were implemented in Java language and performed on a computer having an Intel Core2 Duo 2.8 G Hz processor and 4 GB of main memory, running the 64 bit Microsoft Windows 7 operating system. Both real-life (Frequent itemset mining dataset repository, 2012) and synthetic (Agrawal and Srikant, 1994) datasets were used in the experiments, which are used to validate the eﬀectiveness of the proposed algorithms. Three real-life kosarak, BMSPOS and chess datasets, are obtained from the public FIMI Repository (http://ﬁmi. cs.helsinki.ﬁ) (Frequent itemset mining dataset repository, 2012). The 91

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

W. Gan et al.

Fig. 5. Runtime w.r.t. a ﬁxed LMS under various β.

(Fournier-Viger et al., 2014). Furthermore, the discussed method in MSApriori (Liu et al., 1999) is adopted in the proposed FP-ME algorithm to automatically assign multiple item supports to items, the methodology is as follows:

kosarak dataset is a very sparse dataset containing 990,002 sequences of click-stream data from the Hungarian news portal. The BMSPOS dataset contains about s several years of point-of-sale data from a large electronics retailer and it is a large sparse dataset. The chess dataset is from the UCI chess dataset, which is a very dense dataset. The T10I4D100K dataset is a synthetic database generated by the IBM Quest dataset generator (Agrawal and Srikant, 1994). The parameters and characteristics of used datasets are respectively shown in Table 5 and Table 6. The source code of the CFP-growth and CFP-growth++ algorithms, as well as the tested datasets, can be downloaded from the SPMF data mining library (http://www.philippe-fournier-viger.com/spmf/)

MIS (ij ) = max[β × f (ij ), LMS ], where β is the constant used to set the MIS values of items as a function of their frequency (or support). To ensure the randomness and equipment diversity, β was set in the [0.0, 1.0] interval for the datasets. The parameter LMS is the user-speciﬁed least minimum itemset support allowed, and f(ij) is the frequency (or support) of an item ij. Notice that if β is set as zero, then a single minimum item support value LMS will be used for all items, and this will be equivalent to traditional FPM. If β=1 and f(ij) ≥ LMS, then MIS(ij) = f(ij).

Table 5 Parameters of used datasets. #|D| AvgLen MaxLen #|I|

5.2. Runtime

Total number of transactions Average transaction length Maximal transaction length Number of distinct items

In this section, we compare the runtime of the two proposed algorithms and the CFP-growth++ algorithm. For each tested dataset, the parameter β was randomly set as a ﬁxed number for each item, and the runtime was evaluated w.r.t. a ﬁxed LMS under various β, and w.r.t. a ﬁxed β under various LMS. Fig. 4 shows the runtime of the algorithms under various LMS on four datasets. From Fig. 4, it can be observed that both the FP-MEbaseline and the improved FP-MEDiﬀSet algorithms outperform the CFP-growth++ algorithm on four datasets under various LMS with a ﬁxed β. Moreover, it can be seen that FP-MEDiﬀSet has, in general, the best performance among them. For example, Fig. 4(b) shows the runtime of the three algorithms on the BMSPOS dataset. It can be clearly observed that FPMEDiﬀSet is about from one or two orders of magnitude faster than the

Table 6 Characteristics of used datasets. Dataset

#|D|

AvgLen

MaxLen

#|I|

kosarak BMSPOS chess T10I4D100K

990,002 515,597 3196 100,000

8.1 6.5 23 10.1

2498 164 23 29

41,270 1657 75 870

92

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

W. Gan et al.

Fig. 6. Memory usage w.r.t. a ﬁxed β with various LMS.

CFP-growth++, FP-MEbaseline and FP-MEDiﬀSet are respectively 366 s, 26 s and 22 s, which can be observed in Fig. 5(b). This result is reasonable and the reasons is the same as stated before. Furthermore, when β is increased, the all compared algorithms take less time to ﬁnd FPs. The reason is that when β is set to a large value, the actual minimum support threshold of each item is also set as a larger value based on the presented equation. Hence, less execution time is required for each performed algorithm and fewer FPs are produced. A very interesting result is that the performance gap between the proposed algorithms and the CFP-growth becomes larger when β is increased, as shown in Figs. 4(b) and 4(d). It is reasonable since the formula MIS(ij) = max[β× f(ij), LMS] ensures that MIS values of items on experimental task is more closer to the real-world situations with a larger β. It indicates that β plays a signiﬁcant role using the SDC property and the LMS strategy to prune search space.

CFP-growth++ algorithm. The reason is that the pattern-growth approach CFP-growth++ consists of three phases. It ﬁrst scans the database once to construct a global MIS-Tree. Then, the MIS-Tree is restructured to reduce the search space by four pruning techniques. At last, CFP-growth++ recursively mines the tree by creating projected trees to generate all desired itemsets. This process is too timeconsuming, thus it performs worse than the two proposed FP-ME algorithms. As shown in Fig. 4(d), when LMS is set to a quite low value, the CFP-growth++ algorithm has to generate a large number of conditional CFP-trees for deriving FPs, which is very time-consuming. The ME-tree-based algorithms which having the “mining when structuring” property, relies on the SDC property and LMS pruning strategy to prune a huge number of unpromising patterns. It can avoid the costly join operations of those unpromising patterns, thus considerably improving the performance of the mining process. Thanks to the use of a TID-Set, the two proposed algorithms can avoid multiple database scans by directly calculating the related supports from their TID-Set structures. Besides, the compact vertical data structure DiﬀSet can be used to quickly calculate the supports of the processed patterns. Hence, the FP-MEbaseline and FP-MEDiﬀSet are considerably faster than the state-of-the-art CFP-growth++ algorithm, and the improved FPMEDiﬀSet greatly reduces the computations than the baseline algorithm. The results of runtime under various β with a ﬁxed LMS for diﬀerent datasets are shown in Fig. 5. Fig. 5 also shows that both FP-MEbaseline and FP-MEDiﬀSet outperform the CFP-growth++ algorithm on the four datasets under various β with a ﬁxed LMS, and that FP-MEDiﬀSet has the best performance among them. For example, when LMS was set as 0.0008% and β was set as 0.5 for the BMSPOS dataset, the runtime of three algorithms,

5.3. Memory Usage With the same test parameters, we also assessed the memory consumption of the compared algorithms. Memory measurements were done using the Java API. Notice that the peak memory usage of each algorithm was recorded for all datasets. The results of memory usage under various LMS with a ﬁxed β and under various β with a ﬁxed LMS are respectively shown in Figs. 6 and 7. From Figs. 6 and 7, it can be obviously seen that the proposed FPMEbaseline and the improved FP-MEDiﬀSet algorithms require less memory compared to the state-of-the-art CFP-growth++ algorithm under various parameters on the all tested datasets, and even up to 8 times, which can be observed in Fig. 7(a). Speciﬁcally, it can be seen 93

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

W. Gan et al.

Fig. 7. Memory usage w.r.t. a ﬁxed LMS with various β.

5.4. Scalability Test

that the proposed two ME-tree-based algorithms always require nearly constant memory under various parameters. The memory usage of the CFP-growth++ algorithm which adopted the candidate generationand-test mechanism dramatically increases as LMS or β decreases, while the memory usage of the proposed two algorithms remains stable. For example, when β was set as 0.5 and LMS was set in the range of [0.084%, 0.124%] for the kosarak-100k dataset. As shown in Fig. 6(a), when LMS was set as 0.084%, the three algorithms, CFPgrowth++, FP-MEbaseline and FP-MEDiﬀSet respectively consume 159 MB, 28 MB and 59 MB memory; when LMS was set as 0.124%, the memory usage of them is respectively 81 MB, 29 MB and 62 MB. The reason is that the FP-MEbaseline and FP-MEDiﬀSet algorithms adopt the compact vertical data structure TID-Set and DiﬀSet to store the necessary information from the databases, and they utilize the “mining during constructing” property to directly discover the FPs without generating a huge number of unpromising patterns. The patterngrowth CFP-growth++ approach needs, however, to perform the exhaustive search on the constructed conditional trees for mining the complete set of FPs, which causing performance problem—time-consuming and memory costs. In addition, the FP-MEDiﬀSet algorithm requires slightly more memory than that of the baseline FP-MEbaseline algorithm. This result is reasonable since the vertical data structure DiﬀSet is adopted to keep track of diﬀerences in the TIDs of a preﬁx pattern from its generating FPs. From the above results, it can be concluded that the proposed two ME-tree-bases algorithms are acceptable in real-world applications.

The scalability of the proposed methods are further evaluated by performing experiments on the real-life dataset kosarak when its size was varied from 100k to 500k, increments 100k each time. Fig. 8 shows the runtime, memory consumption, the number of the derived patterns for the three compared algorithms when the LMS and β were respectively set as 0.001% and 0.5. Notice that the number of FPs derived by the traditional FPM algorithm with a single minimum support is denoted as FPs*, and the CFP-growth++ and the proposed two algorithms derived FPs with multiple minimum supports. From Fig. 8, it can be observed that the runtime and memory usage of all the compared algorithms are approximate linear increased along with the increasing of dataset size |X|. The performance of FPMEbaseline and the improved FP-MEDiﬀSet signiﬁcantly scale better than that of CFP-growth++ in terms of runtime and memory usage. As shown in Figs. 8(a) and 8(b), FP-MEDiﬀSet has the smallest runtime and FP-MEbaseline has the least memory usage among the compared three algorithms when the dataset size is varied from 100k to 500k. The reasons have been discussed before. When the value of |X| increases, both the transaction length and the possibility of having larger itemsets in the transaction database are also increased. Therefore, the three algorithms must recursively generate more conditional patterns (w.r.t. conditional FPs in CFP-growth++ and w.r.t. visited nodes in the proposed approaches) to discover FPs, and the execution time is also signiﬁcantly increased as shown in Fig. 8(a). In particular, the larger the database size |X|, the larger the gap between CFP-growth++ and the two proposed algorithms, while the runtime of FP-MEbaseline is close to that of FP-MEDiﬀSet. 94

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

W. Gan et al.

Fig. 8. Scalability of the compared algorithms w.r.t. diﬀerent dataset sizes.

6. Conclusions

Besides, it can be clearly seen that the proposed two algorithms require less memory compared to the state-of-the-art CFP-growth++ algorithm in a wide range of dataset size. From Fig. 8(c), it can be observed that the number of the FPs which derived by the addressed FPMMS problem is always less than that of FPs w.r.t. diﬀerent dataset sizes of kosarak. It means that the number of FPs derived by the traditional FPM algorithm with a single uniform minimum support may easily suﬀer from the considerable redundant mining results. Based on the proposed FP-ME algorithm, the “rare item problem” can be eﬀectively avoided, fewer but more useful and interesting FPs can be discovered. From the above results, it can be concluded that the proposed algorithm with the baseline and enhanced versions can avoid the “rare item problem”, and is more scalable than the previous algorithms. Based on the above extensive experiments and comprehensive analysis, it can be seen that the proposed PF-ME algorithm is an eﬃcient way for mining frequent patterns with MMS, but it also has several limitations. First, it is designed for handling the itemset-based data, but not the quantitative itemset-based data (Huang, 2013). Besides, the itemset-based approach is not suitable to process the sequence databases containing the time sequential information. Finally, the consideration of mining frequent patterns from the dynamic databases with multiple minimum supports is more challenging than the addressed FP-MMS problem in the static databases.

In this paper, we proposed a novel FP-MMS algorithm namely FPME for mining FPs with multiple minimum supports. Based on the designed Set-enumeration-tree with multiple minimum supports (MEtree), a novel sorted downward closure (SDC) property was proposed. In order to prune the search space with the global anti-monotonicity of FPs and to guarantee the completeness and correctness of the derived results, the least minimum support (LMS) concept was extended in ME-tree to mine FPs. Diﬀerent from the candidate generate-and-test approach, the FP-ME algorithm can directly discover FPs by spanning the ME-tree using two pruning strategies without multiple database scans. In addition, an improved algorithm by adopting the DiﬀSet concept is further developed to speed up the mining process by reducing the cost of database scan and pruning search space. Experiments were conducted to evaluate the eﬀectiveness and the eﬃciency of the proposed algorithms for deriving FPs with multiple minimum supports. From the results, it can be found that the proposed two algorithms signiﬁcantly outperform the state-of-the-art CFPgrowth++ algorithm in terms of execution time, memory usage and scalability. Speciﬁcally, the improved algorithm always outperforms the baseline one. Based on the proposed FP-ME algorithm, the “rare item problem” can be eﬀectively avoided. Fewer but more useful and FPs can be discovered, which can be used to aid managers or be applied into expert and intelligent systems for making more eﬃcient decisions. 95

Engineering Applications of Artificial Intelligence 60 (2017) 83–96

W. Gan et al.

under multiple minimum supports. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 5, pp. 4112–4117. Lee, Y.C., Hong, T. P., Lin, W.Y., 20004. Mining fuzzy association rules with multiple minimum supports using maximum constraints. Knowl.-Based Intell. Inf. Eng. Systems, pp. 1283–1290. Lin, C.W., Hong, T.P., Lu, W.H., 2009. The pre-FUFP algorithm for incremental mining,". Expert Syst. Appl. 36 (5), 9498–9505. Lin, J.C.W., Gan, W., Fournier-Viger, P., Hong, T.P., 2015a. RWFIM: Recent weightedfrequent itemsets mining. Eng. Appl. Artif. Intell. 45, 18–32. Lin, J.C.W., Gan, W., Fournier-Viger, P., Hong, T.P., 2015b. Mining high-utility itemsets with multiple minimum utility thresholds. In: Proceedings of the ACM International C* Conference on Computer Science & Software Engineering, pp. 9–17. Lin, J.C.W., Gan, W., Hong, T.P., Tseng, V.S., 2015c. Eﬃcient algorithms for mining upto-date high-utility patterns. Adv. Eng. Inform. 29 (3), 648–661. Lin, J.C.W., Gan, W., Fournier-Viger, P., Hong, T.-P., Tseng, V.S., 2015d. Weighted frequent itemset mining over uncertain databases. Appl. Intell. 44 (1), 232–250. Lin, J.C.W., Gan, W., Fournier-Viger, P., Hong, T.P., 2016. Mining discriminative high utility patterns. Asian Conf. Intell. Inf. Database Syst. 9622, 219–229. Liu, Y.C., Cheng, C.P., Tseng, V.S., 2011. Discovering relational-based association rules with multiple minimum supports on microarray datasets. Bioinformatics 27 (22), 3142–3148. Liu, B., Hsu, W., Ma, Y., 1999. "Mining association rules with multiple minimum supports. In: Proceedings of the ﬁfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 337–341. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L., 1998. Pruning closed itemset lattices for association rules. In: Proceedings of the International Conference on Advanced Databases, pp. 177–196. Pei, J., Han, J., 2002. Constrained frequent pattern mining: a pattern-growth view. ACM SIGKDD Explor. Newsl. 4 (1), 31–39. Pei, J., Han, J., Lu, H., Nishio, S., Tang, S., 2001. H-Mine: Hyper-structure mining of frequent patterns in large databases. In: Proceedings of the IEEE International Conference on Data Mining, pp. 441–448. Rage, U.K., Kitsuregawa, M., 2014. Eﬃcient discovery of correlated patterns using multiple minimum all-conﬁdence thresholds. J. Intell. Inf. Syst. 45 (3), 357–377. Rymon, R., 1992. Search through systematic set enumeration. In: Proceedings of the International Conference Principles of Knowledge Representation and Reasoning, pp. 539–550. Schlegel, B., Gemulla, R., Lehner, W., 2011. Memory-eﬃcient frequent-itemset mining. In: Proceedings of the ACM International Conference on Extending Database Technology, pp. 461–472. Srikant, R., Agrawal, R., 1996. Mining sequential patterns: Generalizations and performance improvements. In: Proceedings of the International Conference on Extending Database Technology: Advances in Database Technology, pp. 3-17. Tseng, M.C., Lin, W.Y., 2007. Eﬃcient mining of generalized association rules with nonuniform minimum support. Data Knowl. Eng. 62 (1), 41–64. Vo, B., Coenen, F., Le, B., 2013. A new method for mining frequent weighted itemsets based on wit-trees. Expert Syst. Appl. 40 (4), 1256–1264. Zaki, M., Hsiao, C., CHARM: An eﬃcient algorithm for closed itemset mining. In: Proceedings of the SIAM International Conference on Data Mining, vol. 2, pp. 457473, 2002. Zaki, M.J., Gouda, K., 2003. Fast vertical mining using diﬀsets. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 326–335. Zhang, B., Lin, C.W., Gan, W., Hong, T.P., 2014. Maintaining the discovered sequential patterns for sequence insertion in dynamic databases. Eng. Appl. Artif. Intell. 35, 131–142.

As mentioned before, the proposed FP-ME algorithm has some limitations. In future work, how to extend the proposed several concepts and techniques to handle quantitative data (Hong et al., 1999) and sequence data (Srikant and Agrawal, 1996; Zhang et al., 2014) is an interesting and challenging topic. We would also would like to extend our work to consider mining frequent patterns from the dynamic databases (Hong et al., 2008). Acknowledgement This research was partially supported by the National Natural Science Foundation of China (NSFC) under grant No. 61503092 and by the Tencent Project under grant CCF-Tencent IAGR20160115. References Agrawal, R., Srikant, R., 1994. Quest synthetic data generator. Available: 〈http://www. Almaden.ibm.com/cs/quest/syndata.html〉 Chen, M.S., Han, J., Yu, P.S., 1996. Data mining: an overview from a database perspective. IEEE Trans. Knowl. Data Eng. 8 (6), 866–883. Fournier-Viger, P., Gomariz, A., Gueniche, T., Soltani, A., Wu, C.W., Tseng, V.S., 2014. SPMF: a java open-source pattern mining library. J. Mach. Learn. Res. 15 (1), 3389–3393. Frequent itemset mining dataset repositor, 2012. Available: 〈http://ﬁmi.ua.ac.be/data/〉 Gan, W., Lin, J.C.W., Fournier-Viger, P., Chao, H.C., 2016. Mining recent high expected weighted itemsets from uncertain databases. Asia-Paciﬁc Web Conference, pp. 581– 593 Geng, L., Hamilton, H.J., 2006. Interestingness measures for data mining: a survey. (Article 9)ACM Comput. Surv. 38 (3), (Article 9). Grahne, G., Zhu, J., 2005. Fast algorithms for frequent itemset mining using FP-trees. IEEE Trans. Knowl. Data Eng. 17 (10), 1347–1362. Han, J., Pei, J., Yin, Y., Mao, R., 2004. Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Discov. 8 (1), 53–87. Hong, T.P., Kuo, C.S., Chi, S.C., 1999. Mining association rules from quantitative data. Intell. data Anal. 3 (5), 363–376. Hong, T.P., Lin, C.W., Wu, Y.L., 2008. Incrementally fast updated frequent pattern trees. Expert Syst. Appl. 34 (4), 2424–2435. Hong, T.P., Wu, Y.Y., Wang, S.L., 2009. An eﬀective mining approach for up-to-date patterns. Expert Syst. Appl. 36 (6), 9747–9752. Hu, Y.H., Chen, Y.L., 2006. Mining association rules with multiple minimum supports: a new mining algorithm and a support tuning mechanism. Decis. Support Syst. 42 (1), 1–24. Huang, T.C.K., 2013. Discovery of fuzzy quantitative sequential patterns with multiple minimum supports and adjustable membership functions. Inf. Sci. 222, 126–146. Kiran, R.U., Reddy, P.K., 2011. Novel techniques to reduce search space in multiple minimum supports-based frequent pattern mining algorithms,. In: Proceedings of the ACM 14th International Conference on Extending Database Technology, pp. 11– 20. Lee, Y.C., Hong, T.P., Lin, W.Y., 2005. Mining association rules with multiple minimum supports using maximum constraints. Int. J. Approx. Reason. 40 (1), 44–54. Lee, Y.C., Hong, and T. C. Wang, 2006. Mining fuzzy multiple-level association rules

96

Mining of frequent patterns with multiple minimum supports

Mining of frequent patterns with multiple minimum supports

Recommend Documents