Expert Systems With Applications 140 (2020) 112874
Contents lists available at ScienceDirect
Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa
FPO tree and DP3 algorithm for distributed parallel Frequent Itemsets Mining Van Quoc Phuong Huynh∗, Josef Küng Institute for Application Oriented Knowledge Processing (FAW), Johannes Kepler University, Science Park 3, Altenberger Straße 66b, 4040 Linz, Austria
a r t i c l e
i n f o
Article history: Received 1 February 2019 Revised 12 June 2019 Accepted 14 August 2019 Available online 31 August 2019 Keywords: Frequent Itemsets Mining Parallel Distributed Data Mining Big Data Prefix tree
a b s t r a c t Frequent Itemsets Mining is a fundamental mining model in Data Mining. It supports a vast range of application fields and can be employed as a key calculation phase in many other mining models such as Association Rules, Correlations, Classifications, etc. Many distributed parallel algorithms have been introduced to confront with very large-scale datasets of Big Data. However, the problems of running time and memory scalability still have not had adequate solutions for very large and “hard-to-mined” datasets. In this paper, we propose a distributed parallel algorithm named DP3 (Distributed PrePostPlus) which parallelizes the state-of-the-art algorithm PrePost+ and operates in Master-Slaves model. Slave machines mine and send local frequent itemsets and support counts to the Master for aggregations. In the case of tremendous numbers of itemsets transferred between the Slaves and Master, the computational load at the Master, therefore, is extremely heavy if there is not the support from our complete FPO tree (Frequent Patterns Organization) which can provide optimal compactness for light data transfers and highly efficient aggregations with pruning ability. Processing phases of the Slaves and Master are designed for memory scalability and shared-memory parallel in Work-Pool model so as to utilize the computational power of multi-core CPUs. We conducted experiments on both synthetic and real datasets, and the empirical results have shown that our algorithm far outperforms the well-known PFP and other three recently high-performance ones Dist-Eclat, BigFIM, and MapFIM. © 2019 Elsevier Ltd. All rights reserved.
1. Introduction Since being introduced by Agrawal and Srikant (1994), Frequent Itemsets Mining (FIM) has become popular and always been a fundamental and important mining model in Data Mining. It can be employed as a primary calculation phase in many other mining modes and applied in a wide range of fields from social to natural sciences. FIM counts the frequencies of co-occurrence items, called itemsets, in transactions of distinct items from a dataset. The mining model just reports itemsets whose frequencies are not lesser than a given threshold. Formally, the mining model can be briefly described as follows. Given a dataset of n transactions D = {T1 , T2 , …, Tn }, the dataset includes a set of m unique items I = {i1 , i2 , …, im }, Tj ⊆I (1 ≤ j ≤ n). A k-itemset (or itemset for short) which is a set of k items (1 ≤ k ≤ m) possesses an attribute, support count, which is the number of transactions containing the itemset. For a given support threshold ε which is the percent of transactions in the whole dataset D, an itemset is called a frequent
∗
Corresponding author. E-mail addresses:
[email protected] (V.Q.P. Huynh),
[email protected] (J. Küng). https://doi.org/10.1016/j.eswa.2019.112874 0957-4174/© 2019 Elsevier Ltd. All rights reserved.
itemset iff its support count ≥ ε ∗ n; and the objective of FIM is to discover all frequent itemsets existing in D. The first FIM algorithm named Apriori was proposed by Agrawal and Srikant (1994), and then it had inspired many following pieces of research (Park, Chen & Yu, 1997; Perego, Orlando & Palmerini, 2001; Savasere, Omiecinski & Navathe, 1995) which aim at reducing I/O cost and enhance the performance. However, Apriori-like algorithms, generally, suffer from two drawbacks: a deluge of generated candidate itemsets and/or I/O overhead caused by repeatedly scanning the dataset. Other three approaches, which are much more efficient than Apriori-like algorithms, were also proposed to enhance the performance: 1. Frequent Pattern Growth: FP-Growth algorithm (Han, Pei & Yin, 20 0 0) avoids many times of dataset scans and generationand-test by adopting a compact prefix tree named FP-Tree to store transactions and utilizing a divide-and-conquer strategy to discover frequent itemsets directly, which make FPGrowth much more efficient than Apriori-like algorithms. Many researchers have been inspired by this approach. Sun and Zambreno (2008) proposed a hardware implementation for FP-Growth through a new hardware architecture based on a systolic tree structure adapting FP-Tree. FP-Growth∗
2
V.Q.P. Huynh and J. Küng / Expert Systems With Applications 140 (2020) 112874
(Grahne & Zhu, 2005) enhances FP-Growth by using an arraybased structure named FP-Array in combination with FP-Tree to reduce the need for the tree traversals. Liu et al. (2004) uses a tree structure AFOPT with ascending frequency order of items to represent the conditional databases and a top-down traversal (FP-Growth uses a bottom-up one) to reduce the total number of conditional trees and the traversal cost. The drawbacks of algorithms in this approach are that they require more memory for the tree structures and many recursive conditional trees and become less efficient when datasets are sparse. 2. Vertical Data Format: A featured algorithm of this approach is Eclat (Zaki, 20 0 0) in which each distinct item is associated with a set of transaction identifiers, Tids, that the transactions contain the item. This approach avoids scanning dataset repeatedly, but the running time increases fast and huge memory expensed for Tids of large numbers of frequent itemsets, especially with large and/or dense datasets. 3. Hybrid Approach: Recently, two state-of-the-art algorithms FIN (Deng and Lv (2014) and PrePost+ (Deng & Lv, 2015) were introduced. These two algorithms use FP-Tree-like structures POCTree and PPC-Tree to generate and then mine on a compact Tids-like data structure, called Nodeset and N-list, respectively. The two algorithms possess very high performance and have beaten many previously well-known ones. Inspired by these algorithms, Huynh, Küng and Dang (2017a) proposed IPPC-Tree and IFIN algorithm which provides incremental tree construction and mining for FIN. The authors continued with a sharedmemory parallel version for IFIN, called IFIN+ (Huynh, Küng, Jäger & Dang, 2017b). Thanks to the throughput increase and computation saving in the scenarios of incremental data accumulation, especially in cases of mining at different support thresholds with an unchanged dataset, the execution time of IFIN and IFIN+ outperform that of FP-Growth, FIN, PrePost+ . Hard-to-mined is a so-called terminology used in this article to indicate very large-scale datasets which have huge numbers of long frequent itemsets and may cause some algorithms to discover frequent itemsets on highly unbalanced subspaces because of the heavily uneven distribution of frequent items at demanded support thresholds. In Big Data era, the sequential algorithms now have been facing the problems of the running time and memory scalability for the very large-scale and hard-to-mined characteristic of some datasets such as Webdocs (Lucchese et al., 2004), a real-life huge dataset which possesses more than 165.7 million frequent itemsets at support threshold 5%. Many researchers have expended much effort with the distributed parallel approaches so as to solve the problems. However, the hardware cost and achieved performance have not yet made end-users and researchers pleased, and the problems are still there as a challenge to the community. In this paper, we propose DP3 (Distributed PrePostPlus), a distributed and shared-memory parallel algorithm which employs the state-of-the-art serial algorithm PrePost+ to mine local frequent itemsets. The features of the algorithm are summarized briefly as follows. 1. DP3 operates in Master-Slaves model. In that, each slave machine executes our shared-memory parallelization for PrePost+ algorithm to discover local frequent itemsets from a separated data partition, calculates support counts for locally infrequent itemsets, and sends support counts and local frequent itemsets to the Master for aggregations. 2. In cases of immense numbers of itemsets exchanged between the Slaves and Master, the computational load of the Master will be extremely heavy if there is not the support from our complete FPO tree (Frequent Patterns Organization) which can
provide optimal compactness for light data transfers and highly efficient aggregations with pruning ability. 3. Like the shared-memory parallelization for PrePost+ , other processing parts (such as FPO tree operations, support counts calculation for locally infrequent itemsets) at Slaves and Master are also parallelized in the shared-memory Work Pool model to exhaust as much as possible the computational power of multicore CPUs. 4. The memory scalability and load balance for both distributed and shared-memory parallelization are thoroughly considered for the successful and high-performance executions. The rest of the paper is organized as follows. Section 2 reports related work. The FPO tree and algorithm DP3 are respectively presented in Sections 3 and 4. Section 5 discusses the load balance and is followed with the experiments in Section 6. Finally, Section 7 finishes the paper with some conclusions. 2. Related work Big Data era comes with very large-scale datasets and brings the challenges of running time and memory scalability to available Data Analysis algorithms, and FIM algorithms are also not out of this influence. As a potential approach to tackle the challenges, many distributed parallel FIM algorithms have been proposed. Beside some framework-independent algorithms (Pramudiono & Kitsuregawa, 2003; Zaki et al., 1997) and some others (Agrawal & Shafer, 1996; Yu & Zhou, 2010) implemented on MPI framework (Message Passing Interface), recently most parallel FIM algorithms (Duong et al., 2017; Li et al., 2008; Lin & Gu, 2016; Lin, Lee & Hsueh, 2012; Liu et al., 2015; Makanju et al., 2016; Moens, Aksehirli & Goethals, 2013; Xun, Zhang & Qin, 2016; Zhou et al., 2010) have been designed based on the programming model MapReduce (Dean & Ghemawat, 2008) with two open source frameworks Apache Hadoop (2011) and Apache Spark (2014). In that, Spark provides higher performance compared to Hadoop since it keeps intermediate results in memory for the next processing phases instead of writing the results to hard disks as Hadoop; due to this, applications running on Spark often require huge memory. In the aspect of parallelization based on serial FIM algorithms, we divide available distributed parallel FIM algorithms into the following categories. Apriori-based: distributed parallel solutions for the serial Apriori-like algorithms. Agrawal and Shafer (1996) proposed Count Distribution (CD). Each process develops local support counts for all candidate kitemsets with respect to its local data partition and then receives local support counts from other processes to filter frequent k-itemsets which then are used to generate all candidate (k+1)-itemsets for the next iteration. Inspired by CD algorithm, Lin, Lee and Hsueh (2012) introduced three algorithms on MapReduce: Single Pass Counting (SPC); Fixed Passes Combined-counting (FPC) and Dynamic Passes Combined-counting (DPC). PApriori algorithm by Li, Zeng, He and Shi (2012) is similar to SPC with a minor difference in implementation, and Qiu, Gu, Yuan and Huang (2014) provided another implementation called YAFIM for Apriori on Spark. In the overview, algorithms in this category still inherit drawbacks of the serial Apriori-like algorithms: repetitive dataset scans and numerous candidate itemsets generated which are the causes of very long running time when mining on big datasets. FP-Growth-based: distributed parallel solutions for the serial FP-Growth algorithm family. The first algorithm was proposed by Pramudiono and Kitsuregawa (2003). In that, each process builds a local FP-Tree based on the frequent item list and its local data partition. From the local
V.Q.P. Huynh and J. Küng / Expert Systems With Applications 140 (2020) 112874
FP-Trees, local Conditional Pattern Bases of each frequent item are generated and exchanged among processes. Each group of global Conditional Pattern Bases is collected by a process which then mines on the corresponding global Conditional FP-Tree constructed from the group. Based on this key idea, many variant algorithms have been introduced. An algorithm named PFP (Li et al., 2008) on MapReduce model was introduced and has become popular. Yu and Zhou (2010) proposed two algorithms to improve PFP algorithm: TPFP-tree where each process builds a transaction identification set (Tidset) to directly select and send transactions to other processes instead of scanning the local dataset; and BTP-tree, an improved version, which balances the loads according to the computational ability of the nodes in heterogeneous environments. However, BTP-tree only achieves the balance in the ideal case that the computational loads of global Conditional FP-Trees are quite equal to each other. Zhou et al. (2010) propose BPFP algorithm on MapReduce which improves the load balance among reducers in the second phase of PFP. BPFP estimates the calculation loads of global conditional FPTrees based on the height of the trees which in fact is not enough to reflect the true loads. The algorithm cannot solve the case of having a huge global Conditional FP-Tree since the load of that tree is not shared for more than one Reducer. To improve the load balance, Makanju et al. (2016) introduced a Parent-Child MapReduce model which allows MapReduce tasks to be created dynamically and synchronized in a hierarchical parent-child fashion, and they applied the programming model for PFP algorithm. The authors showed that the model can provide significantly improved performance for PFP. However, their method requires predicting the computational loads of the global Conditional FP-Trees which is a serious challenge. In the common view, the algorithms in this approach suffer from three inherently crucial drawbacks: (1) much overlap among global Conditional FP-Trees which requires extreme network traffic for exchanging Conditional Pattern Bases among processes, (2) highly unbalanced global Conditional FP-Trees is often led by hard-to-mined datasets, (3) many recursive Conditional FPTrees are generated when mining on huge global Conditional FPTrees that cause out of the memory. Eclat-based: distributed parallel solutions for the serial Eclat algorithm family. Zaki et al. (1997) proposed a parallelization, named ParEclat, for their serial algorithm Eclat. The algorithm’s idea includes two phases. Firstly, Eclat is used to discover all frequent itemsets of max length k = 2. The k-itemsets are then grouped into equivalent classes (an equivalent class is a set of k-itemsets sharing the same prefix of (k-1)-itemset). In the second phase, each process uses Eclat to mine frequent k-itemsets (k > 2) based on the equivalent classes and all related global Tid-lists of k-itemsets (k = 1, 2) distributed to the process. With the similar idea, Moens, Aksehirli and Goethals (2013) proposed algorithm Dist-Eclat on MapReduce that in the first phase, the value of k (3 as the recommended value) is provided as an input parameter. Generally, the algorithms in this category deal with the load balance and the memory scalability problems somewhat better than FP-Growth-based algorithms since subspaces of itemsets (global Conditional FP-Trees for FP-Growth-based algorithms) are divided into smaller parts with longer prefixes of itemsets in Eclatbased algorithms. However, their extreme network traffic in the first phases still cannot be avoided. Compound-based: distributed parallel solutions which utilize more than one serial FIM algorithms. In the same paper with Dist-Eclat algorithm, Moens, Aksehirli and Goethals (2013) proposed BigFIM algorithm on MapReduce. The idea is the same as Dist-Eclat; except that in the first phase, BigFIM employs an implementation of Apriori to mine all
3
frequent itemsets of max length k. The purpose is to reduce the consumed memory, but BigFIM needs much longer time to complete its mining. Duong et al. (2017) proposed MapFIM algorithm based on BigFIM which does not require the input parameter k and automatically switch from the first phase to the second phase. The authors argued that the selection of the parameter k by users is difficult. If k value is too low, BigFIM cannot complete its execution since conditional databases may not fit in main memory. Otherwise, if k value is too high, the first phase of BigFIM will be timeconsuming. However, the MapFIM requires another parameter β (ε ≤ β ≤ 1) to indicate over-frequent itemsets, the support of which are greater than or equal to β . When there are no longer any overfrequent k-itemsets, the algorithm MapFIM switches to the second phase. The higher the value β is, the more efficient the algorithm achieves. MapFIM showed its high performance when mining on the hard-to-mined dataset Webdocs at small enough support threshold ε = 5%. As far as we know, previous algorithms only experimented on Webdocs dataset with ε ≥ 7% or other ones. Hybrid-based: distributed parallel solutions for the serial hybrid algorithm family. Recently, Lin and Gu (2016) proposed PFIN algorithm on Spark framework that parallelizes the state-of-the-art algorithm, FIN. PFIN employs the same idea as the algorithm PFP, except the method of discovering frequent itemsets from the serial algorithm FIN. The experiments were conducted on powerful hardware, a cluster of 7 nodes, each of which has CPUs of 32 cores and 192GB memory. For Webdocs dataset, the algorithm was just benchmarked with ε ≥ 10%, and it took approximately 130 s to mine at ε = 10%. 3. FPO tree 3.1. Definitions and properties Definition 1. FPO tree (Frequent Patterns Organization) is a prefix tree structure: (1) That is constructed from a set of itemsets J S. Each itemset in J S is with its support count value, and the order of items in the itemset obeys a given order O of the total item list. (2) FPO tree includes one root node labeled “root” and a set of prefix subtrees. Each node in the subtrees registers an item, and with or without the support count of an itemset in J S. A node which contains support count information is a complete node. The set of all complete nodes in the tree is denoted as CN . (3) With the same parent node, do not exist two different child nodes registering the same item, and the child nodes are ordered based on the order O of their respective items. (4) Construction Function F: For a k-itemset ∈ J S with the order of its items, based on the order O, as e1 e2 …ek . The itemset contributes to the tree with a node chain from the root node N0 . Node Ni registers item ei , in that Ni is a child node of Ni-1 . The last node Nk contains the itemset’s support count (1 ≤ i ≤ k). Definition 2. A complete FPO tree is an FPO tree in which all its nodes (except the root) are complete nodes. Table 1 shows an example dataset and all frequent itemsets with their support counts mined at the support threshold 0.5; the frequent itemsets are presented in two forms: item name or item code which is the index of items in the list {e, d, f, c} with an order O of increasing order of support counts, the first item e with index 0. The number at the end of each itemset indicates its support count. Fig. 1 demonstrates the steps of constructing the complete FPO tree from the frequent itemsets in the item code form in Table 1. Fig. 1(a) presents the tree after inserting the first two frequent
4
V.Q.P. Huynh and J. Küng / Expert Systems With Applications 140 (2020) 112874 Table 1 An example dataset with the mining result of frequent itemsets at ε = 0.5. Dataset
Mining result of frequent itemsets (ε = 0.5)
c b c a
e f c:2 d f c:2 e f:2
d c e c
ef dfgh f d
e c:2 f c:3 d f:2
d c:3 e:2 d:3
•
f:3 c:4
The total item list applied an order O: {e, d, f, c} 0 2 3:2 0 3:2 1 3:3 2:3 1 2 3:2 2 3:3 0:2 3:4 0 2:2 1 2:2 1:3
•
•
quent itemsets (Proposition 1), while the other trees apply the decreasing order for high compression on datasets. A node of FPO tree may not contain support count information which is the support count of a frequent itemset used to construct the tree while a node of the others always contains the local support count of the item in the node. In FPO tree, sibling nodes must obey the order O while the other trees do not require this constrains. FPO tree does not maintain the Header table as of FP tree or pairs of pre-order and post-order at nodes of PPC tree.
Definition 3. (Recover Function F −1 ). For a node Nk (at height k) in a built FPO tree, an itemset e1 e2 …ek is recovered by traversing through the node chain Nk Nk-1 …N0 from Nk to the root node N0 . In that, Ni-1 is the parent node of Ni , and item ei is collected at node Ni . The itemset’s support count is set to the possible support count of Nk (1 ≤ i ≤ k). Property 1. Two last nodes N1 , N2 that are respectively generated by applying the function F on two different itemsets IS1 , IS2 ∈ J S are different to each other.
Fig. 1. Construction of the complete FPO tree from frequent itemsets in Table 1.
Proof. With the contradiction method, assume the two nodes N1 and N2 are the same. Since a child node has the only parent node, the two node chains from N1 and N2 respectively to the root node will be the same. This means that IS1 and IS2 are the same which is a contradiction. Therefore, Property 1 holds. Property 2. For a given FPO tree, an itemset is generated by the function F −1 with an input node N. The itemset belongs to J S if and only if N is a complete node.
Fig. 2. FPO tree compared with other ones, FP tree and PPC tree.
itemsets 0 2 3:2 and 1 2 3:2. The two itemsets are inserted into the tree as they are. The support count attributes of the last two nodes are set to the itemsets’ respective support counts. Next in Fig. 1(b), two frequent itemsets 0 2:2 and 0 3:2 are inserted. For the itemset 0 2:2, necessary nodes 0 and 2 in the tree have been available; therefore, just the support count attribute of the last nodes 2 is updated with the itemset’s support count. For the itemset 0 3:2, node 0 for the item 0 is found, but the last node 3 for the last item 3 is not. A new node 3 is added as a child node of the node 0 with the support count attribute is set to the itemset’s support count. So forth with the remaining frequent itemsets, a complete FPO tree is shown in Fig. 1(c). Fig. 2 visualizes the differences between FPO prefix tree and two other prefix ones, FP tree of FP-Growth and PPC tree of PrePost+ . The same of some frequent itemsets from Table 1 are used to construct and separate the three trees. The differences are enumerated as follows. •
•
The purpose of FPO tree is to organize the mining result of frequent itemsets that aims at providing high compactness for light transfer and storage, and efficient aggregation while the other trees hold transactions in datasets to discover frequent itemsets. With any given order O of items, i.e. the increasing order for better load balance (discussed in Section 5.2), FPO tree maintains its optimal compactness for a full mining result of fre-
Proof. Applying the function F for an itemset IS ∈ J S, we get a node chain from the root node to a complete node N which contains the support count of IS. Then apply the function F −1 for N, we have IS’ = IS. According to Property 1, the support count of N is not overridden by any other itemset. In other words, the support counts of IS’ and IS are the same. Therefore, we have IS’ ≡ IS. This means that IS’ belongs to J S. (∗ ) Applying the function F −1 for a node which is not a complete node, the output itemset will not have a support count. However, every itemset ∈ J S must possess a support count. This refers that the output itemset does not belong to J S. (∗ ’) From (∗ ) and (∗ ’), Property 2 holds. Lemma 1. For a given FPO tree, the construction function F for mapping from the set of itemsets J S to the set of complete nodes C is a bijective mapping. Proof. According to Property 1 and Property 2, the function F is concurrently an injective mapping and a surjective mapping from set J S to set CN . Therefore, the function F is a bijective mapping from J S to CN . Lemma 2. A complete FPO tree which is constructed from a set of itemsets J S is optimal and lossless compactness of J S. Proof. According to Definition 2 and Lemma 1, each node in a complete FPO tree can be used to recover exactly one distinct itemset in J S with its support count, and the number of itemsets in J S equals the numbers of nodes in the complete FPO tree. Therefore, the complete FPO tree is lossless compactness of J S. By using the complete FPO tree, each k-itemset in J S is encoded with the item of the corresponding complete node. Let see the distinct items be atomic units, and J S is a set of different code words made up with the atomic units. The size of k units of each code word is now reduced to the same size of one unit. This means that the complete FPO tree is the optimal compactness of J S. Therefore, Lemma 2 holds.
V.Q.P. Huynh and J. Küng / Expert Systems With Applications 140 (2020) 112874
5
Table 2 Serial data for the complete FPO tree in Fig. 1(c). Index
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Number Array Bit Array
4 1
0 1
1 1
2 1
3 0
2 1
2 0
3 1
2 0
2 0
3 0
1 0
3
1
3
1
3
Definition 4. A set of itemsets J S is called a closed set of itemsets if all sub-itemsets of every itemset in J S belong to J S. Lemma 3. An FPO tree which is constructed from a closed set of itemsets J S is a complete FPO tree. Proof. According to the function F, a leaf node Nk is a complete node since it contains the support count of a k-itemset IS in J S. Assume that IS with its items in the order O as e1 e2 …ek-1 ek contributed to the FPO tree with a node chain N0 N1 … Nk-1 Nk . Applying the function F −1 on the parent node Nk-1 of node Nk , we travel on the node chain Nk-1 …N1 N0 and collect the corresponding itemset IS’ = e1 e2 …ek-1 which is a sub-itemset of IS. This means that IS’ also belongs to J S and contributes to the FPO tree with a node chain which must be N0 N1 … Nk-1 because of the following two reasons: 1 Items of IS’, when being inserted into the FPO tree, must be in the order e1 e2 …ek-1 (since this unique order was defined in IS as the above hypothesis). 2 Do not exist two sibling nodes which share the same item, according to 3 of Definition 1. Therefore, node Nk-1 contains the support count of IS’ that means it is a complete node. So forth for other ancestor nodes Nk-2 , …, N2 , N1 and the same reasoning process for the remaining leaf nodes, we conclude that the tree is a complete FPO tree. Proposition 1. A FPO tree built from a set of frequent itemsets J Sε mined from a dataset at a support threshold ε is a complete FPO tree which is an optimal and lossless compactness of J Sε . Proof. Based on Definition 4, a set of frequent itemsets J Sε is a closed set of itemsets. Therefore, according to Lemma 3, the FPO tree built from J Sε is a complete FPO tree (e.g. Fig. 1 and Table 1). And follow Lemma 2, the FPO tree is an optimal and lossless compactness of J Sε . Subtrees of a complete FPO tree inherit its properties since they are also complete FPO trees (Definition 2). In DP3, subtrees of complete FPO trees constructed from sets J Sε of local frequent itemsets are exchanged between slave and master processes. The master process aggregates local frequent itemsets from slave processes by merging all local complete FPO subtrees. The rest of the section will present the merging operation and a serialization method for transferring or storing of FPO tree. 3.2. Operations 3.2.1. FPO tree serialization In order to transfer or store an FPO tree efficiently, the tree is converted into an array of numbers. The items of nodes in the tree are in form of their respective indices in the item list with order O (Definition 1). With Breadth-First traversal on the tree, each node (except the root node) becomes a number or a pair of numbers depending on whether the support count information is required. Each group of nodes belonging to the same parent node is separated with a delimiter which is the number of nodes in the group. For writing to hard disk, the delimiters can be the new line characters. The volume overhead for the delimiters cannot be avoided even in the case of transferring or storing the itemsets in other data presentations (e.g. numeric arrays, string arrays, or strings
with delimiters between their two items). A bit array, bits 1 for inner (parent) nodes and bits 0 for leaf nodes, is generated in the same Breadth-First traversal for the number array. The bit array conveys the information of parent nodes of node groups to reconstruct the FPO tree. The node groups sequentially are adopted by parent nodes located at the same indices of bits 1 in the bit array. Table 2 shows an example of the number array (without the support counts) and the bit array for the complete FPO tree in Fig. 1(c). The first 4 nodes {0, 1, 2, 3} have the root node at index 0 as their parent node. The next two groups of two nodes {2, 3} and {2, 3} are respectively adopted by the parent node 0 at index 1 and the parent node 1 at index 2. And so forth, the last three groups of one node {3}, {3} and {3} sequentially have their parent nodes at indices 3, 5 and 7. 3.2.2. FPO tree merge Procedure RecursiveMerge recursively merges a sub-FPO tree with the root node N2 into another sub-FPO tree with the root node N1. Lines 1–4 merge the child node list of N2 into the child node list of N1 that there is at least one empty child node list, and lines 5–35 are for the remaining case of two non-empty child node lists. If child nodes c1 and c2 of the respective nodes N1 and N2 register the same item, the subtree with the root node c2 is recursively merged into the subtree with the root node c1. The line 21 which aggregates the support counts of c1 and c2 can be eliminated if the support information is not required by an algorithm. In cases of child nodes c1 and c2 having different items, assume c1.item < c2.item, the deeper merging for all descendant nodes of c1 is pruned. Algorithm FPOTreeMerge Input: Two FPO trees with the root nodes: R1, R2 Output: The merged FPO tree with the root node R1 1. subNodeList ← ∅; 2. Merge(R1.children, R2.children, subNodeList); 3. index ← 0; 4. For t From 1 To ThreadCount 5. Start MergeThread(subNodeList, index); MergeThread(subNodeList, index) 1. While(index < subNodeList.length) 2. Mutually-exclusive-region { 3. N1 ← subNodeList[index]; index++; 4. N2 ← subNodeList[index]; index++; 5. } 6. RecursiveMerge(N1, N2); 7. End While
In addition to the pruning, merging two FPO trees can be parallelized with Work-Pool model for load balance and higher performance that is represented in FPOTreeMerge algorithm. The procedure Merge merges the child node list of R2 into the child node list of R1. In fact, this procedure is the same as procedure RecursiveMerge except line 23, two child nodes c1 and c2 are added into subNodeList for recursively merging later by threads instead of immediately calling RecursiveMerge(c1, c2). In lines 4 and 5, worker threads start and continuously take and merge two subtrees corresponding to two consecutive nodes in subNodeList until there are not unmerged subtrees left. Since sibling nodes are ordered based on the order O; the computational complexity for merging two FPO trees is O(M + N), in that M and N are the numbers of nodes of the two trees. The worst case is to merge two identical FPO trees.
6
V.Q.P. Huynh and J. Küng / Expert Systems With Applications 140 (2020) 112874 Procedure RecursiveMerge(N1, N2) 1. If(N1.children = ∅) 2. If(N2.children = ∅) return; 3. Else N1 adopts child nodes of N2; 4. Else If(N2.children = ∅) return; 5. mergedList ← ∅; i1 ← 0; i2 ← 0; 6. S1 ← N1.children.size; 7. S2 ← N2.children.size; 8. c1 ← N1.children.get(i1); 9. c2 ← N2.children.get(i2); 10. While(true) 11. If(c1.item < c2.item) 12. mergedList ← mergedList U c1; i1++; 13. If(i1 < S1) c1 ← N1.children.get(i1); 14. Else break; 15. Else If(c1.item > c2.item) 16. mergedList ← mergedList U c2; i2++; 17. c2.parent ← N1; 18. If(i2 < S2) c2 ← N2.children.get(i2); 19. Else break; 20. Else c1.support ← c1.support + c2.support; 21. 22. mergedList ← mergedList U c1; 23. RecursiveMerge(c1, c2); 24. i1++; i2++; 25. If(i1 < S1 and i2 < S2) 26. c1 ← N1.children.get(i1); 27. c2 ← N2.children.get(i2); 28. Else break; 29. End If 30. End While 31. For Each remaining child node N of N1 or N2 32. mergedList ← mergedList U N; 33. End For 34. N1.children ← mergedList; 35. N2.children ← ∅;
4. DP3 algorithm This section presents DP3, a distributed parallel algorithm working in Master-Slaves model. The state-of-the-art algorithm PrePost+ for local frequent itemsets and other processing are parallelized in shared-memory Work Pool model and execute at local machines to unlock the ultimate performance.
4.1. PrePost+ in brief This subsection presents an overview of steps in PrePost+ algorithm that is necessary to introduce the rest of the section. The detailed algorithm can be found in Deng and Lv (2015). Algorithm PrePost+ Input: Dataset D, support threshold ε Output: All frequent itemsets 1. Scan D to collect L, a list of frequent items (at ε ) sorted based on the increasing order of their support counts. 2. Scan D to build a PPC tree T from frequent items of transactions in D. 3. Traversal T to generate N-list structures for frequent items. 4. Generate list of frequent 2-itemsets L2 4.1. Traversal each node N in tree T 4.2. IN ← N.item-name; 4.3. For Each ancestor node A of N 4.4. IA ← A.item-name; 4.5. C2.add(IN IA , IN IA .support + N.support); 4.6. End For 4.7. Filter L2 from C2 based on ε 5. Generate N-lists of frequent 2-itemsets from N-lists of frequent items. 6. Recursively expand each frequent 2-itemset by appending item by item following the order of items in L to generate k-itemsets (k > 2) and their N-lists, and filter frequent ones based on ε and support counts computed from N-lists.
4.2. Shared-memory parallelization for PrePost+ The last three steps 4, 5 and 6 of PrePost+ take the most processing time of the algorithm. Fortunately, we have found a solution to parallelize the three steps in shared-memory and WorkPool model for load balance and higher performance. 4.2.1. Step 4 parallelization In this step, the work pool is the PPC tree T, and tasks in the work pool are the direct subtrees of T. Each worker thread will continuously take a subtree and execute the job defined by lines from 4.1 to 4.6 until there are no longer subtrees in the work pool. Since a 2-itemset IN IA exists in many subtrees, updating the support count for IN IA needs synchronization among worker threads that may cause the running time to be not improved, even worse. To provide independence among worker threads and achieve higher efficiency, the collection C2 is eliminated; and instead, each worker thread t has its own a |L| × |L| matrix of numbers Mt for storing support counts of 2-itemsets. Mt (i, j) conveys the local support count of a 2-itemset comprising two items respectively at positions i and j in the frequent item list L. After all threads have completed their works, aggregation is performed on matrixes Mt to achieve the global support count of all 2-itemsets, and then frequent 2-itemsets is filtered based on ε . Parallel Step 4 1. Initialize matrixes Mt = [1, ThreadCount] ; 2. index ← 0; 3. For t From 1 To ThreadCount 4. Start Step4Thread(R, index, Mt ); 5. Aggregate Mt and filter frequent 2-itemsets; Step4Thread (R, index, Mt ) 1. While(index < R.childList.length) 2. Mutually-exclusive-region { 3. subTree ← R.childList[index]; 4. index++; 5. } 6. Traversal each node N of subTree 7. i ← mapToIndex(N.item-name); 8. For Each ancestor node A of N 9. j ← mapToIndex(A.item-name); 10. Mt [i,j] ← Mt [i,j] + N.support; 11. End For 12. End While
4.2.2. Step 5 parallelization N-lists of all frequent items have been available at this step. N-list structure of a 2-itemset IJ is generated from two N-lists of the items I and J with the read-only accesses. Therefore, generating N-list of a frequent 2-itemset is independent to that of others, and the Work-Pool parallelization for this step is simple. The pool of work is the list of frequent 2-itemsets that its tasks are the 2-itemsets. Worker threads continuously take frequent 2-itemsets and generate the respective N-lists until there are no longer tasks in the pool. 4.2.3. Step 6 parallelization This step mines all remaining frequent itemsets by recursively extending each frequent 2-itemsets. The N-list of an extended kitemset (k > 2) is generated from two N-lists of 2 frequent (k1)-itemsets having the same prefix of (k-2)-itemsets. The support count of the extended k-itemset is calculated from its N-lists to determine whether it is a frequent itemset. The detail can be referred in Deng and Lv (2015). Frequent k-itemsets (k > 2) discovered from a certain frequent 2-itemset will adopt the 2-itemset as their prefix. This gives us the fact that based on frequent 2-itemsets, all frequent k-itemset (k ≥ 2) can be divided into subspaces of frequent itemsets which are
V.Q.P. Huynh and J. Küng / Expert Systems With Applications 140 (2020) 112874
separated to each other. Therefore, calculating the N-list of a kitemsets in a subspace just refers to two generated N-lists of two (k-1)-itemsets in the same subspace. In other words, discovering subspaces of frequent itemsets is totally separated and independent to each other. The Work-Pool model parallelization comes with a pool of frequent 2-itemsets as the tasks. Worker threads successively take frequent 2-itemsets and discover the corresponding subspaces until there are no longer tasks in the pool. 4.3. Distributed parallelization Lemma 4. For a given dataset D partitioned into k local datasets Di , ∩k Di = =Ø, ∪k Di = D; at a given support threshold ε , if an itemset X is a frequent itemset of the D, X must be a frequent itemset of at least one local dataset. Proof. With the contradiction method, assume that there exists an itemset X which is not frequent in any local datasets. We have (1); in that si is the support count of X on the local dataset Di .
si
|Di |
< ε , i = [1..k]
(1)
Let s be the support count of X on the global dataset D, we need to prove that (2) holds.
s
|D|
<ε
(2)
Apply (1) for the left side of (2), we have:
s = |D|
k
i=1 si
|D|
k <
ε ∗|Di | ε ∗ ki=1 |Di | ε ∗|D| = = =ε |D| |D| |D|
i=1
(2) holds that means X is not a frequent itemset of the global dataset. This is a contradiction. Therefore, Lemma 4 holds. According to Lemma 4, it can be inferred that every frequent itemset mined from a local dataset Di is a candidate to become a frequent itemset of the global dataset D.
Algorithm DP3 operates in Slaves-Master model with three phases 1, 2 and 3. Each slave process (Slave for short) is assigned a separate partition Di of dataset D. After finishing the simple Phase 1, all Slaves and the Master have achieved a (global) frequent item list L1 whose items are sorted in the increasing order of their support counts. This order is applied for the order O of every FPO tree in DP3 algorithm (Definition 1). In Phase 2, the first three steps 2.1, 2.2 and 2.3 of Slaves for generating the local frequent 2-itemsets list L2i are respectively corresponding to steps 2, 3 and 4 of the algorithm PrePost+ . The differences are that Slaves use the global frequent items list L1 in steps 2.1 and 2.2 instead of the local frequent item list of Di , and step 2.3 is executed in the shared-memory parallel (Step 4 Parallelization, Section 4.2). After completing step 2.3, the local PPC tree Ti is no longer used so it can be freed for memory. Step 2.4 builds a local FPO tree T 2i from L2i , then sends T 2i to the Master (FPO tree serialization, Section 3). The support count information is not sent to the Master. All the trees T 2i are received and merged parallel into an FPO tree T 2 which is then sent back to Slaves by the Master. After receiving T 2, each Slave converts T 2 into a list of 2-itemsets L2 which contains all candidate 2-itemsets that can become global frequent 2-itemsets. In Phase 3, at Slave side, list L2 is partitioned into parts L2 j in step 3.1; and then in step 3.2, each part L2 j is used in a loop which cooperates with a respective loop in step 3.1 at Master side to discover the corresponding subspaces of frequent kitemsets (k ≥ 2). The reason for the partition and the loops is to provide the memory scalability for DP3. Very large and hard-tomined datasets often possess huge numbers of frequent itemsets that therefore requires huge memory for the itemsets and the related data structures if all frequent itemsets are discovered just in one processing loop. The fundament supporting the partition is that a subspace of itemsets expanded from a 2-itemset in L2 is separated and can be mined independently to other ones. The detail how to partition L2 is reported in the next section. For the loop processing part L2 j , the first two steps 3.2.1 and 3.2.2 of Slave i respectively employ the parallel methods: Step 5
Algorithm DP3 Input: Local dataset Di for Slave i, support threshold ε , the number of loops ( ≥ 1) Output: All frequent itemsets Slave i 1. Phase 1 (Achieve frequent item list L1) 1.1. Scan Di to get local frequent item list L1i whose items are sorted in the order of their item names 1.2. Send L1i to Master 1.3. Wait and receive L1 from Master 2. Phase 2 (Achieve candidate 2-itemsets L2) 2.1. Scan Di to build a local PPC tree Ti with items in L1 2.2. Traversal Ti to generate N-lists for items in L1 2.3. Discover local frequent 2-itemsets list L2i 2.4. Send the FPO tree T 2i which is built from L2i to Master (without support counts) 2.5. Wait and receive T 2 from Master, convert T 2 into L2 3. Phase 3 (Achieve frequent k-itemsets, k ≥ 2) 3.1. Partition L2 into parts L2 j (j = 1… ) 3.2. For each L2 j 3.2.1. Generate N-lists for candidate 2-itemsets in L2 j 3.2.2. Discover a set of local frequent k-itemsets Lij from the candidate 2-itemsets in L2 j j
3.2.3. 3.2.4.
Build an FPO tree Ti from Send Ti j to Master (without support counts)
3.2.5. 3.2.6.
Wait and receive the FPO tree T j from Master Update the local support counts for all nodes of T
3.2.7.
With Breadth-First traversal on T j , extract support counts of the nodes to fill a numeric array Ni j . Send Ni j to Master
3.2.8.
Master 1. Phase 1 (Achieve frequent item list L1) 1.1. Wait and receive all L1i lists from Slaves 1.2.
Aggregate all L1i lists and filter to achieve L1 whose items are in the increasing order of their support counts 1.3. Send L1 to all Slaves 2. Phase 2 (Achieve candidate 2-itemsets L2) 2.1. Wait and receive all FPO trees T 2i from Slaves 2.2. Merge all trees T 2i into an FPO tree T 2 2.3. Send T 2 to all Slaves
3. Phase 3 (Achieve frequent k-itemsets, k ≥ 2) 3.1. For the loop j 3.1.1. Wait and receive all FPO tree Ti j from Slaves i 3.1.2. Merge all trees Ti j into an FPO tree T j 3.1.3. Send T j to all Slaves 3.1.4. 3.1.5. 3.1.6.
Lij
3.1.7. j
7
Wait and receive all Ni j from Slaves i Summarized all Ni j to an array of global support counts N j With the Breadth-First traversal on tree T j , fill nodes’ support counts with the respective numbers in N j Traversal tree T j for filtering and writing frequent k-itemsets
8
V.Q.P. Huynh and J. Küng / Expert Systems With Applications 140 (2020) 112874
Parallelization and Step 6 Parallelization (Section 4.2) to mine a j set of subspaces of local frequent itemsets Li which is the input to
6. NL1 ← subSpace.getN-list(I1 I2 …Ik − 2 Ik − 1 ); 7. NL2 ← subSpace.getN-list(I1 I2 …Ik − 2 Ik ); 8. If (NL2 != null) 9. NL ← Generate a N-list structure from NL1 and NL2 ; 10. subSpace.addN-list(NL); 11. Return NL; 12. End If 13. NL2 ← CalculateN-list(subSpace, I1 I2 …Ik − 2 Ik ); 14. NL ← Generate a N-list structure from NL1 and NL2 ; 15. subSpace.addN-list(NL); 16. Return NL;
j
build parallel a local FPO tree Ti in step 3.2.3. Worker threads consecutively take the subspaces to build the corresponding subtrees j j of Ti until there are no subspaces left. Tree Ti is then sent to Master in step 2.3.4 by applying FPO Tree Serialization in Section 3. The support counts are not needed yet, so they do not need to be sent to Master now. j At Master side, after receiving all local FPO tree Ti in step 3.1.1, j
Master merges parallel all trees Ti into an FPO tree T j in step 3.1.2 (algorithm FPOTreeMerge, Section 3) and send T j to all Slaves. Master’s operations of receiving from or sending to Slaves such as 3.1.1, 3.1.3 and 3.1.4 are also parallelized that each thread is responsible for communicating with a Slave. After receiving FPO tree T j in step 3.3.5, Slave i updates the local support counts for all nodes of T j with a parallel procedure SupportCountUpdate in step 3.3.6. Do Breadth-First traversal on j T j , the nodes’ support counts sequentially fill a numeric array Ni in step 3.2.7 which is sent to Master in step 3.2.8. j At Master side, all numeric arrays Ni are received in step 3.1.4 and then summarized to a global support count array N j in step 3.1.5. Again with Breath-First traversal on tree T j in step 3.1.6, the numbers of N j in sequence fill exactly the support counts of the corresponding nodes of T j . Finally in step 3.1.7, Master parallel traversals on T j to filter and write frequent itemsets. If a node’s support count is lesser than ε , the node’s descendants are pruned. Otherwise, the node’s itemset is globally frequent, and the traversal continues on the node’s child nodes. Worker threads continuously take and traversal on subtrees of T j until there are no subtrees left. The subtrees’ root nodes are at height 2 (The root node of T j is at height 0). j After sending the local FPO tree Ti at step 3.2.4, Slaves must wait for receiving the global tree T j at step 3.2.5. However, at step j 3.2.8, after sending the local support count array Ni , Slave can immediately execute steps 3.2.1, 3.2.2, and 3.3.3 of the next loop; and in the meantime, Master is completing its last steps 3.1.5, 3.1.6 and 3.1.7. When Slaves or Master finish a loop, they free memory allocated for data structures in the loop and instantly start a new one.
The same Work-Pool model is applied for the parallel procedure SupportCountUpdate. Worker threads successively take subj spaces (subSpace) in Li and the corresponding subtrees (subTree) j of T then update the support counts for all nodes of subTree with Deep-First traversal. The support count of a node is updated with the support count of the corresponding itemset which is derived from the N-list structure of the itemset. If the N-list of the itemset is not available, it will be calculated by the function CalculateNlist and maintained in subSpace. At the beginning, N-list structures of local frequent itemsets exist in subSpace but locally infrequent itemsets. Since Deep-First traversal is employed to scan subTree; when the N-list of IS = I1 I2 …Ik − 2 Ik − 1 Ik is being calculated, the N-lists of all its prefix itemsets such as I1 I2 …Ik − 2 Ik − 1 , I1 I2 …Ik − 2 , etc. have been available. The N-list of IS is generated from two N-lists of two sub-itemsets I1 I2 …Ik − 2 Ik − 1 and I1 I2 …Ik − 2 Ik . If the N-list of I1 I2 …Ik − 2 Ik is not available, it will be calculated recursively.
5. Discussions The time that a parallel algorithm completes its work is decided by the latest executing members taking part in the work. A parallel algorithm that runs in a shared-memory environment or a distributed one with relatively uniform computation power of computers requires the workload divided up as equal as possible among the executing members to aim at the best load balance. The load balance in the two environments and the memory scalability are considered in DP3 algorithm as follows. 5.1. Load balance for distributed parallelization
Procedure SupportCountUpdate(L2 j , Lij , T | ). 1. index ← 0; 2. For t From 1 To ThreadCount 3. Start SupportCountUpdateThread(index, L2 j , Lij , T j ); SupportCountUpdateThread(index, L2 j , Lij , T j ) 1. While(index < L2 j .length) 2. Mutually-exclusive-region { 3. subSpace ←Lij .getSubSpace(L2 j [index]); 4. subTree ← T j .getSubTree(L2 j [index]); 5. index++; 6. } 7. Deep-First traversal on each node N of subTree 8. IS ← The corresponding itemset of N; 9. NL ← subSpace.getN-list(IS); 10. If (NL != null) N.support ← NL.getSupport(); 11. Else 12. NL ← CalculateN-list(subSpace, IS); 13. N.support ← NL.getSupport(); 14. End If 15. End While Function CalculateN-list(subSpace, I1 I2 …Ik − 2 Ik − 1 Ik ) 1. If (k = 2) 2. NL ← Generate a N-list from the N-lists of items I1 , I2 ; 3. subSpace.addN-list(NL); 4. Return NL; 5. End If (continued on next page)
The workload of FIM is not only decided by the dataset size or the number of transactions but also is heavily influenced by the characteristics of the dataset. Real datasets are often collected by appending section by section, and the sections may be different in volume and characteristics to each other. Therefore, algorithms employing a sequential division of these datasets among machines will often fail to gain the load balance as an inevitable result. The concept of our partition method can be depicted through a simple problem that how to divide as equally as possible in weight and characteristic three buckets of banana, apple, and lemon among three people. The division that person by person takes one of the buckets, even with the constraint of equal weight, reveals a serious imbalance of characteristic. Instead, each person, in turn, takes a banana until the banana bucket is empty; and so forth for the two remaining buckets. With this division, the balance of characteristic and weight is guaranteed. With the above idea, a global dataset D is horizontally partitioned into N local ones Di (0 ≤ i < N) follow the Eq. (3), each of which is assigned to a slave machine. Tj is the jth transaction of D in sequence.
Di = T j |T j ∈ D ∧ j mod N = i
(3)
V.Q.P. Huynh and J. Küng / Expert Systems With Applications 140 (2020) 112874
9
5.2. Load balance for shared-memory parallelization
5.3. Solutions for memory scalability
As presented in Section 4, most processing of both Slaves and Master is executed in shared-memory parallel. Slaves’ and Master’s jobs are vertically divided into independently performed tasks based on the division of the itemsets space into separate subspaces. Thanks to that, Work-Pool model that is well-known for high load balance can be applied. However, the load balance property of Work-Pool model will not take effect if the subspaces are in an extremely unbalanced condition, for instance, a circumstance that there is one extraordinarily large subspace while other ones are so small. In fact, the balance level among subspaces is primarily affected by the means of dividing the space of frequent itemsets of a dataset, not just by the dataset itself. As referred in previous subsections, the order of items in itemsets is the increasing order of their support counts that is defined in the ordered list of frequent items L1 = {I1 , I2 …,Im }. The division method for (frequent itemsets) space used in DP3 is based on the 2-item-prefixes Ii Ij of the itemsets. The (frequent itemsets) subspace with prefix Ii Ij (i < j) is contributed by suffixes which are ordered combinations of items Ik (j < k ≤ m). Therefore, the smaller the support count of item Ij is, the higher priority the subspaces with prefix Ii Ij take frequent itemsets which contain Ii and Ij . In general, this makes the distribution of frequent itemsets among subspaces tend to the equilibrium. Consequentially, WorkPool model can contribute its very high load balance to DP3 algorithm. It is likely that the bigger the support counts of items Ii and Ij are, the greater the number of frequent itemsets containing items Ii and Ij is. With the reverse order of items in itemsets, the subspaces with prefixes Ii Ij of larger support count items Ij have higher priority to take frequent itemsets containing Ii and Ij . Therefore, this can hurt the balance in some datasets. For example, FP-Growth-based parallel algorithms and others apply this reverse order for high compression on transactions, but they are stuck in extremely unbalanced load when mining on Webdocs dataset. Fig. 3 illustrates the impact of the two division strategies on the distribution of frequent itemsets among subspaces, Fig. 3(a) for the increasing order of items and Fig. 3(b) for the reverse order. The mining result of frequent itemsets is from Table 1. The subspaces in this example are defined by prefixes of one item. When changing from the decreasing order to increasing order of items, the biggest subspace (item 3) in Fig. 3(b) gives all its itemsets to other subspaces, and the subspace with item 2 sends its two itemsets but receives one. Obviously, the balance among subspaces in Fig. 3(a) is better than that in Fig. 3(b).
Disjoint partitioning on large datasets of DP3 is inherently a measure for the memory scalability problem. Finishing Phase 2, the local PPC trees are no longer needed and are released for memory. However, these are not sufficient to deal with the problem in case of immense numbers of frequent itemsets since large memory is required to maintain all N-lists and the total FPO tree at each machine. Fortunately, subspaces of frequent itemsets, i.e. the division based on prefixes of 2-itemsets, are separate to each other and can be mined independently with the Depth-First strategy. All the subspaces are grouped into groups of subspaces that each group is discovered in a loop of Phase 3. The occupied memory in a finished loop is released to use for the next loop. The larger the value is, the smaller the groups are; and therefore, the smaller memory is required in each loop at each machine. Experiments have proved the effectiveness of this measure. The following is the detail of the subspaces grouping. Candidate 2-itemsets of L2 are in the order as the order of their respective nodes in the FPO tree T 2. Sibling nodes are in the increasing order of items’ support counts. With the subspaces division presented in Section 5.2, in an overview, larger subspaces come from 2-itemsets in the middle of L2, while smaller ones are from either side. The sequence division of L2 into parts to be processed in loops can make imbalance in computation load and memory. Therefore, we apply the same partition method expressed in (3), the memory balance becomes much better; and the memory scalability can be easily controlled with the value . Each partition L2 j of L2 (0 ≤ j < ) is expressed in (4).
Fig. 3. The impact of division methods on the balance among subspaces.
L 2 j = {I S i
| ISi ∈ L2 ∧ i mod = j}
(4)
In case of many large FPO subtrees from a large number of Slaves, Master can decide to store some of the received subtrees to avoid memory shortage, and they are loaded to the memory only when they are merged. 6. Experiments All experiments were performed on a cluster of 5 machines, each of which has 46 GB of memory and 8 cores from 2 Xeon Quad-core CPUs X5570 @2.93 GHz. The cluster had been installed Hadoop 2.7.7. We configured Hadoop environment so that a Hadoop application can use up to 7 cores and 42 GB of memory at each machine. We have conducted experiments on the dataset Webdocs (Lucchese et al., 2004), a real-life huge dataset commonly used in Frequent Itemset Mining. We have also experimented on a synthetic dataset generated from an alternative implementation (Synthetic data generator, 2016) of the IBM Almaden Quest generator (The IBM generator is no longer downloaded). The command to generate the synthetic dataset is ./gen lit -ntrans 10,0 0 0 -tlen 50 -nitems 10 -npats 20,0 0 0 -patlen 4 -fname synthetic -ascii. Besides the two big datasets, other three small ones Chess, T40I10DK100, and T10I4D100K have been used to supplement the data diversity for better evaluating our algorithm. T40I10DK100 and T10I4D100K are synthetic and Chess is real. These three datasets and Webdocs can be obtained from http: //fimi.uantwerpen.be/data/. The properties of the five datasets are reported in Table 3. We have implemented DP3 algorithm with Java 1.8 and shared the source code of DP3 at https://gitlab.com/vqphuynh/ dp3-algorithm. To evaluate the performance, our algorithm is compared with the well-known algorithm PFP (Li et al., 2008), and other three highly efficient ones: Dist-Eclat, BigFIM (Moens et al., 2013), and MapFIM (Duong et al., 2017). The source code of Dist-
10
V.Q.P. Huynh and J. Küng / Expert Systems With Applications 140 (2020) 112874
Table 3 Characteristics of the datasets. Dataset
# Transac.
# Distinct Items
Avg Length
File Size
Webdocs Synthetic Chess T40I10D100K T10I4D100K
1692,082 10,000,000 3196 100,000 100,000
5267,656 10,000 75 942 870
177 50 37 40 10
1.4 GB 2.4 GB 334 KB 14.7 MB 3.83 MB
Eclat and BigFIM, written on Hadoop 1.2.1, is provided by the authors at https://gitlab.com/adrem/bigfim-sa, and the implementation of MapFIM on Hadoop 2.7.3 is shared by the authors at https: //github.com/chuongdk/MapFIMv2. We have applied some minor framework-related adaption on the source code of BigFIM and DistEclat so that these two algorithms can execute on Hadoop 2.7.x framework. The implementation of PFP can be found in the library Apache Mahout 0.8 that outputs Top-k frequent itemsets with a Hash structure. For fair comparisons, some authors had forced this program to discover all frequent itemsets by setting the k value equal the total number of frequent itemsets when they compared it with their algorithms. However, this causes the implementation of PFP to perform less efficiently and overflow memory when mining on Webdocs dataset, even executing on larger memory machines (Duong et al., 2017; Moens et al., 2013). Therefore, we have implemented PFP on Hadoop 2.7.7 to give PFP algorithm a fairer competition. In all experiments, the execution time was limited to 48 h and measured in seconds. Dist-Eclat and BigFIM were configured with parameter p = 3, the length of the prefixes to mine before distributing the search space further, as suggested by the authors for the best performance. The parameter β of MapFIM algorithm was set to 100% to achieve the best performance as reported in the authors’ paper. 6.1. Compare with other algorithms Distributed parallel FIM algorithms can be categorized into two general strategies, temporarily called S1 and S2. •
•
S1: Worker machines mine on separate data partitions to discover overlapping spaces of frequent itemsets. S2: Worker machines mine on overlapping conditional data partitions to discover separate spaces of frequent itemsets.
DP3 belongs to S1 while the others drop into S2. S1 confronts big volumes (from numbers and lengths of transactions) better than S2, but S2 takes advantage over S1 for large numbers of frequent itemsets. This principle can be used to basically explain the performance of DP3 compared with that of the others. Fig. 4 depicts the performance of the five algorithms mining on the synthetic dataset. DP3 was configured to run with = 1 and 5 slave processes (one machine performs for both master and slave processes). DP3 executes much faster than the four remaining algorithms. Its running time ranges between 70 and 112 s when
Fig. 4. The execution time of the algorithms with the Synthetic dataset.
ε varies from 0.6% to 0.1%. The performance of BigFIM is superior to that of MapFIM, PFP, and Dist-Eclat until ε = 0.2%. However, at ε = 0.1%, BigFIM runs much slower than the algorithms, taking 8414 s to complete its mining since the number of frequent itemsets significantly increases to 153,614 from just 33,686 at ε = 0.2%. Table 4 reports the performance of the algorithms with Webdocs dataset. The most right column shows the numbers of frequent itemsets mined from the dataset with the corresponding support threshold values in the range [0.2, 0.05]. Webdocs possesses a tremendous number of frequent itemsets that increases dramatically for the last three ε values, and large portions of these frequent itemsets are long ones. Webdocs causes PFP and similar algorithms to discover frequent itemsets on highly unbalanced subspaces, and the same problem with mitigated degrees for DistEclat, BigFIM, and MapFIM. For the synthetic dataset, the subspaces are somewhat uniform. Therefore, Webdocs is a hard-to-mined dataset, much harder to be mined compared with the synthetic dataset, in spite of having the smaller number of transactions. Mining on Webdocs dataset, DP3 runs with 4 slave processes, = 1 for ε ≤ 0.06 and = 6 for ε = 0.05. As shown in Table 4, DP3 is obviously the fastest algorithm and far outperforms MapFIM, the second fastest one. DP3 takes just some minutes to discover 165.714 million frequent itemsets at ε = 0.05. MapFIM is somewhat the slow algorithm when mining on the large synthetic dataset. However, with Webdocs, it exhibits a very high performance which is much better than the performance of Dist-Eclat, BigFIM, and PFP, especially for the lower ε values. As shown in Table 4, both MapFIM and Dist-Eclat can perform well for large or small numbers of frequent itemsets, and they are much superior to BigFIM and PFP when discovering large numbers of frequent itemsets. BigFIM and PFP can work efficiently when the number of frequent itemsets is small and most of them are not long; but when this condition is not satisfied, their performance decreases sharply. Their running time becomes too long for ε ≤ 0.1 and exceeds 2 days for ε ≤ 0.07. The reasons are that BigFIM employs the Apriori method which is heavily influenced by the synergy of the numbers of transactions and frequent itemsets, while PFP consumes much memory and computational overhead
Table 4 The execution time (seconds) of the algorithms with Webdocs dataset.
ε
DP3
MapFIM
Dist-Eclat
BigFIM
PFP
#Frequent Itemsets
0.2 0.15 0.1 0.09 0.08 0.07 0.06 0.05
24 25 33 36 43 59 128 500
172 240 448 548 682 931 1854 3235
415 556 1504 2009 2642 3972 6056 9689
463 2839 21,093 34,720 61,103 Out of time Out of time Out of time
266 346 13,924 86,111 169,328 Out of time Out of time Out of time
1616 10,388 217,774 527,096 1.484 M 5.139 M 23.672 M 165.714 M
V.Q.P. Huynh and J. Küng / Expert Systems With Applications 140 (2020) 112874
11
Table 5 The execution time (seconds) of the algorithms with Chess dataset.
ε
DP3-
MapFIM
Dist-Eclat
BigFIM
PFP
#Frequent Itemsets
0.4 0.35 0.3 0.25 0.2 0.15
16 - 1 37 - 1 75 - 2 202 - 4 708 - 12 3884 - 40
85 93 131 258 713 2205
68 70 76 82 91 131
253 295 381 463 597 830
56 65 103 257 567 4064
6.472 M 15.192 M 37.501 M 99.022 M 291.225 M 1007.117 M
Table 6 The execution time (seconds) of the algorithms with T40I10DK100 dataset.
ε
DP3-
MapFIM
Dist-Eclat
BigFIM
PFP
#Frequent Itemsets
0.006 0.005 0.004 0.003 0.002 0.001
9-1 10 - 1 13 - 1 21 - 1 37 - 1 213 - 3
119 120 122 126 134 251
84 89 96 116 152 272
9072 11,306 15,308 22,140 40,401 Out of time
120 123 133 143 167 368
1.024 M 1.286 M 1.973 M 5.058 M 12.447 M 70.470 M
Table 7 Average execution time (seconds) of FPO tree operations.
ε
# Itemsets
Construction Time
Send/Receive Time
Merge Time (4 local FPO trees)
0.08 0.07 0.06 0.05
1.51 M 5.28 M 24.45 M 28.76 M
0.17 0.51 3.37 4.93
0.4 / 0.43 1.6 / 1.64 7.14 / 7.29 7.74 / 8.86
0.27 0.67 1.4 3.5
for many big and deeply recursive Conditional FP-Trees because of heavily unbalanced subspaces. Table 5 reports the performance of the algorithms with Chess dataset. The column “DP3-” shows the running time of DP3 and the number of loops in Phase 3. Chess dataset is very small but contains immense numbers of frequent itemsets. Therefore, it gives double advantages for the strategy S2. In general, Dist-Eclat is the most efficient algorithm mining on Chess. For ε ≥ 0.3, DP3 run fastest, but it becomes slower than some of the remaining algorithms for the other ε values. At ε = 0.15 with the explosion of the mining result, PFP endures many deep recursive conditional FP trees that cause its execution significantly to slow down, and it is interesting that BigFIM behaves quite well since the number of transactions is too small. Table 6 shows the running time of the algorithms with the small dataset T40I10DK100, which contains very large numbers of frequent itemsets. Like Chess, T40I10DK100 also gives double advantages for S2, but at mitigating degrees. In the experiments with this dataset, DP3 is the fastest algorithm. Although BigFIM belongs to S2, the advantages from T40I10DK100 do not help its performance because of the same reason in the experiments with Webdocs. Fig. 5 visualizes the performance of the algorithms mining on T10I4D100K. The dataset is small and contains small numbers of frequent itemsets, just 27,532 ones at ε = 0.1%. DP3 runs with = 1 for all ε values. Obviously, DP3 executes much faster than the remaining algorithms, taking approximately 1 s for all ε values. PFP becomes the second fastest algorithm, while BigFIM is the slowest one, taking 952 s at ε = 0.1%. From the experiments with the five datasets, we can see that DP3 executes more stable than the remaining four algorithms which their ranks in performance change from dataset to dataset. DP3 outperforms the other algorithms for most cases of experimental conditions. In the case of very small datasets concurrently mined with extremely large numbers of frequent itemsets (Chess dataset with ε ≤ 0.2), some of the algorithms of S2 strategy utilize the advantages and become to run faster than DP3. However,
small datasets are not the objective that the distributed algorithms are designed to intend for. 6.2. Extensive benchmarking Table 7 reports the average performance of operations of FPO tree which is benchmarked while mining on Webdocs dataset at different ε values from 0.08 to 0.05. The column “# Itemsets” shows the average numbers of itemsets from which the corresponding FPO trees are constructed, sent, received, and merged. The merging operation is performed by the Master which aggregates 4 local FPO trees from 4 Slaves. For example at ε = 0.05, an FPO tree of 28.76 million itemsets is built, sent, and received respectively in 4.93, 7.74, and 8.86 s; and with FPO tree, the Master aggregates 28.76 ∗ 4 = 115.04 million itemsets in just 3.5 s. In addition to the basic theory introduced in Section 3, this benchmarking has proved empirically the especially high efficiency of FPO tree in manipulating on tremendous numbers of itemsets. Fig. 6 and Fig. 7 respectively demonstrate the variation of the running time and consumed memory at both sides of DP3 while mining on Webdocs dataset at ε = 0.05 with different numbers
Fig. 5. The execution time of the algorithms with T10I4D100K dataset.
12
V.Q.P. Huynh and J. Küng / Expert Systems With Applications 140 (2020) 112874
7. Conclusions
Fig. 6. The execution time at both sides of DP3 with different values of .
Fig. 7. The consumed memory at both sides of DP3 with different values of .
In this paper, we proposed DP3, a distributed parallel algorithm for Frequent Itemsets Mining operating in Master-Slaves model. Itemsets are exchanged between the Slaves and Master in form of complete FPO trees which provide the optimal compactness for light data transfers and the highly efficient aggregations with pruning ability. The processing phases of Slaves and Master are speeded up with shared-memory parallelization in Work-Pool model so as to exhaust the computational power of multi-core CPUs. The high load balance of DP3 is achieved by considering two parallelization levels: shared-memory in a single machine with Work-Pool model, and distributed environment with the data partition strategy aiming at the balance of volume and data characteristic among partitions. The memory scalability problem can be managed with loops of discovering on balance groups of independent subspaces of frequent itemsets. The performance of DP3 is quite stable while considerable memory benefit is attained with the increase in the number of the loops. Based on the fundamental theory and experimental results, DP3 has showed very high efficiency and speed-up. Its performance is much more stable than and far superior to that of the recently high-performance algorithms in the literature while mining on the datasets with different characteristics. By substituting the current method of discovering frequent itemsets, which is based on the state-of-the-art serial algorithm PrePost+ , with the other counterpart methods; DP3 with FPO tree can be developed to a general algorithm. Thanks to that, there will be more options for many kinds of datasets, users’ hardware conditions and demands in practice. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Credit authorship contribution statement
Fig. 8. Speed-up of DP3 algorithm mining on Webdocs at ε = 0.07, = 1.
of loops in phase 3 of the algorithm. The running time and consumed memory at Slave side are presented with the average values and the corresponding sample-standard-deviation values calculated from the counterpart values at the single slave machines. The small sample-standard-deviation values show that DP3 achieves high balance in computational load and memory among slave machines. The running time at Master side is also the total running time of DP3 which increases from 500 to 513 s for the value vary from 6 to 18. The obvious trade-off between the performance and the consumed memory at both sides is controlled by the number of loops. However, just a slight reduction in performance is compensated by the remarkable memory benefit when the value increases. Fig. 8 depicts the speed-up of DP3 algorithm while mining on Webdocs dataset at ε = 0.07 and = 1. Each machine performs a slave process; and therefore, one machine additionally executes the master process. For ε ≤ 0.06, the single machine PrePost+ algorithm cannot complete its mining because of memory overflow. The speed-up is benchmarked based on the performance of the shared-memory parallel PrePost+ presented in Section 4.2. As showed in Fig. 8, DP3 algorithm achieves very high speed-up, approximate twofold the standard speed-up. Compare with the original serial PrePost+ , the speed-up is even much higher.
Van Quoc Phuong Huynh: Methodology, Conceptualization, Formal analysis, Investigation, Writing - original draft. Josef Küng: Resources, Supervision, Writing - review & editing. Acknowledgments ´ Systems We would like to express our thanks to Mr. Kujundžic, Administrator, Scientific Computing Information Management, JKU for his support and collaboration in installing and configuring Hadoop framework for stable execution and optimal utilization of hardware resources. References Agrawal, R., & Shafer, J. C. (1996). Parallel mining of association rules. IEEE Transactions on knowledge and Data Engineering, 8(6), 962–969. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th international conference very large data bases, VLDB: 1215 (pp. 487–499). Apache Hadoop (2011). http://hadoop.apache.org/ Accessed July 2018. Apache Spark (2014). http://spark.apache.org/ Accessed July 2018. Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. Deng, Z. H., & Lv, S. L. (2014). Fast mining frequent itemsets using Nodesets. Expert Systems with Applications, 41(10), 4505–4512. Deng, Z. H., & Lv, S. L. (2015). PrePost+: An efficient N-lists-based algorithm for mining frequent itemsets via Children–Parent Equivalence pruning. Expert Systems with Applications, 42(13), 5424–5432. Duong, K. C., Bamha, M., Giacometti, A., Li, D., Soulet, A., & Vrain, C. (2017). MapFIM: Memory aware parallelized frequent itemset mining in very large datasets. In Proceedings of the international conference on database and expert systems applications (pp. 478–495). Springer.
V.Q.P. Huynh and J. Küng / Expert Systems With Applications 140 (2020) 112874 Grahne, G., & Zhu, J. (2005). Fast algorithms for frequent itemset mining using fp-trees. IEEE Transactions on Knowledge and Data Engineering, 17(10), 1347–1362. Han, J., Pei, J., & Yin, Y. (20 0 0). Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD record: 29 (pp. 1–12). ACM. Huynh, V. Q. P., Küng, J., & Dang, T. K. (2017a). Incremental frequent itemsets mining with IPPC tree. In Proceedings of the international conference on database and expert systems applications (pp. 463–477). Springer. Huynh, V. Q. P., Küng, J., Jäger, M., & Dang, T. K. (2017b). IFIN+: A parallel incremental frequent itemsets mining in shared-memory environment. In Proceedings of the international conference on future data and security engineering (pp. 121–138). Springer. Li, H., Wang, Y., Zhang, D., Zhang, M., & Chang, E. Y. (2008). PFP: Parallel fp– growth for query recommendation. In Proceedings of the ACM conference on recommender systems (pp. 107–114). ACM. Li, N., Zeng, L., He, Q., & Shi, Z. (2012). Parallel implementation of apriori algorithm based on mapreduce. In Proceedings of the 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing (pp. 236–241). IEEE. Lin, C., & Gu, J. (2016). PFIN: A parallel frequent itemset mining algorithm using nodesets. International Journal of Database Theory and Application, 9(6), 81– 92. Lin, M. Y., Lee, P. Y., & Hsueh, S. C. (2012). Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of the 6th international conference on ubiquitous information management and communication (p. 76). ACM. Liu, G., Lu, H., Lou, W., Xu, Y., & Yu, J. X. (2004). Efficient mining of frequent patterns using ascending frequency ordered prefix-tree. Data Mining and Knowledge Discovery, 9(2), 249–274. Liu, J., Wu, Y., Zhou, Q., Fung, B. C., Chen, F., & Yu, B. (2015). Parallel eclat for opportunistic mining of frequent itemsets. In Proceedings of the international conference on database and expert systems applications (pp. 401–415). Springer. Lucchese, C., Orlando, S., Perego, R., & Silvestri, F. (2004). WebDocs: A real-life huge transactional dataset. In Proceedings of the FIMI: 126 http://fimi.ua.ac.be/data/ webdocs.dat.gz Accessed August 2018. Makanju, A., Farzanyar, Z., An, A., Cercone, N., Hu, Z. Z., & Hu, Y. (2016). Deep parallelization of parallel FP-growth using parent-child MapReduce. In Proceedings of the IEEE international conference on big data (Big Data) (pp. 1422–1431). IEEE.
13
Moens, S., Aksehirli, E., & Goethals, B. (2013). Frequent itemset mining for big data. In Proceedings of the IEEE international conference on BigData conference (pp. 111–118). Park, J. S., Chen, M. S., & Yu, P. S. (1997). Using a hash-based method with transaction trimming for mining association rules. IEEE Transactions on Knowledge and Data Engineering, 9(5), 813–825. Perego, R., Orlando, S., & Palmerini, P. (2001). Enhancing the apriori algorithm for frequent set counting. In Proceedings of the IEEE international conference on international conference on data warehousing and knowledge discovery (pp. 71–82). Springer. Pramudiono, I., & Kitsuregawa, M. (2003). Parallel FP-growth on pc cluster. In Proceedings of the Pacific-Asia conference on knowledge discovery and data mining (pp. 467–473). Springer. Qiu, H., Gu, R., Yuan, C., & Huang, Y. (2014). Yafim: A parallel frequent itemset mining algorithm with spark. In Proceedings of the IEEE international parallel & distributed processing symposium workshops (IPDPSW) (pp. 1664–1671). IEEE. Savasere, A., Omiecinski, E., & Navathe, S. (1995). An efficient algorithm for mining association rules in large databases. In Proceedings of the 20th international conference very large data bases, VLDB (pp. 432–444). Sun, S., & Zambreno, J. (2008). Mining association rules with systolic trees. In Proceedings of the international conference on field programmable logic and applications (pp. 143–148). IEEE. Synthetic data generator (2016). https://github.com/zakimjz/IBMGenerator Accessed August 2018. Xun, Y., Zhang, J., & Qin, X. (2016). FiDoop: Parallel mining of frequent itemsets using mapreduce. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 46(3), 313–325. Yu, K. M., & Zhou, J. (2010). Parallel TID-based frequent pattern mining algorithm on a PC Cluster and grid computing system. Expert Systems with Applications, 37(3), 2486–2494. Zaki, M. J. (20 0 0). Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3), 372–390. Zaki, M. J., Parthasarathy, S., Ogihara, M., & Li, W. (1997). Parallel algorithms for discovery of association rules. Data Mining and Knowledge Discovery, 1(4), 343–373. Zhou, L., Zhong, Z., Chang, J., Li, J., Huang, J. Z., & Feng, S. (2010). Balanced parallel fp-growth with mapreduce. In Proceedings of the IEEE youth conference on information computing and telecommunications (YC-ICT) (pp. 243–246). IEEE.