Scalable out-of-core itemset mining

Scalable out-of-core itemset mining

Information Sciences 293 (2015) 146–162 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins...

1MB Sizes 4 Downloads 77 Views

Information Sciences 293 (2015) 146–162

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Scalable out-of-core itemset mining Elena Baralis, Tania Cerquitelli ⇑, Silvia Chiusano, Alberto Grand Dipartimento di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Torino, Italy

a r t i c l e

i n f o

Article history: Received 3 December 2013 Received in revised form 12 June 2014 Accepted 26 August 2014 Available online 18 September 2014 Keywords: Itemset mining Data mining

a b s t r a c t Itemset mining looks for correlations among data items in large transactional datasets. Traditional in-core mining algorithms do not scale well with huge data volumes, and are hindered by critical issues such as long execution times due to massive memory swap and main-memory exhaustion. This work is aimed at overcoming the scalability issues of existing in-core algorithms by improving their memory usage. A persistent structure, VLDBMine, to compactly store huge transactional datasets on disk and efficiently support large-scale itemset mining is proposed. VLDBMine provides a compact and complete representation of the data, by exploiting two different data structures suitable for diverse data distributions, and includes an appropriate indexing structure, allowing selective data retrieval. Experimental validation, performed on both real and synthetic datasets, shows the compactness of the VLDBMine data structure and the efficiency and scalability on large datasets of the mining algorithms supported by it. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction Itemset mining is an exploratory data mining technique, widely employed to discover valuable, non-trivial correlations between data items in a transactional dataset. The first attempt to perform itemset mining [3] was focused on discovering frequent itemsets, i.e., patterns whose observed frequency of occurrence in the source data is above a given threshold. Frequent itemsets find application in a number of real-life contexts, such as market basket data [3], recommendation systems [19], and telecommunication networks [20]. Frequent itemset mining algorithms have traditionally addressed time scalability, with increasingly efficient solutions that limit the combinatorial complexity of this problem by effectively pruning the search space. To efficiently extract knowledge, most algorithms exploit ad hoc data structures that greatly rely on the available physical memory. However, while the size of real-world databases steadily experiences an exponential growth, mining algorithms are still lagging behind, yielding poor CPU utilization and massive memory swap, thus significantly increasing execution time, and facing the serious bottleneck of main memory. In spite of the increasing availability of physical memory in modern systems and CPUs, the continuous increase in the amounts of analyzed data prompts the need for novel strategies to speed and scale up data mining algorithms. New methods that utilize the secondary storage in the mining process should be the target. Recently, disk-based extraction algorithms have received an increasing interest. These approaches rely on disk-based data structures to represent the transactional dataset. However, the proposed structures support specific mining algorithms, they typically address specific data distributions and, in general, they often provide only a limited scalability. ⇑ Corresponding author. Tel.: +39 011 090 7178; fax: +39 011 090 7099. E-mail addresses: [email protected] (E. Baralis), [email protected] (T. Cerquitelli), [email protected] (S. Chiusano), alberto.grand@ polito.it (A. Grand). http://dx.doi.org/10.1016/j.ins.2014.08.073 0020-0255/Ó 2014 Elsevier Inc. All rights reserved.

147

E. Baralis et al. / Information Sciences 293 (2015) 146–162

This motivates the work described in this paper. This framework, named VLDBMine, includes a persistent transactional data representation and a set of creation and access primitives, to efficiently support large-scale itemset mining. The challenge of this work is to effectively support existing in-core algorithms by enhancing memory usage, thus overcoming scalability issues. The VLDBMine data structure can be profitably exploited to support a variety of state-of-the-art in-core itemset extraction algorithms (e.g., maximal and/or closed itemsets) when the latter outstrip the available memory. Two strategies (loosely- and tightly-coupled) have been proposed to integrate VLDBMine into such mining algorithms, enhancing their scalability. In particular, the tightly-coupled strategy offers the best scalability, by loading, in each mining step, only the data locally required. VLDBMine is based on a compact disk-based representation, called Hybrid-Tree (HY-Tree), of the whole transactional dataset. The HY-Tree exploits two different array-based node structures to adapt its data representation to diverse data distributions. Both structures are variable length-arrays that store different information to compactly represent the dense and the sparse portions of the dataset, respectively. The selection of the node types is automatically driven by the data distribution. VLDBMine also includes an indexing structure, named the Item-Index, which supports selective access to the HY-Tree portion needed for the extraction task. The VLDBMine performance has been evaluated by means of a wide range of experiments with datasets characterized by different size and data distribution. As a representative example, VLDBMine has been integrated with LCM v.2 [31], an efficient state-of-the-art algorithm for itemset extraction. The run time of frequent itemset extraction based on VLDBMine is always comparable to or better than LCM v.2 [31] accessing data on a flat file. VLDBMine-based frequent itemset extraction also exhibits good scalability on large datasets. The paper is organized as follows. Section 2 introduces the VLDBMine data structure, while Section 3 describes its physical organization. Section 4 presents the proposed technique to build VLDBMine on disk. Section 5 discusses the loosely-coupled and tightly-coupled integration strategies, and describes data retrieval techniques to support the data loading phase. Section 6 presents the integration of VLDBMine in the LCM v.2 algorithm. Section 7 discusses how to address the main issues in incrementally updating VLDBMine. The experiments evaluating the effectiveness of the proposed data structure are presented in Section 8. Section 9 reviews existing work in the wide area of frequent itemset mining, focusing on different diskbased solutions proposed in the literature. Finally, Section 10 draws conclusions and presents future developments of the proposed approach. 2. The VLDBMine data structure The VLDBMine persistent representation of the dataset is based on the HY-Tree data structure. The HY-Tree is a prefixtree-based structure, which encodes the entire dataset and all the information needed to support transaction data retrieval. This tree is hybrid, because two different array-based node structures coexist in it to represent tree nodes and, thus, to reduce the tree size by adapting the data structure to the data distribution. VLDBMine also includes the Item-Index, an auxiliary structure providing selective access to the HY-Tree portion needed for the current extraction task. VLDBMine has been designed to efficiently scale up the itemset mining process on very large transactional datasets. According to the standard definition of transactional datasets, an itemset represents a co-occurrence of items without any temporal ordering of events [18]. A dataset, used as a running example, is reported in Table 1. The corresponding HY-Tree and Item-Index are shown in Figs. 1 and 2, respectively. 2.1. The HY-Tree data structure The HY-Tree has a prefix-tree-like structure. Each transaction is represented by a single path, but a prefix path may represent the common prefix of multiple transactions. In the paths, items are sorted by decreasing values of their global support, given by the number of dataset transactions including each item, and by increasing lexicographic order in case of equal support.

Table 1 Example dataset. TID

Items

TID

Items

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

a, b, e, r, x h, l, o a, c, g, i, k b, d, e, g, p, v d, j, p b, i, n, r, s, u c, h, z a, i, s, t h, i a, b, i, n, r, z b, d, e, n, x

T12 T13 T14 T15 T16 T17 T18 T19 T20 T21

c, e, f, i, o, p, x b, d, g, p, x d, p b, h b, h, l, q a, e, j, k, w, x a, b, e, r, t, x b, d, e, m, n, x b, h, q h, l, v, w

148

E. Baralis et al. / Information Sciences 293 (2015) 146–162

Fig. 1. HY-Tree for the example dataset.

Fig. 2. Item-Index for the example dataset.

Since the HY-Tree is a disk-based data representation, the mining performance is significantly affected by the number of blocks that are read from disk to retrieve the HY-Tree portion needed for the analysis. Hence, the HY-Tree size (i.e., the number of disk blocks storing the tree) becomes a critical issue to reduce the I/O cost. The following strategies have been adopted to reduce the HY-Tree size. (i) Each tree node compactly represents a set of items. (ii) To provide compact storage for different data distributions, two data structures coexist to represent tree nodes. The HY-Tree is characterized by two array-based node structures, called Horizontal Nodes (HNodes) and Vertical Nodes (VNodes). They compactly represent a set of items as a variable length-array. Each cell in the array corresponds to an item, and items are sorted within the node by decreasing value of their global support. HNodes are usually located in the upper part of the HY-Tree, a branched tree part representing the dense portion of the dataset. Hence, HNodes include items having quite a high support. VNodes are located in the lower part of the HY-Tree, representing the sparse portion of the dataset. Thus, VNodes include items with a low support. In Fig. 1, HNodes are represented as horizontal rectangles with bold borders, and VNodes as vertical rectangles.

E. Baralis et al. / Information Sciences 293 (2015) 146–162

149

Horizontal Nodes (HNodes). HNodes represent a set of item cells sharing the same transaction prefix path. For example, HNodes N 1 and N 5 include respectively 5 (b; e; h; a, and d) and 1 (r) item cells. All cells in an HNode are siblings, i.e., direct descendants of the same cell in the parent node, and are sorted by decreasing global support. A subpath from the tree root to an HNode N represents the common prefix path of multiple transactions. These transactions include all items found along the direct subpath reaching node N (i.e., disregarding items in sibling cells along the path), together with one of the items in N. Within the HNode, each item is associated with a local support value, representing the number of transactions sharing the prefix path reaching the item. Each item cell in an HNode is thus a pair < item: local support>. Consider the subpath from N 1 to N 3 . The subpath hN 1 :½b : 11; N 2 :½e : 5; N 3 :½x : 4i reaches cell ½x : 4 in N 3 with item x. The subpath hN 1 :½b : 11; N 2 :½e : 5; N 3 :½d : 1i reaches cell ½d : 1 in N 3 with item d. The two subpaths share the same prefix path hN 1 :½b : 11; N 2 :½e : 5i and differ for the last item, x and d respectively. In the dataset, four transactions (TID 1, 11, 18, and 19) share prefix hb; e; xi. Accordingly, the local support of item x in node N 3 is 4. Subpath hN 1 :½b : 11; N 2 :½e : 5; N 3 :½d : 1i represents a single transaction (TID 4), because the local support of d is 1. Each cell in an HNode is linked to its child node (if any) through a child node pointer. This pointer stores the physical location of the corresponding child node. A parent node pointer also links each HNode to the corresponding parent cell (in the parent node). Child and parent pointers allow top-down and bottom-up tree traversal, respectively. Vertical Nodes (VNodes). VNodes represent transaction suffix paths that are unique, i.e., not shared among multiple transactions (according to the total order defined over HY-Tree paths). Items in these subpaths are characterized by a local support equal to 1. A VNode stores all items in the unique transaction suffix path except the first one. Although this item has a local support equal to 1, its parent cell is located in an HNode and has a local support larger than 1. Hence, it may have sibling item cells. For this reason, the first item of an exclusive suffix path is stored in an HNode. The cell including the item is called the sentinel cell. The only child node (if any) of a sentinel cell is the VNode storing all remaining items in the suffix path. In the VNode, the local support is omitted, being unitary for all items. Each sentinel cell is linked to its child VNode through a child node pointer, storing the physical location of the VNode. For example, VNode N 6 in Fig. 1 includes items p; g, and v. Cell ½d : 1 in HNode N 3 is the sentinel cell of N 6 . Subpath hN 3 :½d : 1; N 6 :½p; g; v i from sentinel cell N 3 :½d : 1 through VNode N 6 represents the non-shared suffix path of transaction 4 in the example dataset. Sentinel cells are shaded in Fig. 1. 2.2. The Item-Index data structure The Item-Index is an additional array-based structure supporting the selective retrieval of HY-Tree paths including a given item. Similar to the FP-tree header table, the Item-Index lists all items, sorted like in the HY-Tree paths, together with their global support value. In addition, the Item-Index includes extra information to support selective access to the HYTree and enable the proposed prefetching strategy (see Section 5.1.2. Thus, the Item-Index stores for each item, the size of the corresponding support- and frequent item-based dataset projections, and a Pointer-Array containing the information needed to retrieve all tree paths including the item. Each Pointer-Array has as many entries as the number of HY-Tree node cells including the item, and stores pointers to such HNode cells. For each occurrence of an item in a VNode, it contains the pointer to the sentinel cell of the VNode. Each pointer in the array is a physical pointer, and allows selectively loading the corresponding disk block. For example, in Fig. 2 the Pointer-Array for item d has four entries. Three entries contain pointers to the cells including d in HNodes N 1 ; N 3 , and N 4 . One entry stores the pointer to the sentinel cell N 2 :½x : 1 of VNode N 5 . 2.3. HY-Tree comparison with other approaches Literature to represent transactional data and mine frequent itemsets from it are compared to the HY-Tree. The HYTree has been designed to be a disk-resident representation of the transactional database, complemented with additional information to effectively drive the data retrieval. Hence, it is built once, and then exploited for multiple analysis sessions. By contrast, in-core approaches exploit memory-resident prefix trees, which are generated from scratch in every analysis session. In addition, the HY-Tree is based on different node structures than the ones exploited in other approaches. We compared below some main-memory structures as FP-Tree [18], Patricia trie [26], FP-array [17], nonordFP [27], and disk-resident structures as prefix-tree [15] and DRFP-tree [1] with HY-Tree. The FP-tree [18] exploits the same node structure for the entire dataset, while the HY-Tree relies on two different arraybased node structures (i.e., HNodes and VNodes) to compactly represent both dense and sparse portions of the dataset. In the FP-tree, each node stores a parent pointer. To reach child nodes, either an additional set of child pointers (one for each child node), or a single child pointer to the first child node and a sibling pointer to the next sibling node, can be exploited. In both cases, this results in a large number of pointers (about 3 times the number of nodes) being used. In the HY-Tree, each HNode cell stores a child pointer to its descendant HNode. However, since every HNode includes all sibling items, i.e., all direct descendants of the same cell in the parent node, a single parent node pointer is stored for the whole HNode, while pointers among sibling cells are not needed at all. Furthermore, cells in VNodes do not include neither the local support value nor any pointers, while both pieces of information are stored in every FP-tree node. Thus, the structure of the HY-Tree nodes significantly differs from that of traditional FP-tree nodes, and their size is much smaller.

150

E. Baralis et al. / Information Sciences 293 (2015) 146–162

The Patricia trie [26] is a prefix tree-like structure which collapses each maximal chain of nodes with the same local item support and a single child into a single node. Thus, a Patricia node includes the local support and the corresponding sequence of items. The VNode structure could be considered as a less general variant of the Patricia node. A VNode stores a sequence of items which represent a unique transaction suffix. The VNode does not include the local support information and, hence, its size is smaller than the Patricia node. In the upper part of the Patricia trie, the traditional FP-tree node is exploited. The FP-array [17] is an auxiliary data structure designed to speed up the FP-tree based algorithms. It is a symmetric matrix where each element contains the frequency for each ordered pair of items (i.e., frequencies of the candidate 2-itemsets). An instance of the FP-Array is built at every recursive call of the mining algorithm to reduce the number of FP-tree traversals (i.e., the FP-tree data structure is scanned only once at each recursive call). Thus, the FP-Array is in no way similar to the proposed array-based HY-Tree. Nonordfp [27] is an array-based representation of a prefix tree structure. It is based on four arrays, named node-counters, node-parents, item-starts, and item-counters, which include all information needed during the mining process. The data structure can be efficiently used during the mining phase because it is accessed by means of sequential reads of array cells in main memory. The first two arrays have a number of cells equal to the number of nodes in the corresponding FP-tree representation. For each equivalent FP-tree node, the node-counters array stores its local support, while the node-parents array stores the index of its parent cell. Within both arrays, cells associated with the same item are written in consecutive positions. The item-starts and item-counters arrays contain a number of cells equal to the number of frequent items, and store respectively, for each frequent item, the index of the first cell labeled with that item within the node-counters and node-parents arrays, and the global item support. The HY-Tree, instead, exploits array-based structures to compactly represent FPtree nodes by preserving the tree representation and exploiting pointers to link nodes with one another. The tree-like structures proposed in [15,1] are persistent FP-tree like structures. The approach in [15] has been designed to support out-of-core frequent pattern mining. The node size includes both item identifier and parent pointer, while local item support and node link pointer are stored in a separate structure. A single node structure is used to represent the entire dataset. Conversely, the HY-Tree has a hybrid tree structure that relies on two array-based nodes to represent diverse data distribution. The data structure in [1], named DRFP-tree, is a slight variation of the FP-Tree, because some subpaths (i.e., a consecutive sequence of nodes such that every node has a single child) are represented by means of a single node. For such subpaths, a list of items with the corresponding supports, the length of the subpath and the parent node pointer are stored on disk. The other portion of the tree is modeled by means of the traditional FP-Tree node. The VNode structure of the HYTree could be considered as a special case of a DRFP-tree node. A VNode stores a sequence of items which represents a unique transaction suffix. However, since the VNode does not include the local support information, its size is smaller than that of the DRFP-tree node. Furthermore, in the HY-Tree the HNode structure has been exploited to model the upper part of the tree, whose representation is more compact than the traditional FP-Tree node exploited in the DRFP-tree. 3. VLDBMine physical organization The itemset extraction process requires traversing HY-Tree paths, either top-down or bottom-up, to retrieve the dataset portion relevant for the analysis. Hence, the HY-Tree physical organization should (a) provide efficient data transfer from secondary storage to main memory during this process and (b) reduce multiple reads of the same disk page. To address these issues, an ad hoc physical organization of the HY-Tree and Item-Index data structures has been devised. In addition, the disk pages storing HY-Tree and Item-Index are internally organized as a set of disk blocks placed on contiguous physical positions. 3.1. HY-Tree physical organization To minimize the number of disk page reads for path loading, the HY-Tree physical organization addressed two issues: (i) Splitting each path across few disk pages and (ii) clustering on the same disk page subpaths that tend to be accessed together during the extraction task. The HY-Tree data structure is physically organized as a collection of (almost) disjoint chunk-trees. Each chunk-tree includes a subset of adjacent HY-Tree paths, and compactly represents a portion of the dataset. Two adjacent chunk-trees may overlap on at most a single prefix path. This occurs when there are two paths sharing the same prefix at the border between two adjacent chunks. The overlapping paths only include HNodes, since VNodes always encode unique transaction suffixes. The example HY-Tree in Fig. 1 is organized in three chunk-trees in Fig. 3(a). Chunk-trees are stored on disk by splitting each path between upper and lower pages, based on the type of nodes appearing in the path. An upper page can only store HNodes, while a lower page can only include VNodes. With this physical organization, an entire path is stored on at most two disk pages (see Appendix A for an informal discussion of this issue). Each chunk-tree completely fills the space available in the upper page. However, it might occupy only part of the lower page, or it may need more lower pages. To minimize the amount of unused space in each page, each upper page is owned by a single chunk-tree, while lower pages are shared among adjacent chunk-trees. In each upper (lower) page, cells within each HNode (VNode) are stored in contiguous locations. Fig. 3(b) shows how the example HY-Tree is split across pages (assuming small-sized upper and lower pages for the sake of the example). HNodes are stored on a separate upper page for each of the three chunk-trees (P1 ; P 2 , and P 3 , respectively), while the same lower page (i.e., P4 ), storing VNodes, is shared among all chunk-trees.

151

E. Baralis et al. / Information Sciences 293 (2015) 146–162

(a) A collection of chunk-trees

chunk-tree

Upper page

Lower page

chunk-tree 1

P1

P4

chunk-tree 2

P2

P4

chunk-tree 3

P3

P4

(b) Disk page occupation

Fig. 3. Physical organization for the example HY-Tree.

The physical pointers included in every cell along the paths always link the cell with (parent and child) nodes located in the same page. The only exception is the sentinel cell. Although included in an HNode (and thus stored in an upper page), this cell references a VNode (located in a lower page). To guarantee that all paths are always represented on at most two disk pages, prefix paths shared between two adjacent chunk-trees are replicated in both trees. For example, prefix path hN 1 :½b : 11; N 2 :½h : 3i in Fig. 1 is shared between chunk-trees 1 and 2, and prefix path hN 1 :½h : 4i between chunk-trees 2 and 3. Cells along the shared prefix paths are replicated in both trees, and the item support in these cells is properly distributed among the two copies. These cells are depicted in gray in Fig. 3(a). The proposed approach allows significantly reducing the I/O costs for data retrieval. (i) Being split on at most two disk pages, each path is loaded with at most two disk read operations. (ii) Since the page stores multiple subpaths, reading one disk page loads a tree portion in memory. Subsequent data retrieval operations may already find the needed data in main memory. (iii) Upper and lower pages contain items with high-medium or low support value, respectively. Based on the support threshold, both upper and lower pages, or upper pages only, are read. The HY-Tree physical organization allows some redundancy to optimize the I/O performance. The following scenarios may occur: (a) duplication of prefix paths at the border between two adjacent chunk-trees and (b) duplication of VNodes either (b.i) in two different lower pages or (b.ii) in the same lower page (for more detail, see Appendix B). 3.2. Item-Index physical organization The Item-Index references every HNode cell in the HY-Tree. For each item, the Pointer-Array contains pointers to the HNode cells including either the item or the sentinel cell of the VNode where the item occurs. Two compression techniques have been adopted to reduce the size of the pointers stored on disk. (i) Delta coding scheme. Within each Pointer-Array, pointers are sorted by increasing value. For each pointer the gap with respect to the preceding pointer in the array is stored, instead of its actual value. The rationale is that the gaps between pointers are small, requiring a lot less space to store than the pointers themselves. For infrequent items, which only occur in a small number of HY-Tree nodes, the delta representation is not advantageous, because the gaps between their occurrences are typically large. In this case, the actual pointer value is stored. (ii) Normalized pointers. By definition pointers in the Pointer-Array always point to the starting byte of an HNode. Since the HNode size is a multiple of the cell size (which is fixed for all HNodes), pointers can be better expressed as multiples of the HNode cell. Thus, shorter pointer types can be used, requiring less disk space. 3.3. Comparison with path tiling techniques Path tiling [9,15] is a technique aimed at retrieving data stored at a given level in the memory hierarchy (e.g., disk) and load it into a higher level (e.g., main memory) in such a way that accesses to the latter will enjoy improved temporal locality. The technique was first proposed in [15] for cache hints optimization, while in [9] the approach has been exploited in the context of secondary/virtual memory. Both works focus on reading an information tile (i.e., a portion of data) from memory at level n + 1 in the memory hierarchy, to maximize the reuse of data fetched in memory at level n. In [15], an FP-tree is built in the main memory in such a way that entire paths can be easily cached. Conditional pattern bases for several items are extracted from each tile brought into cache. The same approach holds in [9], where a large FP-tree is built resorting to virtual memory. Given portions (‘‘blocks’’) of the FP-tree are then fetched into main memory from virtual memory (i.e., disk) by a

152

E. Baralis et al. / Information Sciences 293 (2015) 146–162

‘‘page blocking’’ mechanism, and then reused several times. ‘‘Tiles’’ (respectively ‘‘blocks’’) do not separately exist in main memory (respectively on disk), but they rather overlap and are ‘‘virtually’’ delimited by an initial and a final address. The VLDBMine framework exploits a similar approach to the one presented in [9] but in a different context (i.e., to define the physical data layout on secondary memory) and for a different aim. More specifically, the chunk-tree approach addresses the following issues. (i) Materialization of very large datasets. The chunk-tree partitioning approach overcomes the memory bottleneck by constructing a collection of independent and manageable chunk-trees. One chunk-tree at a time is created in main memory, allowing a relevant memory saving that enables the HY-Tree materialization on disk even for very large dataset. (ii) I/O-efficient data layout. The chunk-tree approach also serves as a disk data layout to improve the spatial locality of data accesses. With massive disk read operations, head seek time becomes a crucial aspect. As a consequence, each path should span the lowest possible number of contiguous disk blocks, so that it can be read without seeking back and forth. This result is achieved by embedding a self-contained portion of data (i.e., a complete chunk-tree) into a single upper page (and possibly a small set of lower pages). With this layout, each path can be loaded by accessing at most two pages (depending on the mining support threshold), each one stored in contiguous disk blocks, significantly decreasing I/O time. The I/O cost is further reduced by the hybrid HY-Tree structure, which adapts its node type to dense and sparse portions of the dataset. Once a chunk-tree has been loaded in main memory, it is used to extract all relevant paths encoded in it, thus improving the temporal locality. The prefetching technique (described in Section 5.1.2) enhances the benefits of this approach by extracting the relevant paths for a number of different items at the same time. Finally, both the chunk-tree and the path tiling approaches aim to improve the locality of data access once it has been loaded into the main memory. The latter does not consider the efficiency of the data loading process, addressed by the former with an ad hoc I/O-efficient data layout. 4. The VLDBMine creation process The definition of the HY-Tree physical structure is driven by the knowledge of the tree topology and node support values. The tree topology drives tree partitioning into chunk-trees, and defines cells within each node. Nodes along paths are represented as HNodes or VNodes according to the local support of their entries. Paths within each chunk-tree are stored in upper and lower pages based on their node type. However, all this information is available only when the transactional dataset has been represented as a prefix-tree. To overcome this issue we devised a technique that incrementally builds the HY-Tree on disk avoiding the creation of the whole tree structure in main memory. Specifically, since the HY-Tree is a collection of independent chunk-trees, chunk-trees are built one at a time. The HY-Tree creation process is organized as follows. The transactional dataset is first partitioned into disjoint transaction chunks, which are processed independently. For each chunk, a temporary prefix tree-like data structure is created in main-memory, and is then converted into the final chunk-tree representation and stored on disk. The dataset is preprocessed before partitioning to guarantee that each chunk includes only transactions belonging to the same chunk-tree. During the HY-Tree creation process, the Item-Index structure is progressively updated with the values of the physical pointers to chunk-tree nodes. The creation process of HY-Tree and Item-Index is described in the next subsections, while the materialization details of the complete structure on disk is presented in Appendix C. 4.1. Dataset preprocessing The following operations are sequentially performed. (i) Item support counting. The dataset is read to compute global item supports. If no support threshold is being enforced, then all items are retained to generate a complete representation of the dataset. Otherwise, items are discarded. (ii) Transaction remapping. Each transaction is individually remapped by sorting its items in the same order as in the HYTree paths. Optionally, the minimum support threshold is enforced by discarding infrequent items from each transaction. (iii) Sorting the remapped transactions. Remapped transactions are sorted according to their prefix. Transactions that will be represented in HY-Tree paths sharing a common prefix are thus placed in contiguous positions in the sorted transaction set. A disk-based algorithm has been devised to efficiently sort large transaction sets. This algorithm is a slight variation of a bucket sort algorithm [30]. After the dataset remapping phase, remapped transactions are assigned a target bucket based on their prefix, in such a way that the overall length of the transactions in the bucket does not exceed a maximum bucket size. This value is defined based on the available physical memory, so that all sorting operations within the bucket can be performed in the main memory. During the dataset remapping phase, statistics are collected about the number of transactions starting with each 1-item prefix, avoiding an additional dataset scan in the subsequent steps. After partitioning with respect to the 1-item prefix,

E. Baralis et al. / Information Sciences 293 (2015) 146–162

153

buckets that still exceed the maximum bucket size are recursively split based on increasingly long prefixes until they become small enough. In most cases, splitting on the 1-item prefix is sufficient. Transactions in each bucket are then sorted lexicographically in main memory with an MSD radix sort algorithm [30]. With this technique, each bucket can be sorted independently, thus avoiding the merger of intermediate results usually causing multiple costly disk accesses. 4.2. Building the HY-Tree on disk The sorted transaction set generated in the dataset preprocessing phase is fed to the HY-Tree construction procedure. This algorithm virtually partitions the transaction set into non-overlapped transaction chunks. Each chunk contains a subset of contiguous transactions from the sorted transaction set and is processed independently to build its corresponding chunktree on disk. For each chunk, first a temporary tree is built in main memory, to collect the information needed to define the structure of the final chunk-tree. This temporary tree is an FP-tree-like data structure. It has a larger size than the final chunk-tree, because all its nodes are represented with the same data structure, similar to that of a traditional FP-tree. Then, the temporary tree is visited in depth-first order and is converted into the final chunk-tree by selecting the proper structure for each node. Tree subpaths are stored in either an upper or a lower disk page according to the type of node. Finally, the temporary tree is discarded from main memory. The chunk size (i.e., the number of transactions it contains) is dynamically adjusted during the HY-Tree creation process to address the following issues. (a) To avoid memory swap during the creation process, each temporary chunk-tree separately fits in main memory. (b) To minimize the final number of HY-Tree disk pages, the algorithm tries to insert as many chunk-tree subpaths as possible in the upper disk page of each chunk. An upper page is owned exclusively by a chunk-tree. In contrast, the lower page is shared among chunk-trees and is almost completely filled with contributions from different contiguous chunk-trees. The chunk-tree creation procedure starts with an empty temporary chunk-tree and inserts transactions in it from the sorted transaction set. In parallel, the procedure keeps track of the size of the resulting (final) chunk-tree, in terms of bytes occupied in the upper and lower pages. Whenever a node is created or modified in the temporary chunktree, space occupation in the upper and lower pages is recomputed accordingly. As soon as the upper page becomes full (i.e., the available space in the upper page is insufficient to insert a new HNode and/or to append a new entry to an existing HNode), the temporary chunk-tree is converted into the final chunk-tree representation and its upper page is written to disk. Since the lower page is shared among multiple adjacent chunk-trees, it is written to disk only when full. If the last transaction in the current chunk could not be fully inserted in the temporary chunk-tree due to insufficient space in the upper page, it is removed and inserted in the next chunk-tree. 4.3. Building the Item-Index on disk A temporary Item-Index disk-based data structure is preliminarily created, similar to the final Item-Index but with a larger size. The temporary Item-Index is progressively updated during the HY-Tree creation process. Each time a new chunk-tree is stored on disk, the physical pointers to all its nodes are inserted into the temporary Item-Index. A portion of the temporary Item-Index resides in the main memory throughout this process. The Item-List, that lists all items in the HY-Tree with their global support value, is kept in the main memory. For the Pointer-Arrays, recording pointers to the newly created HY-Tree cells, a main-memory slot is reserved to store only the latest entries. A Pointer-Array is stored in secondary memory when it exceeds the available slot size. More specifically, its cells are appended to the Pointer-Array cells previously stored for the same item. Then, the slot is emptied. To reduce the number of sparse disk write operations, a single memory slot is used to store the (potentially interleaved) Pointer-Array cells for several contiguous items (in the global support order). When the slot is full, its content is written on disk. After completing the temporary Item-Index, Pointer-Arrays are reconstructed for each item, compressed as described in Section 3.2 and then stored on disk.

5. Mining algorithm integration Various algorithms for itemset extraction have been proposed in the literature, characterized by different in-memory data representations and solutions to explore the search space. These algorithms are usually organized in two main phases. First, the data loading phase accesses the flat dataset representation on disk, and loads all the data needed for the extraction process in memory. Then, the extraction phase mines itemsets from the selected data. Since VLDBMine provides a compact representation of the transactional dataset, it can be profitably exploited to support the data loading phase, by selectively retrieving only the dataset portion analyzed in the extraction phase. If no support threshold has been enforced at creation time, VLDBMine can be exploited as a data source for itemset extraction with any type of constraint, thus easily enabling the re-use of the same VLDBMine structure. Otherwise, the structure can support any analysis session whose support constraint is no less than the minimum support threshold. Two strategies have been devised to integrate the data loading phase driven by VLDBMine into state-of-the-art mining algorithms.

154

E. Baralis et al. / Information Sciences 293 (2015) 146–162

The loosely-coupled integration strategy completely decouples the data loading from the extraction phase. It supports the first level of recursion of the mining algorithm, i.e., when the subset of the transactional data including frequent items is loaded in main memory. Thus, the whole dataset projection relevant for itemset extraction is selectively loaded from the HY-Tree, and made available to the mining algorithm. Since the HY-Tree is visited only once, the I/O cost is limited. In addition, the loosely-coupled integration is easier to implement, because it requires less knowledge of the internal procedures and data structures used by the mining algorithm. However, this strategy supports a limited scalability, since the entire dataset projection relevant for the analysis is loaded in memory. The tightly-coupled integration strategy supports the next recursion levels in which the dataset projection with respect to a given frequent item is computed. More specifically, the tightly-coupled strategy interleaves the data loading and extraction phases, thus the complete HY-Tree representation of the original dataset is never resident in memory. Only the dataset projection locally required at each mining step is loaded from disk to main memory, for supporting the extraction of all itemsets from the corresponding item-based projected dataset. Once this projection has been processed, it is discarded, and a new one is loaded from the HY-Tree. This strategy can achieve a significant memory saving, providing better scalability for the mining algorithm, counterbalanced by a larger number of I/O calls and a more complex integration. Even if the tightly-coupled strategy provides a relevant memory saving, memory exhaustion may still occur with truly huge datasets. To overcome this problem, the integration strategy can be modified to support further recursion levels, i.e., to load the dataset projection with respect to a set of frequent items (instead of a single one). This approach can recursively project the dataset until it becomes small enough for the main memory, thus enhancing algorithm scalability. The appropriate integration strategy can be selected based on a trade-off among integration cost and algorithm scalability, given by main-memory requirements and I/O access cost. Both strategies optimize the data loading phase, as they avoid reading the entire dataset by selectively loading from the HY-Tree only the portions of interest. 5.1. Data retrieval techniques Three data retrieval techniques have been developed to support the integration of state-of-the-art algorithms for frequent itemset extraction. The support-based and the item-based projection methods respectively allow a loosely- and a tightlycoupled integration of algorithms. The item-based projection method exploits a prefetching strategy to reduce redundant disk reads and to bound the I/O cost. An additional access method has been devised to support the integration of algorithms enforcing constraints during the extraction process. These data access methods are described in the following. 5.1.1. Support-based projection method The support-based projected dataset is obtained by removing infrequent items from the original dataset. Since items in the HY-Tree paths are sorted by descending support value, the projection is given by the subpaths between the tree root and the first cells with infrequent items. First, frequent items are selected through the Item-Index. Then, the HY-Tree is top-down visited to read the projection. From each root node cell, paths are depth-first traversed by following the child node pointers. The visit of a path is interrupted when a cell with an infrequent item or with no child node is reached. Each retrieved subpath contains one item from each encountered HNode and, if a VNode is reached, the subset of cells including frequent items. This technique reduces the cost for data retrieval. Only HY-Tree pages including at least one frequent item are visited, and these pages are read from disk once. Based on the support threshold, for each chunk-tree either the upper page only, or both the upper and lower pages, are accessed. These pages reside in-memory until all their paths have been visited, and are then discarded. This sequence in page retrieval holds because paths are visited in the same order used to save them on disk during the creation procedure. 5.1.2. Item-based projection method The frequent projected dataset for an arbitrary item a is given by the subset of transactions including a and items with higher support than a, or equal support but preceding a in lexicographical order. Since items in HY-Tree paths are sorted by descending support and ascending lexicographical order, this projection is given by the HY-Tree prefix paths of item a (i.e., the subpaths from the root node to the cells including a). To identify the HY-Tree paths including item a, the Pointer-Array of a is read from the Item-Index. Then, the prefix path of each cell with a is bottom-up traversed by following the parent node pointers. Any retrieved prefix path contains a single cell from each crossed HNode. For any occurrence of a in a VNode, the retrieved subpath includes (i) all items in the VNode between a and the VNode sentinel cell and (ii) the prefix path of the sentinel cell. The VNode is read by following the unique child node pointer in the sentinel cell. In all retrieved prefix paths, item local supports are normalized to the local support of a, to only consider transactions including this item. This data retrieval technique accesses only HY-Tree pages including the prefix paths of the target item. One chunk-tree at a time is considered, and all relevant prefix paths are bottom-up traversed before moving to the next chunk-tree. A straightforward implementation of this data retrieval technique is to load the projection of a single item at a time. This solution minimizes main memory usage, but may cause high I/O costs, since the same HY-Tree page may need to be reloaded multiple times from disk (if no longer available in memory) when retrieving the projections of different items.

E. Baralis et al. / Information Sciences 293 (2015) 146–162

155

To avoid multiple loadings of the same page, the following prefetching strategy has been devised. At each step, prefix paths are retrieved (pre-fetched) for a subset of frequent items (rather than for a single item) and stored in a memory area. To form each subset, frequent items are chosen contiguously according to the item order in the HY-Tree paths. For a given subset, as many items are selected as can be accommodated, with their projections, in the memory area available for prefetching. This information can be known before loading the HY-Tree paths, since the size of each item projection is computed during the creation process and stored in the Item-Index. With this prefetching strategy, the number of times each page has to be read is given by the number of distinct item subsets (as opposed to the number of distinct items) whose prefix paths are stored in the page. A page is read exactly once when it only stores prefix paths of frequent items in the same subset. Due to the HY-Tree creation procedure, prefix paths of contiguous items are expected to be located in the same disk page or in neighboring pages. Since the pages storing the prefix paths of items in the same subset are accessed by consecutive disk reads, the prefetching strategy also bounds the I/O cost. Its effectiveness is discussed in Section 8.5. 5.1.3. Item-based access method This access method supports the enforcement of constraints into the extraction process. In this case, the complete transactions including a, represented by the HY-Tree paths with a, may be needed. The item-based access method is structured into two main steps. Node cells including item a are identified by means of the Item-Index, and the corresponding HYTree prefix paths are read by visiting paths bottom-up as described in the item-based projection method. For each prefix path its subtree is read by top-down visiting the HY-Tree with a procedure similar to the support-based projection method. 5.2. Tuning the VLDBMine configuration The VLDBMine data structure can provide a complete representation of a transactional dataset, without discarding any of its original information, and supplementing it with an indexing structure (i.e., the Item-Index). The complete VLDBMine materialization is also usually significantly smaller than the original data, which can thus be safely deleted. Given a transactional dataset, the corresponding VLDBMine representation needs to be created only once, and can then be re-used multiple times for different itemset extraction sessions, possibly with different item and/or support constraints. If preserving the full transactional dataset is not a requirement (e.g., the analyst can foresee that she will not be interested in extracting itemsets below a given support threshold), then a pruned version of the VLDBMine representation can be obtained by enforcing a minimum support threshold rM upon materialization. The resulting structure can still be used for all itemset extraction sessions enforcing a minimum support constraint r P rM . Moreover, since the sparsest part of the original dataset (i.e., the infrequent items) has been discarded, the size of the structure can be significantly smaller, thus also improving its performance for data retrieval. In its standard configuration, VLDBMine supports both the loosely- and the tightly-coupled integration of algorithms for frequent itemset mining. An ad hoc lightened VLDBMine configuration can be tuned to support only one of the two integrations. The resulting structure is characterized by a higher degree of compactness, thus further reducing the costs for data retrieval. 6. Itemset mining Several algorithms have been proposed for itemset extraction. These algorithms mainly differ from each other in the adopted main memory data structures and in the strategy to visit the search space. VLDBMine can support most state-ofthe-art algorithms for itemset extraction according to the integration strategies presented in Section 5. In the following, the integration of VLDBMine into mining algorithms is described by using LCM v.2 as a reference example. In the data loading phase, LCM v.2 first scans the dataset from a flat file to count item supports. Then, the dataset is read again and the support-based projection is loaded in main memory in an array-based data structure. Finally, the extraction takes place. The loosely-coupled integration replaces the data loading phase in LCM v.2. The set of frequent items is identified by accessing the Item-Index and reading the corresponding item global supports. The support-based dataset projection is retrieved from the HY-Tree by means of the support-based projection method and is loaded in the array-based structure. The original extraction algorithm is then run on it. The tightly-coupled integration strategy stems from the observation that the extraction process in LCM v.2 actually focuses on a single frequent item at a time and mines itemsets from the corresponding item-based projected dataset. Thus, only one portion of the dataset needs to be in memory at a time. For each frequent item, its dataset projection is retrieved from the HY-Tree by means of the item-based projection method and is stored in the array-based representation. Then, the original extraction algorithm is invoked on it. This solution allows frequent itemset extraction from larger datasets and with lower support thresholds, because it only requires that the largest item projection fit in main memory (rather than the entire support-based dataset projection as for the loosely-coupled integration strategy). The prefetching strategy, which loads the item-based projections for a set of items at once, can effectively reduce the I/O costs. The portion of memory area reserved for prefetching can be set by trading off performance of the extraction process against better scalability.

156

E. Baralis et al. / Information Sciences 293 (2015) 146–162

7. Perspectives on incremental update of VLDBMine The incremental update of the VLDBMine structure will be addressed in the future as a further improvement of this work. In particular, the following issues should be considered. (i) Minimum support threshold. If no support threshold has been enforced during the creation phase, the incremental update is feasible without accessing the original transactional dataset, as the compact VLDBMine already includes all required information. Conversely, the original dataset might be necessary if previously infrequent items become frequent after the new transactional data is merged with the old one. To this aim, the header table might be extended to keep track of all items (both frequent and infrequent), so that the updated item supports can be easily computed as new data become available. Only in the case that one or more previously infrequent items exceed the minimum support threshold will the old data be scanned to retrieve the transactions where such items occur, and update the VLDBMine paths accordingly. (ii) Physical layout. As the VLDBMine structure is updated with more transactional data, its size on disk may need to grow. Some empty space may be reserved in each page to account for this. In addition, new VLDBMine nodes may be appended at the end of structure, by relaxing the constraint that each pointer should only reference nodes in the same (upper or lower) page. However, this might result in a sub-optimal data layout on disk, and possibly in a loss of I/O efficiency. (iii) Path ordering. As described in previous sections, items along each VLDBMine path are sorted by decreasing global support (and lexicographic order in case of equal support). However, the relative order of items might change after new transactions are introduced (i.e., one item might become more frequent than one or more items that precede it in the global order). In this case, the order of items in the VLDBMine paths should be changed accordingly, as this is an essential prerequisite for many support-based traversal algorithms (e.g. the support- and item-based projection methods).

8. Experimental results To validate our approach a large set of experiments has been performed to address the following issues: (i) Characteristics of the VLDBMine structure, in terms of both size and creation time (see Section 8.1), (ii) performance of frequent itemset extraction (see Section 8.2), (iii) analysis of the I/O access time and memory consumption (see Section 8.3), (iv) scalability of the proposed approach (see Section 8.4), and (v) impact of the prefetching strategy on performance (see Section 8.5). We ran the experiments for both dense and sparse data distributions. We report experiments on 22 representative datasets whose characteristics (i.e., transaction and item cardinality, average transaction size (AvgTrSz), and dataset size) are in Table 2. The first block (datasets 1–4) includes real datasets. Connect and Pumsb [14] are dense and medium-size datasets, while Kosarak [14] and Wikipedia1 (available at [11]) are sparse datasets. The latter is characterized by a high degree of sparseness, due to the large number of distinct items (i.e., words) with respect to the number of transactions. The following three blocks of datasets (i.e., TxPyIiCwDz) reported in Table 2 have been synthetically generated by means of the IBM generator [2] by setting different parameters (i.e., T average size of transactions, I number of different items, P average length of maximal patterns, C correlation grade between patterns, and D number of transactions).2 In particular, block two (datasets 5–10) includes a selection of heterogeneous datasets that have been used to preliminarily assess the performance of the VLDBMine structure. These datasets differ from one another with respect to various parameters (e.g. number of transactions, correlation, item cardinality) so as to provide a wide enough coverage of possible cases. For instance, low correlation values and high item cardinality are intended to characterize sparse datasets, while the reverse should result in dense datasets. In addition, longer average transaction/maximal pattern lengths should result in an increasingly branched tree (as more item permutations are possible), thus boosting the occurrence of unitary support chains, which can benefit from the VNode representation. Datasets in blocks three and four are designed to assess the scalability of our approach, with respect to both dataset size (datasets 11–16) and average transaction/maximal pattern length (datasets 17–22). The disk-based creation procedure and data access methods have been developed in C++. Experiments have been performed on two different hardware configurations, i.e., (i) a 2.8 GHz Pentium IV PC with 2.5 Gbyte main memory running Linux kernel v. 2.6.20-16-generic, and (ii) a 2.66 GHz quad-core Intel Core 2 Quad Q9400 PC with 8 Gbyte main memory running Linux kernel v. 2.6.32-24-server.3 In particular, the experiments on datasets labeled with a star in Table 2 have been performed on architecture (i), while all other experiments have been run on architecture (ii). The choice of two different hardware configurations (i.e., a low-end machine and an average commodity pc) is motivated by our intent to show how appreciable scalability can be attained by our approach on medium-sized datasets, even with a limited memory budget. On the other hand, high scalability to large data volumes has been demonstrated on the second architecture. In the experiments, all reported execution times are real times, including both system and user time, obtained from the Unix time command as in [14]. The mining process has been performed with a default value of 1 GB for the prefetching area (unless stated otherwise). The impact of this parameter on performance is further discussed in Section 8.5. 1 Wikipedia is obtained from a 2009 dump of the English version of the online encyclopedia [33], including 3,038,075 articles. Wikipedia articles have been pre-processed by removing stop-words and applying Porter stemming. Then, each article sentence has been converted into a transaction where each item represents a word. 2 The script to synthetically generate the datasets is available at [11]. 3 The current implementation of HY-Tree does not exploit multithreading processors.

157

E. Baralis et al. / Information Sciences 293 (2015) 146–162 Table 2 Transactional dataset and corresponding HY-Tree characteristics. ID#

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 a b

Dataset name

CONNECT⁄ PUMSB⁄ KOSARAK⁄ WIKIPEDIA⁄ T20P20I100kC0.25D8M⁄ T30P30I50kC0.75D15M⁄ T22P20I150kC0.75D30M⁄ T24P24I300kC1D45M⁄ T22P20I300kC0.75D60M⁄ T18P15I100kC1D80M⁄ T22P22I100kC1D10M T22P22I100kC1D70M T22P22I100kC1D100M T22P22I100kC1D500M T22P22I100kC1D750M T22P22I100kC1D1000M T5P5I100kC1D500M T10P10I100kC1D500M T20P20I100kC1D500M T30P30I100kC1D500M T40P40I100kC1D500M T50P50I100kC1D500M

Flat dataset

Hy.T.

Size

Size

I.I.

Size (GB)

# Trans

# Items

Avg TrSz

Size (GB)

w/HN ðGBÞa

w/VN ðGBÞb

Size (GB)

0.009 0.016 0.030 4.09 1.12 3.11 4.69 8.54 9.95 9.24 1.54 10.79 15.41 77.06 115.58 154.11 18.18 35.32 70.18 105.68 141.10 176.86

67,557 49,046 990,002 53,991,313 8M 15 M 30 M 45 M 60 M 80 M 10 M 70 M 100 M 500 M 750 M 1000 M 500 M 500 M 500 M 500 M 500 M 500 M

129 2,113 41,271 4,061,422 60,277 37,183 59,121 75,196 73,657 39,157 48,081 48,087 48,087 48,087 48,087 48,087 19,359 30,938 45,968 55,415 62,193 66,896

43.00 74.00 8.10 10.44 25.43 38.32 26.70 30.60 26.70 20.88 27.93 27.94 27.94 27.94 27.94 27.94 6.46 12.72 25.42 38.37 51.29 64.32

0.004 0.007 0.027 2.36 0.45 1.38 2.24 3.31 4.37 4.75 0.65 4.75 6.58 23.12 30.73 37.62 4.63 9.72 20.83 32.66 44.88 57.29

0.008 0.027 0.117 9.30 1.94 5.37 7.19 9.95 12.26 12.76 2.58 13.38 17.61 57.25 76.90 94.95 9.87 23.02 51.22 82.06 114.44 147.19

0.011 0.013 0.034 2.30 0.79 2.20 3.10 5.30 6.19 6.52 1.08 7.55 10.78 53.90 80.85 107.80 13.90 25.56 49.22 73.34 97.41 121.66

0.002 0.006 0.028 2.25 0.45 1.25 1.68 2.32 2.87 3.00 0.60 3.13 4.12 13.42 18.03 22.27 2.47 5.51 12.02 19.15 26.28 34.19

CFH (%)

CFV (%)

Time (min)

48.81 51.94 8.29 42.29 59.91 55.43 52.25 61.22 56.10 48.58 58.09 56.00 57.33 69.99 73.41 75.59 74.52 72.47 70.31 69.10 68.19 67.61

25.51 11.44 86.54 12.67 19.62 15.22 16.44 34.00 27.25 16.09 19.03 26.98 30.60 52.58 57.81 61.13 60.93 56.87 53.18 50.98 49.32 48.28

0.07 0.1 0.33 36.6 4 15 24 40 48 53 3 32 46 210 302 440 100 126 198 323 422 486

Size of a tree built with HNodes only. Size of a tree built with VNodes only.

8.1. Characteristics of the VLDBMine structure Table 2 reports both the HY-Tree and Item-Index size for the considered datasets. Recall that the HY-Tree representation is complete. Hence, for all datasets, it has been generated without enforcing any support threshold. To evaluate the compactness of the persistent structures, we measured the compression factors achieved by (i) the HYTree only, which simply encodes the transactional data and (ii) the VLDBMine structure. These factors, defined as

CFHY-Tree ¼

  sizeðHY-TreeÞ % 1 sizeðDatasetÞ

  sizeðHY-TreeÞ þ sizeðItem-IndexÞ % CFVLDBMine ¼ 1  sizeðDatasetÞ compare either the HY-Tree or the VLDBMine size with the size of the transactional dataset. The HY-Tree structure is always smaller than the original flat-file dataset. In particular, the tree compression factor CFHY-Tree is always larger than 40% for all the considered datasets, except Kosarak. This result is mainly due to the hybrid structure of the HY-Tree, which exploits HNodes and VNodes to provide a compact representation capable of adapting to different item distributions within the dataset If all transactions in the dataset are characterized by the same item distribution, the proposed structure exploits either mainly HNodes or mainly VNodes. The compression factor yielded by HNodes is usually higher than that yielded by Vnodes, since the latter represent unique (suffix) transactions. As indicated by the CFVLDBMine , the data compression achieved by the full VLDBMine structure is lower, because storing both the HY-Tree and the Item-Index requires more disk blocks. The Item-Index contains pointers to all entries in the HY-Tree nodes. Hence, its size is proportional to the total number of HY-Tree entries. The VLDBMine structure is still almost always smaller than the flat-file dataset. The corresponding CFVLDBMine ranges from 11.44% to 61.12% for the considered datasets. For the sparsest datasets (i.e., Kosarak and Wikipedia) this compression factor has a negative value, indicating that the complete materialized structure is larger than the flat-file dataset. However, the VLDBMine structure supplements the raw transactional data with additional information to support top-down and bottom-up data retrieval. Thus, overall, the proposed structure provides good data compression for large datasets characterized by rather diverse data distributions. To further investigate the effectiveness of our hybrid structure, we computed the size of the materialized tree built with only one type of nodes (either HNodes or VNodes). The result is reported in Table 2, in columns Size w/HN and Size w/VN. In general, they show that neither data structure alone is suitable to compactly represent the transactional dataset. The only exception is the Wikipedia dataset, for which the size of a tree built with VNodes only is slightly (i.e., 60 Mbytes) smaller than the HY-Tree. However in this case a significant performance penalty is incurred during the mining process. The two

158

E. Baralis et al. / Information Sciences 293 (2015) 146–162

data representations are most effective when used in combination, because datasets are often characterized by rather diverse data distributions. Table 2 also shows the VLDBMine creation time. Most of the creation time is devoted to sorting, a computationally intensive operation whose cost increases with both the number of transactions and the average transaction length. Once transactions have been sorted, the creation of the HY-Tree is very efficient. The VLDBMine creation time increases with the dataset cardinality. It ranges from less than one minute to (about) 8 h for the largest dataset. However, the creation process is executed only once to materialize the VLDBMine structure, which can then support different mining processes with any support threshold. 8.2. Performance of frequent itemset extraction To evaluate the effectiveness of our approach in supporting itemset mining, we analyzed the run time for the LCM v.2 [31] adapted to our structure Both the support-based and the frequent item-based projection methods have been assessed in these experiments, and are denoted in the following as LCM-support-based and LCM-item-based, respectively. Two real datasets (Pumsb and Wikipedia) and two synthetic datasets (T24P24I300kC1D45M and T18P15I100kC1D80M) are discussed as representative datasets. The VLDBMine-based algorithms are contrasted to the corresponding original implementation of the algorithm [14], denoted as LCM-memory-based. The experiments analyze the run time (including both I/O and CPU times) for frequent itemset extraction with various support thresholds. The VLDBMine-based algorithms always exhibit better or comparable performance with respect to the memory-based algorithm, as reported in Fig. 4. On the smallest dataset (Fig. 4(a)) the disk-based algorithms achieve performance comparable with LCM-memory-based. Due to the small size of this dataset, pre-processing activities and I/O do not significantly impact the overall run time. In addition, all the data required by the mining process can be comfortably accommodated in main memory. Hence, the advantage provided by VLDBMine is not relevant. However, this experiment shows how VLDBMine may support frequent itemset extraction even from small-sized datasets without any performance penalty. As the size of the input data grows larger, the VLDBMine-based algorithms gain a significant performance advantage, as shown in the experiments with larger datasets (Fig. 4(b)–(d)). With high support thresholds, the run time of both LCM-support-based and LCM-item-based is always lower than that of LCM-in-memory. Since the HY-Tree compactly represents the transactional datasets, only smaller amounts of data (in comparison with the flat-file transactional dataset) are fetched from disk for the mining process. In addition, HY-Tree data can be directly supplied to the mining algorithm, avoiding the (memory- and time-consuming) pre-processing performed by the memory-based algorithm. Lowering the support threshold gradually makes the mining process unmanageable by the memory-based algorithm. This algorithm reads the flat-file dataset and loads its support projection in main memory by removing infrequent items from LCM-memory LCM-supp-based LCM-item-based

350 300

Execution time (s)

Execution time (s)

400

250 200 150 100 50 0

35

40

45

50

55

60

65

70

1600 1400 1200 1000 800 600 400 200 0

LCM-memory LCM-supp-based LCM-item-based

0

0.2

MinSup (%)

0.6

0.8

1

MinSup (%)

(b) Wikipedia

LCM-memory 900 LCM-supp-based 800 LCM-item-based 700 600 500 400 300 200 100 0 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3

Execution time (s)

(a) Pumsb

Execution time (s)

0.4

8000 7000 6000 5000 4000 3000 2000 1000 0 0.05

LCM-memory LCM-supp-based LCM-item-based

0.1

MinSup (%)

0.15

0.2

0.25

MinSup (%)

(c) T24P24I300kC1D45M

(d) T18P15I100kC1D80M

Fig. 4. Frequent itemset extraction time.

0.3

159

E. Baralis et al. / Information Sciences 293 (2015) 146–162

each transaction. However, this operation requires a larger amount of memory than the support-projected dataset itself and, thus, it may fail even if the projected dataset does not exceed the available physical memory. Hence, the memory-based algorithm fails to mine all of the above datasets for the lowest support thresholds enforced. In contrast, the support-projected dataset read from the HY-Tree does not require further preprocessing and can be directly loaded in main memory, allowing LCM-support-based to complete the mining tasks at lower support thresholds. LCM-support-based eventually fails on the largest datasets for low support thresholds, on account of memory exhaustion. LCM-item-based can succesfully terminate all experiments. Because of its data loading strategy, which interleaves data loading and mining phases, LCM-item-based scales to high data volumes and low support thresholds otherwise unattainable, at the expense of increased disk reads. For this reason, the item-based approach is often outperformed by LCM-supportbased, which instead loads the entire support-based projection by accessing the HY-Tree only once (e.g., Fig. 4(c) and (d), at a support threshold of 0.1%). 8.3. Analysis of I/O time and memory consumption We analyzed (i) the impact of our approach on the I/O process and (ii) the overall amount of main memory required by the two VLDBMine-based algorithms during itemset extraction. Dataset T18P15I100kC1D80M is discussed as a representative example. Fig. 5(a) reports the VLDBMine I/O cost compared with the I/O cost of the LCM-memory algorithm. For the LCM-supportbased algorithm, the I/O cost is the time to perform a single top-down, depth-first visit of the relevant HY-Tree portion. For the LCM-item-based algorithm, the I/O cost is given by the time for loading all item projections. In contrast, the LCM-memory algorithm reads the entire dataset twice to load in main memory the support-projected dataset for the considered support threshold. LCM-support-based typically requires lower I/O access time than LCM-item-based, since the HY-Tree is accessed only once, but fails for the lowest support threshold because the support-projected dataset exceeds the available main-memory. In contrast, the loading strategy exploited by LCM-item-based always allows the successful termination of the mining activity. The I/O cost required by the LCM-memory algorithm, when the process is correctly completed, is always higher than that required by the VLDBMine-based algorithms. This is mainly due to the larger amount of data read from the transactional dataset. Since the HY-Tree compactly represents the original dataset, it provides more efficient data loading. Fig. 5(b) shows the peak amount of main memory required by each algorithm during itemset extraction. For the LCMitem-based algorithm the main memory includes the prefetching area reserved for the mining process. With high support thresholds, LCM-support-based always requires an amount of memory comparable to that of LCM-item-based. In contrast, LCM-memory needs much more memory to load the relevant dataset portion. This wide gap is due to the memory-hungry preprocessing activity performed by the memory-based algorithm. With support thresholds smaller than 0.25% and high data volumes the memory-based algorithm is not able to correctly complete the mining task, because the space required to load the dataset support projection exceeds the available physical memory. As the support threshold is lowered, LCMitem-based requires less memory with respect to LCM-support-based at the expense of increased disk reads (see Fig. 5(a)). Thus, LCM-item-based yields better scalability to high data volumes and low support thresholds. Memory consumption remains stable when decreasing the support threshold, because the memory area reserved for prefetching has been kept constant. 8.4. Scalability The scalability of the VLDBMine-based algorithms has been studied by analyzing the execution time on different datasets generated by varying (i) the number of transactions and (ii) the pattern length. For (i), we considered large datasets of size

3500

LCM-memory LCM-supp-based LCM-item-based

6000

Memory (MB)

I/O time (s)

7500

4500 3000 1500 0 0.05

LCM-memory LCM-supp-based LCM-item-based

3000 2500 2000 1500 1000 500

0.1

0.15

0.2

MinSup (%)

(a) I/O cost

0.25

0.3

0 0.05

0.1

0.15

0.2

0.25

MinSup (%)

(b) Peak memory consumption

Fig. 5. I/O cost and peak memory consumption on T18P15I100kC1D80M synthetic dataset.

0.3

160

E. Baralis et al. / Information Sciences 293 (2015) 146–162

ranging from 10 M (million) (database size 1.54 Gbyte) to 1000 M (154.11 Gbyte) transactions, reported in Table 2. For the largest dataset, the compression factor is 61.13%. Fig. 6 plots the extraction time for the VLDBMine-based algorithms at different support thresholds. For the LCM itembased algorithm, a larger prefetching area (4 Gbyte) than in previous experiments is allocated. For datasets larger than 70 M transactions and for the considered support range, LCM-memory does not correctly terminate the extraction task.For larger datasets with low support thresholds, the mining process performance is significantly affected by the I/O activity (see Section 8.3). However, the efficient VLDBMine data retrieval techniques limit the I/O cost, thus outperforming the original memory-based algorithm. For very large datasets (i.e., 500 M transactions or more), the mining process could not be completed for the lowest support threshold(s) by the LCM-support-based approach, which runs out of memory while loading the support-projected dataset. In contrast, the LCM-item-based approach partitions the dataset portion of interest into a number of smaller subsets, thus reducing memory consumption and successfully carrying out the extraction task for the considered support range on all datasets. The run time reported in Fig. 6 includes both CPU and I/O time, the latter being the most significant contribution. However, the prefetching strategy can effectively reduce the I/O cost, as further discussed in Section 8.5. To analyze the scalability with respect to the pattern length, we considered transactional databases with 500 M transactions and pattern length ranging from 5 (size 18.18 Gbytes) to 50 (176.86 Gbytes), with 100,000 items and correlation grade 1. The results are plotted in Fig. 7, and show a similar behavior to that described for scalability with respect to the number of transactions.

8.5. Effect of the prefetching strategy To analyze the effect of the prefetching strategy on performance, we evaluated the variation in the execution time of the LCM-item-based algorithm when reserving increasingly large areas of main memory for its prefetching strategy. Dataset T22P22I100kC1D500M is discussed as a representative example. Fig. 8 shows the frequent itemset extraction time for a prefetching area size ranging from 1 Gbyte to 6 Gbytes. When increasing the prefetching area size, the execution time significantly decreases, because the item projections of a larger set of frequent and contiguous items can be loaded at the same time, thus avoiding repeated disk reads of the same page. For example, when the prefetching area is increased from 1 Gbyte to 4 Gbytes, the execution time decreases by roughly 72% for the lowest considered support. Further increasing

25000

minsup=0.3% minsup=0.25% minsup=0.20% minsup=0.15% minsup=0.10%

800 600

Execution time (s)

Execution time (s)

1000

400 200 0

0

15000 10000 5000 0

100 200 300 400 500 600 700 800 900 1000

minsup=0.3% minsup=0.25% minsup=0.20% minsup=0.15% minsup=0.10%

20000

0

100 200 300 400 500 600 700 800 900 1000

Number of records (x 1,000,000)

Number of records (x 1,000,000)

(a) LCM-supp-based

(b) LCM-item-based

1600 1400

40000

minsup=0.45% minsup=0.375% minsup=0.30% minsup=0.225%

1200

Execution time (s)

Execution time (s)

Fig. 6. Scalability on T22P22I100kC1DxM datasets (x is the # of transactions expressed in millions).

1000 800 600 400 200 0

5

10

15

20

25

30

35

40

45

Avg transaction and pattern length

(a) LCM-supp-based

50

minsup=0.45% minsup=0.375% minsup=0.30% minsup=0.225%

35000 30000 25000 20000 15000 10000 5000 0

5

10

15

20

25

30

35

40

45

Avg transaction and pattern length

(b) LCM-item-based

Fig. 7. Scalability on TxPxI100kC1D500M datasets (x is the average transaction and pattern length).

50

161

E. Baralis et al. / Information Sciences 293 (2015) 146–162

1GB 2GB 3GB 4GB 5GB 6GB

Execution time (s)

25000 20000 15000 10000 5000 0 0.1

0.15

0.2

0.25

0.3

MinSup (%) Fig. 8. Effect of prefetching area size on performance.

the prefetching area size only slightly affects performance, because the total number of I/O calls does not significantly decrease anymore. 9. Related work The research activity in the domain of association rules has been initially focused on defining efficient algorithms to perform the computationally intensive frequent itemset mining task. In-memory extraction algorithms (e.g, APRIORI [3], FPgrowth [18], Item-covers [34], COFI-Tree [12], Patricia-Trie [26], nonordfp [27], AFOPT [21], LCM [31,32], CGAT [8]) rely on ad hoc compact structures to represent data in main memory and on effective techniques to mine itemsets from them. Mainly three types of data structures (and corresponding algorithms) have been proposed: Array-based (e.g., LCM v.2 [31]), prefix tree-based (e.g., FP-tree [18]), and bitmap-based (e.g., MAFIA [10]). Each data structure type is typically suited for a given data distribution. Hybrid structures have also been proposed (e.g., LCM v.3 [32], Patricia-Trie [26]) to deal with mixed data distributions. However, since most of the above in-core approaches load the complete (support-projected) dataset in the main memory, they suffer from significant memory issues as soon as they are applied to larger datasets. A first step towards efficient mining on large datasets has been proposed in [29], while Lucchese et al. [23] presented a hybrid in-core/out-of-core approach. Even though the disk is exploited as an auxiliary means to extend scalability, the mining process is still mainly memory-based [23]. Recently, fully disk-based mining algorithms have been proposed to support the extraction of knowledge from large datasets (e.g., B+tree-based indices [28], Inverted Matrix [13], Diskmine [16] I/O conscious optimizations [9], DRFP-tree [1]). Ramesh et al. [28] proposed B+tree-based indices to access data stored by means of either a vertical (e.g., ECLAT-Based [34]) or a horizontal (e.g., APRIORI-Based [3]) data representation. However, performance is often worse than, or comparable to, flat-file mining. El-Hajj and Zaïane [13] presented a disk-based data structure, called Inverted Matrix, to store the transactional dataset in an inverted matrix layout. This structure is specifically suited for very sparse datasets, characterized by a significant number of items with unitary support. In [13], the COFI-Tree (Co-Occurrence Frequent Item Tree) algorithm [12] is exploited for itemset extraction. Grahne and Zhu [16] proposed the Diskmine approach. This approach represents an effective memory saving mechanism that efficiently maximizes the memory use. However, Diskmine may need significant disk space to store all projections and thus may incur a non-negligible I/O cost. An I/O conscious optimization to efficiently mine itemsets has been proposed by Buehrer et al. [9,15] by exploiting the path tiling approach (see Section 3.3 for a further discussion). Like the path tiling technique [9,15], the chunk-tree partitioning presented in this paper aims to improve the locality of data access once it has been loaded into the main memory. However, unlike path-tiling, the chunk-tree approach (i) defines a physical data layout on secondary memory and (ii) serves as a disk data layout to improve the spatial locality of data accesses. The idea of exploiting a persistent compact data structure to support(large-scale) itemset mining has been preliminarily introduced in [4], while a first study towards its parallel approach on a multi-core processor commodity PC has been presented in [7]. The VLDBMine framework significantly enhances the approach presented in [4], by proposing (i) an improved and completely renewed data structure, named VLDBMine, (ii) a set of data retrieval techniques, (iii) a novel physical data organization on disk, and (iv) a new integration strategy for itemset mining algorithms that significantly improves their scalability. A parallel effort has been devoted to solving a related problem, i.e., frequent closed itemset mining, on large datasets using out-of-core techniques. In [24] the input dataset is initially scanned twice and split into a set of disjoint partitions, based on a similar strategy to the one adopted in [16]. Each partition is chosen in such a way that it can be loaded and mined in main memory. The DCI_Closed algorithm [22] is exploited to this aim because of its tight memory bounds. Since the closedness property of an itemset cannot be evaluated based on a single partition, a merging step is necessary at the end of the mining process to combine partial results. Drawing upon a theoretical result, the merger is reduced to an external sorting and performed on secondary memory.

162

E. Baralis et al. / Information Sciences 293 (2015) 146–162

Finally, a research study has been also devoted to fully integrating into the PostgreSQL DBMS kernel a persistent data structure, named the IMine index [5,6]. IMine is a persistent data structure based on an FP-tree, that provides a complete representation of transactional data supporting itemset extraction from a relational DBMS. The IMine exploits the same node structure for the entire dataset, while the HY-Tree relies on two different array-based node structures to compactly represent both dense and sparse portions of the dataset and efficiently support itemset mining on very large datasets. 10. Conclusions and future work This paper describes VLDBMine, a persistent data structure which includes HY-Tree and Item-Index. As further developments of this work, the following issues will be addressed: (i) extension of the proposed approach to a distributed implementation, such as in a cloud computing environment, (ii) incremental update of the persistent structure, (iii) automated intelligent selection of the most appropriate data retrieval technique, and (iv) extension of VLDBMine to store sequence data and efficiently support sequence and temporal pattern mining [25]. Appendix A. Supplementary material Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/ j.ins.2014.08.073. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34]

M. Adnan, R. Alhajj, DRFP-tree: disk-resident frequent pattern tree, Appl. Intell. Springer 30 (2) (2009) 84–97. N. Agrawal, T. Imielinski, A. Swami, Database mining: a performance perspective, in: IEEE TKDE, vol. 5(6), 1993. R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, in: VLDB ’94, pp. 487–499. E. Baralis, T. Cerquitelli, S. Chiusano, A persistent HY-Tree to efficiently support itemset mining on large datasets, in: SAC ’10, pp. 1060–1064. E. Baralis, T. Cerquitelli, S. Chiusano, Index support for frequent itemset mining in a relational DBMS, in: ICDE, 2005, pp. 754–765. E. Baralis, T. Cerquitelli, S. Chiusano, IMine: index support for item set mining, IEEE Trans. Knowl. Data Eng. 21 (4) (2009) 493–506. E. Baralis, T. Cerquitelli, S. Chiusano, A. Grand, P-Mine: parallel itemset mining on large datasets, in: ICDE Workshops, 2013, pp. 266–271. S. Bashir, A. Rauf Baig, Performance analysis of frequent itemset mining using hybrid database representation approach, in: IEEE INMIC ’06, pp. 237– 243. G. Buehrer, S. Parthasarathy, A. Ghoting, Out-of-core frequent pattern mining on a commodity PC, in: KDD ’06, pp. 86–95. D. Burdick, M. Calimlim, J. Flannick, J. Gehrke, T. Yiu, Mafia: a maximal frequent itemset algorithm, IEEE TKDE 17 (11) (2005) 1490–1504. DBDMGroup at Politecnico di Torino. . M. El-Hajj, O.R. Zaïane, COFI approach for mining frequent itemsets revisited, in: DMKD ’04, pp. 70–75. M. El-Hajj, O.R. Zaïane, Inverted matrix: efficient discovery of frequent items in large datasets in the context of interactive mining, in: KDD ’03, pp. 109–118. FIMI. . A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y.-K. Chen, P. Dubey, Cache-conscious frequent pattern mining on modern and emerging processors, VLDB J. 16 (1) (2007) 77–96. G. Grahne, J. Zhu, Mining frequent itemsets from secondary memory, in: ICDM ’04, pp. 91–98. G. Grahne, J. Zhu, Fast algorithms for frequent itemset mining using fp-trees, IEEE Trans. Knowl. Data Eng. 17 (10) (2005) 1347–1362. J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in: SIGMOD ’00, pp. 1–12. A.A. Kardan, M. Ebrahimi, A novel approach to hybrid recommendation systems based on association rules mining for content recommendation in asynchronous discussion groups, Inform. Sci. 219 (2013) 93–110. T. Li, X. Li, Novel alarm correlation analysis system based on association rules mining in telecommunication networks, Inform. Sci. 180 (16) (2010) 2960–2978. G. Liu, H. Lu, J.X. Yu, W. Wang, X. Xiao, Afopt: an efficient implementation of pattern growth approach, in: FIMI ’03. C. Lucchese, S. Orlando, R. Perego, Dci closed: a fast and memory efficient algorithm to mine frequent closed itemsets, in: FIMI, 2004. C. Lucchese, S. Orlando, R. Perego, kDCI: on using direct count up to the third iteration, in: FIMI, 2004. C. Lucchese, S. Orlando, R. Perego, Mining frequent closed itemsets out of core, in: 6th SIAM International Conference on Data Mining, 2006, pp. 419– 429. J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, M.-C. Hsu, Mining sequential patterns by pattern-growth: the prefixspan approach, IEEE Trans. Knowl. Data Eng. 16 (11) (2004) 1424–1440. A. Pietracaprina, D. Zandolin, Mining frequent itemsets using patricia tries, in: FIMI ’03. B. Rácz, Nonordfp: an FP-growth variation without rebuilding the FP-tree, in: FIMI ’04. G. Ramesh, W. Maniatty, M.J. Zaki, Indexing and data access methods for database mining, in: DMKD ’02. A. Savasere, E. Omiecinski, S.B. Navathe, An efficient algorithm for mining association rules in large databases, in VLDB, 1995, pp. 432–444. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein, Introduction to Algorithms, third ed., MIT Press and McGraw-Hill, 2009. T. Uno, M. Kiyomi, H. Arimura, LCM ver. 2: efficient mining algorithms for frequent/closed/maximal itemsets, in: FIMI ’04. T. Uno, M. Kiyomi, H. Arimura, Lcm ver.3: collaboration of array, bitmap and prefix tree for frequent itemset mining, in: OSDM ’05, pp. 77–86. Wikipedia. . M.J. Zaki, Scalable algorithms for association mining, IEEE Trans. Knowl. Data Eng. 12 (3) (2000) 372–390.