Accepted Manuscript An Uncertainty-based Approach: Frequent Itemset Mining from Uncertain Data with Different Item Importance Gangin Lee, Unil Yun, Heungmo Ryang PII: DOI: Reference:
S0950-7051(15)00329-9 http://dx.doi.org/10.1016/j.knosys.2015.08.018 KNOSYS 3259
To appear in:
Knowledge-Based Systems
Received Date: Revised Date: Accepted Date:
29 January 2015 25 August 2015 26 August 2015
Please cite this article as: G. Lee, U. Yun, H. Ryang, An Uncertainty-based Approach: Frequent Itemset Mining from Uncertain Data with Different Item Importance, Knowledge-Based Systems (2015), doi: http://dx.doi.org/ 10.1016/j.knosys.2015.08.018
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
An Uncertainty-based Approach: Frequent Itemset Mining from Uncertain Data with Different Item Importance Gangin Lee, Unil Yun1 and Heungmo Ryang Department of Computer Engineering, Sejong University, Seoul, Republic of Korea E-mail:
[email protected],
[email protected],
[email protected]
Abstract Since itemset mining was proposed, various approaches have been devised, ranging from processing simple item-based databases to dealing with more complex databases including sequence, utility, or graph information. Especially, in contrast to the mining approaches that process such databases containing exact presence or absence information of items, uncertain pattern mining finds meaningful patterns from uncertain databases with items’ existential probability information. However, traditional uncertain mining methods have a problem in that it cannot apply importance of each item obtained from the real world into the mining process. In this paper, to solve such a problem and perform uncertain itemset mining operations more efficiently, we propose a new uncertain itemset mining algorithm additionally considering importance of items such as weight constraints. In our algorithm, both items’ existential probabilities and weight factors are considered; as a result, we can selectively obtain more meaningful itemsets with high importance and existential probabilities. In addition, the algorithm can operate more quickly with less memory by efficiently reducing the number of calculations causing useless itemset generations. Experimental results in this paper show that the proposed algorithm is more efficient and scalable than state-of-the-art methods.
Keywords Data mining, Existential probability, Frequent pattern mining, Uncertain pattern, Weight constraint
1. Introduction Researches of data mining started from the necessity to discover hidden, useful information from large databases; especially, frequent itemset mining [2, 12, 23] has been actively studied as one of the important areas in data mining and utilized in various application fields such as traffic data analysis [9], biomedical data analysis [28],
1 Corresponding author. Email:
[email protected]
1
network data analysis [29], and association rule analysis in a mobile computing environment [3]. The main goal of frequent itemset mining is to find all of the possible itemsets satisfying a user-specified threshold from a given database. Such mining results are used for automated data analysis in the aforementioned areas. Since the Apriori algorithm [2] was proposed, frequent itemset mining has continually developed through methods for improving performance [11, 21, 23, 24] and approaches for extracting more useful pattern information such as weighted itemsets [14, 36, 40], high utility itemsets [25, 26, 27, 38, 39], erasable itemsets [15], privacy preserving itemsets [37], stream itemsets [14, 40], sequential itemsets [4, 28, 43], and trajectory patterns [34]. Such methods focus on databases of which the items clearly exist or not. However, many of the real world applications may have not only such certain data but also various types of uncertain data such as personal identification data [45], sensor data [44, 46], and spatiotemporal query data of moving objects [6, 35]. That is, given an uncertain database, items within each transaction of the database have their own probability values, instead of exact existence or nonexistence information. Hence, previous traditional approaches have faced the limitations that cannot find valid mining results from such uncertain databases. For this reason, the concept of uncertain itemset mining was presented, and a variety of related works have been proposed [5, 8, 20, 31, 46]. Pattern results obtained from uncertain itemset mining have support information considering existential probabilities of items. Table 1. Uncertain database with weather and accuracy information Weather City A B C D
Severe heat
Deluge
70% 30% 40% 50%
10% 80% 70% 20%
Dense fog 20% 50% 30% 60%
Windstorm 10% 10% 90% 10%
Thunder and lightning 5% 60% 70% 10%
Weather Severe heat Deluge Dense fog Windstorm Thunder and lightning
Accuracy 0.8 0.6 0.7 0.4 0.3
For more in-depth considerations of characteristics of data obtained from the real world, we need to take account of not only uncertain data processing but also the following factor. In real world applications, items have their own importance or weight different from one another. Therefore, although an itemset is regarded as a valid uncertain pattern, its actual value can become different according to the weights of items composing the pattern. Let us consider an example of weather data. Table 1 is an example database with weather information of a certain state, where each row (or transaction) signifies weather information of a city. Then, based on the probability values of the weather information, previous uncertain itemset miners extract itemsets with existential probabilities higher than or equal to a given minimum support threshold. After that, the mining results can be used as weather prediction information of the current state. However, as mentioned above, each item (i.e., weather information) can have importance different from one another; thus, additional considerations are needed.
2
Assume that accuracy values of each weather item become their weights as shown in the right side of Table 1 and these values are information derived from weather prediction data accumulated from the past. Then, we can obtain more valuable itemsets considering both existential probabilities and weights by applying these values into the mining process. Motivated by this challenging issue, in this paper, we propose a new approach, Uncertain Mining of Weighted Frequent Itemsets (U-WFI). In brief, the main contributions of this paper are summarized as follows: 1) Proposing a tree structure that can efficiently store a given uncertain database and weight information by maximizing node sharing effect among the nodes of the tree. 2) Devising a list structure that can prevent any losses or incorrect calculations of items’ own existential probability values in the mining process. 3) Suggesting a new tree-based algorithm, U-WFI, which can mine uncertain frequent itemsets considering item weights from a given uncertain database (Uncertain Weighted Frequent Itemsets (UWFIs)). 4) Proposing an overestimation-based pattern pruning method that can prevent pattern losses caused by the weight factor by maintaining the anti-monotone property. Note that the uncertainty of each item represents the existential probability of the item. That is, through these values, we can suppose whether or not items within each transaction (or record) are more likely to exist in uncertain databases. Meanwhile, items can have their own importance or weight information. Items within data obtained from the real world can have weight values different from one another according to their own characteristics such as price and profit. Therefore, we can mine uncertain weighted frequent patterns in uncertain database environments by considering both of the characteristics. These patterns are results that have weighted expected support values larger than or equal to a given threshold, where the weighted expected support of a pattern is a support value that considers both the weight and existential probability values of the pattern. Consequently, these two factors, uncertainty and weight, have the concepts different from each other, and they can be applied to our mining process mutually to each other. The remainder of this paper is as follows. In section 2, we introduce related work including previous valuable uncertain pattern mining researches and the characteristics of their techniques. In section 3, we describe details of the proposed method and its efficient mining and pruning techniques. In section 4, we show performance evaluation results that guarantee more outstanding performance of the suggested algorithm, compared to stateof-the-art ones; finally, we conclude this paper in Section 5.
2. Related Work
3
In this section, background knowledge of uncertain itemset mining is introduced and contents of previous uncertain itemset mining researches are briefly described. After that, the concept and related works of weighted itemset mining are introduced.
2.1. Uncertain frequent pattern mining Uncertain itemset mining is a series of processes for finding valid itemsets from uncertain databases such as the left side of Table 1. In an uncertain database, items within each transaction have existential probability values of their own. As in the case of traditional frequent itemset mining, uncertain itemset mining is also mainly divided into the two categories, level-wise approaches and pattern growth approaches. Level-wise uncertain itemset mining methods [7, 30, 33] are based on the framework of the Apriori algorithm [2]. U-Apriori [7] is the first algorithm devised for extracting frequent itemsets from uncertain databases. This classical algorithm performs a mining process similar to that of Apriori. That is, the method searches for valid itemsets with length k and then generates candidate patterns with length k+1 using the found itemsets. After that, U-Apriori scans a given database once again and selects valid itemsets from the generated candidates. Hence, the algorithm has the same limitations as those of Apriori. That is, the algorithm shows drastic performance degradation if lengths of transactions in a given uncertain database become longer and the user-specific minimum support threshold becomes lower. Another level-wise method, MBP [33], is an algorithm using statistical techniques into the mining process. By applying the Poisson cumulative distribution function, the algorithm extracts valid pattern results from uncertain databases in an approximation manner. Although this is not an exact algorithm, it guarantees relatively high precision performance. IMBP [30], a variation of MBP, has been proposed to enhance runtime and memory performance of MBP at the cost of losing accuracy. However, the degree of its accuracy loss is too severe compared to exact algorithms; especially, the algorithm has lower accuracy on dense databases and does not guarantee stable accuracy. Uncertain itemset mining algorithms based on the pattern growth manner [16, 17, 19, 32] follow the basic framework of FP-Growth [12], which is the first pattern growth approach. Therefore, they perform their own mining operations within two database scans and do not generate any candidate patterns during the mining process unlike level-wise methods. UF-Growth [17] mines uncertain itemsets, employing a UF-tree that is a variation of an FP-tree. However, in the tree, the algorithm only allows sharing nodes with the same item name and existential probability value in order to prevent losses of existential probability information of items with the same name but different values. For this reason, UF-Growth constructs a much less compact tree than FPtree that allows sharing nodes with the same item name, and thus such a tree also causes both delays of tree
4
search time and inefficiency of memory. After UF-Growth, its advanced version [16] has been devised to improve the performance of UF-Growth. In the algorithm, the k-digit measure is used to enhance the node sharing effect. That is, once k is set by a user, the algorithm ignores more than k decimal places in existential probability values of nodes, thereby allowing more nodes to be merged although the existential probability values of nodes are not exactly the same. However, the algorithm still has limitations that 1) cause pattern losses because of disregarded decimal point values at smaller k settings and 2) do not improve the mining performance at larger k settings. UH-Mine [1] is an uncertain mining method utilizing H-struct, a data structure used in the H-Mine algorithm [21]. The algorithm shows good performance with respect to small databases but does not guarantee outstanding performance for relatively large databases because H-struct does not have any node sharing effect. CUFP-Mine [19], which is an approach proposed to maximize node sharing effect, generates a tree structure as compact as an FP-tree (called CUFP-tree) by allowing all of the possible node sharing operations regardless of existential probability values of items. Each node of the tree has an array list that contains information of all the possible uncertain itemsets combined from the path with the node, and through such an array list, CUFP-Mine can find a complete set of valid uncertain itemsets without any recursive manner for generating conditional trees. However, this approach shows drastic performance degradation as the size of a given uncertain database becomes larger and lengths of itemsets become longer. As another tree-based algorithm, AT-Mine [32] uses its own tree and list structures, AT-Tree and ProArr, to perform exact uncertain itemset mining operations without any pattern loss by the node sharing. Two types of nodes, normal and tail nodes, compose the tree, and the list includes existential probability information of items within a given database. However, the above algorithms do not consider weight conditions into the uncertain itemset mining operations. Table 2 shows the characteristics of the aforementioned various uncertain itemset mining algorithms. Each column of the table presents whether or not the algorithms consider the uncertainty and weight factors of items, extract a complete set of pattern results, and follow tree-based mining approaches. The reason why we denote “Exact mining method” of UF-Growth as Y/N is that the accuracy of the algorithm depends on its own technique using the k-digit measure. Notice that our algorithm is compared to the state-of-the-art methods, CUFP-Mine and AT-Mine, for reasonable performance evaluation.
5
Table 2. Characteristics of uncertain itemset mining algorithms Algorithm
Consideration of uncertainty
Exact mining method
Consideration of different item importance
Tree-based method
Note
U-Apriori [7] MBP [33] IMBP [30] UF-Growth [14, 17] UH-Mine [1] CUFP-Mine [19] AT-Mine [32] U-WFI
Y Y Y Y Y Y Y Y
Y N N Y/N Y Y Y Y
N N N N N N N Y
N N N Y N Y Y Y
Apriori-based method Approximation method Lower accuracy than that of MBP FP-growth-based method H-mine-based method Non-growth method FP-growth-based method The proposed algorithm
2.2. Weighted frequent pattern mining Weighted frequent pattern mining [3, 14, 40, 41] is the concept proposed to solve the weight problem, which is one of the drawbacks in traditional frequent pattern mining. In other words, in contrast to the traditional approaches that apply the same level of importance or weight into items within databases, weighted frequent pattern mining methods consider various levels of weight values obtained from the real world. WFIM [42] is the first weighted frequent pattern mining algorithm based on FP-Growth. The algorithm constructs its own tree structure through two database scans and performs its mining operations, where it also uses a pattern pruning technique that can prevent violations of the anti-monotone property caused by applying the weight conditions into the mining process. Unlike the case of FP-Growth, WFIM builds trees according to a weight ascending order; therefore, less compact trees are generated and memory is inefficiently consumed. In addition, there are other various weight-based approaches such as WMFP-SW [14] and MWS [40] for dynamic data streams, MCWP [41] for considering correlations of patterns, and WEP [15] for erasable pattern mining. Table 3 shows the overall features of the aforementioned weight-based algorithms, where we have denoted “Exact mining method” of the maximal pattern mining methods as Y/N because they may cause pattern losses in the pattern converting processes. Note that, since the above algorithms are methods that cannot consider the uncertainty factor of items, they are not directly compared to our approach in our performance evaluation section. Table 3. Characteristics of weight-based itemset mining algorithms Algorithm
Consideration of uncertainty
WFIM [42] MWS [40] WMFP-SW [14] MCWP [41] WEP [15] U-WFI
N N N N N Y
Exact mining Consideration of different Tree-based method item importance method Y Y/N Y/N Y/N Y Y
Y Y Y Y Y Y
Y Y Y Y Y Y
Note FP-growth-based method Stream mining method Sliding window-based method Correlation processing method Erasable pattern mining method The proposed algorithm
3. Mining Weighted Frequent Itemsets from Uncertain Databases
6
In this section, we describe details of the proposed algorithm, U-WFI, such as the suggested data structures and mining and pruning techniques without any pattern loss. In addition, empirical examples are also provided to help understanding of the contents more easily and exactly. The main difference between our approach and the traditional uncertain itemset mining is whether or not weight conditions of items are considered in the mining process. By considering importance of each item within a given uncertain database, we can obtain more valuable mining results compared to the previous approaches, and we can also improve efficiency of the overall mining operations by preventing the algorithm from generating useless patterns. Fig. 1 shows an overall structure and procedure of the proposed algorithm. Most of the uncertain pattern mining works can be classified as approximation methods using statistical techniques and exact ones that do not employ any estimation techniques. Among them, although such statistics-based methods can improve their mining efficiency using approximation techniques, they have the limitations that cause pattern losses and do not obtain exact mining results. For this reason, we focus on the latter approach. The main difference between the target databases in uncertain pattern mining and those of traditional frequent pattern mining is that each item within the uncertain databases has its own existential probability value. Then, many additional considerations are needed to mine uncertain patterns from such uncertain databases. Moreover, we also need to consider special techniques for maintaining the anti-monotone property during the process of uncertain weighted frequent pattern mining, such as an overestimation method.
7
Fig. 1. Mining procedure of U-WFI
3.1. Preliminaries Most of the data in uncertain itemset mining can be expressed as semantics of the possible worlds to show uncertain relations among items [7, 19, 32]. That is, given a transaction, T, item i in T can belong to two possible worlds, W1 (existence) and W2 (non-existence), denoted as i T in W1 and i T in W2. Let P(i, T) be the existential probability of i in T. Then, the probability that i belongs to either W1 or W2 becomes P(i, T) and 1 – P(i, T), respectively. Similarly, the probability that both items, i and j, exist in W1 or W2 can be denoted as P(i, T) * P(j, T) and (1 – P(i, T)) * (1 – P(j, T)), respectively. In other words, W1 and W2 signify two types of the worlds that each item i within transaction T can belong to. Therefore, for the transaction, T, the relationship between W1 and W2 can be denoted as W1 ∪ W2 = T. In addition, if we expand such a concept with respect to an uncertain database with multiple Ts, UDB, the relationship can be expressed as W1 ∪ W2 = UDB. Definition 1. (Expected support of an itemset) Let UDB = {T1, T2, …, Tn} be a given uncertain database, each T be a transaction belonging to UDB, and I = {i1, i2, …, im} be a set of distinct items in UDB. Each T has a set of items such that each of them is included in I, denoted as T = {i1, i2, …, ik}, where each item, i, has its own existential probability between 0 (0%) and 1 (100%) and the value can be different from that of the same item in
8
another T, as shown in Table 1. Then, given an itemset, X, its existential probability in each T, P(X, T), is calculated as follows: (1) An expected support of X, ExpSup(X), is computed as follows: (2) In Equation (1), Tprob(ik) means the existential probability of item ik belonging to a certain transaction, T. While itemsets resulting from traditional frequent pattern mining have actual support values, those from uncertain itemset mining have expected support values calculated by the above equation. If an expected support of an uncertain itemset is higher than or equal to a given minimum support threshold, it is considered as an uncertain frequent itemset [7, 19, 32], where the threshold is given by a user as a percent value and itemsets’ expected supports are compared to the multiplication of this percent value by the number of transactions in UDB, called MinSup. Consequently, the main purpose of frequent uncertain itemset mining is to find all of the possible uncertain itemsets with expected supports higher than or equal to MinSup.
3.2. Storing uncertain database and weight information with a tree structure Table 4 presents an example uncertain database and its item weight information. As shown in the table, transactions within the uncertain database have parts overlapped with one another. Therefore, the tree structure with node sharing effect is more suitable for expressing such information more compactly. However, items in uncertain databases can have existential probability values different from one another unlike those in general itemset databases. Thus, if we construct a tree without any additional consideration for them, losses of uncertain information can be caused. In the beginning of tree-based uncertain itemset mining, the corresponding algorithms used a method that allowed sharing only nodes with the same item name and existential probability. However, since the number of nodes satisfying the above conditions is not large, the constructed tree has no choice but to have a very sparse and complicated form; in the worst case, using such a tree structure may be more inefficient than not doing so. In order to overcome such a problem and store a given uncertain database including weight information more efficiently, we propose a tree structure with the following characteristics.
9
Table 4. Example of an uncertain database with weight information TID 010 020 030 040 050 060 070 080 090 100
Items A
B
C
D
E
F
G
H
0.5 0.9 0.5 0.7 0.4 0.9 0.5 0.9
0.7 0.6 0.9 0.3 0.5 0.7 0.7 0.9 0.7
0.9 0.7 0.9 0.6 0.9 0.8
0.9 0.6 0.4 0.5 0.8 0.6 0.8 0.5 0.6
0.9 0.4 0.8 0.9 0.9 0.4 0.5 0.6
0.6 0.7 0.6 0.9 0.5 0.4 0.9 0.5
0.5 0.7 0.3 0.3 0.9 0.3 0.7 0.4
0.1 0.2 0.7 0.2 0.5 0.4 0.1
Item
A
B
C
D
E
F
G
H
Weight
0.9
0.8
0.6
0.6
0.7
0.8
0.5
0.3
Definition 2. (Global U-WFI-tree) Given an uncertain database, UDB = {T1, T2, …, Tn}, and its weight information, W = {w1, w2, …, wm}, they are stored into a Global U-WFI-tree. This tree is composed of a header table storing essential data for the mining operations of U-WFI and a rooted tree structure storing actual item information in UDB. The header table consists of the following columns: Item, ExpSup, Support, Weight, and Node link. The rooted tree is composed of a root node and multiple normal nodes, where each normal node has an item label (Notice that the definition of Conditional U-WFI-tree for uncertain weighted frequent itemset mining is described in the next section in detail). Information in UDB and W is stored in the Global U-WFI-tree in the following manner. 1) Scan UDB once to know ExpSup and Support information of items in UDB. 2) Delete items such that multiplying each of their ExpSup by MaxW is smaller than MinSup (MaxW means the largest value among weights in W, and its details are described in the next section). 3) Generate a header table of the Global U-WFI-tree on the basis of the items’ support descending order. 4) Scan UDB once again, where for each T, its items are sorted in the support descending order and invalid items are removed. Each of the processed Ts is inserted into the rooted tree of the Global U-WFI-tree in sequence. Finally, 5) connect an appropriate node link whenever each item in T is entered into the tree. There is a need to sort the proposed Global U-WFI-tree in a support descending order that leads to the best node sharing effect in order to store information of UDB and W in the most compact form. However, additional considerations are needed in this case because we cannot know whether nodes shared in the tree have different probability values or not. In the tree structure of traditional frequent itemset mining, each shared node has a
10
support of 1. Meanwhile, in our U-WFI-tree, shared nodes may have ExpSup different from one another, apart from the support of 1. For this reason, such characteristics of uncertain itemset mining should be considered in the tree. A naïve manner is to include all of the corresponding ExpSup values into each node of the tree, but in this case, there is no advantage of using the tree structure because all of the information in UDB is expressed in the tree without any information compression. Hence, we propose an advanced method that can express all the information in UDB more effectively by inserting special information into the last node for each inserted transaction, called a tail node (different from the concept of the leaf node). Definition 3. (Global U-WFI-code) Each tail node of the Global U-WFI-tree links a Global U-WFI-code that stores one or more TID data. That is, given a certain tail node, N, and the number of transactions in UDB, k, the Global U-WFI-code of N stores a set of TIDs for the transactions with N, N.L_TID, which is denoted as N ~ N.L_TID = {TID1, TID2, …, TIDi} (1 ≤ # of TIDs ≤ k). By referring to the transactions corresponding to each TID, we can determine probability information of items without storing all of the information in the tree. The reason why we use the tail node approach to efficiently distinguish probability values of nodes is as follows. Lemma 1. A tail node is an identifier that can classify probability values of the U-WFI-tree’s nodes with the smallest number of operations. Proof. Let us consider distinguishing a path, Path, from a certain U-WFI-tree, Tree. Since Path is equal to the corresponding processed transaction of UDB inserted into Tree, we can know probability values with respect to all the nodes including shared nodes if each Path can be classified from Tree. To do that, we can consider four types of criteria: 1) root node, 2) leaf node, 3) any middle node except for the root and leaf ones, and 4) tail node. In the first case, since the root node has multiple child nodes, additional information and computations are needed to know what child node belongs to Path. In the second case, we can easily see the path including the corresponding leaf node by following the parent node links of the leaf, but we still cannot determine whether the path is just a simple path corresponding to one transaction or a complex path sharing multiple transaction data. Hence, additional work is also necessary. The third case is the worst case that has to consider both parent and child nodes of the middle node. In the last case, since each tail node is a node with the last item in each processed transaction inserted into Tree, all of its ancestor nodes become members of the corresponding Path. In addition, computational overheads are also the smallest in the four cases because we have only to follow parent ■
node links of the tail node.
In order to efficiently refer to items’ probability values in transactions, information of UDB has to be loaded into the main memory with the Global U-WFI-tree during the mining process. Therefore, we compactly refine the
11
information using a list structure, called TID-list, and maintain it during the mining process. Recall that, after the first UDB scan, we can know the support descending order and invalid item information. Hence, we construct TID-list by deleting invalid items for each transaction and storing the transactions sorted in a support descending order. This list is created at the same time as the tree construction in the second scan. The list is composed of the minimum information for classifying the probability value of each node within the Global U-WFI-tree. In addition, it also includes index numbers of items corresponding to the probability values in each TID, where sequence of the index numbers follows the item order of the header table in the Global U-WFI-tree. Table 5. TID-list generated from Table 4 (MinSup = 2) TID 010 020 030 040 050 060 070 080 090 100
Probabilities 1:0.7 1:0.6 1:0.9 1:0.3 1:0.5 2:0.6 1:0.7 1:0.7 1:0.9 1:0.7
2:0.9 2:0.6 2:0.4 2:0.5 2:0.8 5:0.4 2:0.8 2:0.5 3:0.5 2:0.6
3:0.5 5:0.7 3:0.9 3:0.5 3:0.7 6:0.3 3:0.4 3:0.9 4:0.5 3:0.9
4:0.9 6:0.5 4:0.4 4:0.8 4:0.9
5:0.6 7:0.9 5:0.6 5:0.9 5:0.5
6:0.7 6:0.3
7:0.7 7:0.9
4:0.9 4:0.4 6:0.7 4:0.6
5:0.9 6:0.3
6:0.9 7:0.9
7:0.6
5:0.5
6:0.4
7:0.8
Example 1. Consider generating a Global U-WFI-tree and TID-list from the example UDB and W in Table 4 when MinSup is given as 2. After the first database scan, ExpSup and Support of items are calculated as follows: {A: 5.3, 8}, {B: 6.0, 9}, {C: 4.8, 6}, { D: 5.7, 9}, {E 5.4, 8}, {F: 5.1, 8}, {G: 4.1, 8}, and {H: 2.2, 7}. Then, the items’ support descending order and index numbers are {1:B, 2:D, 3:A, 4:E, 5:F, 6:G, 7:C}. Note that item H becomes meaningless because multiplying ExpSup of H, 2.2, by MaxW, 0.9, is smaller than MinSup. Therefore, H is pruned. Thereafter, a header table of the Global U-WFI-tree is created as shown in Fig. 2. In the second database scan, each transaction, of which the invalid items are removed and the remainders are sorted in a support descending order, is inserted into the tree in sequence. In this process, TID-list is also generated as shown in Table 5, where each element of the list is a pair of an item index and its corresponding probability. Fig. 2(a) presents a tree after being inserted up to the 5th transaction, where as shown in the header table, items have been sorted in their support descending order. In Fig. 2(a), the inserted transactions are TID:010 {B, D, A, E, F}, TID:020 {B, D, F, G, C}, TID:030 {B, D, A, E, F}, TID:040 {B, D, A, E, F, G, C}, and TID:050 {B, D, A, E, F, G, C}. When the second transaction is inserted into the tree after the first transaction, items B and D can be merged; similarly, parts of the other transactions are also merged as shown in the figure. Tail nodes corresponding to these transactions are F, C, F, C, and C, respectively; however, three tail nodes, F~{010, 030}, C~{020}, and C~{040, 050}, finally remain by the node sharing effect among them. After that, with respect to
12
the shared nodes in the global tree, B, D, A, E, G, and C, we can determine their exact probability values by using the Global U-WFI-codes included in these tail nodes. Fig. 2(b) shows a tree after all the transactions are properly inserted into the global tree.
Fig. 2. Global U-WFI-tree constructed from Table 4 (MinSup = 2)
3.3. Mining UWFIs from an uncertain database with weight information In this section, we define Uncertain Weighted Frequent Itemsets (UWFIs) and describe how the U-WFI algorithm mines UWFIs from the Global U-WFI-tree constructed in Section 3.2. In addition, we also propose an effective pruning method that does not violate the correctness of the suggested algorithm.
3.3.1. Uncertain weighted frequent itemsets The main goal of the algorithm proposed in this paper is to extract all of the possible UWFIs from a given uncertain database, where UWFI is defined as follows. Definition 4. (Uncertain Weighted Frequent Itemset (UWFI)) Recall that uncertain itemset mining considers ExpSup of each itemset, instead of its Support, as shown in Definition 1. Then, Uncertain Weighted Frequent Itemsets (UWFIs) can be obtained in the following manner. Given an itemset composed of k items, P = {i1, i2, …,
13
ik}, a set of item weights in P, WP is denoted as WP = {w1, w2, …, wk}. Then, the representative weight of P is expressed as an average value of all the w values and computed as follows: (3) In this equation, |Wp| means the number of elements in Wp. Then, Weighted ExpSup of P, WES(P), is derived by Equation (4): (4) If WES(P) is not lower than the user-given MinSup, P is considered as a valid UWFI. Example 2. From Equations (3) and (4), we can calculate WES of an itemset within Table 4, P = {B, D}, as follows. P exists in TID:010, 020, 030, 040, 050, 070, 080, and 100. Thus, from Equations (1) and (2), ExpSup(P) = {(0.7 * 0.9) + (0.6 * 0.6) + (0.9 * 0.4) + (0.3 * 0.5) + (0.5 * 0.8) + (0.7 * 0.8) + (0.7 * 0.5) + (0.7 * 0.6)} = 3.13, where each multiplied value means an existential probability of P in the corresponding transaction, e.g. in 0.7 * 0.9, the first value, 0.7, is the B’s probability in TID:010 and the second one, 0.9, is the D’s value in the same transaction. From Equations (3) and (4), Avg(WP) = (0.8 + 0.6) / 2 = 0.7 and then WES(P) = 0.7 * 3.13 = 2.191. Let us assume that MinSup = 2 which is common with the case of the previous example. Then, P becomes a UWFI because WEP(P) > MinSup. Next, let us consider calculating WES of a certain super pattern of P, P’ = {A, B, D}. Since P’ exists in TID:010, 030, 040, 050, 070, 080, and 100, ExpSup(P’) = {(0.5 * 0.7 *0.9) + (0.9 * 0.9 * 0.4) + (0.5 * 0.3 * 0.5) + (0.7 * 0.5 * 0.8) + (0.4 * 0.7 * 0.8) + (0.9 * 0.7 * 0.5) + (0.9 * 0.7 * 0.6)} = 1.911, Avg(WP’) = (0.9 + 0.8 + 0.6) / 3 = 0.7667 (rounded off to five decimal places), and then WES(P’) = 0.7667 * 1.911 = 1.4652 (rounded off to five decimal places). In this case, P’ becomes a useless pattern.
3.3.2. Pruning techniques based on an overestimation method Recall that MaxW was briefly mentioned in Section 3.2 to delete meaningless items from a given UDB. Pattern pruning by the anti-monotone property [2] is one of the most important factors in the frequent itemset mining area. In addition, this property is also known to be satisfied in the uncertain pattern mining field. In frequent pattern mining, the anti-monotone property (also called the downward closure property) signifies that, if an itemset, X, has a smaller Support than a given MinSup (i.e., X is infrequent), all of the possible supersets of X have values, which are also lower than MinSup. Similarly, in uncertain frequent pattern mining, if ExpSup of an uncertain itemset, Y, is smaller than MinSup, those of all the possible supersets of Y are also lower than the threshold. Through these cases, we can determine that, as the length of a pattern becomes longer, its Support or ExpSup must remain the same state or become smaller to maintain the anti-monotone property. However, through the literature dealing with weight conditions [14, 15, 40], we already know that simply applying weight
14
factors into the mining process without any additional considerations does not satisfy this property. We can also observe such features from the example in this paper. In Example 2, the ExpSup values of P and P’ are 3.13 and 1.911, respectively, which satisfy the anti-monotone property; meanwhile, their Avg values are 0.7 and 0.7667, respectively, which violate this property. That is, the weight factor also has a problem in uncertain itemset mining. Pattern pruning without satisfying this property can cause fatal pattern losses. For this reason, we propose a weight applying method suitable for uncertain itemset mining in order to solve such a problem. Definition 5. (Maximum weight (MaxW)) Given a set of weights belonging to UDB, W = {w1, w2, …, wk}, the weight value corresponding to Maximum({w1, w2, …, wk}) is assigned as the Maximum weight, MaxW. Definition 6. (WES with an overestimated weight (WESover)) Since Avg does not satisfy the anti-monotone property, we employ an overestimation factor, WESover, replacing Avg with MaxW. That is, given an itemset, P, WESover(P) is computed as follows: (5) Lemma 2. Pattern pruning by the WESover technique does not cause any loss of UWFIs. Proof. Let P be a given itemset and P’ be a superset of P. In the case of traditional frequent pattern mining or uncertain pattern mining, if Support(P) or ExpSup(P) < MinSup, then Support(P’) or ExpSup(P’) < MinSup because Support(P) Support(P’) and ExpSup(P) ExpSup(P’). Therefore, in this case, deleting P does not have any problem since all the possible super patterns of P also have smaller values than MinSup. On the other hand, let us consider the case of WES(P) and WESover(P). The value of WES(P) is decided by Avg(WP) and ExpSup(P), where Avg(WP) cannot guarantee that the changes of its value obey the anti-monotone property. Hence, pruning of P by the WES value can lead to fatal pattern losses because it cannot guarantee that WES(P) WES(P’). Meanwhile, in the case of WESover(P), MaxW is used instead of the Avg factor, where MaxW does not have any effect on the change between WESover(P) and WESover(P’) because it is always the same value. Consequently, WESover always satisfies the anti-monotone property and pattern pruning based on this allows us to achieve a complete UWFI mining process without any pattern loss.
■
Example 3. We can observe from Example 2 that Avg(P) and Avg(P’) are 0.7 and 0.7667, respectively, which means that its Avg value has not been decreased or changeless but rather increased during the pattern growth step. In other words, Avg has an effect on the change between WES(P) and WES(P’). On the other hand, since the same MaxW value, 0.9, is used into WESover(P) and WESover(P’) instead of their own Avg values, it is not an element participating in the change between WESover(P) and WESover(P’). That is, we can see that the antimonotone property is maintained by the overestimated value, MaxW.
15
Note that such an overestimation method guarantees the correctness of the proposed algorithm by preventing pattern losses, but we cannot yet know whether the results obtained from the method are actually valid UWFIs because they are results overestimating their WES values. That is, all the resulting patterns from this overestimation method are candidate UWFIs; hence, it is essential to calculate their real WES values and compare them with MinSup.
3.3.3. Pattern growth of U-WFI Once a complete Global U-WFI-tree is constructed, the proposed algorithm mines UWFIs using the tree in divide-and-conquer and recursive call manners. To do this, we generate multiple Conditional U-WFI-trees with smaller scale from the global tree and conduct UWFI mining works from them. In other words, U-WFI generates multiple partial tasks (conditional trees) from the main task (a global tee) and processes each of the partial ones (divide-and-conquer); thereafter, the algorithm mines patterns performing an itemset growth technique for each partial task (Depth-First Search (DFS)). These partial works are also divided into much smaller tasks (conditional trees of a conditional tree), and such a process is recursively conducted until there is no longer portion that can be split from each task. Definition 7. (Conditional U-WFI-tree) Given a Global U-WFI-tree, Tree, the algorithm constructs a Conditional U-WFI-tree, Tree’, with respect to each item of the header table in Tree, where each selected item serves as a reference called prefix (The size of prefix is gradually increased one by one according to the recursively performed mining process, and the details are described in the subsequent part of this section). If the number of items in the Tree’s header table is k, conditional trees obtained from Tree are Tree’1, Tree’2, …, Tree’k. Each of the conditional trees also has one header table and one rooted tree in common with the global tree, but the difference between them is that conditional trees can continue to be created in a recursive manner unlike the global one. That is, on the basis of each Tree’k, a number of conditional trees, Tree’’s, can be constructed. The following are definite differences between the Global and Conditional U-WFI-trees: 1) the global tree is a data structure for efficiently storing UDB and W, while the conditional trees are data structures generated only for the UWFI mining step, and 2) the global tree is created only once and maintained until the whole mining process is finished, while the conditional trees are recursively generated many times if necessary (each of them is immediately deleted after use). Since such conditional trees are data structures constructed for mining UWFIs from the global tree, we need additional considerations as well as the Global U-WFI-code information used in the global tree. Definition 8. (Conditional U-WFI-code) Recall that a Global U-WFI-code stores a TID set, called L_TID,
16
which contains TID information of one or more transactions that are included in a certain path with a corresponding tail node. In addition, a U-WFI-code connected to each tail node, N, in a Conditional U-WFI-tree (called Conditional U-WFI-code) stores a set of item indexes included in a path with N, called L_ItemIdx, and a set of accumulated probability values for the prefix selected so far, called L_PrePro, where the item indexes follow the sequence of the global tree’s header table. That is, a Conditional U-WFI-code of tail node N has the three types of information: N.L_TID, N.L_ItemIdx, and N.L_PrePro. Through such information of the Conditional U-WFI-code, our algorithm efficiently performs the UWFI mining operations.
Fig. 3. General expression and characteristics of the conditional U-WFI-code Fig. 3 shows a general expression and characteristics of the Conditional U-WFI-code, where N is a given tail node and
to
means ancestor tail nodes of N.
is an ancestor node closest to N, while
is the
one nearest to the root. Conditional U-WFI-code information of these tail nodes have the characteristics shown in Fig. 3, which are employed to calculations among tail nodes (See the subsequent part in this section for more details). From the constructed global tree, U-WFI conducts its mining operations as follows (Notice that those who have understood the overall concept and contents of FP-growth can know the following contents more easily because the procedure of the proposed approach is somewhat similar to the basic framework of FP-growth). 1) Select the bottom item in the Global WFI-tree’s header table and input it into prefix (Note that items are continually added into prefix one by one whenever the recursive call is performed. After each recursive call is finished, the item added in the corresponding step is removed from prefix again. Since prefix is also a candidate UWFI, its real WES is computed and it is outputted as a valid pattern if the value is not smaller than MinSup whenever the length of prefix becomes longer.). 2) Generate a Conditional U-WFI-tree for the current prefix. 2-1) Find upper nodes of each node with the recently selected item, i.e., the last item in prefix, traversing node links of the global tree, and calculate WESover values with respect to the items of the found nodes (The first tree scan). 2-2). Remove invalid items from the found ones and sort the reminders in their support descending order (Note that
17
the order may be different from that of the global tree). 2-3) Construct a Conditional U-WFI-tree traversing the global tree once again on the basis of the information obtained from Step 2-2 (The second tree scan). 2-4) Generate Conditional U-WFI-codes for tail nodes and insert necessary information into the codes during the conditional tree construction phase. 3) Iterate Step 2 with respect to each item in the current conditional tree’s header table (This task is conducted recursively). If the growth process for a given item terminates, all of the nodes containing the item are eliminated and their tail node information is transferred to their own parent nodes. Conditional trees with no further task are immediately removed. 4) Repeat all of the above steps until a series of the processes are conducted with respect to all the items in the global tree’s header table. After finishing these tasks, we can obtain a complete set of UWFIs. Recall that we have proposed an overestimation method using MaxW as shown in Definitions 5 and 6. Although this method guarantees the correctness of our algorithm by preventing any pattern loss, it has a problem that generates a higher number of candidate patterns, which are larger than real UWFIs. Therefore, we need to minimize MaxW to the extent that there is no pattern loss by the weight factor in order to reduce the number of candidates as many as possible. Through the aforementioned mining procedure, we can determine that if an item is selected as prefix at the beginning of the pattern growth step, all of the items participating in the mining process for the current prefix are limited to the items of the nodes comprising the corresponding conditional tree. That is, we assign the maximum weight among all the elements in W when constructing a global tree, and then use another maximum weight factor for each item in the global tree’s header table, where the weight is the maximum value among the W’s elements participating in the current mining process, i.e., UWFI growth for the current prefix. From such considerations, we can enhance the efficiency of the proposed U-WFI algorithm. Example 4. When the Global U-WFI-tree in Fig. 2(b) is constructed, MaxW is set as 0.9 because the maximum weight among all the elements in W is 0.9. If C is set as prefix, items related to the corresponding mining process are B, D, A, E, F, and G; therefore, MaxW is 0.9 in this case. Meanwhile, for D, MaxW is decreased to 0.8 because the maximum value in the weights of the participating items is 0.8. That is, in this case, B is only one item used in the pattern growth process for D. Therefore, the maximum weight becomes 0.8 among the weight values of B and D, 0.8 and 0.6.
18
Fig. 4. Conditional U-WFI-trees constructed from the Global U-WFI-tree in Fig. 2(b) (MinSup = 2) Example 5. Figs. 4(a), (b), (c), (d), (e), and (f) show Conditional U-WFI-trees that can be created from the Global U-WFI-tree in Fig. 2(b), where the conditional tree for item B has a null state; therefore, it has been excluded in the figure. A conditional tree for item C is constructed as follows. After the first tree scan from the global tree, we can obtain item information related to C as follows: {B:4, D:4, A:4, E:4, F:4, G:4}, {B:1, D:1:, A:1, E:1, G:1}, and {B:1, D:1 F:1, G:1}, where each number is a support value of the corresponding item. Then, we can calculate ExpSup values for B, D, A, E, F, and G that are items appearing together with C as follows: ExpSup(C→B) = 0.21 (= 0.7 * 0.3) + 0.45 (= 0.9 * 0.5) + 0.42 (= 0.7 * 0.6) + 0.56 (= 0.8 * 0.7) + 0.63 (0.9 * 0.7) + 0.54 (= 0.9 * 0.6) = 2.81; ExpSup(C→D) = 0.35 (= 0.7 * 0.5) + 0.72 (= 0.9 * 0.8) + 0.48 (= 0.6 * 0.8) + 0.48 (= 0.8 * 0.6) + 0.45 (= 0.9 * 0.5) + 0.54 (= 0.9 * 0.6) = 3.02; ExpSup(C→A) = 0.35 (= 0.7 * 0.5) + 0.63 (= 0.9 * 0.7) + 0.24 (= 0.6 * 0.4) + 0.72 (= 0.8 * 0.9) + 0.81 (= 0.9 * 0.9) = 2.75; ExpSup(C→E) = 0.56 (= 0.7 * 0.8) + 0.81 (= 0.9 * 0.9) + 0.54 (= 0.6 * 0.9) + 0.48 (= 0.8 * 0.6) + 0.36 (= 0.9 * 0.4) = 2.75; ExpSup(C→F) = 0.63 (= 0.7 * 0.9) + 0.45 (= 0.9 * 0.5) + 0.54 (= 0.6 * 0.9) + 0.40 (= 0.8 * 0.5) + 0.63 (= 0.9 * 0.7) = 2.65; ExpSup(C→G)
19
= 0.49 (= 0.7 * 0.7) + 0.27 (= 0.9 * 0.3) + 0.54 (= 0.6 * 0.9) + 0.32 (= 0.8 * 0.4) + 0.27 (= 0.9 * 0.3) + 0.45 (= 0.9 * 0.5) = 2.34. Since there is no item such that each ExpSup * MaxW (0.9) < MinSup (2), none of the items are pruned. However, the current item order is different from that of the global tree as shown in Fig. 4(a). With such information, we make a header table for C. Thereafter, during the second global tree scan phase, valid items are inserted into the conditional tree according to the current support descending order, where tail node information is also generated for each path. For instance, in Fig. 4(a), the Conditional U-WFI-code of tail node E shows that the path including the current tail node is a transaction with TID:080, item indexes comprising the path are 1, 2, 6, 3, and 4, and the prefix probability value accumulated with respect to this path is 0.9, where the item indexes follow the sequence of the global tree’s header table. After finishing the conditional tree construction for C, the U-WFI algorithm recursively creates a new smaller conditional tree for the bottom item in the header table again. Such recursive works are iterated until any tree with a null state is generated. If the divide-and-conquer-based recursive call process is performed for every conditional tree in Fig. 4, the U-WFI growth procedure for mining UWFIs is completed. Recall that after a series of tasks for an item is finished in a global or conditional U-WFI-tree, all the nodes with the item are deleted from the tree, and tail node information of the removed nodes is transferred to the corresponding parent nodes. If the parent is also a tail node, we process such case as follows.
Fig. 5. General expression of the operations Definition 9. ( operation) Let N be a tail node to be deleted and NP be a parent of N. If NP is also a tail node, we process them as shown in Fig. 5, denoted as NNP. That is, NNP = {NNP.L_TID = N.L_TID NP.L_TID}, {NNP.L_ItemIdx = N.L_ItemIdx NP.L_ItemIdx}, and {NNP.L_PrePro = N.L_PrePro NP.L_PrePro}. Meanwhile, if NP is not a tail node, NNP = {NNP.L_TID = N.L_TID}, {NNP.L_ItemIdx = N.L_ItemIdx - in}, and {NNP.L_PrePro = N.L_PrePro}. Note that if they have global U-WFI-codes, not conditional ones, NNP = {NNP.L_TID = N.L_TID NP.L_TID}, which is much simpler than the above case. Example 6. Fig. 6 shows how to process the operations in the Conditional U-WFI-tree for item C in Fig. 4(a). After completing a series of tasks for item F, the algorithm removes the two nodes with item F from this
20
conditional tree. In the case of the first node, the operations are conducted between these two Conditional UWFI-codes because the parent of this node is also a tail node. On the other hand, the parent of the second node is a normal node; therefore, the Conditional U-WFI-code information of this second one moves to its parent, G. Note that after the transfer, the index of F, 5, is deleted from G.L_ItemIdx to reflect the elimination of F. After the process for item E, the tail node information for E moves to its parent node, A. After item A, the operations occur between nodes A and G. If the algorithm conducts such works up to item B, all the mining operations for the current prefix, C, are finished.
Fig. 6. Processing operations from the Conditional U-WFI-tree for “C” in Fig. 3(a)
3.4. Algorithm description: U-WFI Through the previous sections, we have described details of the proposed algorithm, U-WFI, and various techniques for efficiently mining UWFIs without any problem. In this section, we present an overall mining procedure of U-WFI dealing with all of the proposed contents. Fig. 7 shows how U-WFI conducts its mining operations to extract UWFIs. In lines 1-2 of Fig. 7, the algorithm makes preparation for the UWFI mining and computes MaxW necessary for constructing a Global U-WFI-tree. After that, in lines 3-4 of Fig. 7, ExpSup and Support values of all the items in UDB are calculated through the first database scan. In line 5 of Fig. 7, the algorithm makes every item such that WESover (= ExpSup * MaxW) < MinSup, and in line 6 of Fig. 7, a support descending order is obtained with respect to the remaining valid items. Thereafter, a header table for the global tree is created in line 7 of Fig. 7. In lines 8-15 of Fig. 7, the global tree is constructed through the second database scan, and its procedure is as follows. In lines 9-10 of Fig. 7, invalid items for each transaction in UDB are deleted and the remaining ones are sorted in a support descending order. In lines 11-12 of Fig. 7, the sorted items are inserted into the global tree; at the same time, appropriate node links are connected to the created nodes. In lines 13-14 of Fig. 7, a tail node is generated for the last item of each transaction, and a corresponding Global U-WFI-code is attached to the tail node. In line 15 of Fig. 7, L is updated with the probability values of the sorted items. After these works are iterated with respect to all the transactions, a complete Global U-WFI-
21
tree, Tree, is obtained. After that, the algorithm calls a function, U-WFI_Growth, and performs UWFI mining operations as shown in Fig. 8. Input : an uncertain database, UDB = {T1, T2, …, Tn} a set of weights in UDB, W = {w1, w2, …, wm} a given minimum support threshold, MinSup Output : a set of UWFIs, S Procedure: U-WFI 01. a global U-WFI-tree, Tree null, a current prefix, pref null; a TID-List, L null ; 02. MaxW maximum(W = {w1, w2, …, wm}); //used for a global tree construction 03. for each transaction, Ti in UDB // the first database scan 04. calculate each item’s ExpSup and Support in Ti ; 05. mark invalid items such that each one’s WESover < MinSup ; 06. calculate a support descending order of the remaining items; 07. generate Tree.HeaderTable ; 08. for each transaction, Ti in UDB // the second database scan 09. delete invalid items; 10. sort the remaining items in the support descending order; 11. insert the sorted items into Tree in sequence; 12. connect a node link for each inserted item; 13. generate a tail node with respect to the last item; 14. attach the corresponding global U-WFI-code into the tail node; 15. update L with the probability values of the sorted items; 16. call U-WFI_Growth (Tree, L, pref); 17. return S ;
Fig. 7. Procedure: U-WFI Sub-procedure: U-WFI_Growth (Tree, L, pref) 01. for each item, xi in Tree.HeaderTable // a bottom-up order 02. add xi into pref ; 03. if WES(pref) MinSup, then 04. S = S pref ; 05. if Tree is a global U-WFI-tree, then 06. calculate MaxW for xi ; // used for a conditional tree construction; 07. a conditional U-WFI-tree, Tree’ null ; 08. for each path with pref, Pk, in Tree // the first tree scan 09. calculate each item’s ExpSup and Support in Pk ; // these are values when each item occurs together with pref 10. mark invalid items such that each one’s WESover < MinSup ; 11. calculate a support descending order of the remaining items; // the current order may be different from the global or other previous orders 12. generate Tree’.HeaderTable ; 13. for each path with pref, Pk, in Tree // the second tree scan 14. ignore invalid items; 15. sort valid items in the support descending order; 16. insert the sorted items into Tree’ in sequence; 17. connect a node link for each inserted item; 18. generate a tail node with respect to the last item; 19. attach the corresponding Conditional U-WFI-code into the tail node; // using the item index information of the global U-WFI-tree and L information 20. call U-WFI_Growth (Tree’, L, pref); 21. delete xi from pref ; 22. return S ;
Fig. 8. Sub-procedure: U-WFI_Growth After the U-WFI_Growth function is called, for each item in the current tree’s header table, the algorithm adds the item to prefix, and outputs the current prefix as a result if its real WES value is not smaller than MinSup. In lines 5-6 of Fig. 8, the algorithm computes MaxW for the current item, xi, again if the current tree is a global tree. In lines 8-9 of Fig. 8, U-WFI calculates ExpSup and Support values of items appearing together with the current prefix through the first tree scan. In lines 10-11 of Fig. 8, invalid items among them are marked for the subsequent pattern pruning, and a support descending order for the valid items is computed. In line 12 of Fig. 8, a header table for a conditional tree, Tree’, is created on the basis of such information. In lines 13-19 of Fig. 8,
22
Tree’ is completed through the second tree scan, which is similar to the contents in lines 8-15 of Fig. 7. The difference between them is that, in the U-WFI-Growth function, Conditional U-WFI-codes are generated using accumulated probability information and L information generated in the global tree construction step. After that, in line 20 of Fig. 8, the growth process of the algorithm is performed in a divide-and-conquer manner for the constructed conditional tree. In line 21 of Fig. 8, if all of the tasks for an item are finished, the item is removed from prefix. After the algorithm processes all of the necessary works, we can obtain a complete set of UWFIs, S. Example 7. Table 6 is the mining results of the proposed algorithm performed according to the sequence shown in Figs. 7 and 8 when the example uncertain database in Table 4 is given and MinSup is set to 2. Here, the Prefix column shows the types of prefixes that can be selected from the global tree in Fig. 2(b). The Conditional transactions column presents conditional data information generated from the corresponding prefixes. The UWFIs column shows all the possible UWFIs extracted through the recursive call manner for each prefix. After finishing all the mining steps for the given example database and threshold, we can obtain a complete set of UWFIs as shown in Table 6. Table 6. Mining result of U-WFI for the example database in Table 4 (MinSup = 2) Prefix
Conditional transactions
UWFIs
B D A E
ø {B} (9 times) {B, D} (7 times), {B} {B, D,A} (7times), {B, A} {B, D, A, E} (6 times), {B, D}, {D} {B, D, A, E, F} (4 times), {B, D, A, E}, {B, D, F}, {B, A, E}, {D, F} {B, D, A, E, F, G} (4 times), {B, D, A, E, G}, {B, D, F, G}
{B} {D}, {D, B} {A}, {A, B}, {A, D} {E}, {E, B}, {E, A}, {E, D} {F}, {F, D}, {F, B}, {F, A}, {F, E}
F G C
{G}, {G, B}, {G, E}, {G, F} {C}, {C, B}, {C, D}, {C, G}, {C, A}, {C, E}, {C, F},
4. Performance evaluation In this section, results of performance evaluation and analysis for the proposed algorithm are provided. For comprehensive, extensive tests, we used various real datasets to experiments of runtime and memory usage, and synthetic datasets with specific characteristics to scalability experiments.
4.1. Environmental settings Our algorithm, U-WFI, is compared to state-of-the-art uncertain pattern mining methods, CUFP-mine [19] and AT-mine [32]. Both CUFP-mine and AT-mine are tree-based algorithms; the main difference between them is that CUFP-mine discovers uncertain frequent patterns without any pattern growth method, while AT-mine finds them in its own pattern growth manner. All of the algorithms were implemented by us in the C++ language and executed from a PC with 4GHz CPU, 16GB RAM, and Windows 7 OS. Tables 7 and 8 show the information of
23
datasets used for our performance evaluation. Table 7. Characteristics of real datasets Dataset
Num. of Trans.
Num. of Items
Avg. Trans. Size
Data size
Chain-store Connect Kosarak Mushroom Pumsb Retail
1,112,949 67,557 990,002 8,124 49,046 88,162
46,086 129 41,270 120 2,113 16,470
7.2 43 8.1 23 74 10.306
45.5MB 8.82MB 31.4MB 0.545M 15.9MB 3.97MB
Table 8. Characteristics of synthetic datasets Dataset
Num. of Trans.
Num. of Items
Avg. Trans. Size
T10I4DxK
100,000 1,000,000
1,000 (fixed)
10 (fixed)
Tx1Lx2Nx3
100,000 (fixed)
10,000 40,000
10 40
Data size 3.83MB 38.3MB 4.81MB 21.9MB
Real datasets in Table 7 are employed for evaluating performance of runtime and memory usage for the algorithms. Chain-store is a dataset obtained from NU-MineBench 2.0 [22], which includes real weight information; the others, Connect, Kosarak, Mushroom, Pumsb, and Retail are available at FIMI Repository (http://fimi.cs.helsinki.fi) and do not contain their own real weight information. Synthetic datasets in Table 8 are used for scalability evaluation of the algorithms and created from the data generator [2]. T10I4DxK is a group of datasets that feature the fixed number of attributes and the increasing number of transactions, where x means the number of transactions. Meanwhile, Tx1Lx2Nx3 is a dataset group that features the fixed number of transactions and the increasing number of attributes, where x1, x2, and x3 are the average transaction size, the maximum number of discoverable frequent itemsets, and the number of items, respectively. Probability values of items within each dataset have random values between 0 and 1 in common with the case of the previous approaches [7, 19, 33].
Fig. 9. Runtime results (Connect)
Fig. 10. Runtime results (Chain-store)
24
4.2. Runtime experiment In this section, we perform performance evaluation and analysis of runtime factors for the real datasets in Table 7, Connect, Chain-store, Kosarak, Mushroom, Pumsb, and Retail. The corresponding results are shown in Figs. 9-14, where we randomly set weight values for the items in each dataset by normalizing them within various weight ranges. The proposed algorithm guarantees the best performance in every case. Note that CUFP-mine does not operate normally with respect to the given threshold settings entirely in the Connect, Mushroom, and Pumsb datasets because of its heavy memory consumption (memory overflow) as shown in Figs. 9, 12, and 13. Since CUFP-mine stores all of the subset combinations into its tree structure, its efficiency is the worst among the compared algorithms. Meanwhile, the proposed algorithm extracts intended pattern results without any problem in all cases. In Fig. 9, the weight range of Connect has been set as 0.3-0.6. In the figure, the increasing degree of both AT-mine and U-WFI is gradually increased, but that of AT-mine is larger in every case; especially when the threshold is lower, the gap between them becomes much larger. For example, the runtime results of AT-mine and U-WFI are 85.289 and 7.409 seconds when the threshold is 15%, and 1698.704 and 447.356 seconds when it is 5%, respectively. Fig. 10 shows the results of the Chain-store dataset with real weight information, where its weight range has been set as 0.3-0.6. Especially, the weight values of the dataset’s items have been normalized on the basis of the real weight information. Notice that, since Chain-store has a very sparse feature, runtime results of the algorithms are relatively fast compared to those of dense datasets. In the figure, CUFP-mine fails to mine uncertain frequent patterns when the threshold is 0.3% or less because of its memory overflow problem. The reason why the algorithm has such bad performance is described in the next section in detail. In Fig. 9, our algorithm has the best results in all the cases because it can selectively mine a smaller number of more important patterns by considering both the uncertainty and weight of patterns into the mining process.
Fig. 11. Runtime results (Kosarak)
Fig. 12. Runtime results (Mushroom)
The result of the Kosarak dataset in Fig. 11 also has tendency similar to that of Chain-store, where CUFP-mine does not also operate normally when the threshold is lower than 1.5%. The weight range of this dataset is 0.1-
25
0.3. In the figure, the runtime gap between AT-mine and U-WFI is not large at higher threshold settings, but the gap becomes gradually larger as the threshold becomes lower. For example, AT-mine and U-WFI consume 12.556 and 10.909 seconds at the threshold of 1.9%, which are somewhat similar to each other, but they spend 312.2 and 65.609 seconds at 0.01%, which have a significant gap. In Fig. 12, the weight range of Mushroom has been set as 0.1-0.2. As shown in the figure, we can observe that the proposed algorithm is also more outstanding than the competitors. Note that Mushroom is a small dataset and therefore runtime results of the algorithms are relatively fast despite its dense feature.
Fig. 13. Runtime results (Pumsb)
Fig. 14. Runtime results (Retail)
Fig. 13 shows the runtime results of the algorithms for the Pumsb dataset, where its weight range is 0.2-0.4. From these results, we can see that the effect of the weight factor is more significant than the others. For example, when the threshold is 10%, the runtime results of AT-mine and ours are 137.993 and 7.321 seconds, respectively, but when it is 5%, they are 2580.521 and 119.895 seconds, respectively. Fig. 14 is the results of Retail that has the weight range of 0.1-0.3. Runtime performance of CUFP-mine is much worse than the others as shown in the figure, and its runtime efficiency is sharply decreased as the threshold becomes lower. Our algorithm always guarantees the best results in every case because of its weight technique and the proposed techniques. Through the experiments, we can determine that the proposed algorithm has the best runtime performance compared to the previous methods. Overall, the proposed algorithm guarantees the best runtime result in every case although the performance gaps of the algorithms are different according to datasets. The main reason of this result is due to the pattern pruning effect obtained by additionally considering the weight factors. In the beginning of the mining process, 1-length uncertain itemsets with low weights and expected supports can be deleted by the proposed algorithm in advance, which leads to reduction of the search space for mining patterns and improvement of the mining performance. In addition, by performing the pruning works with respect to not only the invalid uncertain patterns but also the growth processes for them during the mining operations based on the recursive call and divide-and-conquer manners, the proposed algorithm can conduct the mining tasks faster than the others. As a result, we can obtain
26
uncertain pattern results with higher importance more quickly.
Fig. 15. Memory usage results (Connect)
Fig. 16. Memory usage results (Chain-store)
4.3. Memory usage experiment Figs. 15 to 20 are memory usage tests of the algorithms for the same experimental settings as those in Section 4.1. In Fig. 15, any result of CUFP-mine is not expressed into the graph because of its memory overflow. Although the algorithm works well until the threshold is 46%, it fails to mine patterns at lower threshold settings. Since Chain-store is a sparse dataset, CUFP-mine has no problem at relatively low threshold values; meanwhile, at much lower threshold settings, it shows a limitation because of its heavy memory consumption. On the other hand, our U-WFI and AT-mine operate normally in all the cases. In particular, the proposed algorithm guarantees the best memory efficiency in every case. U-WFI also shows the best memory performance in the other figures. In Fig. 16, memory usage of CUFP-mine is sharply increased as the threshold becomes lower, and the algorithm eventually fails to extract patterns because of the memory overflow. When the threshold is 0.5%, it spends memory space of 290.059MB, but when the threshold is decreased to 0.4%, its memory usage becomes 1088.605MB. The reason why CUFP-mine has the worst performance in memory usage as well as runtime is as follows. Recall that the Apriori-based uncertain mining approach generates k-length candidates from k-1-length patterns when mining k-length patterns. Iterating this process, it discards candidate information after use. In the case of the tree-based method using a pattern growth manner, candidate information is not needed for mining uncertain frequent patterns. However, CUFP-mine has to contain information of all the candidate patterns with length k or less. In the end, the algorithm stores information of all of the candidate uncertain patterns into its own tree, CUFP-tree, regardless of their lengths. Hence, the algorithm spends numerous runtime and memory resources performing such works. The performance degradation of CUFP-mine becomes intensified if the lengths of transactions in a given dataset become longer or the given threshold becomes lower.
27
Fig. 17. Memory usage results (Kosarak)
Fig. 18. Memory usage results (Mushroom)
Fig. 17 is the memory results for Kosarak. As in the case of Chain-store, the proposed algorithm guarantees the most efficient memory performance, while CUFP-mine has the worst results. Although CUFP-mine uses the smallest memory when the threshold is 1.9%, its memory consumption is sharply increased as shown in the figure. AT-mine has better performance than that of CUFP-mine, but it falls behind ours in every case. In Fig. 18, since Mushroom is a very small dataset, necessary memory usage is also not large in the cases of AT-mine and our U-WFI. However, CUFP-mine still requires huge memory space to mine uncertain frequent patterns. For example, the algorithm does not operate normally when the threshold is less than 20%. In Fig. 19, these three algorithms also have tendency similar to that of Fig. 18 although the absolute memory consumption of the algorithms is different. In this case, CUFP-mine fails to extract pattern results when the threshold is less than 44%. The proposed method also provides the best results in Retail as shown in Fig. 20. CUFP-mine consumes much larger memory compared to the others; for example, 340.609MB and 890.309MB when the threshold is 0.5% and 0.4%, respectively. Recall that a significant number of patterns can be extracted from dense datasets compared to sparse ones although the threshold is relatively high. This also signifies that CUFP-mine has to contain information of numerous candidate patterns into its own tree structure. For this reason, it spends much larger memory mining patterns in almost all cases. We can determine from these experimental results that the memory efficiency of the proposed algorithm is the most outstanding in most cases.
Fig. 19. Memory usage results (Pumsb)
Fig. 20. Memory usage results (Retail)
As mentioned above, we can remove itemsets with low weights and expected supports in advance by applying the weight factors into our uncertain mining process, and we can also omit the growth operations corresponding
28
to such patterns because of the anti-monotone property. These pre-pruning works make a significant contribution to reducing the search space. Consequently, the proposed algorithm has the best performance in general as shown in the figures of the memory performance evaluation.
Fig. 21. Results of runtime and memory scalability (T10I4DxK)
Fig. 20. Results of runtime and memory scalability (Tx1Lx2Nx3)
4.4. Scalability experiment The next tests are experiments for evaluating runtime and memory scalability for the datasets in Table 8, where the threshold for the two dataset groups has been set as 0.01% and 0.001%, and their weight ranges are 0.3-0.5 and 0.5-0.8, respectively. Notice that we cannot compare the performance results of CUFP-mine because it causes memory overflow for the environmental settings of the scalability tests. However, it is obvious that scalability performance of the algorithm falls behind the others because of its disadvantages mentioned in the previous sections. For the dataset group with fixed attributes and gradually increasing transactions, T10I4DxK (100 ≤ x ≤ 1000), the proposed algorithm guarantees the most outstanding runtime and memory scalability performance as shown in Fig. 21. Similarly, our algorithm also has the best result for the dataset group with fixed transactions and gradually increasing attributes, Tx1Lx2Nx3 (10 ≤ x1 ≤ 40, 1000 ≤ x2 ≤ 4000, and 10000 ≤ x3 ≤ 40000), as shown in Fig. 22. Since both AT-mine and our U-WFI are tree-based approaches employing their own pattern growth methods, it is obvious that they all have to use more runtime and memory resources whenever the number of transactions or attributes is increased. However, the increasing degree of runtime and
29
memory in our method is smaller than that of AT-mine in all the cases because the U-WFI algorithm can reduce the number of mining operations and necessary memory space by considering the weight factor of items.
4.5. Significance test In this section, we perform significance tests [13, 18] with respect to our U-WFI algorithm and the state-of-theart method, AT-Mine, in order to prove that the proposed method can extract a smaller number of more meaningful pattern results. Notice that both CUFP-Mine and AT-Mine provide the same mining results in every case because they are exact mining methods based on the traditional uncertain mining framework; moreover, CUFP-Mine does not operate normally at relatively low threshold settings as shown in the above figures. For this reason, we compare the proposed algorithm with AT-Mine. With the real datasets used in the above performance evaluation tests, we test whether or not statistically significant differences occur by the weight conditions and the corresponding techniques employed in the proposed algorithm in terms of the generated pattern results. In other words, we verify that U-WFI has a statistically significant effect on reducing the number of the mined patterns by pruning the ones with less importance. The significance tests are conducted as follows: 1) establish a null hypothesis insisting that the proposed method has no effect in terms of the pattern reduction, 2) generate 30 sampled datasets for each real dataset, and 3) reject the established null hypothesis by verifying that our approach is statistically significant through the mining results of the sampled datasets created in Step 2. The detailed procedure is as follows. Let UDB be a given uncertain dataset. Then, from UDB, we generate 30 sampled datasets, {UDB1, UDB2, …, UDB30}, on the basis of Swap Randomization [10], which is one of the famous data sampling techniques. After that, we run the two algorithms, U-WFI and AT-Mine, for each sample dataset UDBk (1 ≤ k ≤ 30), where we denote the results of each algorithm as {R1, R2, …, R30} and {R’1, R’2, …, R’30}, respectively. Note that R and R’ are independent results because they are generated by each algorithm without any mutual influence. Therefore, we execute the z-test for the independent two samples, where a significance level, α, is given as 0.05 and we assume that the sampled datasets follow a normal distribution because of the Central Limit Theorem. Through the mean and variance values of the mining results obtained by each algorithm, we can calculate the z-score and finally gain the one-tailed p-value. If this p-value is smaller than or equal to the given α, we can reject the null hypothesis. This means that the pattern results of the proposed algorithm have statistically significant differences from those of the competitor. Let H0 be a given null hypothesis insisting that there is no difference between our algorithm and the competitor, and H1 be an alternative hypothesis that rejects H0. Then, the p-value signifies the probability that H0 becomes true. Therefore, if the p-value has a value lower than or equal to the significance level, α, we can determine that H0 is rejected
30
under the current significance level. That is, H1 is true in this case. In order to compute the p-value, the corresponding z-score should be calculated in advance. The z-score is obtained by the following equations:
(6)
In Equation (6), SE is a standard error, and
,
,
, and
are the variance values and the number of
mining times of the sampled datasets for each algorithm, U-WFI and AT-Mine, respectively.
and
are the
mean values of the mined patterns. Table 9. Results of significance tests on the real datasets Dataset Connect (MinSup = 11%)
Algorithm U-WFI AT-Mine
1540.478 20.861
61.2667 2143.6333
Chain-store (MinSup = 0.01%)
SE
z-score
p-value
7.214
288.6484
0.0000
6.852 137.357
2255.9 8920.5667
SE
z-score
p-value
2.192
3039.785
0.0000
Dataset Kosarak (MinSup = 0.1%)
Algorithm U-WFI AT-Mine
2.317 18.731
80.4 1293.6
Mushroom (MinSup = 1%)
SE
z-score
p-value
0.838
1448.3969
0.0000
33.926 859.661
316.9333 9508.1666
SE
z-score
p-value
5.458
1684.0927
0.0000
Dataset Pumsb (MinSup = 8%)
Algorithm U-WFI AT-Mine
0.372 120.654
90.2 6944.6333
Retail (MinSup = 0.01%)
SE
z-score
p-value
2.009
3412..6586
0.0000
225.082 3381.857
3718.5667 21667.267
SE
z-score
p-value
10.965
1636.9078
0.0000
Table 9 shows the results of the significance tests for the various real datasets. Notice that parameter settings of this experiment are the same as those of our previous tests. As shown in the table, every p-value for each dataset is 0. Recall that the significance level, α, has been set as 0.05. α also signifies the maximum p-value that can reject H0. In the tests, the p-value results for all the datasets are 0, and they are also lower than α. That is, we can reject the established null hypothesis, H0, in all cases. It means that AT-Mine cannot generate the same or smaller number of patterns compared to our U-WFI. For this reason, we can prove that the proposed method can mine a smaller number of more meaningful patterns by applying the weight conditions into the uncertain itemset mining process.
5. Discussion The main goals of the proposed algorithm are as follows: 1) efficient mining operations requiring less runtime and memory resources and 2) extracting a smaller number of more valuable uncertain patterns. These two goals
31
have been achieved by the proposed data structures and techniques, and the results of our extensive experiments support our contributions. Meanwhile, the U-WFI algorithm has the following limitation. Recall that the proposed algorithm employs its own tree and list data structures in order to mine UWFIs, where these data structures are loaded into the main memory during the mining process. However, if the size of a given uncertain database becomes extremely large, it may be impossible to load the corresponding data into the main memory at once. To solve this issue, we can consider two types of solutions. The first method is to devise brand new data structures and mining techniques instead of the previous tree structures and tree-based pattern growth approaches. We additionally use an ancillary list structure in order to maximize the node sharing effect of our tree structure storing uncertain data without any information loss. Hence, if we can devise new types of efficient data structures and techniques for extracting UFIs or UWFIs without any pattern loss, we can remarkably reduce the memory resources necessary for our mining operations. The second method is to use a distributed processing system into the mining process. That is, after dividing large-scale uncertain data into smaller volumes that can be loaded into the main memory of general computers, we can make the proposed algorithm conduct the mining process for each portion of the data and then merge the partial results again. Through the above two solutions, we can overcome the limitation of the proposed method, and we are scheduled to develop these solutions in the future work. Meanwhile, the proposed algorithm is a mining method focusing on static uncertain databases. However, various uncertain data obtained from the real world can have not only such static characteristics but also the features of dynamic data streams. For this reason, we are scheduled to perform the research for applying the proposed data structures and techniques into various data stream models such as damped window, landmark window, and sliding window models. Moreover, in order to solve the problem that generates a numerous number of pattern results according to the increase of the data stream size, the concept of representative pattern mining framework such as closed or maximal pattern mining can be applied into our UWFI mining process.
6. Conclusions In this paper, we proposed an uncertain itemset mining algorithm for finding meaningful pattern information from uncertain databases with existential probabilities of items by considering the items’ own importance. Moreover, we also devised a tree structure for maximizing the node sharing effect and mining techniques using this tree. In addition, by suggesting and employing an overestimation method for solving the problems caused by applying weight factors into uncertain itemset mining, we guaranteed the correctness of the proposed algorithm. The comprehensive, empirical results of performance evaluation provided in this paper showed that our approach presented more outstanding performance than that of the previous state-of-the-art uncertain mining
32
methods in terms of runtime, memory usage, scalability, and significance aspects. The concepts and techniques in this paper are worthy of research in that they can effectively be integrated with other advanced mining areas such as stream pattern mining and representative pattern mining. For this reason, we are scheduled to perform such studies in our future work.
Acknowledgements This research was supported by the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF No. 2013005682), the Business for Cooperative R&D between Industry, Academy, and Research Institute funded Korea Small and Medium Business Administration in 2015 (Grants No. C0232102), and the Business for Academic-industrial Cooperative establishments funded Korea Small and Medium Business Administration in 2015 (Grants No. C0261068).
References [1] [2] [3]
[4]
[5] [6]
[7] [8] [9] [10] [11] [12]
[13] [14] [15]
[16]
C.C. Aggarwal, Y. Li, J. Wang, and J. Wang, “Frequent pattern mining with uncertain data”, 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.29-37, 2009 R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules”, 20th International Conference on Very Large Data Bases, pp. 487-499, 1994. J. Cai, X. Zhao, and Y. Xun, “Association rule mining method based on weighted frequent pattern tree in mobile computing environment”, International Journal of Wireless and Mobile Computing, vol. 6, no. 2, pp. 193-199, 2013. L. Chang, T. Wang, D. Yang, H. Luan and S. Tang, “Efficient algorithms for incremental maintenance of closed sequential patterns in large databases”, Data & Knowledge Engineering, vol.68, pp.68-106, 2009. J. Chen and P. Chen, “Sequential Pattern Mining for Uncertain Data Streams using Sequential Sketch”, Journal of Networks, vol. 9, no. 2, pp. 252-258, 2014. R. Cheng, D.V. Kalashnikov, and S. Prabhakar, “Querying Imprecise Data in Moving Object Environments”, IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 9, pp. 11121127, 2004. C. Chui, B. Kao, and E. Hung, “Mining Frequent Itemsets from Uncertain Data”, 11th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pp. 47-58, 2007. A. Cuzzocrea, C. Leung, R. MacKinnon, “Mining constrained frequent itemsets from distributed uncertain data”, Future Generation Computer Systems, vol. 37, pp. 117-126, 2014. G. Fang, Z. Deng, and H. Ma, “Network Traffic Monitoring Based on Mining Frequent Patterns”, Fuzzy Systems and Knowledge Discovery, vol. 7, pp. 571-575, 2009. A. Gionis, H. Mannila, T. Mielikäinen, and P. Tsaparas, "Assessing data mining results via swap randomization," ACM Transactions on Knowledge Discovery from Data, vol. l, no. 3, 2007. G. Grahne and J.Zhu, “Fast Algorithms for Frequent Itemset Mining Using FP-Trees”, IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 10, pp. 1347-1362, 2005. 11J. Han, J. Pei, Y. Yin and R. Mao, “Mining frequent patterns without candidate generation: a frequent-pattern tree approach”, Data Mining and Knowledge Discovery, vol. 8, no. 1, pp.53–87, 2004. W. Hämäläinen and Matti Nykänen, "Efficient Discovery of Statistically Significant Association Rules," IEEE International Conference on Data Mining (ICDM), pp. 203-212, 2008. G. Lee, U. Yun, and K. H. Ryu, “Sliding Window based Weighted Maximal Frequent Pattern Mining over Data Streams”, Expert Systems with Applications, vol. 41, no. 2, pp. 694-708, 2014. G. Lee, U. Yun, and H. Ryang, “Mining Weighted Erasable Patterns by using Underestimated Constraint-based Pruning Technique”, Journal of Intelligent and Fuzzy Systems, vol. 28, no. 3, pp. 1145-1157, 2015. C.K. Leung, M.A.F. Mateo, and D.A. Brajczuk, “A tree-based approach for frequent pattern mining from uncertain data”, 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp.653-661, 2008.
33
[17] [18]
[19] [20] [21]
[22]
[23] [24] [25] [26] [27] [28]
[29]
[30]
[31] [32] [33]
[34] [35]
[36] [37] [38]
[39] [40] [41] [42]
C.K. Leung, C.L. Carmichael, and B. Hao, “Efficient mining of frequent patterns from uncertain data”, International Conference on Data Mining Workshops, pp.489-494, 2007. J. Lijffijt, P. Papapetrou, and K. Puolamäki, "A statistical significance testing approach to mining the most informative set of patterns," Data Mining and Knowledge Discovery, vol. 28, no. 1, pp. 238-263, 2014. C. Lin, and T. Hong, “A new mining approach for uncertain databases using CUFP trees”, Expert Systems with Applications, vol. 39, no. 4, pp. 4084-4093, 2012. Y. Liu, “Mining maximal frequent patterns from univariate uncertain data”, Intelligent Data Analysis, vol. 18, no. 4, pp. 653-676, 2014. J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, “H-Mine: Fast and space-preserving frequent pattern mining in a large databases”, IIE Transactions (Institute of Industrial Engineers), vol.39, no.6, pp.593-605, 2007. J. Pisharath, Y. Liu, B. Ozisikyilmaz, R. Narayanan, W. K. Liao, A. Choudhary, and G. Memik, “NUMineBench version 2.0 dataset and technical report,” http://cucis.ece.northwestern.edu/projects/DMS /MineBench.html. G. Pyun, U. Yun, and K.H. Ryu, “Efficient frequent pattern mining based on Linear Prefix Tree”, Knowledge-Based Systems, vol. 55, pp. 125-139, 2014. G. Pyun and U. Yun, “Mining top-k frequent patterns with combination reducing techniques”, Applied Intelligence, vol. 41, no. 1, pp. 76-98, 2014. H. Ryang and U. Yun, “Top-k high utility pattern mining with effective threshold raising strategies”, Knowledge Based Systems, vol. 76, pp. 109-126, 2015. H. Ryang, U. Yun, and K. Ryu, “Discovering high utility itemsets with multiple minimum supports”, Intelligent Data Analysis, vol. 18, no. 6, pp. 1027-1047, 2014. H. Ryang and U. Yun, “Fast algorithm for high utility pattern mining with the sum of item quantities”, Intelligent Data Analysis, 2014, in press. A. Sallaberry, N. Pecheur, S. Bringay, M. roche, and M. Teisseire, “Sequential patterns mining and gene sequence visualization to discover novelty from microarray data”, Journal of Biomedical Informatics, vol.44, pp. 760-774, 2011. M.Y. Su, G.J. Yu, and C.Y. Lin, “A real-time network intrusion detection system for large-scale attacks based on an incremental mining approach”, Computers & Security, vol. 28, no. 5, pp. 301-309, 2009. X. Sun, L. Lim and S. Wang, “An approximation algorithm of mining frequent itemsets from uncertain dataset”, International Journal of Advancements in Computing Technology, vol.4, no.3, pp.42-49, 2012. L. Wang, L. Feng, and M. Wu, “UDS-FIM: An Efficient Algorithm of Frequent Itemsets Mining over Uncertain Transaction Data Streams”, Journal of Software, vol. 9, no. 1, pp. 44-56, 2014. L. Wang, L. Feng, and M. Wu, “AT-Mine: An Efficient Algorithm of Frequent Itemset Mining on Uncertain Dataset”, Journal of Computers, vol. 8, no. 6, pp. 1417-1426, 2013. L. Wang, D.W. Cheung, R. Cheng, S. Lee, and X. Yang, “Efficient Mining of Frequent Itemsets on Large Uncertain Databases”, IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 12, pp. 2170-2183, 2012. L. Wang, K. Hu, T. Ku, and X. Yan, “Mining frequent trajectory pattern based on vague space partition”, Knowledge Based Systems, vol. 50, pp. 100-111, 2013. M. Yiu, N. Mamoulis, X. Dai, Y. Tao, and M. Vaitis, “Efficient Evaluation of Probabilistic Advanced Spatial Queries on Existentially Uncertain Data”, IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 1, pp. 108-122, 2009. U. Yun, G. Pyun, and E. Yoon, “Efficient Mining of Robust Closed Weighted Sequential Patterns Without Information Loss”, International Journal on Artificial Intelligence Tools, vol. 24, no. 1, 2015. U. Yun and J. Kim, “A fast perturbation algorithm using tree structure for privacy preserving utility mining”, Expert Systems with Applications, vol. 42, no. 3, pp. 1149-1165, 2015. U. Yun, H. Ryang, and K. Ryu, “High Utility Itemset Mining with Techniques for Reducing Overestimated Utilities and Pruning Candidates”, Expert Systems with Applications, vol. 41, no. 8, pp. 3861-3878, 2014. U. Yun and H. Ryang, “Incremental High Utility Pattern Mining with Static and Dynamic Databases”, Applied Intelligence, vol. 42, no. 2, pp. 323-352, 2015. U. Yun, G. Lee, and K.H. Ryu, “Mining maximal frequent patterns by considering weight conditions over data streams”, Knowledge Based Systems, vol. 55, pp. 49-65, 2014. U. Yun and K.H. Ryu, “Efficient Mining of Maximal Correlated Weight Frequent Patterns”, Intelligent Data Analysis, vol. 17, no. 5, 2013. U. Yun, “On pushing weight constraints deeply into frequent itemset mining”, Intelligent Data
34
[43] [44]
[45] [46]
Analysis, vol. 13, no. 2, pp. 359-383, 2009. J. Zhang, Y. Wang, and D. Yang, “CCSpan: Mining closed contiguous sequential patterns”, Knowledge Based Systems, In Press, 2015. Y. Zhang, R. Cheng, and J. Chen, “Evaluating Continuous Probabilistic Queries Over Imprecise Sensor Data”, 15th International Conference on Database Systems for Advanced Applications, pp. 535-549, 2010. W. Zhao, R. Chellappa, P.J. Phillips, and A. Rosenfeld, “Face recognition: A literature survey”, ACM Computing Surveys, vol. 35, pp. 399–458, 2003. Z. Zhao, D. Yan, W. Ng, “Mining Probabilistically Frequent Sequential Patterns in Large Uncertain Databases”, IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 5, pp. 1171-1184, 2014.
35