Engineering Applications of Artificial Intelligence 65 (2017) 119–136
Contents lists available at ScienceDirect
Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai
Efficient frequent itemsets mining through sampling and information granulation Zhongjie Zhang a , Witold Pedrycz b , Jian Huang a, * a b
College of Mechatronics Engineering and Automation, National University of Defense Technology, Changsha, 410073, China Department of Electrical and Computer Engineering, University of Alberta, Edmonton, T6R2G7, AB, Canada
a r t i c l e
i n f o
Keywords: Frequent itemsets mining Sampling Information granulation
a b s t r a c t In this study, we propose an algorithm forming high quality approximate frequent itemsets from those datasets with a large scale of transactions. The results produced by the algorithm with high probability contain all frequent itemsets, no itemset with support much lower than the minimum support is included, and supports obtained by the algorithm are close to the real values. To avoid an over-estimated sample size and a significant computing overhead, the task of reducing data is decomposed into three subproblems, and sampling and information granulation are used to solve them one by one. Firstly, the algorithm obtains rough support of every item by sampling and removes those infrequent items, so the data are simplified. Then, another sample is taken from the simplified data, and is clustered into some information granules. After data reduction, these granules obtained in this way are mined by the improved Apriori. A tight guarantee for the quality of final results is provided. The performance of the approach is quantified through a series of experiments. © 2017 Elsevier Ltd. All rights reserved.
1. Introduction The key objective of frequent itemsets (FIs) mining is to form itemsets whose supports are higher than a given minimum support threshold (Agrawal et al., 1993). They are fundamental to association rule (AR) mining and can be applied to a number of areas. Many studies have been devoted to this topic. Agrawal et al. proposed an algorithm called Apriori Agrawal et al. (1994), which mines itemsets with different lengths in different loops and has many improved versions (Agrawal et al., 1994). To avoid repeatedly scanning in Apriori based algorithms, some algorithms change the form of dataset before mining, and there are two main techniques, the tree based method and the bitmap based way. Tree based algorithms, whose representative is FP-growth (Han et al., 2004), transform dataset to a tree structure and scan the tree rather than the original data (Sucahyo and Gopalan, 2004; Deng et al., 2012; Pyun et al., 2014; Alavi and Hashemi, 2015; Deng and Lv, 2015; Deng, 2016; Vo et al., 2016). Bitmap based algorithms, like BitTableFI (Dong and Han, 2007), transform data into a binary matrix or binary numbers, and use logical or matrix operations to replace scanning datasets (Song et al., 2008; Vo et al., 2012; Mohamed et al., 2011). Moreover, some algorithms use heuristic method to directly obtain ARs without mining FIs (Yan et al., 2005, 2009; Sarath and Ravi, 2013).
However, for those tree based and bitmap based algorithms, building and operating the tree or matrix require a lot of memory de-allocation, and for those heuristic based algorithms, evaluating particles or chromosomes needs to scan the whole data. When the data size is huge, the speed of them cannot be ensured. Therefore, many sampling and information granulation (Bargiela and Pedrycz, 2012; Hu et al., 2015) based algorithms are proposed (Toivonen et al., 1996; Parthasarathy, 2002; Scheffer and Wrobel, 2002; Chen et al., 2002; Zhang et al., 2003; Brönnimann et al., 2003; Li and Gopalan, 2004; Jia and Lu, 2005; Jia and Gao, 2005; Chuang et al., 2005; Hwang and Kim, 2006; Hu and Yu, 2006; Zhao et al., 2006; Chuang et al., 2008; Chakaravarthy et al., 2009; Mahafzah et al., 2009; Pietracaprina et al., 2010; Chandra and Bhaskar, 2011; Chen et al., 2011; Riondato and Upfal, 2014, 2015; Zhang et al., 2015), which reduce data by taking sample of the dataset or compressing the dataset to information granules. The difficulty of them is the tradeoff between runtime and the quality of results. Most methods either call for too much time or give no tight guarantee for the quality of results. In this study, a new algorithm called SG (Sampling and Granulation) is proposed, which has the good tradeoff between runtime and the quality of final results when dealing with datasets with large scales of transactions. SG not only runs fast but also strictly controls the errors in
* Corresponding author.
E-mail addresses:
[email protected] (Z. Zhang),
[email protected] (W. Pedrycz),
[email protected] (J. Huang). http://dx.doi.org/10.1016/j.engappai.2017.07.016 Received 7 August 2016; Received in revised form 18 May 2017; Accepted 17 July 2017 0952-1976/© 2017 Elsevier Ltd. All rights reserved.
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
a given range with high probability. Being different from other sampling and information granulation based algorithms, the fast speed and high accuracy of SG are caused by the following advantages and innovations:
the number of transactions is extremely large, even scanning the whole dataset once is unfeasible. Pietracaprina et al. propose an algorithm to mine the approximate top-𝑘 FIs based on Chernoff and Union bound by progressive sampling (Pietracaprina et al., 2010). However, the maximum length of FIs should be fixed. Matteo Riondato et al. set the smallest sample size to ensure the tight guarantee of the results, which is based on VC-dimension (Riondato and Upfal, 2014), but it has to scan the dataset at least once, which is unfeasible when the size of dataset is extremely large. Matteo Riondato et al. improved their algorithm by progressive sampling, and the stop condition is based on Rademacher Averages (Riondato and Upfal, 2015).
(a) To avoid the over-estimated sample size, the task of reducing data is split into 3 subproblems, which are reducing the number of items, reducing the scale of transactions and realizing information granulation. (b) To control the error, the normal approximation to the binomial distribution and a probability inequality called Union bound are joined to study the deviation of supports. (c) To speed up the following mining, the bitmap technique is also taken to fast obtain the FIs.
3. Preliminaries
The paper is organized as follows. Section 2 introduces some studies about FIs mining through sampling and information granulation, Section 3 covers some definitions and knowledge used in the paper, Section 4 describes the SG in detail, Section 5 shows some results of experiments on SG and gives some discussions, and Section 6 contains conclusions.
3.1. The (𝜀, 𝛿)-approximate FI Given a dataset 𝐷 with some transactions {𝑡1 𝑡2 … 𝑡𝑚 }, 𝐼 = {𝑎1 , 𝑎2 … 𝑎|𝐼| } is the universe set of items appearing in 𝐷. For ∀𝑡𝑖 ∈ 𝐷, 𝑡𝑖 ⊆ 𝐼. An itemset, denoted by 𝑥, is a subset of 𝐼. The support of 𝑥, denoted by 𝑓𝐷 (𝑥), is the percent of the transactions which contain 𝑥 in 𝐷. Given a minimum support 𝜃, the set of FIs of 𝐷 under 𝜃 is 𝐹 𝐼𝐷 = {𝑥|𝑓𝐷 (𝑥) ≥ 𝜃}, and 𝐹 𝐼𝑆 is the set of FIs mined from a sample 𝑆.
2. Related studies
Definition 1. Given 𝑆, a sample of a dataset 𝐷, if 𝐹 𝐼𝑆 satisfies
Sampling and information granulation based FI mining algorithms can be classified into three categories, algorithm without any guarantee, algorithm with loose guarantee and algorithm with tight guarantee. Algorithms without any guarantee have high risk to obtain wrong frequent itemsets. Parthasarathy (2002) keeps changing sample size until a criterion is satisfied. Similar works are done by Chen et al. (2002), Brönnimann et al. (2003) and Chuang et al. (2005). Hwang et al. improve Chen’s work (Hwang and Kim, 2006) and Hu et al. join the weighted sampling in the progressive sampling (Hu and Yu, 2006). Chuang et al. also design a progressive sampling algorithm based on the probability distribution of the itemsets’ supports (Chuang et al., 2008). Chandra et al. also make up with an algorithm to sample dataset based on the supports of items, where Hub-Averaging technique is applied (Chandra and Bhaskar, 2011). Chen et al. take locality sensitive hashing to cluster the initial sample of dataset and remove the outliers of sample (Chen et al., 2011). Zhang et al. propose an algorithm to compress transactions into information granules (Zhang et al., 2015). Algorithms offering loose guarantee ensure that the deviation of a random support is limited in a given range with high probability, which is not good enough in some datasets with a large number of items. Toivonen builds an algorithm by building candidates and checking the candidates (Toivonen et al., 1996), where sample size is set through Chernoff bound. Zhang et al. (2003) set the sample bound based on central limit theory. A similar study has been reported by Li (Li and Gopalan, 2004). Jia et al. show that even the sample size is low, the accuracy of results can increase by integrating the results based on many individual samples (Jia and Lu, 2005). They also use a progressive sampling method to do this job (Jia and Gao, 2005), which considers the Hoeffding bound to ensure the accuracy of the estimated error of result. Furthermore, Zhao et al. use hybrid bounds to control the sample size (Zhao et al., 2006). Algorithms with tight guarantee can extract all FIs with high probability, and no itemset with support much lower than the threshold is extracted. However, they either over estimate the sample size or cost too much time and memory. Tobias Scheffer et al. propose an algorithm called GSS (Scheffer and Wrobel, 2002), which mines the approximate 𝑘 most FIs by progressive sampling. In GSS, itemsets performing poorly are removed one by one as the scale of sample increases. However, the algorithm should store all the possible itemsets firstly. If the scale of items is large, it is unfeasible. Chakaravarthy et al. analysis the smallest size of the sample which can ensure the tight guarantee (Chakaravarthy et al., 2009), which is liner with the longest length of transactions. The algorithm has to scan the whole dataset at least once to get this important parameter. However, when
(a) for every 𝑥 ∈ 𝐹 𝐼𝑆 , |𝑓𝑆 (𝑥) − 𝑓𝐷 (𝑥)| ≤ 𝜀∕2, (b) for every 𝑓𝐷 (𝑥) ≥ 𝜃, 𝑥 ∈ 𝐹 𝐼𝑆 , (c) and for every 𝑓𝐷 (𝑥) < 𝜃 − 𝜀, 𝑥 ∉ 𝐹 𝐼𝑆 , 𝐹 𝐼𝑆 is the set of 𝜀-approximate FIs of 𝐷 under 𝜃. If 𝐹 𝐼𝑆 has at least 1−𝛿 probability to be the set of 𝜀-approximate FIs, 𝐹 𝐼𝑆 is the set of (𝜀, 𝛿)approximate FIs, where (𝜀, 𝛿) is set by the user depending on the demand of accuracy (Riondato and Upfal, 2015). The aim of our algorithm is to produce (𝜀, 𝛿)-approximate FIs in an efficient manner. 3.2. The union bound Given two events 𝑒𝐴 and 𝑒𝐵 , if they satisfy( Pr(𝑒𝐴 ) )≥ 1 − 𝑝(𝐴 and ) Pr(𝑒 ( 𝐵 )) ≥ 1 −( 𝑝𝐵 at ) the same ( time, ) because ( ) Pr 𝑒𝐴( ∪ 𝑒)𝐵 = (Pr 𝑒𝐴 )+ Pr (𝑒𝐵 −) Pr 𝑒𝐴 𝑒𝐵 and Pr 𝑒𝐴 𝑒𝐵 = Pr 𝑒𝐴 + Pr 𝑒𝐵 − Pr 𝑒𝐴 ∪ 𝑒𝐵 , Pr 𝑒𝐴 𝑒𝐵 ≥ 1−𝑝𝐴 −𝑝𝐵 . This probability inequality is called Union bound. Moreover, if Pr(𝑒𝐴1 ) ≥ 1 − 𝑝𝐴1 , Pr(𝑒𝐴2 ) ≥ 1 − 𝑝𝐴2 ... and Pr(𝑒𝐴𝑛 ) ≥ 1 − 𝑝𝐴𝑛 , we can get that 𝑛 ∑ ( ) 𝑝𝐴𝑖 . Pr 𝑒𝐴1 𝑒𝐴2 … 𝑒𝐴𝑛 ≥ 1 −
(1)
𝑖=1
3.3. The agglomerative hierarchical clustering Agglomerative hierarchical clustering is a bottom-up clustering algorithm, where each element starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. The algorithm runs through two kinds of distances, the distance between two elements and the distance between two clusters. When clustering, the algorithm keeps merging two clusters with the shortest distance until the stopping condition has been satisfied (Akbari et al., 2015). 4. The sampling and granulation in SG 4.1. The outline of SG Fig. 1 visualizes the essence of SG. The algorithm reduces the original dataset step by step. Firstly, items with low probability to be FIs are removed, which simplifies the problem. Then, SG takes another sample 𝑆2 , whose size is set fast based on the remained items and their supports in 𝑆1 . Furthermore, the algorithm granulates 𝑆2 , where those same and similar transactions are compressed in a granule, so the scale of data can be further reduced. SG ensures the final results be the (𝜀, 𝛿)-approximate FIs, and the detailed steps are shown as follows. 120
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
Fig. 1. Outline of SG.
Theorem 2. If Pr(𝑒1 ) ≥ (1 − 𝛿)1∕3 , Pr(𝑒2 |𝑒1 ) ≥ (1 − 𝛿)1∕3 and Pr(𝑒3 |𝑒1 𝑒2 ) ≥ (1 − 𝛿)1∕3 , 𝐹 𝐼𝐺 is the set of (𝜀, 𝛿)-approximate FIs under 𝜃.
4.2. Problem transformation and decomposition First, the problem is transformed to another more suitable form. Given 𝜀1 , 𝜀, 𝑆1 , 𝑆2 , the following two events are considered:
Proof. Because Pr(𝑒1 𝑒2 𝑒3 ) = Pr(𝑒1 ) Pr(𝑒2 |𝑒1 ) Pr(𝑒3 |𝑒1 𝑒2 ), if Pr(𝑒1 ) ≥ (1 − 𝛿)1∕3 , Pr(𝑒2 |𝑒1 ) ≥ (1 − 𝛿)1∕3 and Pr(𝑒3 |𝑒1 𝑒2 ) ≥ (1 − 𝛿)1∕3 , we can get Pr(𝑒1 𝑒2 𝑒3 ) ≥ 1 − 𝛿. Therefore, 𝐹 𝐼𝐺 is the set of (𝜀, 𝛿)-approximate FIs under 𝜃. The proof has been completed.
𝑒1 : Every 𝑎 ∈ 𝐼 satisfies |𝑓𝑆1 ({𝑎}) − 𝑓𝐷 ({𝑎})| ≤ 𝜀1 , where 𝐼(𝜀1 ) = {𝑎|𝑓𝑆1 ({𝑎}) ≥ 𝜃 − 𝜀1 }, 𝑒2 : Every 𝑥 ⊆ 𝐼(𝜀1 ) satisfies |𝑓𝑆2 (𝑥) − 𝑓𝐷 (𝑥)| ≤ 𝜀∕2.
The algorithm proposed in this paper is called SG with twice sampling and once information granulation, where the first sampling phase ensures Pr(𝑒1 ) ≥ (1−𝛿)1∕3 , the second sampling phase ensures Pr(𝑒2 |𝑒1 ) ≥ (1−𝛿)1∕3 and the information granulation ensures Pr(𝑒3 |𝑒1 𝑒2 ) ≥ (1−𝛿)1∕3 . Finally, 𝐹 𝐼𝐺 is mined out through the Apriori improved by bitmap technique.
The following theorem holds. Theorem 1. If 𝑒1 and 𝑒2 are true at the same time, 𝐹 𝐼𝑆2 = {𝑥|𝑓𝑆2 (𝑥) ≥ 𝜃 − 𝜀∕2&𝑥 ⊆ 𝐼(𝜀1 )} is the set of 𝜀-approximate FIs under 𝜃. Proof. First, because 𝑒1 is true, we can get |𝑓𝑆1 ({𝑎}) − 𝑓𝐷 ({𝑎})| ≤ 𝜀1 , so for any 𝑎 ∈ 𝐼, if 𝑓𝑆1 ({𝑎}) < 𝜃−𝜀1 , 𝑓𝐷 ({𝑎}) < 𝜃. Therefore, 𝑥 not satisfying 𝑥 ⊆ 𝐼(𝜀1 ) cannot be FI. Then, because 𝑒2 is true, for any 𝑥, if 𝑓𝐷 (𝑥) ≥ 𝜃, we can get 𝑓𝑆2 (𝑥) ≥ 𝜃 − 𝜀∕2. Therefore, 𝐹 𝐼𝑆2 satisfies (b) in Definition 1. Meanwhile, if 𝑓𝐷 (𝑥) < 𝜃 − 𝜀, we can get 𝑓𝑆2 (𝑥) < 𝜃 − 𝜀∕2, so 𝐹 𝐼𝑆2 satisfies (c) in Definition 1. Finally, 𝑒2 itself contains (a) in Definition 1. Therefore, 𝐹 𝐼𝑆2 is the set of 𝜀-approximate FIs under 𝜃. Proof has been completed.
4.3. The first sampling phase The first sampling phase has two tasks. Firstly, it should ensure Pr(𝑒1 ) ≥ (1 − 𝛿)1∕3 , which is to ensure the accuracy of the final results. Then, it should make 𝐼(𝜀1 ) as small as possible, so less time will be consumed by the phase two and many infrequent itemsets can be rejected directly. For the first task, the following theorem and its proof can be given.
Now given a set of granules 𝐺 built from 𝑆2 , let us see another event
Theorem 3. Pr(𝑒1 ) ≥ (1 − 𝛿)1∕3 is true if the sample size of 𝑆1 , denoted by 𝑛1 , satisfies
𝑒3 : Every 𝑥 ⊆ 𝐼(𝜀1 ) satisfies 𝑓𝐺 (𝑥) = 𝑓𝑆2 (𝑥), where 𝑓𝐺 (𝑥) is the support calculated through 𝐺, which will be defined in Section 4.5.
1 ( ) 1 − |𝐼| 1 − 𝑃 (𝜀1 , 𝑛1 , 0.5) ≥ (1 − 𝛿) 3
and
It can be easily known that if 𝑒1 , 𝑒2 and 𝑒3 are all true, 𝐹 𝐼𝐺 = {𝑥|𝑓𝐺 (𝑥) ≥ 𝜃 − 𝜀∕2&𝑥 ⊆ 𝐼(𝜀1 )} is the set of 𝜀-approximate FIs under 𝜃.
𝑛1 ≥ max 121
(
5 , 𝜀1
) √ 25 + 20𝑁𝜀1 − 5 2𝜀1
,
(2)
(3)
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
where
) ( ) ( 𝑝(1 − 𝑝) 𝑝(1 − 𝑝) − 𝜙 −𝜀, 0, , 𝑃 (𝜀, 𝑛, 𝑝) = 𝜙 𝜀, 0, 𝑛 𝑛
(4)
and 𝜙(𝜆, 𝜇, 𝜎 2 ) is Pr(𝑋 ≤ 𝜆) and 𝑋 ∼ 𝑁(𝜇, 𝜎 2 ). Proof. First, when 𝑁, the number of transactions in the dataset, is very large, the support of an itemset in dataset can be seen as the probability of it to appear, and the support of an itemset in a sample 𝑆 with size 𝑛 can be seen following the binomial distribution, so Pr(𝑓𝑆 (𝑥) = 𝑓 ) = 𝐶𝑛𝑓 𝑛 𝑓𝐷 (𝑥)𝑓 𝑛 (1 − 𝑓𝐷 (𝑥))1−𝑓 𝑛 , and if 𝑛 is large enough, Pr(𝑓𝑆 (𝑥) = 𝑓 ) can be calculated by 𝑓𝑆 (𝑥) ∼ 𝑁(𝑓𝐷 (𝑥), 𝑓𝐷 (𝑥)(1 − 𝑓𝐷 (𝑥))∕𝑓𝐷 (𝑥)(1 − 𝑓𝐷 (𝑥))𝑛). Therefore when 𝑛 is large enough, one has ( ) 𝑓 (𝑥)(1 − 𝑓𝐷 (𝑥)) 𝑓𝑆 (𝑥) − 𝑓𝐷 (𝑥) ∼ 𝑁 0, 𝐷 . (5) 𝑛 Then, the following formula holds as well, ( ) ( ) 𝑓 (𝑥)(1 − 𝑓𝐷 (𝑥)) Pr |𝑓𝑆 (𝑥) − 𝑓𝐷 (𝑥)| ≤ 𝜀 = 𝜙 𝜀, 0, 𝐷 𝑛 ( ) 𝑓 (𝑥)(1 − 𝑓𝐷 (𝑥)) . − 𝜙 −𝜀, 0, 𝐷 . 𝑛
(6)
Furthermore, according to the formula of 𝜙(𝜆, 𝜇, 𝜎 2 ), given two itemsets 𝑥1 and 𝑥2 , if they satisfy 𝑓𝐷 (𝑥1 )(1 − 𝑓𝐷 (𝑥1 )) ≥ 𝑓𝐷 (𝑥2 )(1 − 𝑓𝐷 (𝑥2 )), we obtain ( ) ( ) Pr |𝑓𝑆 (𝑥1 ) − 𝑓𝐷 (𝑥1 )| ≤ 𝜀 ≤ Pr |𝑓𝑆 (𝑥2 ) − 𝑓𝐷 (𝑥2 )| ≤ 𝜀 . (7) ( Since 𝑃 (𝜀, 𝑛, 𝑝) = 1 − (1 − 𝑃 (𝜀, 𝑛, 𝑝)), according to (4), Pr |𝑓𝑆 (𝑥) − 𝑓𝐷 (𝑥)| ≤ 𝜀) = 1 − (1 − 𝑃 (𝜀, 𝑛, 𝑝)) can be satisfied when 𝑛 is large enough. Then, according to the Union bound, we also can get that Pr(𝑒1 ) ≥ 1 − ∑ 𝑎∈𝐼 (1−𝑃 (𝜀1 , 𝑛1 , 𝑓𝐷 ({𝑎}))). Then, because 𝑓𝐷 (𝑥)(1−𝑓𝐷 (𝑥)) ≤ 0.5(1−0.5), according to (7), 1 − 𝑃 (𝜀1 , 𝑛1 , 𝑓𝐷 ({𝑎})) ≤ 1 − 𝑃 (𝜀1 , 𝑛1 , 0.5) and the inequality ( ) Pr(𝑒1 ) ≥ 1 − |𝐼| 1 − 𝑃 (𝜀1 , 𝑛1 , 0.5)
Fig. 2. The flowchart of the first sampling phase.
bound. Thus, when 𝑛1 satisfies (2) and (3) at the same time, Pr(𝑒1 ) ≥ (1 − 𝛿)1∕3 is true. Proof has been completed. The second task is to make 𝐼(𝜀1 ) as small as possible. Therefore, 𝜀1 should be as small as possible. In SG, 𝜀1 initially equals to 𝜀, and keeps being reduced by 𝜀1 = 𝜀1 ∕2 until
(8) (1 − 𝛿)1∕3 ,
hold. Therefore, to make Pr(𝑒1 ) ≥ 𝑛1 should satisfy (2). Note that all mathematical derivations above are based on an important assumption stating that 𝑛1 is large enough to apply normal approximation. For any probability 𝑝 whose value following the binomial distribution, a common used rule to decide whether the sample size 𝑛 is large enough to use normal approximation is whether (Berengut, 2012)
( ) | ( )| ( ) | ( )| 𝐼 𝜀1 |𝑆1 𝜀1 | > 𝐼 2𝜀1 |𝑆1 2𝜀1 | | | | |
(11)
where 𝑆1 (𝜀1 ) is the 𝑆1 generated through 𝜀1 and Theorem 3. If the data scale, which is the number of items times the number of transactions, becomes larger, further adjusting 𝜀1 will not be cost-effective, because building 𝐼(𝜀1 ) also takes time, and the larger 𝑆1 is, the more time is cost. The flowchart of the first sampling phase is shown in Fig. 2.
5 5 ≤𝑝≤1− . (9) 𝑛 𝑛 Therefore, 𝑛1 and every item should satisfy (9). However, there may be some items does not satisfy (9) when 𝑛1 is just set through (2). The items not satisfying (9) can be classified to two types, those items whose supports are lower than 5∕𝑛1 and those items whose supports are higher than 1−5∕𝑛1 . Let us consider those items with supports lower than 5∕𝑛1 firstly. Given an item 𝑎 with 𝑝 = 𝑓𝐷 ({𝑎}) < 5∕𝑛1 , it is obviously that the lowest value of 𝑓𝑆1 ({𝑎}) is zero, and the highest value of 𝑓𝑆1 ({𝑎}) is 𝑝𝑁∕𝑛1 . Therefore, if [ ] [ ] 𝑝𝑁 0, ⊆ 𝑝 − 𝜀1 , 𝑝 + 𝜀1 (10) 𝑛1
4.4. The second sampling phase The second sampling phase should ensure Pr(𝑒2 |𝑒1 ) ≥ (1 − 𝛿)1∕3 , and the following theorem and its proof can be given. Theorem 4. Pr(𝑒2 |𝑒1 ) ≥ (1 − 𝛿)1∕3 is true if the sample size of 𝑆2 , denoted by 𝑛2 , satisfies |𝑣1 | ( ( ( ))) ∑ 𝜀 2|𝑣1 |+|𝑣2 |−𝑖 1 − 𝑃 , 𝑛2 , 𝑓𝑆1 ({𝑎𝑖 }) + 𝜀 2 𝑖=1 ( ( )) 1 𝜀 −2|𝑣2 | 1 − 𝑃 , 𝑛2 , 0.5 − (1 − 𝛿) 3 ≥ 0 2 and ( ) √ 25 + 20𝑁𝜀 − 5 10 𝑛2 ≥ max , , 𝜀 𝜀
1−
is satisfied, Pr(|𝑓𝑆1 ({𝑎}) − 𝑝| ≤ 𝜀1 ) = 1 holds. To ensure (10), we should ensure 𝑝 − 𝜀1 ≤ 0 and 𝑝𝑁∕𝑛1 ≤ 𝑝 + 𝜀1 , so 𝑝 ≤ 𝜀1 and 𝑝 ≤ (𝜀1 𝑛1 )∕(𝑁 − 𝑛1 ) should be true. Considering 𝑝 < 5∕𝑛1 , if 5∕𝑛1 ≤ 𝜀1 and 5∕𝑛1 ≤ 𝜀1 𝑛1 ∕(𝑁 − 𝑛1 ) is satisfied, (10) is true, which means that 𝑛1 should satisfy (3). For those items whose supports are higher than 1 − 5∕𝑛1 , given an item which satisfies 𝑝 = 𝑓𝐷 ({𝑎}) > 1 − 5∕𝑛1 , let 𝑞 = 1 − 𝑓𝐷 ({𝑎}) and 𝑞𝑆1 = 1 − 𝑓𝑆1 ({𝑎}), it is obviously that 𝑞 < 5∕𝑛1 , and if Pr(|𝑞𝑆1 − 𝑞| ≤ 𝜀1 ) = 1, we can also get Pr(|𝑓𝑆1 ({𝑎}) − 𝑝| ≤ 𝜀1 ) = 1. By the similar way discussed above, to make sure Pr(|𝑞𝑆1 − 𝑞| ≤ 𝜀1 ) = 1, 𝑛1 should also satisfy (3). Meanwhile, the number of items satisfying (9) is less than |𝐼|, so when (10) is satisfied, (8) is still established according to the Union
(12)
(13)
where 𝑣1 is the set of items in 𝐼(𝜀1 ) sorted by 𝑓𝑆1 ({𝑎}) from low to high and 𝑎 ∈ 𝑣1 → 𝑓𝑆1 ({𝑎}) ≤ 0.5 − 𝜀, and 𝑣2 is the set of items in 𝐼(𝜀1 ) sorted by 𝑓𝑆1 ({𝑎}) from low to high and 𝑎 ∈ 𝑣2 → 𝑓𝑆1 ({𝑎}) > 0.5 − 𝜀. Proof. For any 𝑎 ∈ 𝑣1 , because 𝑓𝑆1 ({𝑎}) ≤ 0.5 − 𝜀 and 𝑒1 is true and 𝜀1 ≤ 𝜀, any 𝑥 containing 𝑎 satisfies 𝑓𝐷 (𝑥) ≤ 𝑓𝐷 ({𝑎}) ≤ 𝑓𝑆1 ({𝑎}) + 𝜀1 ≤ 𝑓𝑆1 ({𝑎}) + 𝜀 ≤ 0.5, and through (7), when 𝑛2 is large enough, 122
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
groups, and uses groups to join the computation, and many methods are proposed (Yan et al., 2005). In SG, the agglomerative hierarchical clustering is used to cluster 𝑆2 and build granules. Because every transaction is a set of discrete elements, the number of different items between two transactions are used to calculate their distance. Given two transactions 𝑡1 and 𝑡2 , the distance between them is defined as 𝑑(𝑡1 , 𝑡2 ) = ||𝑡1 ∪ 𝑡2 || − ||𝑡1 ∩ 𝑡2 || .
(15)
Given two clusters 𝑐1 and 𝑐2 , the distance between them is defined as } { 𝐷𝑖𝑠(𝑐1 , 𝑐2 ) = max 𝑑(𝑡𝑖 , 𝑡𝑗 )|𝑡𝑖 ∈ 𝑐1 &𝑡𝑗 ∈ 𝑐2 .
(16)
Agglomerative hierarchical clustering is chosen because according to (15) and (16), while the distance between every pair of transactions is calculated at the beginning of the process and stored in a matrix, say ⎡0 ⎢ ⎢0 ⎢⋯ ⎢0 ⎣
Fig. 3. The flowchart of the second sampling phase.
)
𝜀 , 𝑛 , 𝑓 (𝑥) 2 2 𝐷
(
𝜀 , 𝑛 , 𝑓 ({𝑎}) 2 2 𝐷
)
(
⋯ ⋯ ⋯ ⋯
𝑑(𝑡1 , 𝑡𝑛2 ) ⎤ ⎥ ⋯ ⎥, 𝑑(𝑡𝑛2 −1 , 𝑡𝑛2 )⎥ ⎥ 0 ⎦
distance between every two clusters can be obtained fast by this matrix. When the number of clusters need to be changed, the algorithm need not to scan every transactions repeatedly like 𝑘-means and other algorithm does. After information granulation, for the 𝑖th cluster 𝑐𝑖 , 𝛼𝑖 and 𝑤𝑖 are respectively the union and the number of all the transactions in 𝑐𝑖 , and the granule to represent 𝑐𝑖 is defined as 𝑔𝑖 = (𝛼𝑖 , 𝑤𝑖 ). For any itemset 𝑥, 𝑓𝐺 (𝑥) based on 𝐺 = {𝑔1 , 𝑔2 … 𝑔|𝐺| } is defined as ( ) |𝐺| ∑ 𝑤𝑖 Bl(𝑥, 𝛼𝑖 ) 𝑓𝐺 (𝑥) = . (17) ∑|𝐺| 𝑤 𝑖=1 𝑖=1 𝑖
) ≥𝑃 ≥𝑃 . Suppose 𝑃 𝑒𝑥,𝑎 be the event that an itemset 𝑥 containing 𝑎 satisfies (|𝑓 𝑆2 (𝑥) ( ) − 𝑓𝐷 (𝑥)|) ≤ 𝜀∕2, we can get Pr(𝑒𝑥,𝑎 ) = Pr |𝑓𝑆2 (𝑥) − 𝑓𝐷 (𝑥)| ≤ 2𝜀 ≥ ( ) 𝜀 𝑃 2 , 𝑛2 , 𝑓𝑆1 ({𝑎}) + 𝜀 . Itemsets contained by 𝐼(𝜀1 ) can be classified into two groups, itemsets containing at least one item in 𝑣1 , and itemsets containing at least one item in 𝑣2 but not containing any item in 𝑣1 . Furthermore, suppose the items in 𝑣1 be 𝑎1 … 𝑎|𝑣1 | , those itemsets containing at least one item in 𝑣1 can be further classified into the following categories, {𝑥|𝑎1 ∈ 𝑥}, {𝑥|𝑎2 ∈ 𝑥&𝑎1 ∉ 𝑥}, {𝑥|𝑎3 ∈ 𝑥&𝑎1 ∉ 𝑥, 𝑎2 ∉ 𝑥}, ... {𝑥|𝑎|𝑣1 | ∈ 𝑥&𝑎1 ∉ 𝑥, … , 𝑎|𝑣1 |−1 ∉ 𝑥}. The size of the 𝑖th category is 2|𝑣1 |+|𝑣2 |−𝑖 . According to the Union bound, the probability of every 𝑥, which contains least one item in 𝑣)1 , to satisfy (|𝑓𝑆2 (𝑥) − 𝑓𝐷 (𝑥)|) ≤ 𝜀∕2 meets (⋂ at ⋂ ∑ ∑ Pr ( 𝑒𝑥,𝑎𝑖 ) ≥ 1 − 𝑎𝑖 ∈𝑣1 𝑎𝑖 ∈𝑥,𝑎1 to 𝑖−1 ∉𝑥 𝑎𝑖 ∈𝑣1 −1 ∉𝑥 ( ))) ( )𝑎𝑖 ∈𝑥,𝑎1 to ∑ ( ( |𝑣1 | . 1 − Pr(𝑒𝑥,𝑎𝑖 ) ≥ 1 − 𝑖=1 2|𝑣1 |+|𝑣2 |−𝑖 1 − 𝑃 2𝜀 , 𝑛2 , 𝑓𝑆1 ({𝑎𝑖 }) + 𝜀 In a similar way, we derive the probability of every 𝑥, which contains at least one item in 𝑣2 but does not contain any item in 𝑣1), to (⋂ ⋂ satisfy (|𝑓𝑆2 (𝑥) − 𝑓𝐷 (𝑥)|) ≤ 𝜀∕2 meets Pr ( ≥ 𝑎∈𝑣2 𝑎∈𝑥,𝑥∩𝑣1 =∅ 𝑒𝑥,𝑎 ) )) ( ( 𝜀 |𝑣 | 1 − 2 2 1 − 𝑃 2 , 𝑛2 , 0.5 . According to the Union bound, ( ))) ∑|𝑣1 | ( |𝑣 |+|𝑣 |−𝑖 ( Pr(𝑒2 |𝑒1 ) ≥ 1 − 𝑖=1 2 1 2 1 − 𝑃 2𝜀 , 𝑛2 , 𝑓𝑆1 ({𝑎𝑖 }) + 𝜀 ( ( )) (14) − 2|𝑣2 | 1 − 𝑃 2𝜀 , 𝑛2 , 0.5 . (
𝑑(𝑡1 , 𝑡2 ) 0 ⋯ 0
𝜀 , 𝑛 , 𝑓 ({𝑎}) + 𝜀 2 2 𝑆1
𝐵𝑙(𝑥, 𝛼𝑖 ) = 1 when 𝑥 ⊆ 𝛼𝑖 , or it is 0. Then, the following theorem holds. Theorem 5. Pr(𝑒3 |𝑒1 𝑒2 ) ≥ (1 − 𝛿)1∕3 is true if 𝐺 = {𝑔1 , 𝑔2 … 𝑔|𝐺| } satisfies 1−
|𝐺| ∑∑ 𝑎∈𝐼1 𝑖=1
(
|𝛽𝑖 | |𝛼𝑖 |
)
1
≥ (1 − 𝛿) 3 ,
(18)
where 𝛽𝑖 is the intersection of all the transactions located in the 𝑖th cluster. Proof. Suppose 𝑒𝑔𝑖𝑎 be the event that for a random item 𝑎 ∈ 𝐼1 , 𝑎 ∈ 𝛼𝑖 |𝛽 | and 𝑎 ∈ 𝛽𝑖 are satisfied at the same time, we can obtain Pr(𝑒𝑔𝑖𝑎 ) = 𝛼𝑖 , | 𝑖| where |𝛽𝑖 | and |𝛼𝑖 | are the numbers of items contained by 𝛽𝑖 and 𝛼𝑖 respectively. Therefore, for every granule, the probability that 𝑎 ∈ 𝛼𝑖 ( ) and 𝑎 ∈ (𝛽𝑖 are satisfied at the same time is Pr 𝑒𝑔1𝑎 𝑒𝑔2𝑎 … 𝑒𝑔|𝐺|𝑎 ≥ ∑|𝐺| |𝛽𝑖 | ) 1 − 𝑖=1 |𝛼 | . For every 𝑎 ∈ 𝐼1 and every granule, if 𝑎 ∈ 𝛼𝑖 and 𝑖 𝑎 ∈ 𝛽𝑖 are satisfied at the same time, we can get that for every 𝑥 ⊆ 𝐼1 , 𝑓𝐺 (𝑥) = 𝑓𝑆2 (𝑥). Therefore, when (18) is satisfied, Pr(𝑒3 |𝑒1 𝑒2 ) ≥ (1 − 𝛿)1∕3 is true. Proof has been completed.
Therefore, if (12) is satisfied, we can obtain Pr(𝑒2 |𝑒1 ) ≥ (1 − 𝛿)1∕3 . However, the above mathematical derivations also have a prerequisite, namely 𝑛2 is large enough to make all the itemsets contained in 𝐼(𝜀1 ) satisfy (9). Therefore, like the first sampling phase, for any itemset 𝑥 not satisfying (9), we want to ensure that Pr(|𝑓𝑆2 (𝑥)−𝑓𝐷 (𝑥)| ≤ 𝜀∕2) = 1, so 𝑥 will not affect whether Pr(𝑒2 |𝑒1 ) ≥ (1 − 𝛿)1∕3 is established or not. Through the similar way discussed in the first sampling phase, 𝑛2 should also satisfy (13). Meanwhile, when (13) is satisfied, if we do not consider those itemsets not satisfying (9), because the number of itemsets in every category becomes smaller, (14) is still true according to the Union bound. Thus, if 𝑛2 satisfies (12) and (13), Pr(𝑒2 |𝑒1 ) ≥ (1 − 𝛿)1∕3 is true. Proof has been completed.
The flowchart of the information granulation is shown in Fig. 4. 4.6. Mining FIs by binary matrix After information granulation, Apriori improved by bitmap technique is used to mine FIs, where the operations of matrix replace the scanning of data. Because the dataset has been reduced much, operating a small binary matrix through some high-performance tool such as MATLAB and MKL (Wang et al., 2014; Release, 2013) with a not too low 𝜃 is faster than operating a complex tree based data structure. Fig. 5 shows the runtime of PrePost+, a high-performance tree based algorithm, and the runtime of bitmap based Apriori, where the dataset has 8124 instances and 120 items. Environment of this example is the same as what is described in Section 5.3. In detail, every granule is represented by a binary vector and a weight. Length of this vector is |𝐼(𝜀1 )|. Items contained by the granule
The flowchart of the second sampling phase is displayed in Fig. 3. 4.5. Information granulation Information granulation is to further reduce 𝑆2 and to ensure Pr(𝑒3 |𝑒1 𝑒2 ) ≥ (1 − 𝛿)1∕3 . Information granulation puts elements into 123
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
5. Experimental studies 5.1. A basis of comparative study In the experiments, the following algorithms are chosen to compare with SG, which are shown in Table 1. These algorithms reduce the runtime of mining by several methods, and are briefly introduced by follows. (a) CT-PRO (Sucahyo and Gopalan, 2004): CT-PRO applies the compressed FP-Tree (CFP-Tree) to improve FP-Tree, whose nodes is much less than what are in FP-Tree. CT-PRO is much faster than FP-growth. (b) PrePost (Deng et al., 2012): PrePost joins N-list, which improves PPC-tree, to represent data. The nice properties of PPC-tree ensures that N-list is compact. PrePost computes the supports of itemsets through high efficient intersections between N-lists, so it is faster than FP-growth. (c) PrePost+ (Deng and Lv, 2015): PrePost+ is an extension of PrePost. Through Children–Parent Equivalence pruning, the search space is greatly reduced, so PrePost+ is faster than PrePost theoretically. (d) BitTableFI (Dong and Han, 2007): BitTableFI is an extension of Apriori, where dataset is compressed into some binary numbers, and the logic operator AND is taken to fast calculating the support of itemsets and fast generate candidate itemsets. The performance of BitTableFI is better than many Apriori-like algorithms. (e) Algorithm based on central limit theory (CLT) (Li and Gopalan, 2004): This algorithm speeds up the mining process through sampling. A loose guarantee for the accuracy of final results is given, the deviation of a random support can be limited in a given range with a given probability. The sample size is chosen through CLT. (f) Algorithm based on hybrid bounds (HB) (Li and Gopalan, 2004): This algorithm also reduces data by sampling and gives a loose guarantee for the accuracy of final results, where the sample scale is chosen through combining multiplicative Chernoff bound and additive Chernoff bound. (g) FI-GF (Zhang et al., 2015): FI-GF reduces data by the principle of justifiable granularity, where those similar and adjacent transactions are compressed in a granule. FI-GF offers no guarantee for the final results. (h) Algorithm proposed by Venkatesan T. Chakaravarthy (Chakaravarthy et al., 2009), denoted by VTC: This algorithm reduces data scale through sampling and offers a tight guarantee satisfying Definition 1. The sample scale is chosen through Chernoff bound. (i) Algorithm based on Rademacher Averages (Riondato and Upfal, 2015): This algorithm reduces data scale by progressive sampling. A tight guarantee satisfying Definition 1 is given, where the stop condition is designed based on Rademacher Averages. In the rest of this paper, this algorithm is briefly called RA. (j) ARMGA (Yan et al., 2005): ARMGA is built through genetic algorithm, which generates ARs directly and need not to set minimum support and minimum confidence. It evaluates rule )−𝑆𝑢𝑝(𝑋)𝑆𝑢𝑝(𝑌 ) . Considering that ARMGA can 𝑋 → 𝑌 by 𝑆𝑢𝑝(𝑋∪𝑌 𝑆𝑢𝑝(𝑋)(1−𝑆𝑢𝑝(𝑌 )) only search rules with fixed length, we run it 10 times per mining to extract rules with lengths from 2 to 11. (k) BPSO (Sarath and Ravi, 2013): BPSO extract rules through binary particle optimization. It need not to set minimum support and minimum confidence, which evaluates rule 𝑋 → 𝑌 by 𝑆𝑢𝑝(𝑋∪𝑌 )𝑆𝑢𝑝(𝑋∪𝑌 ) . 𝑆𝑢𝑝(𝑋) (l) QAR-CIP-NSGA-II (Martín et al., 2014): This algorithm applies a multi-objective genetic algorithm called NSGA-II to mining ARs, where a restarting process is joined. The original version takes
Fig. 4. The flowchart of information granulation.
Fig. 5. The runtime of PrePost+ and bitmap based Apriori on small dataset.
are denoted by 1, or are denoted by 0. For example, if 𝐼(𝜀1 ) = {𝑎1 , 𝑎2 , 𝑎3 , 𝑎4 , 𝑎5 }, given a granule 𝑔 = (𝛼, 𝜔) and 𝛼 = {𝑎1 , 𝑎3 , 𝑎4 }, 𝛼 is represented by [1 0 1 1 0]. After the granulation, the reduced dataset are represented by a binary matrix ⎡1 𝐺𝑏 = ⎢ ⎢ ⎣1
0 ⋯ 1
1 0
1 ⋯ 1
0⎤ ⎥. ⎥ 1⎦
Rows are granules, whose weights are stored in a vector, which looks like [ ] 𝑊 = 10 5 ⋯ ⋯ 7 . Candidate itemsets are represented by a binary matrix, which looks like ⎡1 𝐶𝐼 = ⎢ ⎢ ⎣0
1 ⋯ 1
0 1
0 ⋯ 1
0⎤ ⎥, ⎥ 0⎦
where every row represents an itemset. When calculating the supports of candidate itemsets in the 𝑘th step, the following steps are applied. Firstly, we get an intermediate matrix 𝐹 by 𝐹 = 𝐶𝐼 × 𝐺𝑏T , and every element in 𝐹 smaller than 𝑘 is set to 0,( or is set) to 1. Then, the support of every candidate itemset is (1∕|𝐺|) ⋅ 𝐹 × 𝑊 T . 124
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
(a) BMS-POS.
(b) RecordLink.
(c) kosarak.
(d) USCensus.
(e) kddcup99.
(f) PAMAP.
(g) POWERC.
(h) webdocs.
(i) SUSY.
Fig. 6. The probability density functions of the runtime of algorithms. Table 1 The information of algorithms to compare with SG.
support, confidence, interestingness and comprehensibility into accounts. It evaluates rules more strictly than ARMGA and BPSO. ) In our experiments, only 𝑆𝑢𝑝 (𝑋 ∪ 𝑌 ) and 𝑆𝑢𝑝(𝑋∪𝑌 are used to 𝑆𝑢𝑝(𝑋) evaluate rule 𝑋 → 𝑌 . In our implementation, the method in Section 4.6 is used to mine FIs after data reduction. If the sample size obtained by those sampling based methods exceeds the original data size, algorithms will directly go to the mining phase. The binary matrix is also applied to evaluate chromosomes, where all the transactions with the same items are compressed into a granule, whose weight is the number of transactions in the that granule. 5.2. Selecting parameters Before doing the experiment, there are some parameters need to be set, which are the minimum support 𝜃, the maximum error caused by the deviation of support which can be endured 𝜀, and the probability of this maximum error to be satisfied 1 − 𝛿. This section offers some direction to set these parameters and gives values used in the experiments. Firstly, selecting minimum support 𝜃 is based on the specific application, which has two common ways, trial and error and using another parameter to replace 𝜃 such as mining 𝑘 most interesting FIs. The second
Algorithm
Style
Guarantee ofresults
CT-PRO PrePost PrePost+ BitTableFI
Tree based Tree based Tree based Bitmap based
Exact results Exact results Exact results Exact results
CLT HB FI-GF VTC RA
Sampling based Sampling based Granulation based Sampling based Sampling based
Loose Loose No Tight Tight
ARMGA BPSO QAR-CIP-NSGA-II
GA based PSO based MOGA based
No No No
method does not solve this problem completely, because the substitute parameter like 𝑘 also need to be set. Therefore, even when the user want to get the exact results, SG can make this trial and error fast and reliable. In our experiments, 𝜃 for different datasets is shown in Table 2. To fully test the performance of algorithms, they are all given relatively low values. 125
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
(a) BMS-POS.
(b) RecordLink.
(c) kosarak.
(d) USCensus.
(e) kddcup99.
(f) PAMAP.
(g) POWERC.
(h) webdocs.
(i) SUSY.
Fig. 7. The distribution of deviations of FIs’ supports.
Then, selecting 𝜀 and 1 − 𝛿 also depends on the specific application. The high 𝜀 and 𝛿 make the algorithm more fast but generate more errors, and the low values slow down the algorithm but enhance the accuracy. To fully test the performance of SG, 𝜀 and 𝛿 are both given low values. They are both set to 0.01. Furthermore, to be fair, for RA, VTC and algorithms based on CLT and HB, 𝜀 and 𝛿 are also set to 0.01 and 0.01. Furthermore, if the user want to mine ARs, a minimum confidence threshold 𝛾 is also needed. The 𝛾 is an estimated minimum conditional probability where the condition is the antecedent, which also depends on the specific application. In a research of three-way decision (Yao, 2007), an often used conditional probability which ensures the establishment of rule is 0.75, so 𝛾 in our experiments is set to 0.75.
The datasets in Table 2 have at least 500 000 transactions. Following is the reasons for us to choose them. SG is a sampling based algorithm and offers tight guarantee for final results, so it is proposed for mining datasets with a large scale of transactions, and there is no need or necessary to sample and offer approximate results when the scale of transactions is small. Therefore, we firstly choose three datasets from http://fimi.ua.ac. be/data/ and www.kdd.org/kdd-cup/view/kdd-cup-2000, which are BMS-POS, kosarak and webdocs. However, few open source datasets with large scale of transactions can be found for ARs mining, and three datasets are not enough, so we transformed RecordLink, USCensus, kddcup99, PAMAP, POWERC and SUSY, whose original versions can be found on http://archive.ics.uci.edu/ml/datasets.html, to the type for ARs mining and denote them to http://www.philippe-fournier-viger. com/spmf/index.php?link=datasets.php.
5.3. The environment of the experiment Table 2 shows datasets used in experiments. C++ is used to implement sampling and data conversion, and Matlab R2013A is used to implement information granulation and matrix operation. Codes of CTPRO are obtained from http://fimi.ua.ac.be/src/, and codes of PrePost and PrePost+ are obtained from http://www.voidcn.com/blog/pku_ sigma/. The platform is a PC with i5-2450, 6G DDR3 memory and 64-bit Windows 8.1.
5.4. The runtime of FIs mining At the beginning, the speed of algorithms in Table 1 is compared, where heuristic algorithms are not involved because they do not generate FIs. Every algorithm is run 50 times, and a probability density function of the runtime, which is assumed to follow the normal distribution and includes the time cost by all the operations in the mining, is 126
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
(a) BMS-POS.
(b) RecordLink.
(c) kosarak.
(d) USCensus.
(e) kddcup99.
(f) PAMAP.
(g) POWERC.
(h) webdocs.
(i) SUSY.
Fig. 8. The distribution of real supports of FIs obtained by every algorithm. Table 2 The information of datasets. Dataset
Scale of transactions
Scale of items
𝜃
Brief introduction
BMS-POS RecordLink kosarak USCensus kddcup99 PAMAP POWERC webodcs SUSY
515,597 547,913 990,002 1,000,000 1,000,000 1,000,000 1,040,000 1,692,082 5,000,000
1,657 29 41,270 396 135 141 140 5,267,656 190
5% 70% 3% 70% 70% 80% 40% 35% 80%
Clickstream data from an e-commerce Records representing individual data Clickstream data from a news portal Individual information from a census Characteristics of connections to a website Physical activity monitoring dataset Electric power consumption dataset A collection of web documents Properties of signal processes
estimated. The results are shown in Fig. 6, where o.o.m means running out of the memory. Then, the Welch’s t-test is applied, whose null hypothesis is that the expected value of SG’s runtime is smaller than other algorithms’. The level of significance is set to 0.05, a common used value. The results are shown in Table 3, where 𝑝-value is given first, and the decision of whether the null hypothesis is accepted or not, denoted by A and R, is given after 𝑝-value. Moreover, Table 4 shows the average scale of instances in every dataset after being reduced by algorithms based on data reduction.
According to Fig. 6, Tables 3 and 4, we observe that (a) When the scale of data becomes large, SG runs faster than CTPRO, PrePost, PrePost+ and BitTableFI. Furthermore, SG’s speed advantage grows as the size of data increases. (b) Comparing to FI-GF, VTC and RA, SG also runs faster. (c) SG is slower than algorithms based on CLT and HB, and in the following two experiments, we will prove that SG is more reliable than them. (d) The algorithm taking more time has larger variance of runtime. 127
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
(a) BMS-POS.
(b) RecordLink.
(c) kosarak.
(d) USCensus.
(e) kddcup99.
(f) PAMAP.
(g) POWERC.
(h) webdocs.
(i) SUSY.
Fig. 9. The distribution of lengths of FIs obtained by every algorithm.
Table 3 Welch’s t-test on runtime of SG and other algorithms (A: acceptance R: rejection). CT-PRO
PrePost
PrePost+
BitTableFI
CL
BMS-POS RecordLink kosarak USCensus kddcup99 PAMAP POWERC webdocs SUSY
2.5 × 10−8 R 0.6914 A 1A 1A 1A 1A 1A / /
3.0 × 10−25 R 4.5 × 10−4 R 0.9238 A 0.9646 A 1A 1A 0.9981 A / 1A
4.9 × 10−25 R 5.3 × 10−4 R 0.8606 A 0.9618 A 1A 1A 0.9968 A / 1A
1A 1A 1A 1A 1A 1A 1A 1A 1A
2.1 × 10−34 R 3.6 × 10−20 R 1.5 × 10−10 R 7.8 × 10−16 R 5.6 × 10−12 R 4.8 × 10−16 R 3.2 × 10−22 R 5.9 × 10−6 R 7.7 × 10−102 R
HB
FI-GF
VTC
RA
BMS-POS RecordLink kosarak USCensus kddcup99 PAMAP POWERC webdocs SUSY
7.8 × 10−43 R 8.5 × 10−24 R 1.4 × 10−11 R 2.8 × 10−17 R 5.2 × 10−21 R 0.0395 R 7.9 × 10−22 R 4.5 × 10−6 R 1.8 × 10−105 R
1A 1A 1A 1A 1A 1A 1A 1A 1A
1A 1A 1A 1A 1A 1A 1A 1A 1A
1A 1A 1A 1A 1A 1A 1A / 1A
128
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
(a) BMS-POS.
(b) RecordLink.
(c) kosarak.
(d) USCensus.
(e) kddcup99.
(f) PAMAP.
(g) POWERC.
(h) webdocs.
(i) SUSY.
Fig. 10. The distribution of deviations of rules’ confidences. Table 4 The average number of instances in the reduced data.
BMS-POS RecordLink kosarak USCensus kddcup99 PAMAP POWERC webdocs SUSY
SG
CLT
HB
FI-GF
VTC
RA
26 306.8 67.1 919.62 2072.46 67.52 1729.88 40.02 13 505.4 231
16,513 16,513 16,513 16,513 16,513 16,513 16,513 16,513 16,513
3179 3179 3179 3179 3179 3179 3179 3179 3179
1946.3 834 2952.14 3878.56 281.8 588.76 667.6 8328.84 8448.1
515,597 547,913 990,002 1,000,000 1,000,000 1,000,000 1,040,000 1,692,082 5,000,000
447,785.44 439,927.16 990,002 724,275.6 454,126 471,842.96 472,287.04 / 493 775.3
Table 5 F-test on deviations of supports (A: acceptance R: rejection).
CLT HB
BMS-POS
RecordLink
kosarak
USCensus
kddcup99
PAMAP
POWERC
webdocs
SUSY
1A 1A
1A 1A
1A 1A
1A 1A
1A 1A
1A 1A
1A 1A
1A 1A
1A 1A
These results can be explained by the follows.
adjacent transactions, which costs a lot of time. To give the tight guarantee of the final results, both VTC and RA over-estimate
(a) SG evidently reduces the dataset and saves much time. (b) FI-GF (Zhang et al., 2015) needs to scan dataset at least twice and computing the hamming distances between every pair of
the sample size. Furthermore, RA (Riondato and Upfal, 2015) has to operate two huge matrixes, whose sizes are proportional 129
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
(a) BMS-POS.
(b) RecordLink.
(c) kosarak.
(d) USCensus.
(e) kddcup99.
(f) PAMAP.
(g) POWERC.
(h) webdocs.
(i) SUSY.
Fig. 11. The distribution of real confidences of rules obtained by every algorithm.
to the scale of items, which slows RA down and makes it run out of memory when operating webdocs. (c) Algorithms based on CLT or HB can directly get the sample size from 𝜀 and 𝛿 without any operation, so they are very fast. However, the following experiments will show that their reliability is worse than SG. (d) The algorithm taking long time containing more unit operations of the computer, whose variances will accumulate.
rejected is given after 𝑝-value, where A means acceptance and R means rejection. Then, Table 6 shows the range of those deviations, the average number of FIs generated by every algorithm, and the precision and recall of the FIs generated by every algorithm, where precision is the fraction of real FIs in the itemsets obtained by the algorithm, and recall is the fraction of real FIs which are obtained by the algorithm in all the real FIs. The real FIs are generated by BitTableFI. To further show the accuracy of FIs obtained by every algorithm, the fitted distributions of FIs’ real supports generated by every algorithm are shown in Fig. 8, and the distributions of FIs’ lengths obtained by every algorithm are shown in Fig. 9. According to Fig. 7, Tables 5 and 6, Figs. 8 and 9, the following can be obtained:
5.5. The accuracy of FIs Algorithms running faster than SG are chosen to do the further comparison, which are the algorithms based on CLT and HB. For every dataset, all the algorithms are run 50 times to generate the FIs and supports. The real supports of those FIs are also obtained by scanning the original dataset. Deviations of the supports obtained by every algorithm are computed, and Fig. 7 shows the fitted probability density of them, which are assumed to follow the normal distribution. Table 5 displays the results of F-test, whose null hypothesis is that the variance of deviations obtained by SG is less than what another two algorithms get, and the level of significance is set to 0.05. In Table 5, the 𝑝-value is given first, and the decision of whether the null hypothesis is accepted or
(a) The supports generated by SG have much less deviations than what CLT and HB based algorithms do, and the error of SG has been strictly controlled according to Definition 1. (b) Although algorithms based on CLT and HB suppose to limit the error in [−0.01, 0.01] with a high probability, errors are always out of this range. (c) The recall of SG is relatively higher than the recalls of algorithms based on CLT and HB, and the precision of SG is a little lower than them. 130
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
(a) BMS-POS.
(b) RecordLink.
(c) kosarak.
(d) USCensus.
(e) kddcup99.
(f) PAMAP.
(g) POWERC.
(h) webdocs.
(i) SUSY.
Fig. 12. The distribution of real leverages of rules obtained by every algorithm.
(d) The distributions of FIs’ supports and lengths generated by all the algorithms are similar to what BitTableFI obtains.
5.6. The accuracy of ARs Then, ARs are generated from FIs, where every algorithm is run 50 times. Deviations of rules’ confidences obtained by every algorithm are computed. The fitted probability densities, which are assumed to follow the normal distribution, of these deviations are shown in Fig. 10. Table 7 displays the results of F-test, whose null hypothesis is that the variance of deviations obtained by SG is less than what another two algorithms get, and the level of significance is 0.05. In Table 7, the 𝑝-value is given first, and the decision of whether the null hypothesis is accepted or rejected is given after 𝑝-value, where A means acceptance and R means rejection. Table 8 shows the range of those deviations, the average number, the precision and recall of those rules obtained by every algorithm, where precision is the fraction of real rules in the rules obtained by the algorithm, and recall is the fraction of real rules which are obtained by the algorithm in all the real rules. Real rules are generated through BitTableFI. Furthermore, Figs. 11 and 12 respectively shows the fitted distributions of real confidences and real leverages of rules obtained by every algorithm, which are compared with what BitTableFI generates. According to Fig. 10, Tables 7 and 8, Figs. 11 and 12, the following can be obtained.
These phenomenon can be explained by the follows. (a) SG offers the tight guarantee introduced in Definition 1, so the accuracy of its results is higher than those algorithms with loose guarantee. (b) The algorithms based on CLT and HB do not take any action to avoid loosing FIs. On the other hand, some supports becomes lower after sampling, so they are sure to lose some real FIs and get lower recalls. (c) To avoid losing some real FIs, SG actually uses 𝜃 − 𝜀∕2 to be the 𝜃, so it will get more itemsets, higher recall and lower precision. However, considering its high accuracy of supports, the supports of those extra itemsets are very close to their real values and 𝜃, so they also can offer some useful information for the user. (d) Nevertheless, SG and algorithms based on CLT and HB have offered some guarantee of the final results, so they all can reflect the true picture of the real FIs to some degrees. 131
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136 Table 6 Error range, average precision, recall and number of FIs generated by algorithms. Dataset
Measure
SG
CLT
HB
BitTableFI
BMS-POS
Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number
[−0.0025, 0.0027] 0.8487 1 69.6 [−0.0017, 0.0016] 0.9695 1 127 [−0.0031, 0.0028] 0.7892 1 82.4 [−0.0019, 0.0020] 0.9110 1 57.1 [−0.0018, 0.0034] 1 1 134 [−0.0025, 0.0027] 0.9279 1 116.5 [−0.0025, 0.0017] 1 1 40 [−0.0025, 0.0025] 0.9648 1 70.5 [−0.0021, 0.0023] 0.9406 1 334.9
[−0.0141, 0.0075] 0.9964 0.9017 53.4 [−0.0111, 0.0061] 0.9874 0.9870 123 [−0.0074, 0.0081] 0.9730 0.9354 62.54 [−0.0109, 0.0093] 0.9842 0.9038 47.8 [−0.0122, 0.0231] 1 0.9761 130.8 [−0.0141, 0.0075] 0.9890 0.9870 107.8 [−0.0091, 0.0080] 1 1 40 [−0.0124, 0.0099] 0.9868 0.9779 67.4 [−0.0074, 0.0062] 0.9906 0.9803 311.9
[−0.0285, 0.0168] 0.9555 0.8356 51.82 [−0.0173, 0.0229] 0.9827 0.9837 123.02 [−0.0181, 0.0141] 0.9552 0.9092 61.9 [−0.0177, 0.0208] 0.9552 0.8981 49.1 [−0.0178, 0.0354] 1 0.8672 116.2 [−0.0285, 0.0168] 0.9474 0.9685 110.7 [−0.0219, 0.0213] 1 0.995 39.8 [−0.0308, 0.0206] 0.9742 0.9382 65.6 [−0.0158, 0.0128] 0.9809 0.9657 310.4
[0, 0] 1 1 59 [0, 0] 1 1 123 [0, 0] 1 1 65 [0, 0] 1 1 52 [0, 0] 1 1 134 [0, 0] 1 1 108 [0, 0] 1 1 40 [0, 0] 1 1 68 [0, 0] 1 1 315
RecordLink
kosarak
USCensus
kddcup99
PAMAP
POWERC
webdocs
SUSY
Table 7 F-test on deviations of confidences (A: acceptance R: rejection).
CLT HB
BMS-POS
RecordLink
kosarak
USCensus
kddcup99
PAMAP
POWERC
webdocs
SUSY
1A 1A
1A 1A
0.9328 A 1A
1A 1A
1A 1A
1A 1A
1A 1A
1A 1A
1A 1A
(a) The confidences of rules obtained by SG are evidently more accurate than what algorithms based on CLT and HB obtain. (b) SG forms more rules than what BitTableFI and algorithms based on CLT and HB do. In some cases, the precision of SG is a little bit lower than others, but in most datasets, the recall of SG is higher than them. (c) Algorithms based on central limit and HB, especially the second one, always lose some real rules, so despite their precisions are high in some datasets, their recalls are always lower than SG. (d) In most datasets, distributions of the real confidences and leverages of rules generated by every algorithm are similar to what BitTableFI obtains, but in some cases, the distributions of the real confidences and leverages of rules generated by the algorithm based on HB have obvious difference from what BitTableFI obtains.
(c)
(d)
(e)
These phenomenon can be analyzed by the follows (a) The high accuracy of supports obtained by SG also ensures the high accuracy of the confidences of its rules, so it has lower deviation of confidences and loses less rules. (b) To avoid loosing FIs, SG has obtained more itemsets whose supports are very close to 𝜃, so some extra ARs are also obtained, which makes the precision of SG lower than another two algorithms in some datasets. Because 𝜃 is always set by trial and error,
(f)
132
we cannot arbitrarily conclude that an itemset with support only a little bit lower than 𝜃 is useless. Then, considering that the deviations of supports obtained by SG are very low, these extra FIs and the rules obtained by SG also provide useful information for the user. The number of itemsets grows exponentially when the support decreases (Chuang et al., 2008), so when the 𝜃 decreases, a little bit deviation of support can generate more extra FIs, and more rules are obtained through these extra FIs. The 𝜃 given in this paper is relatively low to test the speed of SG, so the number of FIs, rules and extra rules obtained by SG is relatively high. In many real applications, 𝜃 will not be set so low. Although choosing 𝜃 is another classical and big problem, it is not focused on in this paper. Algorithms based on CLT and HB lose many real FIs, the rules based on these lost FIs are lost. On the other hand, the higher deviation of confidences also makes them further lose real rules. Moreover, the sample scale of algorithm based on HB is relatively low, so its accuracy is also lower than the algorithm based on CLT. SG and algorithms based on CLT and HB all offer some guarantee for the FIs, so their results can reflect real rules to some degree. However, algorithm based on HB loses more ARs and obtains higher deviations of supports, confidences and leverages, so in some cases, distributions of those indexes have evident difference from what BitTableFI gets.
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136 Table 8 Error range, average precision, recall and number of ARs generated by algorithms. Dataset
Measure
SG
Central limit
HB
BitTableFI
BMS-POS
Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number Error range Precision Recall Number
[−0.0099, 0.0038] 0.81 1 29.7 [−0.0011, 0.0011] 0.9446 1 1281.64 [−0.0090, 0.0081] 0.7869 1 61.2 [−9.28 × 10−4 , 0.0010] 0.9374 1 124.08 [−0.0018, 0.0034] 1 0.9737 1890.88 [−0.0012, 9.99 × 10−4 ] 0.8711 1 462.02 [−0.0021, 7.8 × 10−4 ] 1 1 144 [−0.0025, 0.0025] 0.9618 1 132.66 [−0.0016, 0.0015] 0.9695 1 6912.72
[−0.0121, 0.0095] 1 0.6308 15.14 [−0.0033, 0.0027] 1 0.9763 1180.3 [−0.0154, 0.0036] 1 0.9050 43.44 [−0.0029, 0.0020] 1 0.8 92 [−0.0122, 0.0231] 0.9865 0.9784 1926 [−0.0038, 0.0018] 1 0.9659 388.3 [−0.0031, 8.86 × 10−4 ] 1 0.9444 136 [−0.0124, 0.0099] 0.9563 0.9791 129.38 [−0.0015, 0.0026] 0.9998 0.9884 6581.8
[−0.0550, 0.0372] 0.9333 0.8477 21.8 [−0.0067, 0.0172] 1 0.9453 1142.92 [−0.0152, 0.0719] 0.7778 0.8850 54.72 [−0.0074, 0.0153] 0.5777 1 206.7 [−0.0178, 0.0354] 1 0.4530 879.74 [−0.0093, 0.0121] 0.9573 0.9453 398.16 [−0.0059, 0.0067] 1 0.9146 131.7 [−0.0308, 0.0206] 0.9628 1 131.9 [−0.0158, 0.0027] 1 0.9183 6114.24
[0, 0] 1 1 24 [0, 0] 1 1 1209 [0, 0] 1 1 48 [0, 0] 1 1 115 [0, 0] 1 1 1942 [0, 0] 1 1 402 [0, 0] 1 1 144 [0, 0] 1 1 126 [0, 0] 1 1 6658
RecordLink
kosarak
USCensus
kddcup99
PAMAP
POWERC
webdocs
SUSY
Table 9 Parameters of heuristic algorithms used in the experiment. Algorithm
Parameters
Values
BPSO
Inertia weight Acceleration coefficient 1 Acceleration coefficient 2 Population size Iteration number
{0.5, 1, 1.5} {0.5, 1, 1.5} {0.5, 1, 1.5} {30, 40, 50} {30, 90, 150}
ARMGA
Selection probability Crossover probability Mutation probability Population size Iteration number
0.95 (Recommended by the literature) 0.85 (Recommended by the literature) 0.01 (Recommended by the literature) {30, 40, 50} {30, 90, 150}
QAR-CIP-NSGA-II
Difference threshold Mutation probability Factor of amplitude Population size Iteration number
0.05 (Recommended by the literature) 0.1 (Recommended by the literature) 2 (Recommended by the literature) {30, 40, 50} {30, 90, 150}
Table 10 Welch’s t-test on the runtime (A: acceptance R: rejection).
ARMGA BPSO QAR-CIP-NSGA-II
BMS-POS
RecordLink
kosarak
USCensus
kddcup99
PAMAP
POWERC
webdocs
SUSY
1A 1A 1A
1A 1A 1A
1A 1A 1A
1A 1A 1A
1A 1A 1A
1A 1A 1A
1A 1A 1A
1A 1A 1A
1A 1A 1A
Ravi, 2013; Martín et al., 2014), we use different values to test their performance. The choice of parameters are shown in Table 10, where the iteration numbers of ARMGA and QAR-CIP-NSGA-II are the total iteration numbers. Firstly, the runtime, which includes forming FIs and generating ARs, of every algorithm is compared, where every algorithm runs 50 times, and the fitted probability densities are shown in Fig. 13. For any
5.7. Comparing with heuristic algorithms In this section, SG is compared with BPSO, ARMGA and QARCIP-NSGA-II. The parameters of those heuristic algorithms are shown in Table 9. The performance of heuristic algorithms depends on the adjusted parameters. Therefore, for all the parameters, except those with recommended values in their literatures (Yan et al., 2005; Sarath and 133
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
(a) BMS-POS.
(b) RecordLink.
(c) kosarak.
(d) USCensus.
(e) kddcup99.
(f) PAMAP.
(g) POWERC.
(h) webdocs.
(i) SUSY.
Fig. 13. Distribution of time cost by every algorithm with the lowest expectation.
heuristic algorithm, considering that there are too many alternatives of parameters, only the probability density with the lowest expectation is shown, and the corresponding parameters are also shown. Furthermore, the runtime of SG comes from the experiment done in Section 5.6. Table 10 shows the results of Welch’s t-test, whose null hypothesis is that the time cost by SG is less than heuristic algorithms. In Table 10, the 𝑝-value is given first, and the decision of whether the null hypothesis is accepted or not is given after 𝑝-value, where A means acceptance and R means rejection. The significance level is set to 0.05, a common used value. According to Fig. 13, Table 10, we observe that
(b) The smaller iteration number and the smaller population size reduces the number of times and the complexity to evaluate chromosomes. However, reducing them reduces the search ability, and the population size 30 and the iteration number 30 have been two relatively low values. Then, the ARs obtained by SG and those heuristic algorithms are compared. Considering that heuristic algorithms extract rules without setting minimum support (Yan et al., 2005; Sarath and Ravi, 2013; Yan et al., 2009), it is complex to compare SG with them. Two experiments are done. Firstly, all the algorithms extract rules based on 𝜃 and 𝛾, and the results are compared to show the reliability of SG. Then, rules obtained by heuristic algorithms without 𝜃 are compared with rules obtained by SG to show the advantages of heuristic algorithms. In the first experiment, because heuristic algorithms do not form rules under 𝜃 and 𝛾 (Yan et al., 2005; Sarath and Ravi, 2013; Yan et al., 2009), it is not fair if we just use the original methods of them. Some parts are modified. After every loop, heuristic algorithms put rules satisfying 𝜃 and 𝛾 into the results, and rules the same as others are removed. Selecting, crossover, mutation, moving and other important steps of heuristic algorithms are not changed, so this modification does not change their ability to extract rules under 𝜃 and 𝛾. Every algorithm
(a) SG runs faster than BPSO, ARMGA and QAR-CIP-NSGA-II. (b) For those heuristic algorithms, when their population size is 30 and iteration number is 30, they have the fastest speed. These phenomenon can be explained by the follows. (a) SG saves much time by data reduction. However, heuristic algorithms evaluate every particle or chromosome in every loop. According to the parameters set by us, they should do this operation at least 30 times. 134
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
Table 11 Average precision, average recall and average number of ARs generated by algorithms. Dataset
Measure
SG
ARMGA
BPSO
QAR-CIP-NSGA-II
BitTableFI
BMS-POS
Precision Recall Number Parameter Precision Recall Number Parameter Precision Recall Number Parameter Precision Recall Number Parameter Precision Recall Number Parameter Precision Recall Number Parameter Precision Recall Number Parameter Precision Recall Number Parameter Precision Recall Number Parameter
0.81 1 29.7 / 0.9446 1 1281.64 / 0.7869 1 61.2 / 0.9374 1 124.08 / 1 0.9737 1890.88 / 0.8711 1 462.02 / 1 1 144 / 0.9618 1 132.66 / 0.9695 1 6912.72 /
1 0.4275 10.26 (1.5,1.5,1,50,150) 1 0.1850 233.7 (1.5,1.5,1.5,50,150) 1 0.3100 14.88 (1.5,1.5,1.5,50,150) 1 0.2243 25.8 (1,1.5,1,50,150) 1 0.0727 141.28 (1.5,1.5,1,50,150) 1 0.18 72.36 (1.5,1,1,50,150) 1 0.4049 58.3 (1.5,1,1.5,50,150) 1 0.2257 28.44 (1.5,1.5,1.5,50,150) 1 0.0578 385.12 (1,1.5,1,50,150)
0 0 0 (0.95,0.85,0.01,50,150) 1 0.5772 744.58 (0.95,0.85,0.01,50,150) 1 8.3333 × 10−4 0.04 (0.95,0.85,0.01,50,150) 1 6.9565 × 10−4 0.08 (0.95,0.85,0.01,50,150) 1 0.1260 244.7 (0.95,0.85,0.01,50,150) 1 0.0089 3.56 (0.95,0.85,0.01,50,150) 1 0.9161 131.92 (0.95,0.85,0.01,50,150) 1 3.1746 × 10−4 0.04 (0.95,0.85,0.01,50,150) 1 0.1107 737 (0.95,0.85,0.01,50,150)
1 0.3892 9.34 (0.05,0.1,2,50,150) 1 0.0785 94.9 (0.05,0.1,2,50,150) 1 0.2567 12.32 (0.05,0.1,2,50,150) 1 0.1840 21.16 (0.05,0.1,2,50,150) 1 0.0305 59.14 (0.05,0.1,2,50,150) 1 0.0473 19 (0.05,0.1,2,50,150) 1 0.2422 34.88 (0.05,0.1,2,50,150) 1 0.1276 16.08 (0.05,0.1,2,50,150) 1 0.0122 81.5 (0.05,0.1,2,50,150)
1 1 24 / 1 1 1209 / 1 1 48 / 1 1 115 / 1 1 1942 / 1 1 402 / 1 1 144 / 1 1 126 / 1 1 6658 /
RecordLink
kosarak
USCensus
kddcup99
PAMAP
POWERC
webdocs
SUSY
Fig. 15. Number of rules obtained by every algorithm from webdocs. Fig. 14. Number of rules obtained by every algorithm from kosarak.
put into the results, and rules the same as others are removed. Every
is run 50 times. Table 11 contains the average number of real rules and the average precision and recall of rules obtained by every algorithm. The real rules are obtained by BitTableFI under parameters shown in Table 2 and 𝛾 = 0.75. For any heuristic algorithm, only the result with parameters generating the most number of rules is shown. In the second experiment, considering that heuristic algorithms just return the final population and a few of rules with high ranks (Yan et al., 2005; Sarath and Ravi, 2013; Yan et al., 2009), rules returned by them are not enough to fully express the real search ability. Therefore, some modifications are made. After every loop, rules satisfying 𝛾 are
algorithm runs 50 times. Datasets kosarak and webdocs are taken into account, whose large scale of items can fully show the advantages of heuristic algorithms. The histograms of average numbers of rules with supports from 0 to 1 and confidences from 0.75 to 1 are shown in Figs. 14 and 15. For any heuristic algorithm, only the result with the parameters generating the most number of rules is shown. According to Table 11, Figs. 14–15, the follows can be obtained. 135
Z. Zhang et al.
Engineering Applications of Artificial Intelligence 65 (2017) 119–136
(a) The recalls of heuristic algorithms are low, so they have low ability to get rules satisfying 𝜃 and 𝛾. However, heuristic algorithms are good at forming rules with high confidences but low supports. (b) Rules obtained by different heuristic algorithms are also different.
Deng, Z., 2016. DiffNodesets:An efficient structure for fast mining frequent itemsets. Appl. Soft Comput. 41, 214–223. Deng, Z.-H., Lv, S.-L., 2015. PrePost+: An efficient N-lists-based algorithm for mining frequent itemsets via Children–Parent Equivalence pruning. Expert Syst. Appl. 42 (13), 5424–5432. Deng, Z., Wang, Z., Jiang, J., 2012. A new algorithm for fast mining frequent itemsets using N-lists. Sci. China Inf. Sci. 55 (9), 2008–2030. Dong, J., Han, M., 2007. BitTableFI: An efficient mining frequent itemsets algorithm. Knowl.-Based Syst. 20 (4), 329–335. Han, J., Pei, J., Yin, Y., Mao, R., 2004. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Min. Knowl. Discov. 8 (1), 53–87. Hu, X., Pedrycz, W., Wang, X., 2015. Comparative analysis of logic operators: A perspective of statistical testing and granular computing. Internat. J. Approx. Reason. 66, 73–90. Hu, X., Yu, H., 2006. The research of sampling for mining frequent itemsets. In: International Conference on Rough Sets and Knowledge Technology. Springer, pp. 496–501. Hwang, W., Kim, D., 2006. Improved association rule mining by modified trimming. In: The Sixth IEEE International Conference on Computer and Information Technology. (CIT’06), IEEE, pp. 24–28. Jia, C.-Y., Gao, X.-P., 2005. Multi-scaling sampling: an adaptive sampling method for discovering approximate association rules. J. Comput. Sci. Tech. 20 (3), 309–318. Jia, C., Lu, R., 2005. Sampling ensembles for frequent patterns. In: International Conference on Fuzzy Systems and Knowledge Discovery. Springer, pp. 1197–1206. Li, Y., Gopalan, R.P., 2004. Effective sampling for mining association rules. In: Australasian Joint Conference on Artificial Intelligence. Springer, pp. 391–401. Mahafzah, B.A., Al-Badarneh, A.F., Zakaria, M.Z., 2009. A new sampling technique for association rule mining. J. Inf. Sci. 35 (3), 358–376. Martín, D., Rosete, A., Alcalá-Fdez, J., Herrera, F., 2014. QAR-CIP-NSGA-II: A new multiobjective evolutionary algorithm to mine quantitative association rules. Inform. Sci. 258, 1–28. Mohamed, M.H., Darwieesh, M.M., Ali, A.S., 2011. Advanced Matrix Algorithm (AMA): reducing number of scans for association rule generation. Int. J. Bus. Intell. Data Min. 6 (2), 202–214. Parthasarathy, S., 2002. Efficient progressive sampling for association rules. In: 2002 IEEE International Conference on Data Mining, 2002. ICDM 2003. Proceedings. IEEE, pp. 354–361. Pietracaprina, A., Riondato, M., Upfal, E., Vandin, F., 2010. Mining top-K frequent itemsets through progressive sampling. Data Min. Knowl. Discov. 21 (2), 310–326. Pyun, G., Yun, U., Ryu, K.H., 2014. Efficient frequent pattern mining based on linear prefix tree. Knowl.-Based Syst. 55, 125–139. Release, M., 2013. The MathWorks, Inc., Natick, Massachusetts, United States, 488. Riondato, M., Upfal, E., 2014. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Trans. Knowl. Discov. Data (TKDD) 8 (4), 20. Riondato, M., Upfal, E., 2015. Mining frequent itemsets through progressive sampling with Rademacher averages. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 1005–1014. Sarath, K., Ravi, V., 2013. Association rule mining using binary particle swarm optimization. Eng. Appl. Artif. Intell. 26 (8), 1832–1840. Scheffer, T., Wrobel, S., 2002. Finding the most interesting patterns in a database quickly by using sequential sampling. J. Mach. Learn. Res. 3 (Dec), 833–862. Song, W., Yang, B., Xu, Z., 2008. Index-BitTableFI: An improved algorithm for mining frequent itemsets. Knowl.-Based Syst. 21 (6), 507–513. Sucahyo, Y.G., Gopalan, R.P., 2004. CT-PRO: A bottom-up non recursive frequent itemset mining algorithm using compressed FP-tree data structure. In: FIMI, Vol. 4. pp. 212– 223. Toivonen, H., et al., 1996. Sampling large databases for association rules. In: VLDB, Vol. 96. pp. 134–145. Vo, B., Hong, T.-P., Le, B., 2012. DBV-Miner: A Dynamic Bit-Vector approach for fast mining frequent closed itemsets. Expert Syst. Appl. 39 (8), 7196–7206. Vo, B., Le, T., Coenen, F., Hong, T., 2016. Mining frequent itemsets using the N-list and subsume concepts. Int. J. Mach. Learn. Cybern. 7, 253–265. Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q., Wang, Y., 2014. Intel math kernel library. In: High-Performance Computing on the Intel® Xeon Phi. Springer, pp. 167–188. Yan, X., Zhang, C., Zhang, S., 2005. ARMGA: identifying interesting association rules with genetic algorithms. Appl. Artif. Intell. 19 (7), 677–689. Yan, X., Zhang, C., Zhang, S., 2009. Genetic algorithm-based strategy for identifying association rules without specifying actual minimum support. Expert Syst. Appl. 36 (2), 3066–3076. Yao, Y., 2007. Decision-theoretic rough set models. In: International Conference on Rough Sets and Knowledge Technology. Springer, pp. 1–12. Zhang, Z.-j., Huang, J., Wei, Y., 2015. FI-FG: Frequent item sets mining from datasets with high number of transactions by granular computing and fuzzy set theory. Math. Probl. Eng. 2015. Zhang, C., Zhang, S., Webb, G.I., 2003. Identifying approximate itemsets of interest in large databases. Appl. Intell. 18 (1), 91–104. Zhao, Y., Zhang, C., Zhang, S., 2006. Efficient frequent itemsets mining by sampling. In: AMT. pp. 112–117.
These phenomenon can be analyzed by follows (a) Many rules with high confidences have low supports, which are local optimal solutions and can easily trap heuristic algorithms. Because 𝜃 is hard to set and rules may be ignored by inappropriate 𝜃, heuristic algorithms can be supplementaries to find rules with supports lower than 𝜃. (b) Different heuristic algorithms have different fitness functions and methods for evolution, so rules obtained by them are different.
6. Conclusions The SG algorithm has been proposed in this study, which forms the (𝜀, 𝛿)-approximate FIs by decomposing the mining task into 3 subproblems, and exploits sampling and information granulation. Because of the decomposition being used here, the algorithm avoids overestimated sample sizes and huge computing overheads. The experiments demonstrated the accuracy and efficiency of SG. Essentially, we showed that SG is faster than most sampling and information granulation based algorithms through its well performance of data reduction. Furthermore, compared with sampling methods with loose guarantee, results of SG are more reliable. Finally, compared with heuristic based algorithms, SG also has advantages over on speed. Furthermore, SG focuses on extracting rules with supports and confidences higher than the given 𝜃 and 𝛾, and heuristic algorithms focus on obtaining rules with low supports and high confidences. Acknowledgments Zhongjie Zhang is supported by the China Scholarship Council under Grant no. 201503170285. References Agrawal, R., Imieliński, T., Swami, A., 1993. Mining association rules between sets of items in large databases. In: Acm Sigmod Record, Vol. 22. ACM, pp. 207–216 2. Agrawal, R., Srikant, R., et al., 1994. Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, Vol. 1215, pp. 487–499. Akbari, E., Dahlan, H.M., Ibrahim, R., Alizadeh, H., 2015. Hierarchical cluster ensemble selection. Eng. Appl. Artif. Intell. 39, 146–156. Alavi, F., Hashemi, S., 2015. DFP-SEPSF: A dynamic frequent pattern tree to mine strong emerging patterns in streamwise features. Eng. Appl. Artif. Intell. 37, 54–70. Bargiela, A., Pedrycz, W., 2012. Granular Computing: An Introduction, Vol. 717. Springer Science & Business Media. Berengut, D., 2012. Statistics for experimenters: Design, innovation, and discovery. The American Statistician. Brönnimann, H., Chen, B., Dash, M., Haas, P., Scheuermann, P., 2003. Efficient data reduction with EASE. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 59–68. Chakaravarthy, V.T., Pandit, V., Sabharwal, Y., 2009. Analysis of sampling techniques for association rule mining. In: Proceedings of the 12th International Conference on Database Theory. ACM, pp. 276–283. Chandra, B., Bhaskar, S., 2011. A new approach for generating efficient sample from market basket data. Expert Syst. Appl. 38 (3), 1321–1325. Chen, B., Haas, P., Scheuermann, P., 2002. A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 462–468. Chen, C., Horng, S.-J., Huang, C.-P., 2011. Locality sensitive hashing for sampling-based algorithms in association rule mining. Expert Syst. Appl. 38 (10), 12388–12397. Chuang, K.-T., Chen, M.-S., Yang, W.-C., 2005. Progressive sampling for association rules based on sampling error estimation. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, pp. 505–515. Chuang, K.-T., Huang, J.-L., Chen, M.-S., 2008. Power-law relationship and self-similarity in the itemset support distribution: analysis and applications. VLDB J. 17 (5), 1121– 1141. 136