Efficient automatic discovery of ‘hot’ itemsets

Efficient automatic discovery of ‘hot’ itemsets

Information Processing Letters 90 (2004) 65–72 www.elsevier.com/locate/ipl Efficient automatic discovery of ‘hot’ itemsets Ioannis N. Kouris a,b,∗ , ...

215KB Sizes 5 Downloads 84 Views

Information Processing Letters 90 (2004) 65–72 www.elsevier.com/locate/ipl

Efficient automatic discovery of ‘hot’ itemsets Ioannis N. Kouris a,b,∗ , Christos H. Makris a,b , Athanasios K. Tsakalidis a,b a University of Patras, School of Engineering, Department of Computer Engineering and Informatics, 26500 Patras, Greece b Computer Technology Institute, PO Box 1192, 26110 Patras, Greece

Received 23 July 2003 Communicated by F. Dehne

Abstract In real life applications the dominant model of the single support, which assumed all itemsets to be of the same nature and importance proved defective. The non-homogeneity of the itemsets on one hand and the non-uniformity of their number of appearances on the other require that we use different approaches. Some techniques have been proposed thus far trying to address these inefficiencies, but then new more demanding questions arose such as which itemsets are more interesting than others, what distinguishes them and how should they be identified, and finally how they should be effectively handled. Furthermore one common drawback of all approaches is that they have a tremendous lag in discovering new relationships and work only with long existing relationships or patterns. We propose a method that finds what we define as ‘hot’ itemsets in our database, deals with all problems described above and yet proves very efficient.  2004 Elsevier B.V. All rights reserved. Keywords: Data mining; Association rules; Hot itemsets; Databases; Algorithms

1. Introduction Since its first introduction in [1] the task of association rule mining has been one of the most popular and well-studied applications in data mining. A formal description of the problem is as follows: Let I = {i1 , i2 , . . . , im } be a set of items. Let T be a set of transactions (the database), where each transaction t is a set of items such that t ⊆ I . An association rule is an implication of the form, X → Y , where X ⊂ I , Y ⊂ I , and X ∩ Y = ∅. The rule X → Y * Corresponding author.

E-mail addresses: [email protected] (I.N. Kouris), [email protected] (C.H. Makris), [email protected] (A.K. Tsakalidis). 0020-0190/$ – see front matter  2004 Elsevier B.V. All rights reserved. doi:10.1016/j.ipl.2004.01.013

holds in the transaction set T with confidence c if c% of transactions in T that support X also support Y . The rule has support s in T if s% of the transactions in T contain X ∪ Y . Given a set of transactions T (the database), the problem of mining association rules is to discover all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf). An association mining algorithm works in two steps: 1. Generate all large itemsets that satisfy minsup. 2. Generate all association rules that satisfy minconf using the large itemsets.

66

I.N. Kouris et al. / Information Processing Letters 90 (2004) 65–72

Clearly the most demanding step is to find all frequent itemsets, since after this step has been completed the generation of all association rules is fairly straightforward. The prototypical application of this task has been the market basket analysis, but the specific model is not limited to it since it can be applied to many other domains and data types such as text documents, census data, telecommunication data etc. In fact any data set consisting of “baskets” containing multiple “items” can fit this model. All subsequent works (e.g., [3,11,12,14]) tried to improve various aspects of the best-known strategy for the task of association rule mining, called Apriori [1]. However most works thus far assumed all itemsets to be of the same importance and relied on the model of one single support for all itemsets in the database. In real life applications though this measure proves inefficient, since the frequency of appearances of the itemsets in a database as well as the itemsets themselves are far from being uniform. For example there are many cases where nonfrequent itemsets are more interesting than the frequent ones (e.g., products that are considered as lost leaders or they are in the process of market testing [6]), or the nature and characteristics of the items themselves are too diverse to be dealt as a whole. Some approaches have been proposed in order to solve these problems, but ignored some very important parameters. More specifically they left unanswered the question of which itemsets are more important or more interesting than others and what practically distinguishes them, they surpassed the most important task of identifying all those interesting itemsets, and finally ignored the problem of how can we effectively handle the interesting itemsets. We propose an approach that solves all these problems and yet proves very efficient. This paper is organized as follows. In Section 2 we give a brief overview of the previous works. In Section 3 we introduce a new problem in the task of association rule mining and in Section 4 we sketch our approach. Finally in Section 5 we present some experimental results.

2. Previous work A similar situation to the one presented above was described in [10], formally known as the rare itemset dilemma. More specifically suppose we have the case

where some itemsets hardly appear in our database, but cannot nevertheless be considered useless (e.g., they generate more profits per item, are more durable and thus are bought less frequently etc.). On the other hand let there be itemsets that appear very frequently in the data, at a very high percentage. How can one include those infrequent itemsets with the frequent ones, especially if overall the frequencies vary a lot? One is then confronted with the following dilemma: If the overall minsup is set too high, then we eliminate all those itemsets that are infrequent. On the other hand if we set the overall minsup too low, in order to find also those itemsets that are infrequent, we will most certainly cause a combinatorial explosion in the number of candidate and large itemsets generated. Various approaches and algorithms have been proposed to deal with this problem. One is splitting the data into a few blocks according to the frequencies of the items and then mine for rules in each distinct block with a different minimum support [8]. One second approach is to group a number of related rare items together into a higher order item, so that this higher order item is more frequent [8,4,13], or as it was used in a similar manner in [5]. However the first approach fails to produce in a straightforward simple manner rules with items across different blocks, while the second approach fails to produce rules involving the individual rare items that form the higher order itemset. Liu et al. in [9] proposed an algorithm called MSApriori, which was based on algorithm Apriori and could find rules among items with different supports without falling in the pitfall described in [10]. The intuition behind this algorithm was that not all itemsets are of the same importance, and so they should also receive a different support value called MIS according to how important they are considered. So, an important but rare itemset should receive a lower support value in order to be finally included in the large itemsets. Having though a different support value for every itemset means that the downward closure property, which in essence holds the key to pruning candidate itemsets that are bound not to be large, no longer holds. This problem was surpassed by introducing a new property by the name sorted closure property, where all itemsets are sorted in ascending order according to their MIS values (see [9] for more details). The performance of this work was improved in [7], but the main idea remained the same. In our opinion the work in [9], addresses this prob-

I.N. Kouris et al. / Information Processing Letters 90 (2004) 65–72

lem most effectively than all other approaches. This approach however too suffers from some serious drawbacks, making it rather impractical. 2.1. Algorithm MSApriori The first problem of the approach in [9] is that it assigns MIS values to all itemsets in a rather arbitrary way, and not by taking into account the real significance of an itemset. What it actually does is it uses the following formulae in order to assign MIS values to all the itemsets:  M(i) M(i) > LS, MIS(i) = LS otherwise, M(i) = β · f (i), where f (i) is the actual frequency of an item in the dataset, LS is a user defined lowest minimum item support allowed and β is a parameter that controls how the MIS values should be related to their frequencies. If parameter β is set equal to 0 algorithm MSApriori decays to any algorithm using a single support value. If on the other hand β is set close to 1, every itemset with number of appearances above the user specified lowest minimum support, LS, is considered as significant and is taken into account with all other itemsets practically discarded (since they cannot possibly be large). Consequently, in both cases the specified algorithm works similar to any algorithm using a single support with the only difference in the later case where every itemset with count above the lowest minimum support has its own MIS value. Practically all itemsets are considered of the same importance, with the only thing distinguishing them being their number of appearances. Whether an itemset should receive some extra attention is not taken into account in any way. Secondly with this approach the only way one can predefine an MIS value for an itemset without using the formulas described above is simply by guessing it. So if at the end of our algorithm we found that we were not satisfied with the final output (i.e., the specific itemsets were not among the large ones) we would have to run again and again the algorithm, guessing each time new MIS values until we are finally satisfied, thus wasting both time and resources. Last but not least is the fact that the specific approach suggested that we knew in advance which the interesting itemsets are. The user should by some

67

unknown way know which itemsets most interest him and which should he take into special consideration. It did not propose a way or a concrete method for discovering the interesting itemsets, thus leaving the users completely unassisted.

3. Discovering emerging trends The idea behind association rule mining is to search a considerable amount of data collected over a long period, and to apply various techniques to discover some more or less unseen knowledge hidden inside the data. Another big problem of the approaches used up to now was that they discovered long existing relations rather than emerging ones. All approaches up to now followed rather than kept up with the sales or more generally the appearances of the itemsets. We need an approach that finds emerging trends in the bud, along with the long established ones. This situation can be explained better in the example below. Imagine a product that is on sale for over a year with moderate sales and a product that just entered the market (e.g., is on sale about a month) but has tremendous sales. If we apply the classical statistical model of association rule mining, where we simply measure the number of appearances of every itemset and if it is above a user specified threshold it is considered as large then we will certainly miss the new product. A product that is on sale for so little cannot practically come even near the sales of a product that is on sale for so long. So we must either wait for so long as for the new products to sum enough sales, which could well take months, or find a way to effectively take them into consideration right from the beginning. The same situation can happen with products that were on sale but with very low sales and begun to present abnormally high sales because they are currently under heavy promotion, they suddenly became in fashion, some external circumstances or factors (e.g., weather conditions) promoted their sales or more generally they present a highly seasonal behavior. The specific situation is very usual, especially at retail stores where the itemsets on sale present such behaviors. After all an itemset having that many sales over so little period must be indeed very interesting. In the next section we see how these itemsets are handled.

68

I.N. Kouris et al. / Information Processing Letters 90 (2004) 65–72

3.1. ‘Hot’ itemsets We name the itemsets that present very high sales in a certain period of time as ‘hot’ itemsets. More formally for every 1-itemset we calculate what is called as the interest ratio, which is defined as the number of sales of an itemset in the last period of sales (i.e., from the last time we have run our algorithm again) divided by the mean number of sales of all itemsets in the same period ir i =

salesi

. sales Every itemset that has its interest ratio above a user defined threshold called minimum interest threshold, it is considered as hot for the specific period. Of course if an itemset has a number of sales above the support threshold it is treated as a large itemset. In essence we are searching for itemsets that have sales below the support threshold but their interest ratio is above the minimum interest threshold. The user has of course the option of giving to that threshold any value, depending on the desired output. A logical question could be what happens with a product that was hot in the previous period but is no longer hot or large in the next one. One option would be to treat these itemsets as small itemsets in the new period since they are obviously no longer interesting for the users. Another one could be to give these itemsets a grace period, and treat them as hot, to see if they will come back to high sales. Either one is possible and acceptable and depends solely on the needs of the data miner. One though could claim that we wrongly considered such itemsets as interesting, since their subsequent trend showed that they are no longer ‘hot’. Nevertheless we have managed to immediately identify the period that they became interesting, took them into consideration and promoted their sales when they were actually very interesting and this was exactly our goal. If on the other hand a hot itemset becomes large in the next period then we have managed to successfully predict its future performance early enough and to take it into consideration in advance rather than having to wait for it to actually become large. The final question is how often we have to run our algorithm over the data. As we understand from the definitions above, the period that we run our

algorithm influences both the quantity and the nature of the itemsets found. According to the work in [11] in order to get some meaningful conclusions from the data in our database we have to wait for some considerable time before enough data is gathered. This time can vary from enterprise to enterprise, and from sector to sector. For example a multinational company assembling data from various stores in many countries, a very popular web store (e.g., Amazon or eBay) or some industry sectors like the telecommunications could very well run a data mining algorithm once every one or two days. Others not so popular may need a period of a month or so. So the period when we run our algorithm is a completely subjective decision based on the application and the experience and needs of the data miner, and is no different than if we were using any other classical algorithm. Therefore the automatic choice of this period is not viable.

4. The proposed approach The main idea of our approach is to monitor all itemsets for abnormal behavior from the last time we run our algorithm till the next one. We begin by counting the number of appearances of the 1-items (Fig. 1, lines 4–8). After all 1-items have been counted through, we can identify three types of itemsets according to their number of appearances. • Large 1-itemsets: Itemsets with number of appearances above the user specified minimum support threshold. • Hot 1-itemsets: Itemsets with number of appearances below the support threshold, but above the minimum interest threshold. • Small 1-itemsets: Itemsets with number of appearances below both the minimum support threshold as well as the minimum interest threshold. First we identify all large 1-itemsets, by comparing their appearances to the support threshold. Then we find all hot 1-itemsets by taking the difference between the previous and the new sales and find the number of appearances in the current period. The remaining ones are automatically characterized as small 1-itemsets (Fig. 1, lines 10–15). The next step is the assignment

I.N. Kouris et al. / Information Processing Letters 90 (2004) 65–72

1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32) 33)

69

Database = set of transactions; Items = set of items; Transaction = TID, {x|x ∈ Items}; foreach transaction t ∈ D do begin foreach item x in t do begin x.count++; end end Comment: Categorize all 1-items according to their number of appearances foreach item x|x ∈ Items do if x.count  min sup then x = large; elseif x.count  ir then x = hot; else x = small; end end Comment: Assign MIS values to all 1-items foreach 1-item x ∈ large_or_hot do if x ∈ large MIS(x) = min sup else MIS(x) = f (x) end H1 = {large and hot 1-items}; M = sort(H1 , MIS); L1 = {sorted large and hot 1-items}; for (k = 2; Lk−1 = ∅; k++) do Ck = candidate_gen(Lk−1 ) //create new candidates foreach transaction t ∈ D do begin Ct = subset(Ck , t); foreach candidate x ∈ Ct do x.count++; end Lk = {x ∈ Ck |x.count  MIS(c[1])} end  Answer = k Lk ; Fig. 1. Main program of our algorithm.

of MIS values to all 1-itemsets, according to the following formulas:  f (i) i = hot, MIS(i) = min sup i = large. Every large itemset is assigned as MIS value the value of the support threshold. The hot itemsets take as MIS value their exact number of appearances. The remaining itemsets are simply discarded (Fig. 1, lines 17–20). Having now the MIS values for all 1-itemsets we can sort them in ascending order according to their MIS values so that they satisfy the sorted closure property (Fig. 1, line 22). We put first the hot 1-item with the lowest MIS value, and subsequently every hot 1-item with higher MIS value. After all hot 1-items have been

sorted, we put the large 1-items. These items constitute set L1 , the sorted set of large 1-items (Fig. 1, line 23). As we can easily understand all ‘hot’ 1-itemsets come first in the MIS ordering since their corresponding MIS values are always below the minimum support, and all other large itemsets follow. From that point after our approach performs three operations for each subsequent pass over the data (Fig. 1, lines 24–32). First the candidate itemsets Ck are generated using the large itemsets found in the previous (k − 1) pass. Then our database is scanned again and the support of every itemset in Ck is found. Finally the large k-itemsets are identified and are used to form set Lk the large k-itemsets.

70

I.N. Kouris et al. / Information Processing Letters 90 (2004) 65–72

The algorithm terminates upon the absence of any candidate itemsets. We do not search for ‘hot’ itemsets of higher order (i.e., k-itemsets, where k  2) but only for large itemsets since any k-itemset will have all of its subsets as large or hot. This means that any k-itemset can have maximum sales equal to the number of sales of the least frequent itemset constituting it. Also the MIS value assigned to it (the lowest MIS value among all itemsets constituting it) is the lowest possible MIS value it can receive. So there is no way that a k-itemset found not found large is hot.

5. Experimental results In this section we present the results from the experiments we made in order to test the performance of our approach. The datasets used were only synthetic data, created using the synthetic data generator available from IBM.1 The specific generator, which is said to simulate the buying behavior of customers in retail business, has become the de facto benchmark for all approaches in the field of association rule mining.

and represents a transaction made by a customer. Nevertheless, as discussed earlier our algorithm tries to find the large as well as the hot itemsets from the last time we run our algorithm. So if a dataset created using the specific generator represents the transactions made for a specific period of time, we must find a way to represent the sales in the new period. Consequently for the data generation process we use a different approach, which is as follows: First of all we create a dataset, using the synthetic data generator. We suppose that this dataset represents the sales of all itemsets up to now. To simulate the sales of all itemsets in the new period (i.e., when we run our algorithm again), we employ a random number generator that assigns a positive integer to every itemset representing its new sales. Since we know the number of sales of every itemset in the alleged previous periods, we limited the possible number of sales of every itemset between the number of sales of the most and least frequent itemset, reduced by a factor of 10. In other words if the most popular itemset in the initial dataset had 1000 sales then any itemset in the new period cannot have number of sales above 100 and a minimum of 10. We generated 5 datasets in total, which are shown in Table 2.

5.1. Synthetic data 5.2. Execution times The generation of synthetic data and the function of the generator are well documented and explained in [2]. As an example of the datasets created using the specific generator, dataset “T15.I4.D100K” refers to a dataset with an average transaction size of 15 items, an average size of the maximal potentially frequent itemsets of 4 and a total of 100 000 transactions. The parameters used are shown in Table 1. The datasets created this way look like text files with lines, where each line contains some itemsets Table 1 Parameters used for generating synthetic datasets |D| |T | |I | |L| N

Number of transactions Average number of transactions Average size of maximal potentially frequent itemsets Number of maximal potentially frequent itemsets Number of items

1 http://www.almaden.ibm.com/cs/quest/syndata.html.

Our algorithm presented only slightly higher execution times than algorithm MSApriori. This was mainly due to the fact that unlike algorithm MSApriori that had all itemsets sorted according to their support values before it actually started, our approach does not sort the 1-items until it has actually counted them. So that procedure adds some to the total execution time. Of course if we excluded this step from the calculation of the total execution time in order to give our algorithm a fairer comparison, then the total time would Table 2 Synthetic datasets Name

|T |

|I |

|D|

Size in megabytes

T10.I2.D100K T10.I4.D100K

10 10

2 4

100 K 100 K

4.4

T20.I2.D100K T20.I4.D100K T20.I6.D100K

20 20 20

2 4 6

100 K 100 K 100 K

8.4

I.N. Kouris et al. / Information Processing Letters 90 (2004) 65–72

71

Fig. 2. Comparison of execution times.

have been almost equal to the time needed by other algorithms. Another thing that increases the total execution time are the additional comparisons that have to be made after having counted the 1-itemsets. All algorithms up to now checked whether the support of every 1-item was above the minimum support threshold, and consequently characterized them as large or small. In our case though even if an itemset has number of sales below the minimum support threshold, we still have to check its interest ratio. Of course since this happens only with the 1-items, which usually range up to a few thousands in a typical retail store, the costs are minor. Finally, our algorithm in the quest for hot itemsets generates slightly more candidate and large itemsets than other algorithms. The relative times of the two algorithms are shown in Fig. 2.

6. Conclusions and future work In this paper we managed to deal with a tantalizing open problem, namely the automatic discovery and handling of important itemsets. Furthermore we gave a different dimension to the mining process, where we

kept up with the tendencies of the various itemsets rather than simply following their past behaviors. Our final proposition not only addressed a series of problems, but managed to present only slightly larger times as compared to other methods that did not address these problems at all. References [1] R. Agrawal, T. Imielinski, A. Swani, Mining association rules between sets of items in large databases, in: Proc. ACM SIGMOD Internat. Conf. on Management of Data, 1993, pp. 207–216. [2] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, in: Proc. of the 20th Internat. Conf. on Very Large Data Bases, 1994, pp. 487–499. [3] S. Brin, R. Motwani, J.D. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market basket data, in: Proc. ACM SIGMOD Internat. Conf. on Management of Data, 1997, pp. 255–264. [4] J. Han, Y. Fu, Discovery of multiple-level association rules from large databases, in: Proc. of Internat. Conf. on Very Large Databases, 1995. [5] W. Kim, Introduction to Object-Oriented Databases, MIT Press, Cambridge, MA, 1990. [6] P. Kotler, Marketing Management, Tenth Edition, PrenticeHall, Upper Saddle River, NJ, 2000.

72

I.N. Kouris et al. / Information Processing Letters 90 (2004) 65–72

[7] I.N. Kouris, C.H. Makris, A.K. Tsakalidis, An improved algorithm for mining association rules using multiple support values, in: Proc. of FLAIRS Internat. Conf., St. Augustine, FL, 2003. [8] W. Lee, S.J. Stolfo, K.W. Mok, Mining audit data to built intrusion detection models, in: Proc. ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 1998. [9] B. Liu, W. Hsu, Y. Ma, Mining association rules with multiple minimum supports, in: Proc. ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 1999. [10] H. Mannila, Database methods for data mining, in: Proc. ACM SIGKDD Conference on Knowledge Discovery & Data Mining, New York, 1998 (Tutorial).

[11] J.S. Park, M.S. Chen, P.S. Yu, An effective hash based algorithm for mining association rules, in: Proc. ACM SIGMOD Internat. Conf. on Management of Data, 1995, pp. 175–186. [12] A. Savasere, E. Omiecinski, S.B. Navathe, An efficient algorithm for mining association rules in large databases, in: Proc. of 21st Internat. Conf. on Very Large Databases, Zurich, Switzerland, 1995, pp. 432–444. [13] R. Srikant, R. Agrawal, Mining generalized association rules, in: Proc. of 21st Internat. Conf. on Very Large Databases, 1995, pp. 407–419. [14] H. Toivonen, Sampling large databases for association rules, in: Proc. of 22nd Internat. Conf. on Very Large Data Bases, Mumbai (Bombay), India, 1996, pp. 134–145.