Data & Knowledge Engineering 52 (2005) 353–383 www.elsevier.com/locate/datak
Using Information Retrieval techniques for supporting data mining Ioannis N. Kouris
a,b,* ,
Christos H. Makris
a,c
, Athanasios K. Tsakalidis
a,b
a
c
Department of Computer Engineering and Informatics, School of Engineering, University of Patras, 26500 Patras, Hellas, Greece b Computer Technology Institute, P.O. Box 1192, 26110 Patras, Hellas, Greece Department of Applied Informatics in Management and Finance, Technological Educational Institute of Mesolonghi, Hellas, Greece Received 5 May 2004; accepted 21 July 2004 Available online 21 August 2004
Abstract The classic two-stepped approach of the Apriori algorithm and its descendants, which consisted of finding all large itemsets and then using these itemsets to generate all association rules has worked well for certain categories of data. Nevertheless for many other data types this approach shows highly degraded performance and proves rather inefficient. We argue that we need to search all the search space of candidate itemsets but rather let the database unveil its secrets as the customers use it. We propose a system that does not merely scan all possible combinations of the itemsets, but rather acts like a search engine specifically implemented for making recommendations to the customers using techniques borrowed from Information Retrieval. Ó 2004 Elsevier B.V. All rights reserved. Keywords: Knowledge discovery; E-commerce; Itemsets recommendations; Indexing; Boolean-ranked queries
*
Corresponding author. Tel.: +30 6945165529; fax: +30 2610429297. E-mail addresses:
[email protected] (I.N. Kouris),
[email protected] (C.H. Makris), tsak@ceid. upatras.gr (A.K. Tsakalidis). 0169-023X/$ - see front matter Ó 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2004.07.004
354
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
1. Introduction One of the most well studied tasks in KDD is the discovery of association rules. The prototypical application of this task, first introduced in [4], was the analysis of supermarket sales or basket data, where it is analyzed how the items purchased by customers are associated. An example of an association rule is as follows: bread ! cheese [sup = 20%, conf = 70%]. This rule says that 20% of customers buy bread and cheese together, and those who buy bread also buy cheese 70% of the time. A formal description of the problem is as follows: Let I = {i1, i2, . . . , im} be a set of items. Let T be a set of transactions (the database), where each transaction t is a set of items such that t I. An association rule is an implication of the form, X ! Y, where X I, Y I, and X \ Y = ;. The rule X ! Y holds in the transaction set T with confidence c if c% of transactions in T that support X also support Y. The rule has support s in T if s% of the transactions in T contain X [ Y. Given a set of transactions T (the database), the problem of mining association rules is to discover all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf). An association mining algorithm works in two steps: 1. Generate all large itemsets that satisfy minsup. 2. Generate all association rules that satisfy minconf using the large itemsets. The task of association rule mining can be applied to various other data types, such as text documents, census data, telecommunication data etc. In fact any data set consisting of ‘‘baskets’’ containing multiple ‘‘items’’ can fit this model. In the many works that followed [10,27,33,35] the researchers have tried to improve various aspects (like the number of passes made over the data or the efficiency of those passes) of the best-known strategy for the task of association rule mining, called Apriori [5]. However the two stepped model described above has always been more or less the same, i.e., finding all rules that satisfy user-specified minimum support and minimum confidence constraints. In real life applications though and especially for data other than that used for classic market basket analysis (e.g. highly correlated data like census data or data with multiple non-binary attributes), going through the whole database multiple times and searching and counting all possible combinations of items may not be as practical as it seems. As an example the Cover Type dataset from UCI repository 1 with only 120 1-itemsets, but with most of them appearing almost all the time, results in about 15 million large itemsets. Using algorithm Apriori just to find all these large itemsets, requires more than 96 CPU hours [37]. So we see that the identification alone of large itemsets is with the classic model rather impractical. In particular if one takes into consideration the repetitive nature of the knowledge discovery process, where many runs of the same algorithm would have to be made before fine-tuning the required parameters (minimum support and minimum confidence) and getting the desired output then the whole situation becomes dramatic.
1
http://kdd.ics.uci.edu/.
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
355
On the other hand such big numbers of large itemsets result in a degraded and problematic rule generation process (the second step). First of all handling and using all those itemsets becomes a complicated and especially resource consuming task. But even after all the association rules are generated we still face problems with the number of discovered rules. They are practically so many that we would have to mine the discovered rules themselves or find ways to reduce them and be left with the most interesting ones [3,8,26,36]. Another major drawback of the traditional algorithms is with the pruning technique. According to this technique if an itemset is found to be non-frequent then all the higher order itemsets that include this itemset are bound to be non-frequent too. So we simply forget everything that has to do with the specific itemset. But what happens if a customer begins a transaction or puts in his basket very early a non-frequent itemset? Then this customer with this specific transaction is doomed not to get any suggestions at all. Whatever he or she will buy from the moment on he chose a non-frequent itemset is going to be non-frequent too. But we would most certainly like to propose to a customer a non-so frequent item, even if with the traditional algorithms it falls below the specified minimum support, than not to propose something at all. Our proposed algorithm overcomes this problem in that it does not have any measure for pruning some itemsets that are not so frequent. All it does is it gets input from the customers and creates its own suggestions on the fly, ignoring any pruning technique. Finally this two stepped model is correct if considered in a strict statistical manner but fails to approach and address the semantics of the appearances of the items in the various transactions as well as their inherent natures. In other words it is another thing if an itemset appears ten times in the same transaction and another if it only appears once or twice. Also it is not the same if a very important itemset appears in a transaction and if a not so important one also appears in the same transaction. As a consequence, apart from the total number of appearances of the itemsets some other very important parameters must also be taken into consideration. All these requirements and implications are well taken into consideration and addressed in a complete system which we present in this work. The rest of this paper is organized as follows. We begin by giving some related works as well as an overall description of our proposed system in Sections 2 and 3 respectively. In Section 4 we give some preliminaries regarding our solution. Then in Section 5 we give a very brief overview of the first function of our system, as a classic data mining algorithm. In Section 6 we present in detail the second function of our system, as a recommendation engine using either Boolean or ranked queries and present all possible services provided by it. In Section 7 we report some evaluation results from the test methods we performed and we conclude with Section 8 where we report our findings as well as the future work.
2. Related works The system that we propose in the current work utilizes methods and techniques from Information Retrieval in order to assist data mining functions. To our knowledge no work has been published that proposes a similar system. Most of the techniques and functions proposed here are completely novel even to classic data mining. Only some functions of our system present some similarities to collaborative filtering systems.
356
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
Collaborative filtering systems attempt to offer personalized recommendations of items (usually on news articles, movies and music) based on information about similarities among users tastes. Some of the most known collaborative filtering systems up to now are Tapestry [19], the GroupLens research system [24,30], the Ringo system [34] and Video Recommender [22] (for a more comprehensive overview on such systems readers are referred to [31]). A collaborative filtering process can be divided into three sub-tasks namely, representation, neighborhood formation and recommendation. In the representation task we try to model the items associated with each user (i.e. practically to create a user profile for every existing user). In the neighborhood formation task we try to find the users that appear as the best matches for every new user. Finally the recommendation task deals with proposing the most relevant items from the users that best match our new user. Usually collaborative filtering systems are categorized according to how feedback is obtained from the users, in implicit and in explicit voting. In explicit voting the system asks the users to provide a score or a rating about what they had just been presented with, whereas in implicit voting the system tries to observe or to discover some evidence about the users interests based on various factors. Explicit voting is that way far more accurate than implicit since it requires the conscious rating from the users whereas implicit voting is based on factors that might not be that accurate. On the other hand the fact that explicit voting requires ratings about everything a user is being presented with can be very annoying and can have various unwanted ‘‘side effects’’. Instead of the explicit–implicit categorization we prefer an alternative one based on whether they are memory or model-based. Memory-based systems use the entire available user database to make predictions whereas the model-based ones use the user database only to construct a model and then use this model to make predictions rather than the entire database. In a memory-based system the predictions about a users preference are made using various measures such as the Pearson correlation coefficient or the cosine measure. Various enhancements have been made to the memory-based methods the most known of which are default voting, case amplification [9] and inverse user frequency [32]. Model-based techniques on the other hand use techniques such as Bayesian networks or singular value decomposition to build the model which will then be used for making predictions. Both systems present more or less the same predictive accuracy in their proposals but as expected have their corresponding advantages and disadvantages. So memory-based systems manage to be always up to date and to immediately react to any changes to the user database on the cost though of requiring immense memory and cpu resources, and thus sometimes resulting in a rather sluggish system. Model-based systems on the contrary have a more guaranteed behavior irrespectively of the database used but have a tremendous lag in taking into consideration any changes made to the user database. Last but not least are the systems that have the same function as collaborative filtering systems but use quite different approaches in achieving their goals [2,6,14,15,21]. For a more detailed look on collaborative filtering systems in general readers are referred to [1,7,21]. As noted above some functions of our system resemble a collaborative filtering system and more specifically a memory-based, implicit voting one. Nevertheless these functions present fundamental differences to all collaborative filtering approaches proposed up to now mainly due to the way they try to find the items that will constitute the best possible proposal as well as due to the logic they use. So first of all unlike the collaborative filtering systems where the system generates user profiles with the existing data and then tries to match the user profile of any new customer to the existing ones, our system merely tries to match any new transaction to a new one.
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
357
That way the problem of ever increasing user profiles that sometimes also do not reflect the current trends of the users are avoided. Also little or none work has been done on systems and environments selling a wide range of dissimilar products such as retail stores, and the collaborative systems studied up to now were mostly focused on systems with homogenous items. Our work also takes into consideration the special variables (in our case mostly the microeconomic) of the environment our system is applied to. Finally by making use of well studied and established methods borrowed from Information Retrieval our system manages to overcome major problems faced by collaborative systems such as scalability, performance and memory bottlenecks faced by memory-based systems as well as the inherent static nature of model-based systems.
3. The proposed system The system we introduce takes as input the data from our database, generates the corresponding index and subsequently is ready to be used in a variety of ways. The specific system is a complete solution that has essentially two modules (Fig. 1). The first one which we will refer to as the batch module, works exactly like the classic level-wise algorithms (e.g. Apriori) by finding all frequent itemsets and producing all possible rules. The second one, which we will refer to as the interactive module, has a more interactive and on-line behavior (see Section 6). Both modules have as their base an index and work using queries (Boolean or ranked), techniques widely used in Information Retrieval. Especially the second module offers services that deviate from the classic scenario used by all other approaches. More specifically instead of going through all the items and their possible combinations and evaluating them using one measure or another, it makes recommendations based on an approach similar to that used when issuing queries to search engines. In practice it tries to answer the following question: suppose we have a database containing past
Fig. 1. The function of our system.
358
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
transactions, and a customer who makes a new transaction. Which transaction(s) best approach the transaction of the specific customer, based on the itemsets already in his basket?
4. Preliminaries Before continuing with the presentation of our solution we must first give some preliminaries required in the approaches we propose. So in this section we present the indexing techniques we used for our system, the way Boolean queries are resolved, and finally the function of the vector space model. 4.1. Indexing In order to address the issue of how the data inside a database should be organized so that the various queries could be both efficiently and effectively resolved, especially in an online approach, we adopted techniques used in the classic Information Retrieval theory. More specifically we used the notion of an index and the indexing techniques. When referring to indexes and indexing techniques we are referring to a mechanism for locating a given itemset inside the data (in a classic Information Retrieval system we would search for a term inside the text). This can be achieved in various ways, using methods like signature files, bitmaps and inverted files. We used inverted files, as for most cases they are the best way of achieving this. For more details readers are referred to [38]. 4.1.1. Inverted file indexing In order to create an inverted file index we have to create first the corresponding lexicon for our dataset. The lexicon is nothing else but a list of all items that appear in the database. In our case the lexicon is already there since we know in advance all the products that are on sale, and there is no way that an item not contained in the lexicon is present in the database. What the lexicon actually does is a mapping from items to their corresponding inverted lists and in its simplest form it is a list, probably ordered, containing strings and disk or main memory addresses. An inverted file index contains also for each item in the lexicon, an inverted list that stores a list of pointers to all occurrences of that item in the database, where each pointer is in effect, the number of a transaction in which that item appears. Consider the following example. We have a sample database containing seven transactions and eight items, which are shown below in Table 1. With each transaction we associate a unique number called transaction identifier––TID. Using now this database as input we create the corresponding inverted file index, which would look like the one shown in Table 2. The lists for each item are in the form {ft; tid1, tid2, tid3, . . . , tidf}, where ft is the number of transactions in which an item appears, followed by the identifiers of the transactions (the TIDs) which are denoted here with tidk. For example item 2 appears in 3 transactions, namely transactions 2, 4 and 7. 4.1.2. Inverted file compression Once the inverted file has been constructed it is very probable that it needs to be compressed so that it can be kept and processed from main memory. Keeping the inverted file in main memory
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
359
Table 1 Sample database containing k transactions, where each line is a transaction TID
Items
1 2 3 4 5 6 7
1, 2, 1, 2, 3, 1, 2,
3, 4, 4, 4 4 4, 3,
6 6, 7 5, 8
5 6, 8
Table 2 Inverted file for the sample database in Table 1 Itemset
Transactions
1 2 3 4 5 6 7 8
{3; {3; {3; {5; {2; {3; {1; {2;
1, 3, 6} 2, 4, 7} 1, 5, 7} 2, 3, 4, 5, 6} 3, 6} 1, 2, 7} 2} 3, 7}
makes it faster to access, and the processing of any queries practically instantaneous. Of course this is not always the case since if the size of the index is small enough or the available main memory is big enough then there is no need to compress it. On the other hand, compressing the index means some extra time to perform the compression, and some extra effort and time to process each query. Nevertheless these costs are negligible compared to the gains that we accomplish by keeping the whole index in main memory. An inverted list can be stored as an ascending sequence of integers. Now let us suppose that we have a database with some transactions in it, and a specific item appears in seven of them––say transactions number 5, 8, 9, 10, 53, 54, and 55. Then the specific item is described in the inverted file by a list like the one below: f7; 5; 8; 9; 10; 53; 54; 55g All the TIDs in the inverted list of an item are stored in ascending order so that tidk < tidk + 1. Because of this ordering, and because the processing is done sequential from the beginning of each list, every list can be stored as an initial position followed by a list of tid-gaps, where a gap is the difference tidk+1 tidk. The above list could be stored as f7; 5; 3; 1; 1; 43; 1; 1g We can always obtain the original TIDs simply by calculating at each step the sums of the gaps.
360
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
4.2. Resolving Boolean queries When a query that combines any of the three main Boolean operators is issued to our system, the procedure that is followed in order to answer that query is as follows: first the lexicon has to be searched for each term. Then, according to where the lexicon points for each term, the corresponding inverted lists are accessed, retrieved and decoded. Now we are ready to merge the lists, by taking the intersection (AND––case), the union (OR––case) or the complement (NOT––case). If the user wishes to see also the exact transactions that contain the specific itemsets, then the transactions themselves are retrieved and displayed to the user in a form that is best for him. Nevertheless, normally we would only want to know in how many transactions an itemset appears. Let us see how a query involving the AND operator and some itemsets is resolved, as apart from the fact that it is the most common case it is maybe the most complicated, interesting and demanding one. Suppose we wish to find in how many transactions itemsets A, C, F appear in, or in other words we issue the following query. A AND C AND F What we do first, is we locate each itemset in the lexicon, which as said before resides on main memory for fast lookup and access. Next, after all terms have been found we sort them by increasing frequency. The sorting is straight forward as in the lexicon we have already stored the number of appearances of every itemset (ft). Next the inverted list of the least frequent item is decompressed and read into memory. This list is the one that establishes a set of candidate transactions (more specifically the corresponding TIDs), that would be checked against the lists of the other itemsets (always in increasing frequency order). The size of the list that will be the answer can be at most the same size as the first candidate list, and normally it would be much smaller or sometimes even empty. So we see that the operation that dominates the resolvement of a query involving ANDs is looking up rather than merging lists. The OR case is implemented in a much simpler way. What we need to do again is locate each itemset in the lexicon, and process their corresponding inverted lists by merging rather than looking them up. After all the lists have been merged we remove any duplicates and present the result. Finally the NOT case is implemented in much the same way as the AND case. The only difference is while checking the lists of all the other itemsets against the list of candidates, we eliminate those transactions that appear in both the candidate list and any other list. 4.3. The vector space model The index created as described above can only give answers to Boolean queries. When using an index for making ranked queries we employ a model called vector space model. We will describe the functioning of the specific model through its most frequently application, the text collections. In a vector space model, documents are represented as vectors in a multidimensional Euclidean space. Each axis in this space corresponds to a term. The functioning of a vector space model can be divided in three logical phases. The first logical phase is the generation of an inverted index based on a collection of files (usually documents). The second is the assignment of a weight to all items in the index and the third and final phase consists of using some similarity measure to rank the documents that constitute the answer to a user query. Of course these phases, and especially
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
361
the first and the second ones are not really separate but we consider them so for simplicity. Let us see how this model works. In the indexing phase the collection is read from the database and an index is created. This collection can be practically anything, from a single text file to several thousands files with any content and format. The specific phase is the same as that presented in Section 4.1. In the weighting phase now all terms are assigned a weight [41]. A common weighting scheme is what is known as the TFIDF scheme. According to this, every term is assigned a weight according to the number of times the term occurs in a document, scaled in any of a variety of ways to normalize document length (such as the sum of the number of terms in the document). This weight is called the Term Frequency––TF factor. The TFIDF scheme tries also to scale down the importance of terms that occur in too many documents, or are of little interest to the users. This is accomplished by the Inverse Document Frequency––IDF factor. A widely used method for determining the IDF factor is the Zipfs principle [40]. According to this principle the frequency of a term tends to be inversely proportional to its rank: wt = 1/ft. TF and IDF factors are combined into the complete vector space model like this: each term corresponds to a dimension in the vector space. Documents as well as queries are represented as vectors in this space. The coordinate of a document d in the direction of axis t is given by dt = TF(d, t) Æ IDF(t). Every query q is interpreted as a document and is also transformed to a vector in the same vector space. This scheme is been extensively used with very good results especially for text collections. The final and most important phase is to measure the proximity between any query q and all relevant documents in the collection, sort the answers and present the user with the most relevant ones. Similarity between a query and the documents in our collection is determined by using various similarity measures, such as the inner product, the Jaccard or the Dice coefficients or the most popular cosine measure [41]. The results are sorted and finally the user is presented with the most relevant documents in descending order of importance.
5. Using our approach like classic data mining algorithms (batch module) The batch module of our system works like all classic association rules approaches, by finding all frequent itemsets and delivering to the user all corresponding association rules. In this section we only give a very brief overview of how this module works, since our main goal was not to give an implementation of yet another classic data mining algorithm that would generate exactly the same results, but rather to give a new perspective to the association rules problem and the approaches that recommend itemsets to its users (i.e. the interactive module presented in Section 6). Also despite the fact that we have strong indications that the performance of this approach would be quite enhanced as compared to previous techniques we did not tested it since it was out of the scope of this work. Suppose now that we have the sample database, which is shown in Table 3. First of all the database is scanned once and an index, using inverted files, is created. Then this index is compressed and stored in main memory for fast retrieval. This step is equivalent to the first step of the traditional algorithms, where the database is scanned once in order to find the candidate 1-itemsets and to count their appearances. The candidate 1-itemsets are already identified by our approach too, since the index stores every distinct itemset along with information about the number of
362
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
Table 3 An example transaction database for datamining DATABASE D TID
Items
100 101 102 103
ACD BCE ABCE BE
transactions it appeared in and also which are those transactions. After the first scan, the database does not have to be scanned again. Each consecutive scan of the database is replaced by the scan of the inverted file index, which is stored in main memory (Fig. 2). Suppose now that the user sets his support threshold to 2, meaning that in order an item to be considered large it must appear in at least two transactions. We can instantly generate the set of large itemsets L1 by choosing from the compressed index those itemsets for which the number of appearances is at least two. In order to generate the set of candidate 2-itemsets we combine the large 1-itemsets in L1 and we generate C2. The support of each 2-itemset in C2 is found by querying the index repeatedly. For example for itemset AB we query the index like this: A AND B. The two lists of A and B respectively are intersected and the support (the common items of the two lists) is returned. This is done until
INDEX FOR DATABASE D
DATABASE D
TID 100 101 102 103
ITEMS ACD BCE ABCE BE
Index D
ITEM A B C E
ITEM TIDS A <2; 100, 102> B <3; 101, 102, 103> C <3; 100, 101, 102> D E
L1
TIDS
<3; 101, 102, 103>
{BCE}
ITEM A B C D E
TIDS <2; 100, 2> <3; 101, 1, 1> <3; 100, 1, 1> <1;100> <3; 101, 1, 1>
C2
C2
L2
itemset
itemset Sup. 1 {AB] {AC} 2 1 {AE} 2 {BC} 3 {BE} 2 {CE}
Itemset Sup. {AC} 2 {BC} 2 {BE} 3 {CE} 2
Query the Index
C3
C3
Itemset
Compress the index
<1;100>
{AB] {AC} {AE} {BC} {BE} {CE}
<2; 100, 2> <3; 101, 1, 1> <3; 100, 1, 1> <3; 101, 1, 1>
COMPRESSED INDEX
Query the Index
L3
Itemset Sup. {BCE}
2
Itemset Sup. {BCE}
Fig. 2. Use of our system as a classic level-wise data mining algorithm.
2
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
363
every itemset in C2 is queried and their support is returned, thus creating L2––the set containing all large 2-itemsets. Combining the large 2-itemsets we create C3––the candidate set of 3-itemsets and so on. An advantage of our approach is that the creation of the index is completely disengaged from any statistical measure. The index is created irrespectively from what the support or any other measure used in other methods is, and does not change if we alter the support threshold. So if we are not satisfied with the output, for example too many or too few large itemsets are generated, we are able at any step and at any time to restart the whole process. The cost is significantly smaller compared to all other methods since everything is in main memory and the database does not have to be read again. What we have to do is re-query the index step by step using the new threshold. Using a traditional algorithm, like e.g., Apriori [5], we would have to scan the whole database multiple times again.
6. Interactive module The cornerstone of the interactive module is also the same index as in the batch module, which we use though in a different way. The specific module is based either on Boolean or on ranked queries, and all services except one have as function to propose users with the itemsets they are most probable to buy, based on the itemsets already in their basket. We give first the services provided by the Boolean queries and then we present the services provided by the ranked queries. 6.1. Services based on Boolean queries 6.1.1. Find all itemsets containing one specific itemset The first service that proposes itemsets to its users functions by trying to find all the combinations of a specific itemset. Suppose we have a company selling products through some web site on the Internet. A customer browses through the products and at some point he puts product B in his basket. We now wish to find all the combinations of B with the other products and present him the most frequent ones. If there are N distinct itemsets in our dataset, there may be at max N 1 combinations of B with all the other itemsets, which we find and present to him in descending order of frequency. Suppose the same customer puts now product F in his basket. We now find all the combinations of itemset BF with all the other itemsets, which are at max N 2. As the customer continues to put products in his basket we continue to find all the new combinations. In Fig. 3 we show all the combinations that have to be performed in each step for the sequence of buys B–F–D–C, and for the set of products (A, B, C, D, E, and F). Of course the procedure proposed above has one obvious drawback. If the number of distinct itemsets is very big (say hundreds of thousands) then as we can easily understand the number of combinations that would have to be made every time an itemset is added to the basket is very big, and so the time and resources needed to process each itemset would be excessive, making it rather impractical. Therefore the procedure described above is ideal for a small amount of distinct itemsets. Below we propose a technique that deals with this disadvantage to a great extent and improves the performance considerably.
364
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
Fig. 3. The combinations for the sequence B–F–D–C.
Instead of going through all itemsets and combining them with our itemset, we stop at a point when we have found the most frequent ones. Suppose again that a customer puts a product in his basket. Then we begin combining the specific itemset first with the itemsets that are the most frequent ones since they are the most probable to contain the specific itemset in a high percentage and then with the other itemsets. In order to do this we keep another list of the itemsets, apart from the one used as a lexicon, where all itemsets are sorted according to their appearance frequencies in descending order (alternatively the lexicon itself could be stored in a sorted form). Normally a user is interested in three or maximum four itemsets proposals. Most popular web sites that propose products to their customers (e.g. Amazon [43], Egghead [45], Ebay [44] etc.) also stop at four itemsets proposals. If we present a customer with more than three products as suggestions then this becomes information overload and confuses rather than helps. Of course the user is free to adjust the system to his needs, and define the number of itemsets that would be proposed to a customer. For our purposes we have also limited the specific number to four. So we begin combining itemsets and stop when we have gathered the four most frequent ones. The question now is how we know that we have gathered the most frequent ones. If we find four itemsets with support equal to the support of our itemset and confidence 100% then we are certain that we have found at least some of the most frequent ones and there is no need to find any more. There may exist more frequent itemsets but they can have support and confidence at most equal to that of the itemsets found up to now and so they would not add any new information to the itemsets proposals. In case now that we have not gathered four such itemsets what we do is check every time the support of the next itemset in the list with the support of the last itemset we have generated by combining the itemsets already in our basket with the remaining ones. If the specific support is larger than the support of the next one and subsequently of all remaining itemsets then we have found all most frequent itemsets and we can stop at that point. So the equation below must be true in order for our algorithm to stop: supportðI x ; I k Þ > supportðI n Þ
ð1Þ
where Ix are the itemsets already in our basket, Ik is the itemset in our sorted lexicon which we are currently considering and In is the next itemset in the lexicon. The algorithm for finding the k most frequent itemsets among all itemsets combinations is illustrated in Fig. 4. The algorithm Find_k_TopItemsets begins by combining the itemsets already in our basket with the first k top itemsets in the frequency sorted list of itemsets, which are not included also in our basket. The itemsets generated by these combinations are stored in an array, which we sort according to their support in descending order. Subsequently we proceed with the next itemsets
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
365
Fig. 4. Finding the k most frequent itemsets.
in the frequency sorted list. The support of every new itemset generated is found and is compared to that of the last itemset in the array. If it is found smaller then the last itemset is replaced by the new one and the array is resorted. The algorithm stops when the last itemset inserted in our list is found to have bigger support than the next itemset in the frequency sorted list. The function of this algorithm can be seen well in the example below: Example 1. Suppose we have the following list of itemsets in descending order of appearance (Table 4) and a customer puts product D in his basket.
366
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
Table 4 An ordered list of itemsets Product
Support
W B C A D G Z
10 8 6 5 4 3 2
We begin by combining itemset D with product W, then with product B etc. Suppose now that itemsets DW and DB have both support and confidence 100%, itemset DC has support 0, and that product A appears with product D in 4 transactions. Then there is no need to search and combine any more itemsets since there is no way that the rest itemsets are more frequent (itemset G as well as all the rest itemsets can appear in maximum as many transactions as the number of their appearances), and we have already gathered the three most frequent ones. Despite the improvement proposed above the specific approach has unfortunately no guaranteed performance. We might need after all to search all items in the list, which would require as many combinations as the number of distinct items until we find the most frequent ones. Below we propose another approach that deals with that inefficiency effectively and has also guaranteed behavior. 6.1.2. Suggesting itemsets based on similar transactions The second service provided, tries also to make recommendations to its customers but uses a different approach for evaluating the itemsets that are most probable to be bought. Suppose again that a customer puts product B in his basket. This is passed as a Boolean query to our system which returns all transactions that constitute the inverted list of the specific itemset. All transactions are read and subsequently we could then follow two different techniques with them. One would be if we counted the appearances of every itemset in the specific transactions and accordingly suggested the ones with the greatest appearances. For example suppose that itemset B appears in transactions 4, 7, 10, and 16, which are shown in Table 5. As we can see itemset E is the most frequently appearing itemset in all transactions containing also B (appears in 3 transactions) and so it is recommended first to the customer. Then itemset A is the second most frequently appearing itemset (appears twice) and so it is recommended to the Table 5 Itemsets contained in transactions 4, 7, 10 and 16 TID
Itemsets
4 7 10 16
B, C, E A, B, E, F A, B, E B, D
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
367
customer right after itemset E. The rest itemsets (itemsets C, F, and D) all appear only once in the transactions and could be recommended in any order. Proposing first the itemsets that appear most of the times is statistically and logically correct since these itemsets present the greatest probabilities of finally getting bought by a customer. But what happens though when all or even some of the itemsets share the same number of appearances as it happened before? Suggesting all itemsets in plain lexicographic order would be the most simple and trivial solution, but we wish to make the best possible suggestion. Since all itemsets appear in exactly the same number of times they all share the same probabilities to be bought. Consequently we must propose the itemsets in the order we consider them more important for us. So we could propose the customer first with the most profitable itemsets or more generally the ones with the largest weights. Of course this presupposes that all itemsets would have been given some weight according to how important they are considered. Another possibility is if we used what we define as the correlation coefficient. The correlation coefficient is a measure that shows how closely related some itemsets are, and is defined as in Definition 1: Definition 1 [Correlation coefficient]. Let I = {i1, i2, . . . , im} be the set of all items in our database. Suppose that we have a set of items T = {i1, i2, . . . , ik} which constitute the itemsets bought by a customer up to now such that T I, and that sup(T) is the support of the specific itemset. The correlation coefficient between the itemset T and every candidate itemset ic which is also present in the transactions where T is present is defined as the ratio between the support of itemset T divided by the support of the itemset we consider in the whole database (denoted as suptotal(ic)): ccði1 ; i2 ; . . . ; ik 7! ic Þ ¼
supði1 ; i2 ; . . . ; ik Þ suptotal ðic Þ
ð2Þ
The higher the value of the specific coefficient the greater the relation between the itemsets considered. Let us see how this coefficient is used through an example. Example 2. Suppose we have a database consisting of 100 transactions and a customer chooses product E which appears in the following four transactions (Table 6). By checking now these transactions we find that itemsets A, B, F, and G are also included, where itemset F appears three times and the rest two times each. Itemeset F will be proposed first as it appears the most. We have to propose now the rest of the items. Suppose now that the total supports for these itemsets are: suptotal ðAÞ ¼ 10;
suptotal ðBÞ ¼ 50;
suptotal ðGÞ ¼ 3
Table 6 Transactions containing itemset E TID
Itemsets
3 9 77 106
A, B, E, F B, E, F A, E, F, G E, G
368
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
The total support for every itemset is readily available, since either in the lexicon entry of every itemset or in its corresponding inverted list we store also its number of appearances. So the correlation coefficient for each itemset is ccðE 7! AÞ ¼
supðEÞ 4 ¼ 0:4; ¼ suptotal ðAÞ 10
ccðE 7! BÞ ¼ 0:08 and ccðE 7! GÞ ¼ 1:33
We see that despite the fact that all itemsets share the same number of appearances itemset G is most closely related with itemset E since all of its appearances occur with itemset E, and consequently it should be proposed first. Itemset B on the other hand appears the same number of times with itemset E, but also appears in half the transactions in our database and so it is not so closely related to itemset E. Therefore it is proposed last. The same procedure is followed each time a new itemset is added to a customers basket, but with ever decreasing number of common transactions (see Section 4.2). The recommendations stop either when the customer completes his transaction or when there are no more common transactions for the specific combination of itemsets. 6.1.3. Using Boolean queries complementary to classic algorithms A final possible use of our approach would be as a complement to classic association rule mining algorithms (i.e. Apriori [5], DHP [27], DIC [10] etc.), rather than making any suggestions to the customers as before. More specifically we could run an association rule algorithm, get the suggested rules and subsequently perform a further investigation on the discovered rules using Boolean queries. So our proposed system is first of all able to give answers to any kinds of queries involving any of the three main Boolean operators AND–OR–NOT. For example it can give answers to queries like how many people bought product A and B, or how many people bought product C or D. Additionally our system is capable of performing any queries that involve combinations of any form of the three operators. For example how many people bought product C and A but not product B. This gives absolute freedom to the dataminer, since it allows him to get specific results for specific itemsets in a way that is more natural and simple to him, and resembles a search engine. Also using an index allows the resolvement of all kinds of queries practically immediately and most importantly with minimum resources. 6.2. Services based on ranked queries We begin by giving the rationale for our second solution which is based also on an index but on ranked queries instead of Boolean as before and then by presenting the solution itself. Finally we present some aspects regarding a better index implementation for the specific approach. 6.2.1. Importance of items Most approaches thus far relied on the support measure for finding all large itemsets, and assumed all itemsets to be of the same binary nature i.e., either they appear or not in a transaction. They did not give any other special measure of importance to the itemsets nor took under consid-
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
369
eration the number of times an itemset appears in a transaction. We argue that alone the total number of appearances of an itemset, should be the only judge of its significance. If we take the supermarket case for example, the individual sales of an itemset tells us only half the truth about it. The other half is the total profit we have from it as well as the number of times it appears in each transaction. Products such as chewing gums are sold almost at every single transaction but if we take a look at their total profits we will see that they are not that big. On the other hand products such as wines are not sold that often but their profit margins are far bigger, and consequently their total profits significant. Also some products (e.g. beers) tend to be bought in large quantities at a single transaction and not in a timely manner like for example milk. Taking into consideration only if items appear or not in a transaction and discarding their real significance, whatever that might be, as well as their quantities might lead us to false suggestions. Various works have tried to address the problems presented by the single support measure for all itemsets in a database. A very successful one has been the work in [25], where every itemset was assigned a different support value in order to reflect the natures and the widely varied frequencies of the items in the database. So if an itemset was considered more important than others then we would assign to it a lower support value. This would happen with all interesting/important itemsets. However this approach too did not measure the number of times an itemset appeared in a transaction but considered them as binary and most important had a rather arbitrary and biased way of assigning supports to the items. In [13] there has been proposed another alternative to the widely used support measure, which took into consideration the special importance an itemset might have. According to this approach every itemset was assigned a different weight that reflected its importance. The multiplication of the support of an itemset with its assigned weight gave us the weighted support measure. If the weighted support of an itemset was above a user specified threshold (called weighted minimum support––wminsup), then this itemset was considered large. The specific approach successfully addressed the problem of assigning importance to all itemsets, but suffered from various drawbacks such as the procedure for generating and pruning candidate itemsets that unavoidably had a great impact on the final performance. Also this approach too did not measure the number of times an item appears in a transaction. In our approach we manage to take into consideration both the importance of an itemset as this is expressed with a form of some weight, as well as the number of times the itemsets appear in each transaction and present the users with a system that acts like a recommendation engine. What is most important though is that with our approach these parameters are taken into consideration both with the existing transactions as well as for any new transaction made by a user. Our proposal is especially suitable when working with weighted items or with items with more than one attribute. 6.2.2. Our approach Our approach, which also works in three steps, resembles as noted before a lot the vector space model used in Information Retrieval. So, first a weight is assigned to every itemset, all existing transactions are read from our database and an index is created. Then every new transaction is interpreted as a ranked query and is passed to our system. Finally the system finds the most relevant transactions to any new transaction and the customers are suggested with some itemsets to choose from. Let us see how it works.
370
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
6.2.2.1. Existing transactions weighting. Suppose that we have a set of transactions T = {t1, t2, t3, . . . , tn}, which constitute our database and a set of items I = {i1, i2, i3, . . . , in}. Each transaction t is a set of items such that T I. We model these 1-items as an m-dimensional Euclidean space where each dimension corresponds to a 1-item. Consequently every transaction is represented as a vector in this space. The coordinate of a transaction t in the direction corresponding to an item i is determined by two quantities. Itemset weight: This weight tries to simulate the importance the data miner gives to an itemset i. An important itemset will enjoy a large weight whereas an unimportant one a very small weight. The determination of this weight is application specific and depends on various parameters such as the desired output or the nature of our data. In the few works that assigned weights to itemsets [13], both their determination as well as their assignment were arbitrary or were not addressed at all. There does not exist a general formula or method for determining these values, while using techniques used in other cases might lead us to wrong suggestions. For example if we borrowed and used the Zipf principle from the text databases, it would be completely wrong in our case since in no circumstance should a very frequently appearing itemset be considered as not important. This weight has similar function as the IDF factor in the TFIDF scheme, or in other words to scale down the effect of unimportant items, or to boost the effect of the important ones. If we consider again the supermarket case database where our first priority is the total profits we make from the products we could use the following formula: fi PMi wi ¼ ln 1:718 þ max TPk where max TPk is the maximum total profits we have from an itemset in our database, fi is the number of appearances of itemset i and PM is the profit margin of itemset i. Practically this means that one appearance of a very profitable itemset counts much more than say 10 appearances of a not so profitable itemset. Intra transaction itemset frequency: The intra transaction itemset frequency component itft,i in essence controls how the appearances of the itemsets within the transactions are going to be evaluated. This number has a similar function as the TF component of the TFIDF scheme. As said before with our approach we want to take into consideration the number of times an itemset appears in every transaction. One simple solution would be if we set this variable equal to the number of times the specific itemset appears in a transaction (itft,i = ft,i). Nevertheless this has the obvious drawback that transactions with many appearances of a very important term or more generally very long transactions would always be ranked first. This is not necessarily completely wrong, but we wish to reduce this effect. Practically we want to take into consideration the quantities of the itemsets bought in a transaction, but also not to favor these appearances too much over the other itemsets. For that reason we will use the following formula, where the first appearance of an itemset in a transaction contributes much more than the subsequent ones. ft;i ð3Þ itf t;i ¼ K þ ð1 KÞ maxc ft;c Variable K is a tuning constant that controls the balance between the first and all later appearances of an itemset in a transaction, with reported optimums between 0.3 and 0.5 [18]. In our
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
371
experiments we used K equal to 0.4. The factor maxcft,c is the maximum frequency of any term in transaction t. The vector of a transaction t in the direction of an itemset i is then calculated as follows: wt;i ¼ itf t;i wi The procedure described above is in essence the generation of an index from a collection of data (in our case transactions) but with assigning weights to all the itemsets. This index can then be used for making multiple types of queries, as we will see in the following section. 6.2.2.2. New transactions evaluation. Up to now we have managed to represent all transactions that already exist in our database as vectors in a multidimensional space. The question now is how we use that information and what happens with every new transaction. When a customer buys a product or more generally when a user of our collection focuses on an object inside that collection, then this is passed as a query to our system. Suppose for example that we have again a supermarket database, and a customer puts a product in his basket. Then we would be interested in the transaction or the transactions that best approach the subsequent buys of this customer. Of course it is rather difficult to predict this precisely from just one product but as the same customer will continue to put more products in his basket the answers will become more and more accurate. Every query (i.e. every new transaction) is interpreted as a transaction and transformed to a vector in the same space. The vector of a query q in the direction of an itemset i is calculated as follows: wq;i ¼ itf q;i wi The itfq,i component is given by the same formula as the itft,i component above (Eq. (3)). By using this formula we manage to take into consideration the number of times every item appears in a new transaction as well as its weight, the same way it is done for the transactions already in our database, and so we accomplish to find more accurately the most relevant transactions. The final step now is to measure the proximity between the vectors ~ q and ~ d for all t 2 T. Various techniques and heuristics have been proposed in order to give answers to ranked queries. In our case we will use the widely used cosine measure which is given by the following formula: X 1 ð1 þ loge fd;t Þ wd;t cosðQ; Dd Þ ¼ wd wq t2Q\Dd According to the cosine measure we try to find the cosine of the angle between the query vector ~ q and every relevant document vector ~ d. The larger the cosine value the greater the similarity. The classical vector space models, used for collections of text files stop at this point, and sort all the answers according to the similarity score they received and present them to the user. However as we will see in Section 6.2.2.3 below, in our case one more additional step is introduced. 6.2.2.3. Recommendation phase. In a text database, when the users issued a query they would be presented with the most relevant documents to choose from. In our case though, the users are not interested in transactions proposals (as it is done in a text collection), but rather in itemsets proposals. In practice we might be searching for the most relevant transaction, but finally we want to
372
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
propose to the users the itemsets that they are most probable to buy. In order to achieve this we could follow two alternative strategies. Suppose that a user has put some itemsets in his basket, and our system has concluded on some transactions that are most relevant for the specific user up to now. The first option would be if we proposed all itemsets from the most important transaction i.e., the one that was finally ranked first. In this case all itemsets within this transaction would be also ranked according to the multiplication of their weight with the number of times they appear in the transaction and presented to the customer in descending order of importance. For example if the most relevant transaction to some new transaction contained the following itemsets (every tuple contains first the itemset and then the number of times it appears in the specific transaction) h(1, 3), (4, 2), (7, 1), (9, 6)i and the weights of these itemsets are respectively w1 = 0.3, w4 = 0.5, w7 = 0.8 and w9 = 0.1 then the multiplication of the weight of every itemset with its number of appearances would give us (Table 7). According to the final weights the recommendation order would be: [4,1,7,9]. The second strategy would be if we also evaluated some of the subsequent transactions. For example suppose we also checked the two or three immediate transactions and from each of them presented the users with the first or the second most important itemsets. Subsequently, any time a customer puts more products (itemsets) in his basket then this is passed as another ranked query to our system now consisting of all itemsets. The highest ranked transactions are returned and so on and so forth until the customer has finished his/her buys. Both strategies could be coupled with a technique known as Rocchios relevance feedback [11,12], technique widely used mostly in Information Retrieval tasks, in order to produce better and more accurate predictions. According to the specific technique the responses to a user query are presented together with a simple rating form and the user then indicates which of the returned documents are useful––relevant. The system then reformulates the query and feeds it back to our system, by adding to the original query vector ~ q a weighted sum of the vectors corresponding to the relevant documents D+ and subtracting a weighted sum of the irrelevant documents D according to the following formula: X X ~ ~ ~ qþb d c d q0 ¼ a~ Dþ
D
The values for the constants a, b and c are usually set equal to 8, 16, and 4 respectively as these values were determined reasonable also in TREC 2. Of course requiring that the users manually rate everything that they were recommended with would create the same problems as those appearing in a collaborative filtering system using explicit voting (see Section 2). So instead of asking the users to manually provide these parameters a method called pseudo-relevance feedback
Table 7 Finding the most important itemsets within a transaction Itemset
No. of appearances
Weight
Final weight
1 4 7 9
3 4 1 6
0.3 0.5 0.8 0.1
0.9 2 0.8 0.6
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
373
method or PRF could be used. With PRF essentially we assume a number of the first ranked documents as relevant (i.e. the D+) and set c to 0 (that is D, the irrelevant documents are not used unless we have that information). That way the system gets its feedback but without bothering or even getting noticed by the user. Despite the fact that both these enhancements would result in more accurate suggestions, as we will see later (Section 7.5) they have certain drawbacks that render them rather impractical in our case. 6.2.3. Special index construction for ranked queries If we decide to use our system only for the second approach rather than for the first one (i.e. only for making ranked queries and not for making Boolean queries) then we would better use an alternative index structure, especially implemented for this purpose [28]. An index created using the classic inverted file technique stores the list of every itemset ordered according to the transaction number, and with every pair representing a transaction number t and the corresponding within-transaction frequency fd,t. Suppose that we have an itemset that appears in four transactions and more specifically twice in transaction 100, four times in transaction 105, 10 times in transaction 130 and twice in transaction 199. The list of this itemset would be stored as h4; ð100; 2Þ; ð105; 4Þ; ð130; 10Þ; ð199; 2Þi In order now to speed up the whole ranking process we sort every list by decreasing fd,t value. The list above would be written as h4; ð130; 10Þ; ð105; 4Þ; ð100; 2Þ; ð199; 2Þi That way the transactions with the greatest fd,t values can be found very quickly. However the list above should be represented another way, because since the document number pointers in each list are no longer in ascending order, we cannot process the lists by taking the d-gaps. In order to achieve this we can break every list into chunks within which all the fd,t values are the same. That way we can take the d-gaps inside each chunk. The list above would be written as h4; ð10; 1 : 130Þ; ð4; 1 : 105Þ; ð2; 2 : 100; 199Þi every chunk is prefixed by the common fd,t value that is shared by all pointers within the specific chunk, followed by a counter showing how many pointers there are in the specific chunk. Finally we store in each chunk the exact transaction numbers that share the same fd,t value. In the list above the first and second chunk contain one pointer, and the third one two. The final frequency-sorted index size is rarely larger than when generated the classic way [29].
7. Evaluation We tested our approaches in two different directions, namely qualitatively and quantitatively. More specifically we wanted to check the results they produced (i.e. the quality factor) as well as their overall performance (i.e. the quantity factor). The performance of our approach in terms of time and space requirements was tested using synthetic data from the synthetic data generator. The quality of the results as well as the usefulness of our approach on the other hand was verified using real data.
374
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
7.1. Synthetic data In order to test the performance of our approach we used similar datasets to the synthetic datasets used in [5] and in most works thereafter. The data generated from this synthetic generator are said to successfully simulate the buying pattern in a retail environment. A brief explanation of the parameters used is given in Table 8. We evaluated our algorithm with 20 different databases, eight of them in order to test the behavior for varying transactions length and average size of maximal potentially frequent itemsets and the rest with keeping all other parameters stable and scaling up the number of transactions. For the first eight we set jDj = 100K, N = 1000 and jLj = 2000. The values of jTj are set to 5, 10 and 20. The average size of maximal potentially frequent itemsets is 2, 4, 6 and 8. In the transactions generation process we introduced one additional step in order to represent and to take into consideration later the number of times every itemset appears in a transaction. More specifically we used a random number generator that determines the number of times every itemset appears in each transaction. We limited the possible number created by our generator to 20, since it would be highly abnormal for a customer to buy more than 20 times the same product in a single transaction. The final datasets had the double size compared to those generated in [5]. In Table 9 we show the parameter settings for the databases used. For the generation of weights we used a similar method to that used in [13]. More specifically every itemset is assigned a weight according to an exponential distribution (Fig. 5) based on its frequency of appearance in the database.
Table 8 Explanation of parameters used for generating synthetic datasets jDj
Number of transactions
jTj jIj jLj N
Average size of transactions Average size of maximal potentially frequent itemsets Number of maximal potentially frequent itemsets Number of items
Table 9 Parameter settings for the databases used jTj
jIj
T5.I2.D100K T5.I4.D100K
5 5
2 4
4.8
T10.I2.D100K T10.I4.D100K T10.I6.D100K
10 10 10
2 4 6
8.8
T20.I2.D100K T20.I4.D100K T20.I6.D100K T20.I8.D100K
20 20 20 20
2 4 6 8
16.8
Name
Size in MB
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
375
Fig. 5. Distribution of weights over one dataset.
7.2. Time and space requirements for index construction The database is read once and an index is created using inverted files. In order to compress the index we used local Bernoulli model [38] with the Golomb code [20] to represent the gaps between the consecutive transactions (in our case noted as TIDs). Finally each inverted list is prefixed by an integer ft to allow the calculation of the Golomb parameter b. These integers (ft) are represented using c-coding. The final indexes had a size which accounted for 20% of the original file and so taking into account the availability in memory, the index can be held and accessed from main memory quite easily. Generally since our goal was not the level of compression of the index that we would accomplish but rather how fast it would be indexed and compressed we preferred a less complex method. Certainly there is a tradeoff between the complexity of the method and the final size of the inverted file but since we accomplish very small sizes of our inverted files we found little incentive in investing in more complex compression methods. In Fig. 6 we see the time needed in order to build and compress the datasets shown in Table 9, all with 100,000 transactions. In Fig. 7 we see the time needed for building and compressing the index for three datasets for transactions scale-up (starting from 100,000 up to 1,000,000 transactions). These datasets had sizes ranging from 15.4MB the smallest one (T5.I2.D250K) to 224MB (T20.I6.D1000K) the largest one. 7.3. Response times Since we wish our system to have a similar function to that of a search engine and to present an on-line behavior, the time needed to evaluate each query should be as low as possible. The users cannot wait too long in order to get a recommendation. Our system has essentially the same performance as any web search engine and so the answer to any query is almost instantaneous or is given at acceptable times.
376
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
Fig. 6. Time requirements for index construction and compression.
7.4. Evaluating quality The qualitative evaluation of our approach proved a very tricky task since no known technique used in literature up to now could be used in our case. More specifically one option would be to try and evaluate it against some traditional approaches (e.g. Apriori, DIC, DHP etc.) in terms of running time, resources required and output produced. However this was immediately rejected since we would be comparing two completely different logics with different functioning and different outcomes. The second option that came into mind was the evaluation using techniques from Information Retrieval such as recall or precision. However this was also rejected since in our case it presented insurmountable problems.
Fig. 7. Time needed to build and compress the index for transactions scale-up.
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
377
The final thought was the use of a test audience to evaluate the recommendations made by our system. However as noted before this would unavoidably contain a great amount of objectiveness since plain itemsets provide no clear yardstick of what is relevant and what is not. Also since data mining is by definition the discovery of hidden knowledge it would be illogical to ask from users to evaluate the recommendations made. They could probably judge and evaluate the obvious answers but would certainly be helpless with the indeed more valuable unexpected answers. It is fairly easy to judge a recommendation that suggests milk and cereals together, but how can one decide if a recommendation that puts together diapers and beer is correct or not? The solution was given at last by using indeed a test audience, that however had a completely different functioning. More specifically we used a test audience only to simply make some transactions rather than having to evaluate them. For the specific test we used two sets of people, 50 persons each composed of the same number of females and males. Both sets of people consisted of people in the age of 18–22 years old all studying at the same university and at the same department. We did not put any other restrictions when forming the specific set like for example the financial status, since we wanted it to be as representative as possible of the actual audience of a real store. All of them were technology savvy quite familiar with the computers and the Internet. We chose to build a fictitious web store selling cell phones and accessories. The products, their names and the prices used in our web store were all real. The reason we chose the specific product was because the people in our test audience had a cell phone and/or have visited a store selling them at a percentage almost 100%. Consequently the test was implemented as follows. We let both sets of people make some buys from the store so that we could sum up enough data. The specific buys were made without making any products recommendations to the users at all. Then we had one set of people making buys with recommendations made using a classic algorithm (e.g. Apriori) and the second set using our approach. The users were unaware of the system used, as well as the subject of our experiment so that they would make all buys uninfluenced. They only knew that they would have to make some buys, and that they had a specific amount of money to spend every time as well as in total. The amount of money that every user had to spend was not determined arbitrarily but came after actually asking the users what amount of money they would be willing to spend or have spent in the past in order to buy such products. From their answers we calculated an average amount of money which we ‘‘granted’’ to the users to spend. The users were free to browse through all available products at leisure, visiting each product page as many times as desired, before deciding to actually buy a product. The goal was to check after having summed up a number of sales, which solution would generate the largest net profits. The outcome was that the solution that suggested itemsets to the users by using Boolean queries increased the net profits by a percentage of almost 10%, while the solution that suggested items to the users by using ranked queries increased the net profits by a percentage a little above 15%. So we can conclude that our system indeed provides better suggestions to its users, since our main objective of increasing the net profits was accomplished. 7.5. Discussion on evaluation In our evaluation we decided not to use the relevance feedback or the pseudo-relevance feedback optimization not because they are not successful capturing users preferences but because
378
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
as noted before they present specific drawbacks that in practice render them rather unhandy (even at web search engines relevance feedback is a very seldom appearing feature). So first of all users usually do not wish to be bothered by feedback systems, but just want to get the right answers and as soon as they do they simply focus on using that answers rather than evaluating them. Even if the answer returned was no good they are either too annoyed by it or too anxious to find the right answer that again they do not wish to waste any time in providing the system with feedback [14]. Of course this problem could be tackled with pseudo-relevance feedback––PRF, but any of the two methods used arises a new more serious problem. A system using any form of relevance feedback, adds a lot to its complexity leading to better answers but to overall sluggish performance [14]. And in the specific applications we cannot sacrifice performance over quality, since the users are not interested that much in getting better answers as it is done in a web search engine or in text databases. At commercial applications the use of such methods most probably would arise serious privacy concerns from the customers side. Unlike at a search at a web search engine or at a public library where the users feel much more concealed, while visiting a retail store they feel more vulnerable and have strong inhibitions about their personal data. So despite the fact that the use of that data for data mining purposes is not illegal it nevertheless implies a high degree of risk for the reputation of a company. To put as the managers and the IT directors of various retail firms we came into contact put it: we might be practising data mining methods but there is no need to remind it to the customers all the time. Finally as noted in Section 7.4 data mining is by definition the discovery of hidden, implicit and probably useful knowledge. Also unlike web pages or documents returned by classic Information Retrieval systems, products or more generally itemsets provide no clear yardstick of what is relevant and what is not. It is illogical to ask a user that has just bought beers and the system suggests to him diapers to rate how relevant this suggestion was. As far as the final form of the evaluation technique used (i.e. the net profits they produced) apart from the fact that it was the best solution we could think of it also fulfilled another important parameter. According to Kleinberg et al. in [23], a pattern in the data is interesting only to the extent in which it can be used in the decision-making process of an enterprise to increase utility. So the utility of extracted patterns (such as association rules and correlations) in the decisionmaking process of an enterprise can only be addressed within the microeconomic framework of the enterprise. In our opinion there does not exist some better way of confirming this, than the actual net profits we manage to make from each solution. The way current algorithms treat itemsets and extracted patterns as mere statistical probabilities and as Boolean variables is rather a simplification of a complex problem that we managed to address differently. While trying to determine the net profit each product would generate we came up against various problems. Finding the sale price of a product was a task that could be easily addressed with a simple visit at a store (physical or electronic) that sells such products. However the sale price of a product is one thing and the profits we make from it is another one. Also as noted before an expensive product is not always a profitable product and a cheap product is not always a nonprofitable. So we had to make some assumptions. First we assumed that companies buy the products they sell on average 20% cheaper than they actually sell them. We also assumed an average 4% cost that included product handling and inventory costs. Product handling costs refer to the costs associated with the physical handling of products and inventory costs include the financial costs of stocking the items and costs of re-stocking, which are a function of replenishment fre-
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
379
quency and the lead-time of the orders. Therefore we concluded on a 16% net profit on the sale price of every item. We found out that most of the people actually made all of their buys in a ‘‘one stop’’ fashion (i.e. they spend all of their available money at once). Also we observed that almost all people followed the same buying pattern by choosing the main product to be very expensive (i.e. the cell phone) and the rest of the products as cheap as possible (i.e. the accessories).
8. Conclusions and future work In most cases adopting a technique from one field and using it in another, leads to sub-optimal solutions. In our case however, the techniques we borrow from the field of Information Retrieval not only lead us to an overall good solution but manage to overcome some problems they presented than when traditionally used with text files. The main reasons for this are: in the text collections and especially for ranked queries one serious problem was with the nature of the words themselves. More specifically we face problems like polysemy, synonymy etc. making it difficult both to make the right query as well as to return the most relevant answers. In our case though these problems do not exist since the separation of the items as well as their identification is 100% accurate (e.g. by their unique bar code). So when we issue a query regarding a specific itemset we are sure that we are referring to the specific itemset. The answers to any given query are thus much more neutral and unaffected by the queries used, even when we have very few terms (itemsets). Also the number of 1-itemsets in a transactions database is far smaller than the number of distinct words in a typical text database. Less 1-itemsets means less itemsets combinations and in general less complexity. Another problem solved is when more than one person wanted to have access and mine the collected data, especially if they were not at the same location. One solution would be to replicate the data where needed and let each user mine it with its own parameters or let each user set its parameters from his current location on one instance of the data and wait for the result. However the sizes of the databases nowadays as well as legal implications make any thoughts of distributing or allowing access to them prohibiting. If every interested user on the other hand runs a classic algorithm with its own parameters on the original database then we would unavoidably put a large amount of burden on our database system and drastically deteriorate its performance. The size of an index created from a database as well as its ease of use make it possible to either distribute it or use it for remote access to various users who wish to mine a database, without having to access the database itself. Managing updates in the database is yet another very important and challenging task. According to [27] it is essential to collect a sufficient amount of sales data (say, over last 30 days) before we can draw meaningful conclusion from them. Of course this period depends on various parameters such as the size and the sector where a company or an organization operates, as well as the nature of the application implemented. For example at sectors such as the telecommunications or at companies such as Wal-Mart where there are generated and collected incredible amounts of data every day, or in applications such as fraud detection in credit cards where identification of a possible fraud must be done as fast as possible, it might be essential to mine the data very frequently. Using any of the classic data mining algorithms we would have to run through the whole
380
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
data, old and new one from the beginning. If one takes also into consideration that we might need to make multiple runs of the same algorithm in order to tune the required parameters (minsupport and minconfidence) in order to get the desired output, then the resources required are excessive. Some solutions have been given so that not all the data will have to be mined from the beginning but just the new one such as the work in [16] the generalization in [17] or the most recent work in [39]. Our solution to the specific problem is very efficient and yet simple and is implemented using either a stop press file for handling queries or a more effective method such as the work proposed in [42]. In the future we would like to check the behavior of our approach against previous approaches. The system that we propose recommends itemsets to its users, by taking into consideration valuable information that no other system did. The final solution is very efficient, requiring minimum resources for its operations. Another possibility would be the proposal and evaluation of a different relevance measure, apart from the cosine measure, specially implemented for the case of market basket data. Of interest might also be verifying with experiments whether the performance of our approach as a classic data mining algorithm (i.e. the batch module in Section 5) is better than previous algorithms. Finally we would like to address more formally the procedure for assigning weights to all itemsets in a database.
Acknowledgments Research of the first author (K.N. Ioannis) is supported by the European Social Fund (ESF), Operational Program for Educational and Vocational Training II (EPEAEK II), HERAKLEITOS Program. We acknowledge the reviewers for their valuable comments. References [1] K. Aas, A survey on personalized information filtering systems for the world wide web, Report no. 922, Norwegian Computing Center, December 1997. [2] D. Achlioptas, F. McSherry, Fast computation of low rank matrix approximations, in: Proceedings of ACM STOC 2001, pp. 611–618. [3] C.C. Aggarwal, P.S. Yu, Online generation of association rules, in: Proceedings of 14th Int. Conf. on Data Engineering (ICDE98), Orlando, Florida, USA, February 1998, pp. 402–411. [4] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, May 1993, pp. 207–216. [5] R. Agrawal, R. Srikant, Fast algorithms for mining generalized association rules, in: Proceedings of the 20th International Conference on Very Large Databases (VLDB94), Santiago, Chile, September 1994, pp. 487–499. [6] Y. Azar, A. Fiat, A. Karlin, F. Mcsherry, J. Saia, Spectral analysis of data, in: Proceedings of ACM 2001, pp. 619– 626. [7] R. Baeza Yates, B. Ribeiro Neto, Modern Information Retrieval, Addison Wesley, 1999. [8] R.J. Bayardo, R. Agrawal, Mining the most interesting rules, in: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 1999, pp. 145–154. [9] J. Breese, D. Heckerman, C. Kadie, Empirical analysis of predictive algorithms for collaborative filtering, in: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, Madison, WT, Morgan Kaufmann Publisher, July 1998, pp. 43–52.
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
381
[10] S. Brin, R. Motwani, J. D. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market basket data, in: SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, May 1997, pp. 255–264. [11] C. Buckley, G. Salton, J. Allan, The effect of adding relevance information in a relevance feedback environment, in: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1994, pp. 292–300. [12] C. Buckley, G. Salton, J. Allan, A. Singhal, Automatic query expansion using SMART:TREC3, in: Proceedings of the Third Text Retrieval CONF Nist, 1994. [13] C.H. Cai, A.W.-C. Fu, C.H. Cheng, W.W. Kwong, Mining association rules with weighted items, in: Proceedings of 1998 International Database Engineering and Applications Symposium (IDEAS98), Cardiff, Wales, UK, July 1998, pp. 68–77. [14] S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kauffman Publisher, 2001. [15] S. Chakrbarti, B. Dom, D. Gibson, R. Kumar, R. Raghavan, S. Rajagopalan, A. Tomkins, Spectral filtering for resource discovery, in: ACM SIGIR 98, Workshop on Hypertext Analysis. [16] D.W. Cheung, J. Han, V. Ng, C.Y. Wong, Maintenance of discovered association rules in large databases: an incremental updating technique, in: Proceedings of the 1996 International Conference Data Engineering, New Orleans, Louisiana, February 1996, pp. 106–114. [17] W.L. Cheung, S.D. Lee, B. Kao, A general incremental technique for maintaining discovered association rules, in: Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA97), Melbourne, Australia, March 1997, pp. 185–194. [18] W. Frakes, R. Baeza-Yates (Eds.), Information Retrieval: Data Structures and Algorithms, Prentice Hall, Englewood Cliffs, New Jersey, 1992. [19] D. Goldberg, D. Nichols, B.M. Oki, D. Terry, Using collaborative filtering to weave an information tapestry, Communications of the ACM 9 (1992) 61–70. [20] S.W. Golomb, Run_Length Encodings, IEEE Transactions of Information Theory 12 (3) (1966) 399–401. [21] U. Hanani, B. Shpira, P. Shoval, Information filtering: overview of issues, research and systems, User Modeling and User-Adapted Interaction 11 (2001) 203–259. [22] W. Hill, L. Stead, M. Rosenstein, G. Furnas, Recommending and evaluating choices in a virtual community of use, in: Proceedings of CHI95, 1995, pp. 194–201. [23] J. Kleinberg, C. Papadimitriou, P. Raghavan, A microeconomic view of data mining, In Data Mining and Knowledge Discovery Journal 2 (4) (1998) 311–324. [24] J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, J. Riedl, Group lens: applying collaborative filtering to usenet news, Communications of the ACM 40 (3) (1997) 77–87. [25] B. Liu, W. Hsu, Y. Ma, Mining association rules with multiple minimum supports, in: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 1999, pp. 337– 341. [26] B. Liu, W. Hsu, Y. Ma, Pruning and summarizing the discovered associations, in: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, San Diego, CA, August 1999, pp. 125– 134. [27] J.-S. Park, M.-S. Chen, P.S. Yu, An effective hash based algorithm for mining association rules, in: Proceedings of the ACM SIGMOD Conference on Management of Data, San Jose, CA, May 1995, pp. 175–186. [28] M. Persin, Document filtering for fast ranking, in: W. Croft, C. van Rijsbergen, (Eds.), Proceedings of the ACMSIGIR International Conference on Research and Development in Information Retrieval, Dublin, Ireland, July 1994, pp. 339–348. [29] M. Persin, J. Zobel, R. Sacks-Davis, Filtered document retrieval with frequency-sorted indexes, Journal of the American Society for Information Science 47 (10) (1996) 749–764. [30] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, J. Riedl, GroupLens: an open architecture for collaborative filtering of netnews, in: Proceedings of CSCW 94, Chapel Hill, NC, 1994, pp. 175–186. [31] P. Resnick, H.R. Varian, Recommender systems, Special Issue of Communications of the ACM 40 (3) (1997) 56– 58.
382
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
[32] G. Salton, M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New York, 1983. [33] A. Savasere, E. Omiecinski, S. Navathe, An efficient algorithm for mining association rules in large databases, in: Proceedings of the 21st VLDB Conference, Zurich, Switzerland, 1995, pp. 432–443. [34] U. Shardanand, P. Maes, Social information filtering: algorithms for automating Word of Mouth, in: Proceedings of CHI 95, Denver, CO, 1995, pp. 210–217. [35] H. Toivonen, Sampling large databases for finding association rules, in: Proceedings of the 22nd International Conference on Very Large Databases (VLDB96), Mumbay, India, September 1996, pp. 134–145. [36] H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hatonen, H. Mannila, Pruning and grouping of discovered association rules, in: Workshop Notes of the ECML-95 Workshop on Statistics, Machine Learning, and Knowledge Discovery in Databases, Heraklion, Crete, Greece, April 1995, pp. 47–52. [37] G.I. Webb, Efficient search for association rules, in: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000, pp. 99–107. [38] I. Witten, A. Moffat, T. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, second edition, Morgan Kaufmann, San Francisco, 1999. [39] S. Zhang, C. Zhang, X. Yan, PostMining: maintenance of association rules by weighting, Information Systems 28 (7) (2003) 691–707. [40] G.K. Zipf, Human Behavior and the Principle of Least Effort, Addison-Wesley, Boston, MA, 1949. [41] J. Zobel, A. Moffat, Exploring the similarity space, ACM SIGIR Forum 32 (1) (1998) 18–34. [42] J. Zobel, A. Moffat, R. Sacks-Davis, Storage management for files of dynamic records, in: M.E. Orlowska, M. Papazoglou, (Eds.), Proceedings of the Fourth Australian Database Conference, Brisbane, Singapore: World Scientific, February 1993, pp. 26–38. [43] http://www.amazon.com. [44] http://www.ebay.com. [45] http://www.egghead.com.
Ioannis N. Kouris received a 5-year B.E. from University of Patras, Department of Electrical Engineering and Computer Technology in 1999 and a M.Sc. in Decision Sciences with specialization in E-commerce from Athens University of Economics and Business in 2000. Currently he is a Ph.D. Candidate at Patras University, Department of Computer Engineering and Informatics. He is a research staff at the Department of Computer Engineering and Informatics at Patras University, and a member of research unit 5 at Computer Technology Institute (CTI). He teaches computer programming in Technological Educational Institute of Patras. Major research interests include data mining, web mining, collaborative filtering and information retrieval.
Christos H. Makris was born in Greece, 1971. He graduated from the Department of Computer Engineering and Informatics, School of Engineering, University of Patras, in December 1993. He received his Ph.D. degree from the Department of Computer Engineering and Informatics, in 1997. Today he works as an Assistant Professor in the Department of Applied Informatics in Management and Finance, Technological Educational Institute of Mesolonghi, and as a research assistant in the Computer Technology Institute. His research interests include Data Structures, Computational Geometry, Data Bases and Information Retrieval. He has published over 30 papers in various journals and refereed conferences.
I.N. Kouris et al. / Data & Knowledge Engineering 52 (2005) 353–383
383
Athanasios K. Tsakalidis Computer-Scientist, Professor of the University of Patras. Born 27.6.1950 in Katerini, Greece. Studies: Diploma of Mathematics, University of Thessaloniki in 1973. Diploma of Informatics in 1980 and Ph.D. in Informatics in 1983, University of Saarland, Germany. Career: 1983–1989, researcher in the University of Saarland. He has been student and cooperator (12 years) of Prof. Kurt Mehlhorn (Director of Max-Planck Institute of Informatics in Germany). 1989–1993 Associate Professor and since 1993 Professor in the Department of Computer Engineering and Informatics of the University of Patras. 1993–1997, Chairman of the same Department. 1993-today, member of the Board of Directors of the Computer Technology Institute (CTI), 1997-today, Coordinator of Research and Development of CTI. He is one of the contributors to the writing of the ‘‘Handbook of Theoretical Computer Science’’ (Elsevier and MIT-Press 1990). 2001-today, Chairman of the above mentioned Department. He has published many scientific articles, having an especial contribution to the solution of elementary problems in the area of data structures. Scientific interests: Data Structures, Computational Geometry, Information Retrieval, Computer Graphics, Data Bases, and Bio-Informatics.