Exploratory data analysis leading towards the most interesting simple association rules

Computational Statistics & Data Analysis 52 (2008) 3269 – 3281 www.elsevier.com/locate/csda Exploratory data analysis leading towards the most intere...

Download PDF

3MB Sizes 0 Downloads 58 Views

Report

PDF Reader
Full Text

Computational Statistics & Data Analysis 52 (2008) 3269 – 3281 www.elsevier.com/locate/csda

Exploratory data analysis leading towards the most interesting simple association rules Alfonso Iodice D’Enzaa , Francesco Palumbob,∗ , Michael Greenacrec a Dipartimento di Matematica e Statistica, Università di Napoli Federico II, Italy b Dipartimento di Istituzioni Economiche e Finanziarie, Università di Macerata, Via Crescimbeni, 20 I-62100 Macerata, Italy c Department of Economics and Business, Universitat Pompeu Fabra, Barcelona, Spain

Available online 16 October 2007

Abstract Association rules (AR) represent one of the most powerful and largely used approaches to detect the presence of regularities and paths in large databases. Rules express the relations (in terms of co-occurrence) between pairs of items and are deﬁned in two measures: support and conﬁdence. Most techniques for ﬁnding AR scan the whole data set, evaluate all possible rules and retain only rules that have support and conﬁdence greater than thresholds, which should be ﬁxed in order to avoid both that only trivial rules are retained and also that interesting rules are not discarded. A multistep approach aims to the identiﬁcation of potentially interesting items exploiting well-known techniques of multidimensional data analysis. In particular, interesting pairs of items have a well-deﬁned degree of association: an item pair is well deﬁned if its degree of co-occurrence is very high with respect to one or more subsets of the considered set of transactions. © 2007 Elsevier B.V. All rights reserved. Keywords: Association rules; Multidimensional data analysis; Binary variables

1. Introduction A multidimensional data analysis (MDA)-based strategy for the study of association in transactional data bases is proposed. In particular, the MDA-based strategy aims to detect the interesting patterns of co-occurrence among the considered items, in order to increase the useful applicability of association rules (AR) mining. Elder and Pregibon (1996) indicated the guidelines for statistical perspective in knowledge discovery. They stated that a great deal of work goes into identifying, gathering, cleaning, and labeling the data, into specifying the question(s) to be asked of it, and into ﬁnding the right way to view it (literally and ﬁguratively) to discover useful patterns. An AR in its simplest form involves a pair of items playing different roles: the antecedent part of the rule (body), and the consequent part of the rule (head). The relation characterizing the considered items is usually expressed in two different measures: support and conﬁdence. The support is the intensity of the association between the considered items; the conﬁdence measures the strength of the logical dependence expressed by the rule. An example illustrates an AR in its simplest form: let A and B be a pair of items, beer and chips for example, the following notation represents the achieved AR: A → B = {support = 20%, conﬁdence = 80%}. The rule shows that 20% of customers buy both beer and chips, and if a customer buys beer, in 80% of cases buys chips too. Simple rules do not represent the whole output of a mining process, ∗ Corresponding author.

E-mail address: [email protected] (F. Palumbo). 0167-9473/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2007.10.006

3270

A. Iodice D’Enza et al. / Computational Statistics & Data Analysis 52 (2008) 3269 – 3281

since complex AR are extracted too. In its most general deﬁnition a complex AR is a rule in which there are one or more antecedent items and one or more consequents. Since the proposal by Agrawal et al. (1993) most contributions in the AR framework have been oriented to implementing more efﬁcient algorithms, even in case of massive amounts of data. Today the computational effort to mine rules represents a minor “bottleneck”. However, there is a large amount of generated rules that makes any selection criterion inefﬁcient. The risk is that the truly interesting information is not detected and that the presence of trivial and redundant rules cancels out the effort produced in the AR generations. Many of contributions aiming to select interesting AR have come from the statistics and computer science domains and are characterized by different approaches: selection and visualization of minedAR, as well as deﬁnitions of measures of AR interestingness. In order to ﬁnd interesting patterns and regularities hidden by the redundant and uninteresting rules, the proposed methods of AR grouping and pruning operate an a posteriori reduction in the number of generated rules. The general aim of the present proposal is to detect the best deﬁned co-occurrence relations. In other words, the aim is to detect item pairs that strongly characterize even subsets of the whole set of considered transactions. The strategy for achieving this task consists of three main phases: (1) determine homogeneous groups of binary sequences; (2) select interesting items through the comparison of global and local structures; and (3) assign AR roles to the selected items. The method does not produce AR but it provides the user with a priori information about the most potentially interesting items. Focusing the attention only on those items that the procedure indicated as interesting, to obtain the AR set, it is necessary to apply an AR mining algorithm. Reducing the number of considered items, the user can deﬁne looser thresholds, avoiding as well as the risk of a huge amount of rules. In addition, information about the items help the user to pay greater attention to the rules containing the previously identiﬁed items. The proposed strategy exploits the graphical and analytical capabilities of MDA, in particular the paper will focus on the computational aspects when n and p are large. The present paper is structured as follows: in Section 1 AR are deﬁned, then the proposed strategy is introduced. In Section 2 data structures and their formalism are introduced; in Section 3 the usage of MDA techniques in AR context is motivated, and the steps and the main features of the exploratory strategy are described; clustering of transaction and the item heuristics selection criteria are presented and described in Section 4, while the last phase of the strategy and the graphical representations are described in Section 5. In particular, we propose a strategy based on the correspondence analysis (CA) of the conﬁdence matrices of the K determined groups. The last two sections of the paper are characterized, respectively, by an application of the procedure to a real world data set and by conclusion and perspectives. The reference point among the AR mining algorithms is the Apriori algorithm. Introduced by Agrawal and Srikant (1994), this algorithm is oriented to the mining of Boolean AR. The Apriori procedure consists of two phases: the itemset mining and the properly deﬁned association rules mining. The aim of the ﬁrst phase is to mine all the frequent itemsets: a set of items is frequent if its support exceeds a global minimum threshold. In the second phase those rules are mined that are related to the frequent itemsets characterized by a conﬁdence exceeding a global minimum threshold. A further important feature of the Apriori algorithm is the exploitation of the downward-closed property of the support: any subset of a frequent itemset is frequent too and, similarly, any superset of infrequent itemset is also infrequent. This property reduces the search space of the rules at each data base scan. Several Apriori optimizations have been proposed, such as the AprioriTid and the AprioriHybrid by (Agrawal and Srikant, 1994). 2. Basic notation and deﬁnitions A transaction database can be represented, through an algebraic formalization, as a binary matrix having transactions on rows, items on columns. The starting data structure Z of transactions is an n×p presence/absence matrix characterized by n binary vectors considered with respect to p Boolean variables (e.g.: buy or not buy). The support matrix S=n−1 ZT Z is a symmetric contingency table, whose general extra-diagonal term sjj = sj j represents the support with respect to the items j and j . Indexes j and j vary in 1, . . . , p. The main diagonal of S carries the supports related to each single item. The conﬁdence matrix C is deﬁned as C = n−1 ZT ZD−1 , where D is a diagonal matrix such that the general term djj = sj (j = 1, . . . , p). It is square and asymmetric as the general extra-diagonal term cjj corresponds to the conﬁdence of the rule {Aj → Aj }, with j = 1, . . . , p. As a consequence, all terms on the main diagonal of C are equal

A. Iodice D’Enza et al. / Computational Statistics & Data Analysis 52 (2008) 3269 – 3281

3271

to one. Since in the AR context the number n of statistical units is often very large, it is possible to interpret both sjj and cjj in terms of probability estimation, according to the frequentist deﬁnition of probability. The general element ˆ j ∩ Aj ) that is the approximation Pˆ of the probability that the of S, the support, can be thus interpreted: sjj = P(A events Aj andAj occur together. In the same way, it is possible to interpret cjj , the element of the conﬁdence matrix, ˆ j |Aj ) expressing the estimation of the probability that the item j occurs, given the occurrence of the as cjj = P(A item j. Since cjj represents a conditional probability, the table C is square and asymmetric. The clustering phase, the ﬁrst one of the strategy, produces a row-wise partition of Z in Z1 , Z2 , . . . , ZK , where K is the number of chosen clusters. For each Zk , the corresponding Sk and Ck matrices are computed in the same way as S and C. In the MDA context, different proposals in the literature extend well-known methods like CA (Greenacre, 2000, 2007) and multidimensional scaling (MDS) to square asymmetric tables (Bove, 1992). A common aspect of these methods is in the decomposition of the asymmetric table into two components, as proposed by Gower (1977): symmetric and skew-symmetric. Applying this decomposition to a table C, it results: C = Cs + Csk , where Cs = 1/2(C + CT ) represents the symmetric component of C, and Csk = 1/2(C − CT ) represents the skew-symmetric component of C. The separate analyses of the two components lead to representations of the symmetric and skew-symmetric characteristics of table C. 3. MDA-based exploratory strategy MDA comprises a set of techniques aiming to describe the relations characterizing sets of statistical units, which are observed on several variables of any nature. It is possible to roughly separate MDA techniques into factorial analysis and cluster analysis: in both cases, the n statistical units described by p variables can be considered as points of a p-dimensional (or n-dimensional) space. The factorial methods aim to ﬁnd a projection of the points on an orthogonal subspace, where each dimension is a linear combination of the original variables. Cluster analysis is an unsupervised procedure aiming to partition the statistical units in groups, maximizing the homogeneity within the groups, as well as the separation between each group the other. Use of MDA techniques on the data structures Z, S and C is straightforward, but it is not free from complications due to the dimensions of the structures (n and p can be very large), the squareness of S and C and the asymmetry of C. The MDA-based strategy performs both a row-wise and a columnwise reduction of the data structures. In particular, the clustering of transactions determines a partition of Z into K submatrices with a reduced number of rows. The factorial analysis of matrices C aims to represent the co-occurrence structure underlying data in a low-dimensional space: the starting variables are linearly combined to achieve summarization. Further aspects to take into account are due to transaction data. Dealing with large, high-dimensional and sparse data sets, classic clustering techniques like K-means algorithm (MacQueen, 1967) and agglomerative algorithm require very high computational costs and they do not guarantee reliable solutions. In the literature there are many contributions proposing algorithms optimized for massive amounts of binary data. In particular, two non-hierarchical clustering algorithms for binary data are light weight clustering (LWC) and incremental K-means proposed by Gaber et al. (2004) and Ordonez (2003), respectively: the main aim of both these procedures is clustering of binary data streams, which are ﬂows of data. The procedures are then single pass: they require a unique data scan to achieve the clustering solution. A fast clustering algorithm is useful and effective in determining homogeneous groups of transactions. After the ﬁrst phase (clustering), the attention is focused on the support and conﬁdence matrices of order p × p related to each group. Recalling that p is the number of considered items, in the real world transaction databases it is always the case that p>n. In the literature, other approaches exploit clustering techniques to discover non-obvious AR (Plasse et al., 2007). The idea is that the non-trivial information regarding those itemsets whose support is non-signiﬁcant with respect to the whole database, but it expresses a high co-occurrence relation with respect to subgroups of transactions. In other words, a pattern is considered interesting if it represents a well-deﬁned association between items. If a combination of items occurs in almost all the considered transactions, it is more likely to contain trivial information; if a co-occurrence pattern strongly characterize an homogeneous group of transactions, then it may reveal previously unknown and potentially useful information. After the selection of item pairs, the last task to achieve in the proposed strategy is to assign the roles of antecedent and consequent to the selected interesting items, as will be described in Section 5. This phase is based on a suitable factorial analysis of group conﬁdences matrices, and it requires short computing time and low memory usage.

3272

A. Iodice D’Enza et al. / Computational Statistics & Data Analysis 52 (2008) 3269 – 3281

4. Statistical criteria for the selection of interesting items The ﬁrst step of the proposed strategy is implemented exploiting the features of incremental K-means clustering, that is a non-hierarchical algorithm characterized by doing a single data scan to get the partition of the rows of the binary matrix Z in K disjoint classes. The logical dissimilarity of the binary rows of Z is measured by the Jaccard coefﬁcient. The incremental K-means takes as input the number of clusters, which is user-deﬁned. The initialization phase of incremental K-means differs from the standard K-means by using global statistics (mean and variance) of the input indicator matrix to obtain the starting centers. This is not the only distinguishing feature of incremental K-means, the reader is referred to Ordonez (2003) for further details about the procedure. In the second step it is necessary to identify the interesting items through a global/local comparison of the item co-occurrence structure. This is a crucial step since item selection involves the risk to select trivial items as well as discarding interesting items. Item selection is based on a double criterion: in particular, we look for items having a different degree of occurrence in the groups and among them, items characterized by a different support in the groups and in the whole database. The ﬁrst aspect is based on the concept of conditional independence: pairs of items having a different degree of association in the K different groups are to be considered interesting. This is because items having a constant degree of association in each of the considered groups determine trivial and/or redundant relations. The problem is, in other words, to study the interaction of the K-level factor “group” on the co-occurrence relation of each pair of items. As proposed by Goodman (1964), the hypothesis of no interaction can be formalized as H0

1 = k with k = 2, 3, . . . , K,

(1)

where k = 11k 22k /12k 21k . Table 1 shows the kth stratum of this 3-way matrix: the element s11k corresponding to sjj k , represents the frequency of the combination of outcomes 1 for the binary variables j and j (items). The different strata of the table correspond to the previously determined K groups: then the marginal totals in each table are ﬁxed. With ﬁxed marginal totals, all the cells except sjj k are not interesting, as they represent the remaining combinations of outcomes and can be computed as the differences between sjj k and the ﬁxed margins. The condition in (1) can be veriﬁed according to the following statistic: 2 K 2 K d w − d w k k k k=1 k k=1 (2) X2 = K k=1 wk estimating k by dk = s11k s22k /s12k s21k and wk = (dk2 ∗ j,j sjj k )−1 . Under the null hypothesis (1) the statistic X 2 above has a chi-square distribution with K − 1 degrees of freedom. The paper proposes to use the X 2 statistic as a heuristic criterion to rank the pairs of items. For each pair of item, the higher the X2 value, the more the degree of co-occurrence of the items varies within the K groups. The use of a chi-square-based ﬁlter for item selection aims to identify item pairs with a non-constant degree of association within the groups: this is the ﬁrst feature characterizing interesting items. The latter ﬁlter, focusing on the comparison of local and global associations, considers the support of each selected item pair within each group and with respect to the whole data set. The support of an item pair (j, j ), with j = j and j, j = 1, . . . , p, with respect to the kth group (k = 1, . . . , K), can be expressed in terms of probability approximation as follows P((Aj ∩ Aj )|Gk ); remembering that the total support for the item pair (j, j ) is P(Aj ∩ Aj ). We need to determine, for each item pair, whenever or not there is difference between the global degree of association and the local one. However, a direct comparison of the two probabilities would not make any sense: a common way to compare two proportions consists of the odds ratio between Table 1 Contingency table kth stratum

Aj Not Aj

Aj

Not Aj

sjj k ...

... ...

A. Iodice D’Enza et al. / Computational Statistics & Data Analysis 52 (2008) 3269 – 3281

3273

the global and local supports. In terms of probabilities of success (items j and j ), we have jj |k = P((Aj ∩ Aj )|Gk ) and jj = P(Aj ∩ Aj ), respectively, for local and global proportions. Then the odds for global and local supports are, respectively, jj |k =jj |k /(1−jj |k ) and jj =jj /(1−jj ), with corresponding odds ratio jj |k =jj |k /jj .The item pairs selected by the chi-square-based ﬁlter are characterized by a high degree of association in one or more of the considered groups; the odds-based ﬁlter detects the item pairs whose global support differs considerably from the K corresponding local supports. In both cases it is possible to rank the item pairs in descending order, selecting in this way a user-deﬁned proportion, say q, of the most interesting items. 5. Item roles in the rules: CA approach The last phase of the strategy aims to identify the roles of antecedent and consequent items and to represent the co-occurrence relations using a version of CA on each Ck , k = 1, . . . , K. CA (Benzécri, 1973; Greenacre, 1984, 2007) is an MDA technique for the visualization of frequency tables. The application of CA to square asymmetric Ck requires special treatment because of their particular data structure. We refer the reader to Greenacre (1984) for a deﬁnition of CA. The aim in this case is to detect the off-diagonal term differences and assign the roles to the items. The standard application of CA to the conﬁdence matrices would not be suitable, owing to the strong inﬂuence of values on the diagonal of each square matrix Ck . The identiﬁcation of roles of antecedent or consequent for each of the considered item pair (Aik , Aj k ), where i and j vary in [1, 2, . . . , p], is based on the comparison between P (Aik |Aj k ) and P (Aj k |Aik ). These values correspond to the off-diagonal cells in each conﬁdence matrix Ck , hence the aim is to identify the highest differences between cij and cj i . However, the differences are very low if compared to the diagonal values, since cij and cj i take values in the interval [0, 1] and cii = 1 = max ∀i ∈ [1, . . . , p]. In order to ease the notation, hereinafter we refer to C as the general group conﬁdence deﬁned (previously indicated as Ck ). The ones on the main diagonal of C are irrelevant to the analysis: it is possible to emphasize the off-diagonal term differences through an iterative alternating procedure according to Daunis et al. (2002). The matrix C is decomposed into a symmetric component Cs and skew-symmetric component Csk (Greenacre, 2000) and then treated to replace the diagonal elements, as follows: • decomposition C in Cs and Csk ; • CA of Cs ; and • reconstruction of the main diagonal of Cs through the general reconstruction formula: ⎛ ⎞ F npij = nr i rj ⎝1 + −0.5 if jf ⎠ , f

(3)

f =1

with i, j = 1, . . . , p and f = 1, . . . , F . F is the number of considered factors, p is the number of considered items, if is the principal coordinate of the ith item on the factor f; ri and rj represent the row margins of Cs (note that the row and column margins are the same here, since Cs is symmetric); • comparison of the reconstructed diagonal with the previous main diagonal for Cs : if there is no difference then the whole matrix C∗s is reconstructed using formula 3; else Cs is updated with the obtained diagonal and repeat the previous steps. Each of the previously selected items is considered as an antecedent or consequent part of interesting rule depending on its deviation from symmetry: items having a positive deviation are considered interesting heads, the remaining items are then the bodies. The reconstructed symmetric component C∗s is analyzed to obtain a factorial representation of the co-occurrence structures of the considered items. 6. Example The aim of the present section is to describe the application of the described procedure to a real world data set, in order to show key aspects of the usage and interpretation of the exploratory study of association. The data set considered here is the BMS-Webview-2 that was used in KDD-cup 2000 competition and is available at KDD-cup

3274

A. Iodice D’Enza et al. / Computational Statistics & Data Analysis 52 (2008) 3269 – 3281

Fig. 1. Classiﬁcation quality versus number of classes.

2000 home page1 . The data set of transactions of an e-commerce company has already been used in many data mining applications. However, since the approach presented here is complementary to the computer science-based proposals, a direct comparison of the results would not make any sense. In the BMS-Webview-2 data set each statistical unit represents the click-streams of a single session of a visitor in the e-commerce web-site; each item corresponds to a single product. The raw data set is characterized by 77512 web click-streams and 3340 product-pages (items). In a pre-processing phase, product-pages never visited are discarded as well as web-sessions with a very low number of visited pages. Groups of homogeneous transactions are obtained by incremental K-means (see Section 4): because there is no a priori information about the number of clusters characterizing the data set structure, the clustering algorithm is applied with different numbers of clusters. Fig. 1 shows the quality of the clustering solution plotted versus the number of clusters: it is evident in the ﬁgure that the best solution is obtained for K = 5. The quality of the solution is measured as the reciprocal of the mean distance between each unit and its related cluster centroid: quality = [((1/n) ni=1 (ti , j th centroid))−1 ], where is the dissimilarity computed between each transaction ti and the corresponding cluster centroid. Once groups of homogeneous transactions have been obtained, the one-to-one association between items are analyzed in order to identify the item pairs of higher interest. At this stage the user can directly choose the proportion of item pairs to consider as interesting, thus keeping the amount of output under control. The ﬁrst criterion for the selection of items is based on chi-square, as described in Section 4. A simple but straightforward representation of the X2 values for the different item pairs is provided in Fig. 2. For each of the determined groups, the item pairs are further ranked according to the corresponding odds ratio values: the aim in this case is to detect different global/local degrees of co-occurrence. The conﬁdence matrices Ck relating to each of the K groups are processed as explained in Section 5: we obtain in this way K factorial maps representing the co-occurrence structure underling the items within the group. Furthermore, through the application of the procedure in Section 5, we obtain the reconstructed one-to-one conﬁdences. Then the factorial maps are snapshots of the association characterizing the items in each one group; the interestingness of the simple (one-to-one) rules is determined through the IS measure proposed by Tan et al. (2002). The IS formula for a pair of items Aj and Aj is IS(Aj ,Aj ) = I(Aj ,Aj ) × f11 /n with I(Aj ,Aj ) = P (Aj , Aj )/P (Aj )P (Aj ) being the interest factor and f11 is the number of co-occurrences of the considered items. The IS measures has many useful properties: it is the combination between support and interest factor, and it takes into account of both signiﬁcance and interestingness of a pattern. In order to provide an immediate representation of the rules with highest IS, we simply used a bar diagram. The results of the analysis are then visualized in different 1 URL: http://www.ecn.purdue.edu/KDDCUP.

A. Iodice D’Enza et al. / Computational Statistics & Data Analysis 52 (2008) 3269 – 3281

3275

Fig. 2. Colormap-wise representation of the X2 values.

manners: the graphical displays are of immediate interpretation; however, summary tables containing X 2 , odds-ratio, support, conﬁdence for each of the selected pairs are provided. In a visual data mining approach, it is possible to retrieve all the information available for an item pair by selecting it from the display (either factorial or bar-graph). In particular, the left side of Fig. 3 shows the factorial map obtained analyzing the reconstructed symmetric component of the conﬁdence matrix C of group 5. Points close on the map represent items with a high degree of co-occurrence: this means that the set of close items is present within the same transactions. The highlighted area denotes itemsets that could generate interesting complex rules. The representation shows global item co-occurrences: the lack of points being close to axes center means that just few of them are in all of the transactions belonging to the considered group. The right side of Fig. 3 displays the bar-chart of the most interesting item pairs in group 5: the user can choose, for example, the item pairs to consider on the basis of the maximum decrease of interestingness. Another aspect that must be remarked is that, in group 5, six of the 10 ﬁrst ranked item pairs have the same head, the item whose tid is “55267”. This means that, within the group 5, the item “55267” is a good consequent item. The remaining groups are described in Figs. 4–7. In the factorial display related to group 1 it is interesting to notice a set of points nearby the upper left corner of the map: the points within the outlined area in Fig. 4 (left side) are close, and in the same way they are far from the center of the map and from most of the other points as well. With respect to the bar-graph representation of group 1, the vertical line on the right side of Fig. 4 indicates the points of considerable decrease in IS values. In group 2 the factorial display shows the following situation: there are two groups of items, covering most of the items in the group. The two areas (indicated in Fig. 5, left side) are opposite with respect to the center of the map. Such representation of the structure indicates that in group 2, there are two different itemsets which have a higher degree of co-occurrence, then we may expect within this group that two complex rules will represent the structure. This situation seems to be conﬁrmed in the bar-graph representation in 5: with respect to one-to-one relations, there are no large differences in

3276

A. Iodice D’Enza et al. / Computational Statistics & Data Analysis 52 (2008) 3269 – 3281

Fig. 3. Factorial plan and interestingness bar-chart representations of group 5.

A. Iodice D’Enza et al. / Computational Statistics & Data Analysis 52 (2008) 3269 – 3281

Fig. 4. Factorial plan and interestingness bar-chart representations of group 1.

3277

3278

A. Iodice D’Enza et al. / Computational Statistics & Data Analysis 52 (2008) 3269 – 3281

Fig. 5. Factorial plan and interestingness bar-chart representations of group 2.

IS values. This is due to the fact the association pattern characterizing the group concerns more than two items a time. The situation in group 3 is quite dissimilar from group 2: the points/items are spread out on the map, meaning that the association patterns involve small sets of items (see left side Fig. 6). In this case the bar-graph shows that the ﬁrst

A. Iodice D’Enza et al. / Computational Statistics & Data Analysis 52 (2008) 3269 – 3281

Fig. 6. Factorial plan and interestingness bar-chart representations of group 3.

3279

3280

A. Iodice D’Enza et al. / Computational Statistics & Data Analysis 52 (2008) 3269 – 3281

Fig. 7. Factorial plan and interestingness bar-chart representations of group 4.

A. Iodice D’Enza et al. / Computational Statistics & Data Analysis 52 (2008) 3269 – 3281

3281

ranked item pairs have higher values of IS. The last analyzed group represent a situation which is more or less the average of groups 2 and 3, with a structure similar to that of group 1. 7. Conclusion and perspectives Since the ARs were introduced for the analysis of large (and huge) databases, most of the computational aspects, in term of speed and memory, have been successfully solved. However, it is still an open problem how to interpret the massive output. Setting a priori threshold to mine rules, the user is not able to predict the number of rules that are actually mined: in the present approach the user sets the proportion of interesting items to consider, gaining the possibility to control the level of output. Future enhancements of the proposed approach are in several different directions. From a computational point of view, the aim is to improve the classiﬁcation step in order to obtain higher quality solutions in less time. Another important aspect is the generalization of the selection criteria from pairs of items to pairs of itemsets, or rather to generalize the procedure to complex rules. Furthermore, according to the exploratory nature of the procedure, it is necessary to improve the visualization tools. The enhancement of the graphical aspects is in terms of both friendly interpretation for non-expert users and interactivity user/display, under the modern visual data mining paradigm. Acknowledgment We would like to thank our anonymous referees for their comments and remarks that let us hopefully improve the quality of the paper. References Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules. In: Bocca, J.B., Jarke, M., Zaniolo, C. (Eds.), Proceedings of the 20th International Conference on Very Large Data Bases, VLDB. Morgan Kaufmann, Santiago, Chile, pp. 487–499. Agrawal, R., Imielinski, T., Swami, A., 1993. Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (Eds.), ACM SIGMOD International Conference on Management of Data, vol. 22. ACM Press, Washington, DC, pp. 207–216. Benzécri, J., 1973. L’analyse des Données. Dunod. Bove, G., 1992. Asymmetric multidimensional scaling and correspondence analysis for a square table. Statist. Appl. 4, 587–598. Daunis, J., Aluja, T., Thió, S., 2002. Anàlisi de matrius quadrades no simètriques: un enfocament integral usant anàlisi de correspondències. QUESTIIÓ (Quaderns d’Estadstica i Investigació Operativa). Elder, J., Pregibon, D., 1996. A statistical perspective on knowledge discovery in data bases. Adv. Knowledge Discovery Data Mining, 83–113. Gaber, M.M., Krishnaswamy, S., Zaslavsky, A., 2004. Cost-efﬁcient mining techniques for data streams. In: ACS, (Ed.), Proceedings of the Second Workshop on Australasian Information Security, Data Mining and Web Intelligence, and Software Internationalisation. Australian Computer Society Inc., Dunedin, New Zealand, pp. 109–114. Goodman, L., 1964. Simple methods for analyzing three-factor interaction in continency tables. J. Amer. Statist. Assoc. 59 (306), Gower, J., 1977. The Analysis of Asymmetry and Orthogonality. Recent Developments in Statistics. North-Holland, Amsterdam. Greenacre, M.J., 1984. Theory and Applications of Correspondence Analysis. Academic Press, London. Greenacre, M.J., 2000. Correspondence analysis of square asymmetric matrices. Appl. Statist. 49 (3), 297–310. Greenacre, M.J., 2007. Correspondence Analysis in Practice, second ed. Chapman & Hall, CRC, London, Boca Raton, FL. MacQueen, J., 1967. Some methods for classiﬁcation and analysis of multi-variate observations. In: Cam, L.M.L., Neyman, J. (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. University of California Press, Berkeley, CA. Ordonez, C., 2003. Clustering binary data streams with k-means. In: In Association for Computing Machinery (ACM) Special Interest Group: Management of Data (SIGMOD) Workshop on Research Issues on Data Mining and Knowledge Discovery. ACM Press, San Diego, CA, pp. 10–17. Plasse, M., Niang, N., Saporta, G., Villeminot, A., Leblond, L., 2007. Combined use of association rules mining and clustering methods to ﬁnd relevant links between binary rare attributes in a large data set. Comput. Statist. Data Anal. 52 (1), 596–613. Tan, P., Kumar, V., Srivastava, J., 2002. Selecting the right interestingness measure for association patterns. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, Edmonton, Canada, pp. 32–41.

Exploratory data analysis leading towards the most interesting simple association rules

Exploratory data analysis leading towards the most interesting simple association rules

Recommend Documents