Breaking Symmetries in Association Rules

Breaking Symmetries in Association Rules

Available online at www.sciencedirect.com Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect Procedi...

316KB Sizes 8 Downloads 122 Views

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 00 (2019) 000–000 Procedia Computer Science 14800 (2019) 283–290 Procedia Computer Science (2019) 000–000

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

Second International Conference on Intelligent Computing in Data Sciences (ICDS 2018) Second International Conference on Intelligent Computing in Data Sciences (ICDS 2018)

Breaking Breaking Symmetries Symmetries in in Association Association Rules Rules

a c a Fatima-Zahra El Mazourib,∗ b,∗, Said Jabboura , Badran Raddaouic , Lakhdar Saisa , Fatima-Zahra El Mazouri , Said Jabbour , Badran Raddaoui , Lakhdar Sais , b b Mohammed Chaouki Abounaimab , Khalid Zenkouarb , Khalid Zenkouar Mohammed Chaouki Abounaima a CRIL-CNRS, Universit´e d’Artois, F-62307 Lens Cedex, France

a CRIL-CNRS, Universit´ b Faculty e d’Artois, F-62307 Cedex, of Sciences and Technology, P.O. BoxLens 30000, F`es, France Morocco b Faculty of Sciences and Technology, P.O. Box 30000, F` c SAMOVAR, Morocco T´el´ecom SudParis, CNRS, Univ. Paris-Saclay,es, Evry, France c SAMOVAR, T´ el´ecom SudParis, CNRS, Univ. Paris-Saclay, Evry, France

Abstract Abstract In this paper, we propose an extension of the framework proposed in [1] for breaking symmetries in association rules problems. In this paper, propose an extension of the framework proposed in [1] forthe breaking in association rules Symmetries are we defined as permutations between items that leave invariant set of symmetries the transactions. Such kind of problems. structural Symmetries are defined as permutations between that leaveclasses invariant the set of theitemsets. transactions. Such kind of structural knowledge induces a partition of the search space items into equivalent of symmetrical Our proposed approach aims knowledge induces a partition of the search spacethe into equivalent classes of symmetrical proposeddetected approach aims show that symmetries can be exploited to reduce search space of associations rules by itemsets. the use ofOur symmetries before. show that symmetries can be exploited to reduce the search space of associations rules by the use of symmetries detected before. Firstly, recall the symmetry discovery in transaction databases. Secondly, we propose how symmetries can be broken as a preproFirstly, the symmetry discovery in transaction databases. Secondly, we propose howfrom symmetries can bedatasets brokencontain as a preprocessing recall step. Our experiments clearly show that several association rules instances taken the available such cessing step.We Ouralso experiments clearly showevidence that several rules instances reduces taken from the available datasets contain such symmetries. provide experimental that association breaking such symmetries the size of the output on some families symmetries. We also provide experimental evidence that breaking such symmetries reduces the size of the output on some families of instances. of instances. c 2019  2019 The The Authors. Authors. Published Published by by Elsevier Elsevier B.V. B.V. © c 2019  The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND This is an open access article under the CC BY-NC-ND license license (http://creativecommons.org/licenses/by-nc-nd/3.0/) (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under the CC BY-NC-ND license of (http://creativecommons.org/licenses/by-nc-nd/3.0/) the Second International Conference on Intelligent Computing in Peer-review under responsibility of the scientific committee Peer-review under responsibility of the scientific committee of the Second International Conference on Intelligent Computing in Data Sciences (ICDS 2018). Data Sciences (ICDS 2018). Keywords: Data Mining; Association rules; Symmetries Keywords: Data Mining; Association rules; Symmetries

1. Introduction 1. Introduction Extracting association rules from transactional databases have received intensive research since its introduction by Extracting association from transactional have received intensive research extended since its introduction by Rakesh Agrawal et al. in rules [2]. Initially referring todatabases data analysis, it have been successfully to other appliRakesh Agrawal et al. in [2]. Initially referring to data analysis, it have been successfully extended to other application domains have been, including bioinformatics, medical diagnosis, networks intrusion detection, web mining, cation domains haveand been, including medical diagnosis, networks intrusion web mining, documents analysis, scientific databioinformatics, analysis. This broad spectrum of applications enableddetection, association analysis to documents analysis, and scientific data analysis. This broad spectrum of applications enabled association analysis to be applied to a variety of datasets, including sequential, spatial, and graph-based data. Interestingly, association patbe applied to a variety of datasets, including sequential, spatial, and graph-based data. Interestingly, association patterns are now considered as a building block of several other learning problems such as classification, regression, and terns are now considered as a building block of several other learning problems such as classification, regression, and clustering. clustering. ∗ ∗

Corresponding author. Tel.: +212-619-630-386. Corresponding Tel.: +212-619-630-386. E-mail address:author. [email protected] E-mail address: [email protected] c 2019 1877-0509  The Authors. Authors. Published Published by by Elsevier Elsevier B.V. 1877-0509 © 2019 The B.V. c 2019 1877-0509  Thearticle Authors. Published by B.V. This isisan access under the CC licenselicense (http://creativecommons.org/licenses/by-nc-nd/3.0/) This anopen open access article under the BY-NC-ND CCElsevier BY-NC-ND (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under license Peer-review under responsibility of the the scientific committee of(http://creativecommons.org/licenses/by-nc-nd/3.0/) the Second International ConferenceConference on Intelligent Data Sciences Peer-review under responsibility of CC theBY-NC-ND scientific committee of the Second International onComputing IntelligentinComputing in Peer-review responsibility Data Sciences (ICDS 2018). of the scientific committee of the Second International Conference on Intelligent Computing in Data Sciences (ICDS 2018).under (ICDS 2018). 10.1016/j.procs.2019.01.052

284 2

Fatima-Zahra El Mazouri et al. / Procedia Computer Science 148 (2019) 283–290 Author name / Procedia Computer Science 00 (2019) 000–000

Symmetry between items is another kind of structural information that might be recovered from the transaction databases and used to further reduce the size of the output by limiting the search space. Symmetries are a fundamental concept in computer science, mathematics, physics and many other domains. Many human artifacts (e.g. classroom in a university, aircraft seats, circuit patterns) and entities in nature (e.g. plants, molecules, DNA sequences, atoms) exhibits symmetries. Such symmetries allow us to reason and to understand more complex entities and systems. For example, this kind of structures has been widely exploited in constraint satisfaction (CSP) and propositional satisfiability (SAT) problems. As far as we know, symmetry reasoning has been first introduced by Krishnamurthy [3] to shorten resolution proofs in propositional logic. He shows that a resolution proof system augmented with the symmetry rule is more powerful than resolution. In [4], the authors showed how dynamic symmetry detection and elimination can lead to significant improvements of several automated deduction techniques. In constraint satisfaction problems, a weaker form of symmetry called value interchangeability is first introduced by E. Freuder in [5], while variable and value symmetry are studied in [6]. Symmetry breaking using additional constraints has been proposed independently by James Crawford in the context of first order [7] and propositional [8] logic theories and by JeanFranc¸ois Puget in CSP [6]. More recent contributions clearly show that symmetry remains an active research area in both CSP and SAT. In the context of data mining and particularly for the itemset mining problem, symmetries have been studied in limited papers . Let us mention two related works addressing a particular case of general symmetries considered in this paper [9, 10]. Indeed, this restricted form of symmetry, called pairwise items symmetry, is obtained by exchanging two items, while leaving the remaining items unchanged. This is clearly a restriction of the general symmetry principle. In [9], an efficient ZBDD algorithm for detecting pairwise symmetries is proposed and the property of symmetric items in transaction database are discussed. Pairwise items symmetries, called clone items, have been proposed by Medina and Nourine [11] to explain why sometimes, the number of rules of a minimum cover of a relation is exponential in the number of items of the relation. More recently, in the context of constraint programming for k-pattern set mining, Guns et al. [12] used symmetry breaking constraints to impose a strict ordering on the patterns in the pattern set. Said Jabbour and al. [1] proposed to use the symmetry breaking in the context of itemsets mining as a preprocessing step. In this paper, we propose an extension of such work for association rules. We prove that such symmetries can be also be used in this context to reduce the search space by reducing the transactions database by eliminating some items. The rest of this paper is organized as follows: in ”Preliminary definitions and notations” section, we showed an overview of association rules, in the second section we gave some definitions about ”symmetry in frequent itemset mining”, then the section 4 describes ”the symmetry detection in transaction databases ”. The experiments and the obtained results are introduced in ”experimental evaluation” section. In the last section, we concluded by summarizing the work done in this study and we cited the expected future works. 2. Preliminary definitions and notations 2.1. Association Rules Let Ω be a finite non empty set of symbols, called items. From now on, we assume that this set is fixed. We use the letters a, b, c, etc to range over the elements of Ω. An itemset I over Ω is defined as a subset of Ω, i.e., I ⊆ Ω. We use 2Ω to denote the set of itemsets over Ω and we use the capital letters I, J, K, etc to range over the elements of 2Ω . A transaction is an ordered pair (i, I) where i is a natural number, called transaction identifier, and I an itemset, i.e., (i, I) ∈ N × 2Ω . A transaction database D is defined as a finite non empty set of transactions (D ⊆ N × 2Ω ) where each transaction identifier refers to a unique itemset. Given a transaction database D and an itemset I, the cover of I in D, denoted C(I, D), is defined as {i ∈ N | (i, J) ∈ D and I ⊆ J}. The support of I in D, denoted S upp(I, D), corresponds to the cardinality of C(I, D), i.e., S upp(I, D) = |C(I, D)|. An itemset I ⊆ Ω such that S upp(I, D)  1 is a closed itemset if, for all itemsets J with I ⊂ J, S upp(J, D) < S upp(I, D). Example 1. For instance, let us consider the transaction database D depicted in Table 1. We have C({c, d}, D) = {1, 2, 3, 4, 5} and S upp({c, d}, D) = 5 while S upp({ f }, D) = 3. The itemset {c, d} is closed,



Fatima-Zahra El Mazouri et al. / Procedia Computer Science 148 (2019) 283–290 Author name / Procedia Computer Science 00 (2019) 000–000

285 3

while { f } is not since S upp({ f }, D) = S upp({c, d, f }, D). tid 001 002 003 004 005 006

Transactions A B F A B F B C F B D F E A B E C D

Name r1 r2 r3

Asso. Rules {A} → {F} {F} → {C} {A} → {B}

Support 1 2/6 1/6

Confidence 1 2/4 1/3

Table 2. Some association rules

Table 1. A Transaction Database D

In this work, we are interested in the problem of mining association rules (MAR). An association rule is a pattern of the form X → Y where X (called the antecedent) and Y (called the consequent) are two disjoint itemsets. In MAR, the interestingness predicate is defined using the notions of support and confidence. The support of an association rule X → Y in a transaction database D, defined as S upp(X → Y, D) = S upp(X∪Y,D) , determines how often a rule is |D| applicable to a given dataset, i.e., the occurrence frequency of the rule. The confidence of X → Y in D, defined as Con f (X → Y, D) = S Supp(X∪Y,D) upp(X,D) , provides an estimate of the conditional probability of Y given X. When there is no ambiguity, we omit to mention the transaction database D , and we simply note S upp(X → Y) and Con f (X → Y). A valid association rule is an association rule with support and confidence greater than or equal to the minimum support threshold (minsupp) and minimum confidence threshold (minconf) respectively. More precisely, given a transaction database D, a minimum support threshold minsupp and a minimum confidence threshold minconf, the problem of mining association rules consists in computing MAR(D, minsupp, mincon f ) = {X → Y | X, Y ⊆ Ω, S upp(X → Y, D)  minsupp, Con f (X → Y, D)  mincon f } Table 2 illustrates some association rules with their corresponding supports and confidences. For instance, S upp({A} → {B}) = 36 and Con f ({A} → {B}) = 1. Definition 1 (Association Rules Problem). The association rules problem consists in computing the following set: AR(D, λ, γ) = {X → Y | S upp(X → Y) ≥ λ, Con f (X → Y) ≥ γ} Definition 2 (Transaction Renaming). Let D be a transaction database. A renaming f over Tid (D) is a bijective mapping from Tid (D) to Tid (D). We can extend a renaming f to D as follows: f (D) = {( f (tid), I)|(tid, I) ∈ D}. 3. Symmetry in Frequent Itemset Mining Definition 3 (Permutation). A permutation σ over I is a bijective mapping from I to I. Let D be a transaction database. We can extend a permutation σ to D as follows: σ(D) = {(tid, σ(I))|(tid, I) ∈ D} where σ(I) = {σ(a)|a ∈ I}. We denote by P(I) the set of all the permutations over I and ◦ is the composition operation over the elements of P(I). It is easy to see that (P(I), ◦) is a group where the identity permutation is a neutral element. (S  , ◦) is a sub-group of the group (P(I), ◦) if and only if (S  , ◦) is a group and S  ⊆ P(I). A generating set of the group (S , ◦) is a subset Σ of S such that every element of the group can be obtained from the combination of finitely many elements of Σ and their inverses, and we write S =< Σ >.

286 4

Fatima-Zahra El Mazouri et al. / Procedia Computer Science 148 (2019) 283–290 Author name / Procedia Computer Science 00 (2019) 000–000

Each permutation σ can be represented by a set of cycles c1 . . . cn where each cycle ci = (a1 , . . . , ak ) is a list of elements of I such that σ(a j ) = a j+1 for j = 1, . . . , k − 1, and σ(ak ) = a1 . In the sequel, for clarity reasons, we omit cycles of the form (a, a) in the description of permutations and symmetries,. Definition 4 (Symmetry). Let D be a transaction database. A symmetry of D is a permutation σ ∈ P(I) such that there exists a transaction renaming f over Tid (D) where σ(D) = f (D) i.e. f −1 (σ(D)) = D. Example 2. Let us consider again the transaction database given in Table 2. σ = (A, B) is a symmetry because σ(D) = f (D) where f is a transaction renaming defined as follows:           f (x) =         

003 001 002 004 x

if x=001 if x=003 if x=004 if x=002 otherwise

Let S(D) be the set of all the symmetries. One can see that (S(D), ◦) is a sub-group of (P(I), ◦). Definition 5. Let D be a transaction database. Two items a, b ∈ Iitems (D) are symmetric if there exists a symmetry σ of D such that σ(a) = b. 4. Symmetry Detection in Transaction Databases In this section, we describe symmetry detection in transaction databases. One of the most popular means of discovering symmetries of a problem is to first convert the problem into a graph, and employ a general-purpose graph symmetry tool to uncover the symmetries [13]. These symmetries can then be reflected back into the original problem. The oldest and most established graph symmetry program is McKays Nauty [14]. Most of the available symmetry detection tools convert the original problem into a colored undirected graph, where vertices are labeled with colors. Such colored vertices are considered when searching for automorphism on the graph (i.e. vertices with different colors can not be permuted to each other). Definition 6. A colored undirected graph is a triplet G = (V, E, η) with vertex set V and edge set E ∈ V 2 and η is a function from V to N that associates a positive integer (a color) to each vertex. Definition 7. A permutation σ of V is a symmetry of G iff σ(G) = G or equivalently σ(E) = E. Definition 8. A symmetry σ of G respects a partition π of V if for each v ∈ V, v and σ(v) belong to the same cell of π. The set of all symmetries of G with respect to a partition π is called the automorphism group of G under π and is denoted A(G)π . We now show how a transaction database can be encoded as a colored undirected graph. Definition 9 (From D to G). Let I be a set of items and D a transaction database over I. We define the colored undirected graph G associated to D as G(D) = (V, E, η) where V = I ∪ Tid (D), E = {(tid, i)|∃(tid, I) ∈ D, i ∈ I} and ∀v ∈ V, η(v) =



0 1

if v ∈ I otherwise v ∈ Tid (D)

Let us note that π = {{v|v ∈ V, η(v) = 0}, {v|v ∈ V, η(v) = 1}} is the initial partition of V. Several refinements have been proposed to improve the partition π. The goal of these refinements is to reduce the search space by distinguishing as much as possible non symmetric vertices using vertex invariant such as vertex degree (e.g. [15]).



Fatima-Zahra El Mazouri et al. / Procedia Computer Science 148 (2019) 283–290 Author name / Procedia Computer Science 00 (2019) 000–000

287 5

The conversion of the transaction database D given in Example 1 to a colored undirected graph G(D) is depicted in Figure 1. For simplicity, we use the first letter of each item name in lower case (e.g. e for Eggs). In the figure we distinguish two kinds of nodes, items represented by circles (color 1) and transaction identifiers represented by rectangles (color 2). t

s

001

m

a

002

o

c

003

p

e

004

b

Fig. 1. From transaction database D to colored graph G(D)

With this construction, there is a one to one mapping between symmetries in the transaction database and the automorphisms of the colored undirected graph. Given G(D) and an initial partition π, the goal is to compute A(G(D))π . Using Nauty [14], one can find the automorphisms that leave G(D) invariant. For example, A = (A, B)(C)(D)(E)(F) [(001, 003)(002, 004)(005)(006)] is an automorphism of G(D). The corresponding symmetry σ is obtained from A by projection on the set of items I of D. 4.1. Symmetry Breaking In this section, we propose to extend the approach of [1] to reduce the search space of possible interesting association rules by detecting and breaking symmetries between itempairs. We only consider the particular case of the symmetries with binary cycles in our description. Moreover, we fix an order  on I and we denote by min(I), with I an itemset, the least item of I following . Proposition 1. Let σ be a symmetry of D, and X → Y a valid association rule. Then, σ(X) → σ(Y) is also a valid association rule. Using Proposition 1, it is clear that also the set of association rules can be partitioned into equivalence classes. To allow to keep a representative rule from each equivalence class, we follow the approach of [1] to remove items from D. Proposition 2. Let σ = (x1 , y1 ) . . . (xn , yn ) be a symmetry of D and T a transaction of D. If yi ∈ T and ∀, k < i, xk  T , then yi can be removed from T . Proposition 2 allows to simplify the original transaction database by removing some items according to some conditions. Let us show an example where Itempair-Based Symmetry Breaking approach eliminates items in the transaction database. Let D be a transaction database of Example 1. We have σ1 = (A, B) σ2 = (C, D) are two symmetries of D. Using Proposition 2, the new transactions database is tid 001 002 003 004 005 006

Transactions A C F A F C F F E A B E C D

288 6

Fatima-Zahra El Mazouri et al. / Procedia Computer Science 148 (2019) 283–290 Author name / Procedia Computer Science 00 (2019) 000–000 Dataset Dataset1 Dataset2 Dataset3 Dataset4 Dataset5 Dataset6 Dataset7

|D| 100 778 6110 7822 10000 60808 100

|I| 1006 3506 13680 17266 26445 27141 27151

|Iocc| 1020 4089 30886 40506 200102 246817 199869

%dens 0.099% 0.011% 0.014% 0.023% 0.075% 0.073% 7.37%

time(s) 1.29 1.67 2.64 2.89 3.87 7.72 8.5

|Irem| 894 2774 10577 7520 8330 13396 13032

#sym 899 2642 8371 9415 9283 9074 11607

Table 3. Details and results on simulated datasets

5. Experimental evaluation Our experimental studies are carried out on both real, public and simulated datasets widely used in the data-mining community. We perform different experiments to show how symmetry breaking can be achieve improvements in association rules mining. In our study, we carried out a series of three experimentations using different datasets: 1. Simulated datasets: In this experimentation, we use the well-known IBM itemset data generator to generate data with different features (such as dataset size, density, etc.). Note that this generator is open source and publically available1 . We generated different datasets with different parameters regarding the number of transactions, average number of items per transaction, number of different items, etc. In Table 3, we provide the features of each simulated dataset. 2. Public datasets: The datasets used in this evaluation are from the well-known itemset FIMI repository 2 . However, the datasets from FIMI are mostly simulated and not high dimensional. Therefore, we used S IDO1, a dataset containing pharmacology real data form the Causality challenge3 . This dataset contains massive data of 12678 transactions with 4932 items. 3. Real datasets: In this experimentation, we used real data regarding intrusion detection alerts. The raw data is collected from real and recent alert log files produced by Snort intrusion detection system (IDS) 4 monitoring a university campus network. These alert logs represent three months activity. The alert log files are preprocessed into alert windows. Each alert window represents 1 hour of alerts in dataset Alert1h and 8 hours in dataset Alert8h. In the obtained datasets, each transaction represents the alerts triggered during an alert window. In our experiments, we exploit Saucy5 . It is originally proposed in [13] and significantly improved in [15]. The latest version of Saucy outperforms all the existing tools by many orders of magnitude, in some cases improving runtime from several days to a fraction of a second. In the sequel, we note, |D| (resp. |I|) the number of transactions (resp. different items) in the dataset D. |Iocc| denotes the total number of item occurrences in D, while % dens represents the density of the input transaction dataset. time(s) denotes the cumulated time in seconds required to search and break symmetries. |Irem| represents the number of removed item occurrences from the initial dataset. #sym denotes the number of symmetries discovered in the dataset. Finaly, I set and I sset corresponds to the number of frequent itemsets obtained on the transaction dataset before and after breaking symmetries respectively. 5.1. Results on simulated datasets Table 3 provides details on the generated datasets and the results of symmetry breaking approach. Table 3 clearly shows the significant improvements (in terms of the number of removed item occurrences) that can be performed as the number of symmetries grows. For instance, in experimentation on Dataset3 with 6110 transactions, the number of symmetries is 8371. The second point to mention is that on the generated datasets, the 1 2 3 4 5

IBM Quest Market-Basket Synthetic Data Generator: http://sourceforge.net/projects/ibmquestdatagen/ Frequent Itemset Mining Implementations Repository: http://fimi.ua.ac.be/ http://www.causality.inf.ethz.ch/challenge.php http://snort.org Saucy2: Fast symmetry discovery http://vlsicad.eecs.umich.edu/BK/SAUCY/



Fatima-Zahra El Mazouri et al. / Procedia Computer Science 148 (2019) 283–290 Author name / Procedia Computer Science 00 (2019) 000–000 #sym 6 10 16 22 26

|I| 900 5586 37350 148471 311850

|Irem| 421 3711 32482 128471 289847

|I set| 1015830 17467891 2679034 1462982 22156

λ 5 90 375 900 1450

289 7

|I sset| 4370 12952 467 1232 33

Table 4. Comparison of the number of removed item occurrences and output size in the number of discovered symmetries. Dataset Mushroom Retail BMSWebView2 BMSPOS SIDO1

|D| 8124 88162 77512

|I| 186852 908576 3341

|Iocc| 198169866 4106009 358278

%dens 13% 0.005% 0.14%

|Irem| 1208 89 4

#sym 11 217 10

515597

1658

3367020

0.4%

3

16

9297

4926

4514145

9.85%

568

190

Table 5. Details and results on public datasets Dataset Alert1h Alert8h

|D| 22203 123

|I| 167 167

|Iocc| 27049 3453

%dens 0.73% 16.81%

|Irem| 86 248

#sym 23 19

Table 6. Details and results on the alert dataset

number of symmetries is important in datasets with low densities. In Table 4, we provide results pointing out the correlation between the size of the preprocessed datasets (in terms of item occurrences ) and the size of outputs (in terms of frequent itemsets) with the number of discovered symmetries. Note that this experiment is carried out on datasets we simulated specifically to involve interesting symmetries. To compare the output size, we use the SPMF 6 [16] association rules mining that isa multi-purpose data mining toolkits, implemented in Java, which incorporate a rich collection of data mining algorithms. For pure and closed association rules. One can see for instance in Table 4 that as the number of symmetries in the dataset increases, the size of the preprocessed dataset is significantly reduced. In particular, in simulated dataset, when the number of symmetries equals 26, the item occurrences is reduced from 311850 to 66272 (ensuring a reduction rate of 78%). Note also that the output size (see the number of frequent itemsets |I set| before preprocessing and the number frequent itemsets |I sset| on the preprocessed dataset) is significantly reduced on all the datasets of Table 4. The output size is computed for each dataset using the same support λ. 5.2. Results on public datasets In Table 5, we provide details on the used datasets and the obtained results. The results of Table 5 show that in the tested public datasets, symmetries can be found. For instance, the S IDO1 dataset contains 190 symmetries allowing to remove 568 item occurrences from the initial dataset. However, the number of symmetries is not as important as in some simulated datasets of Table 3. Also, the number of removed item occurrences is not significant, consequently the size of the outputs in terms of the number of frequent itemsets is not significantly reduced. 5.3. Results on real datasets In Table 6, we find details on the used alert datasets and the results of our approach. In Table 6, we see that our datasets regarding real intrusion detection alerts contain some symmetries. More importantly, on the dataset Alert8h, the size of the output in terms of the number of frequent itemsets (with a support λ=60) is decreased by a ratio of 50%. The two datasets have different densities and the number of removed items does not depend only on the number 6

SPMF: http://www.philippe-fournier-viger.com/spmf/

Fatima-Zahra El Mazouri et al. / Procedia Computer Science 148 (2019) 283–290 Author name / Procedia Computer Science 00 (2019) 000–000

290 8

of discovered symmetries but also on the nature of the data itself. Let us summarize the experimental evaluation of symmetry framework for frequent itemset mining problems. From the obtained results, we can make several interesting observations: • Several datasets contain symmetries and are discovered by symmetry detection method in a reasonable amount of time. • The size of the output (number of frequent itemsets) can be significantly reduced by symmetry-based framework. • The reduction rate in terms of the number of eliminated item occurrences in the transaction database does not depend only on the number of found symmetries but also on their form. Indeed, some of the discovered symmetries are made of permutations of items in the same transactions, leading to no reduction using symmetry breaking approaches. 6. Conclusion and Future Works Symmetry is an important structural property widely exploited to reduce the search space of many combinatorial problems. In this paper, show that the framework proposed in itemsets mining [1] can be extended to association rules. Used as a preprocessing approach, breaking symmetries allows to remove items from a transaction database and consequently to speed up the search. The symmetry breaking proposed in this paper for the association rules problem opens interesting research directions in data mining in general. We plan to integrate the symmetry pruning in specialized approaches. References [1] S. Jabbour, L. Sais, Y. Salhi, K. Tabia, Symmetries in itemset mining, in: ECAI 2012 - 20th European Conference on Artificial Intelligence. Including Prestigious Applications of Artificial Intelligence (PAIS-2012) System Demonstrations Track, Montpellier, France, August 27-31 , 2012, 2012, pp. 432–437. [2] T. I. R. Agrawal, A. N. Swami, Mining association rules between sets of items in large databases, in: ACM SIGMOD International Conference on Management of Data, ACM Press, Baltimore, 1993, pp. 207–216. [3] B. Krishnamurthy, Shorts proofs for tricky formulas, Acta Informatica 22 (1985) 253–275. [4] B. Benhamou, L. Sas, Tractability through symmetries in propositional calculus, Journal of Automated Reasoning 12 (1) (1994) 89–102. [5] E. C. Freuder, Eliminating interchangeable values in constraint satisfaction problems, in: Proceedings of the 9th National Conference on Artificial Intelligence (AAAI’91), 1991, pp. 227–233. [6] J.-F. Puget, On the satisfiability of symmetrical constrained satisfaction problems, in: Proceedings of the 7th International Symposium on Methodologies for Intelligent Systems (ISMIS’93), 1993, pp. 350–361. [7] J. Crawford, A theorical analysis of reasoning by symmetry in first order logic, in: Workshop on Tractable Reasonning (AAAI’92), 1992, pp. 17–22. [8] J. Crawford, M. L. Ginsberg, E. Luck, A. Roy, Symmetry-breaking predicates for search problems, in: Proceedings of the Fifth International Conference on Principles of Knowledge Representation and Reasoning (KR’96), Morgan Kaufmann, 1996, pp. 148–159. [9] S.-i. Minato, Symmetric item set mining based on zero-suppressed bdds, in: DS’06, 2006, pp. 321–326. [10] R. Medina, C. Noyer, O. Raynaud, Efficient algorithms for clone items detection, in: CLA’05, 2005, pp. 70–81. [11] A. G´ely, R. Medina, L. Nourine, Y. Renaud, Uncovering and reducing hidden combinatorics in guigues-duquenne bases, in: Proceedings of the third International ConferenceFormal Concept Analysis (ICFCA’05), 2005, pp. 235–248. [12] T. Guns, S. Nijssen, L. de Raedt, k-pattern set mining under constraints, IEEE TKDE 99 (PrePrints). [13] F. A. Aloul, A. Ramani, I. L. Markov, K. A. Sakallah, Solving difficult instances of boolean satisfiability in the presence of symmetry, Transactions on Computer Aided Design. [14] B. D. McKay, Practical graph isomorphism, Congressus Numerantium 30 (1981) 47–87. [15] P. T. Darga, K. A. Sakallah, I. L. Markov, Faster symmetry discovery using sparsity of symmetries, in: Proceedings of the 45th Design Automation Conference (DAC’08), 2008, pp. 149–154. [16] P. Fournier-Viger, A. Gomaric, T. Gueniche, A. Soltani, C.-W. Wu, V. S. Tseng, Spmf: a java open-source pattern mining library, The Journal of Machine Learning Research 15 (1) (2014) 3389–3393.