Conditional discriminative pattern mining: Concepts and algorithms

Conditional discriminative pattern mining: Concepts and algorithms

Accepted Manuscript Conditional Discriminative Pattern Mining: Concepts and Algorithms Zengyou He, Feiyang Gu, Can Zhao, Xiaoqing Liu, Jun Wu, Ju Wan...

767KB Sizes 3 Downloads 124 Views

Accepted Manuscript

Conditional Discriminative Pattern Mining: Concepts and Algorithms Zengyou He, Feiyang Gu, Can Zhao, Xiaoqing Liu, Jun Wu, Ju Wang PII: DOI: Reference:

S0020-0255(16)30986-0 10.1016/j.ins.2016.09.047 INS 12543

To appear in:

Information Sciences

Received date: Revised date: Accepted date:

2 July 2015 7 July 2016 19 September 2016

Please cite this article as: Zengyou He, Feiyang Gu, Can Zhao, Xiaoqing Liu, Jun Wu, Ju Wang, Conditional Discriminative Pattern Mining: Concepts and Algorithms, Information Sciences (2016), doi: 10.1016/j.ins.2016.09.047

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Conditional Discriminative Pattern Mining: Concepts and Algorithms Zengyou Hea,b , Feiyang Gua,, Can Zhaoa , Xiaoqing Liua , Jun Wuc , Ju Wangd,∗ a School of Software, Dalian University of Technology, Dalian, China. Laboratory for Ubiquitous Network and Service Software of Liaoning, Dalian, China. c School of Computer and Information Science, Zunyi Normal College, Zunyi, China. d School of Ecomics and Management, University of Electronic Science and Technology of China, Chengdu, China.

CR IP T

b Key

Abstract

AN US

Discriminative pattern mining is used to discover a set of significant patterns that occur with disproportionate frequencies in different class-labeled data sets. Although there are many algorithms that have been proposed, the redundancy issue that the discriminative power of many patterns mainly derives from their sub-patterns has not been resolved yet. In this paper, we consider a novel notion dubbed conditional discriminative pattern to address this issue. To mine conditional discriminative patterns, we propose an effective algorithm called CDPM (Conditional Discriminative Patterns Mining) to generate a set of non-redundant discriminative patterns. Experimental results on real data sets demonstrate that CDPM has very good performance on removing redundant patterns that are derived from significant sub-patterns so as to generate a concise set of meaningful discriminative patterns.

M

Keywords: Discriminative pattern, emerging pattern, contrast sets, subgroup discovery, data mining.

1. Introduction

AC

CE

PT

ED

Over the past decades, pattern mining has become one of the most important tasks in data mining. A pattern would be considered significant if its significance value could satisfy some user-specified conditions [1], [5], [7]. This challenging task has proven great value in simplifying the search for potential useful knowledge and characterizing the large complex data sets in concise and informative forms. Discriminative pattern mining is a popular group of pattern mining techniques designed to discover a set of significant patterns that occur with disproportionate frequencies in different class-labeled data sets [14], [32], [10]. For example, the US census data from 1979 to 1992 show that P(occupation=sales| Ph.D.) = 2.7% while P(occupation=sales| Bachelor) = 15.8% [4]. Then (occupation=sales) could be a discriminative pattern due to its significantly different levels of the proportion between the group with a Ph.D degree and the group with a bachelor’s degree. Similarly, the biological molecules such as differentially expressed genes and proteins in patients and healthy persons, are also examples of discriminative patterns. Discriminative patterns can provide valuable insights into data sets with class labels by contrasting the characteristics of given classes and identifying humanly interpretable differences. Moreover, the discrimination information gained from discriminative patterns generally reveals the underlying regulation of labeled data and contributes to a solution to complex problems such as the construction of accurate classification models [9], [39], [41]. In addition, the practical value of discriminative patterns has been demonstrated by a wide range of studies in several domains including bioinformatics, medical science, marketing management, etc [32], [24], [34], [8]. Discriminative pattern mining has drawn widespread attention and many algorithms have been proposed during past years. Some researches investigate discriminative patterns under other names such as contrast sets [5], emerging patterns [13], and subgroups [45]. However, these definitions are actually equivalent since their target patterns can be used interchangeably with the same ability of capturing the differences between distinct classes [34], [36]. In this paper, we will refer to these patterns collectively as discriminative patterns. ∗ Corresponding

author Email address: [email protected] (Ju Wang )

Preprint submitted to Information Sciences

September 20, 2016

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

The exploration of discriminative patterns generally includes two aspects: the frequency and the statistical significance. On one hand, the frequency of one pattern can be assessed by its support, which is defined as the percentage of transactions in one class that contain this pattern. One pattern is frequent if its support value is no less than a given threshold. On the other hand, the statistical significance of discriminative patterns can be measured by various test statistics such as odds ratio [37], information gain [9], chi-square [5], etc. A pattern is defined to be significant if its significance value generated from a certain statistical measure could meet some user-defined conditions, e.g., no more (or less) than a given threshold. Generally, any statistical measures that are capable of quantifying the differences between classes are utilizable, and the choice of statistical measures typically will not affect the overall performance of the discriminative pattern discovery algorithms. Recent studies have achieved relatively promising efficiency and effectiveness for the task of discriminative pattern mining. Overall, existing algorithms for discriminative pattern mining can be differentiated into two categories according to their ranking strategies: algorithms with user-specified thresholds and algorithms based on the order of statistical significance. Algorithms falling into the first category usually set some feasible thresholds of statistical measures to reduce search space and filter out insignificant patterns [14], [16], [28]. Algorithms included in the second category primarily aim at finding a certain number of patterns with the strongest discriminative power [45], [42], [44]. Most of these studies adopt a two-step procedure: first generate a set of candidate patterns (i.e. frequent patterns in one class); then, perform the statistical significance test to measure their discriminative power and prune insignificant patterns [9], [31], [11], [46]. However, due to the exponential combinations of large amount of items, the candidate pattern generation process could incur a high computation cost to complete the search procedure, while a proportion of the candidates might turn out to possess low discriminative power in the second stage. This issue motivates some methods to directly mine discriminative patterns without generating candidates in a single stage [10]. In addition, the majority of the popular approaches investigate the discriminative patterns by evaluating every individual pattern separately [14], [16], [28], [3]. The newly developed pattern set mining technique is concerned with finding a condensed set of significant and non-redundant patterns by measuring the overall discriminative power of each pattern set [42], [26], [12]. Despite the significant improvements of discriminative pattern mining algorithms, the discovery of discriminative patterns in class-labeled data sets is still a challenging and time-consuming task. One important issue that remains unsolved in discriminative pattern mining is the redundancy with respect to the detected pattern sets. With a given data set, most existing algorithms are able to report a number of significant patterns according to their definitions, while how to reduce the redundancy of the output pattern set has not been well addressed. To date, most existing discriminative pattern discovery approaches usually generate many redundant patterns. These patterns might possess similar discriminative power in the sense that they distinguish different class-labeled data sets by covering the similar set of transactions. As a result, each such pattern provides no additional insights of discrimination compared to other patterns. Though such patterns are investigated under different variants, they are equivalent as owners of the same level of discriminative power across classes in essence. Generating all these redundant patterns in the data mining procedure is quite time-consuming and makes it difficult to interpret the results. Accordingly, keeping one or some representative patterns of all the equivalent ones is feasible, which can reduce the degree of redundancy and increase the ease of understanding and using discriminative patterns in real-word applications. Overall, eliminating these redundant patterns is a critical task. To address the redundancy problem, research attempts from different angles have been made. In general, the basic idea is to impose some constraints on the relevance between different patterns to reduce the redundancy level to some extent [42], [12], [22], [21]. The simplest strategy is to employ the closedness constraint during the pattern discovery process, which ensures that only closed discriminative patterns are selected [22], [6], [17], [29]. For the binary labeled data sets (positive vs. negative), some studies adopt the concept of relevance to remove redundancy so as to report a set of non-redundant discriminative patterns [17], [20], [27]. They define one discriminative pattern to be irrelevant if it is dominated by another discriminative pattern in the sense that: (1) in the positive class, the set of transactions containing the dominating pattern is a superset of those containing the dominated pattern; (2) in the negative class, the transactions containing the dominating pattern are covered by those containing the dominated pattern. Such relevance-based approaches have been proven to be effective in the context of non-redundant discriminative pattern mining. However, this strategy has not fully addressed the issue of eliminating the redundancy caused by significant sub-patterns. In [30], one pattern is deemed to be non-redundant with respect to its sub-patterns if their confidence intervals of odds ratios don’t overlap. Another feasible strategy is to apply some additional evaluation procedures 2

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

such as permutation tests in statistics [19], [18] to post-process discriminative pattern sets discovered by the existing algorithms. One representative approach is to employ the “sequential permutation test” method [35] to reduce the effects of constituent sub-patterns when evaluating the statistical significance of discriminative patterns. However, the permutation test procedure is relatively complex and very time-consuming. To remove the influence of subsets of patterns, some methods proposed in bioinformatics such as Motif-X (Motif extractor) [40], MMFPh (Maximal Motif Finder for Phosphoproteomics) [43] and C-Motif (Conditional Phosphorylation Motif) [33] consider a modified version of the standard task of pattern discovery, where patterns are investigated on the data sets induced from their subsets. This paper focuses on a specific type of redundant patterns, whose discriminative power mainly derives from their sub-patterns. More precisely, we propose a new pattern mining problem called conditional discriminative pattern mining by generalizing recent researches devoted to the non-redundancy of pattern sets. Intuitively, one pattern is defined as a conditional discriminative pattern on condition that its discriminative power induces from the overall pattern other than its any components. Owing to this new problem formulation, the interplays of the subsets would be removed. In the following, we will first elucidate the concept of conditional discriminative pattern, subsequently show that how this formulation can exclude the patterns whose discriminative power mainly benefits from their subsets. One pattern p is a k-pattern if it consists of k items. If the items comprising the pattern p are all also contained in another pattern q (i.e., the set of items contained in p is a subset of the set of items contained in q), then p is a sub-pattern of q and q is a super-pattern of p. Notably, every transaction that contains one pattern must also contain its subpatterns but does not necessarily contain its super-patterns. Hence, the set of transactions that contain one pattern must be a subset of the transaction collection that contain its sub-pattern. In essence, there are 2k − 2 sub-patterns for the k-pattern p. Moreover, each sub-pattern of p corresponds to a set of transactions that contain this sub-pattern. Thus, based on the sub-patterns of p, we can obtain 2k − 2 new transaction sets, of which the class label information remains unchanged. If we calculate the statistical significance value of the k-pattern on these new transaction sets, the induced significance value is called the local significance or conditional significance. Similarly, we define the significance value on the original data sets in the traditional way as the global significance. Consequently, the statistical significance evaluation of conditional discriminative patterns includes two aspects: local significance evaluation and global significance evaluation. In this context, this k-pattern is defined as a conditional discriminative pattern if it can pass the significance tests both on the local and the global levels. In fact, the evaluation of one candidate discriminative pattern on its sub-pattern induced transaction sets is motivated by the concept of conditional association based on the partial table in statistics [2]. The so-called partial table is a contingency table that displays the relationships between two variables X and Y at fixed levels of another variable Z, hence showing the effect of X on Y while removing the effect of Z by holding its value constant. In our context, X corresponds to the candidate pattern, Y is the class variable and Z is the target sub-pattern. Since the use of conditional association is able to remove the effect of the fixed variable, there is no doubt that the effect of sub-patterns can be eliminated as well since we fix all possible sub-patterns during the calculation of local significance. To address the problem of conditional discriminative pattern mining, we propose a tree-based approach dubbed CDPM in this paper. CDPM first condenses the given data into a tree structure, and then follows the tree traversal to produce significant conditional discriminative patterns in a depth-first manner. This method employs the statistical measures to evaluate both local and global significance of each candidate pattern. Experiments on real data sets demonstrate that CDPM achieves the success of discovering significant conditional discriminative patterns and excluding redundant patterns effectively. The concept of conditional discriminative pattern mining enables us to eliminate redundant patterns whose discriminative power mainly comes from their sub-patterns. We believe that CDPM is a good choice if one desires to remove such kind of redundancy from the reported pattern set. The remainder of this paper is organized as follows. Section 2 presents the details of the CDPM algorithm. Section 3 presents the experimental results on real data sets. Section 4 concludes this paper.

3

ACCEPTED MANUSCRIPT

2. Methods 2.1. Basic terminology

AN US

Definition 1. p is a frequent pattern if S up(p, D1 ) ≥ θ sup .

CR IP T

In this paper, for the ease of illustration and discussion, we focus on the two-class problem, where the set of transactions in one class is defined as the positive data set and the remaining transactions in another class compose the negative data set. In fact, the two-class problem can be extended to multi-class problem as described in [3][4]. For instance, we can first convert the multi-class problem to a two-class problem by choosing transactions in one class or some classes as the positive data and the set of remaining transactions as the negative data, and then execute our method iteratively. Let D be a data set with two classes, denoted by D={D1 , D2 }, where D1 corresponds to the positive data and D2 represents the negative data. The set of items that appear in all the transactions is denoted as I = {i1 , i2 , i3 , ..., in }. For a pattern p={x1 , x2 , ..., xr }, the number of its occurrences in Dt is denoted as Occ(p, Dt ) (t=1, 2), and that in the whole data set is Occ(p, D) = Occ(p, D1 ) + Occ(p, D2 ). We use Sup(p,D) to denote the support of the pattern p in the data set D, i.e. S up(p, D) = Occ(p, D)/|D|. One pattern is frequent in class Dt if its corresponding support S up(p, Dt ) is no less than a user-specified threshold θ sup . In fact, the discovery of discriminative patterns can be interpreted as finding a set of patterns that are “over-expressed” in the positive data set against the negative data set. Thus, researchers typically only use the positive data to assess the frequency of a pattern.

We utilize S ig(p, D) to denote the statistical significance calculation function for the pattern p, which measures the discriminative power of p with respect to a data set D. In fact, many significance measurements such as DiffSup, odds ratio and relative risk can be used for evaluating discriminative patterns in different applications. DiffSup is defined as the absolute difference of the relative supports of p in D1 and D2 : S ig(p, D) = |S up(p, D1 ) − S up(p, D2 )|.

(1)

AC

CE

PT

ED

M

In this paper, we choose DiffSup as the significance measurement to evaluate the discriminative power of each pattern. In fact, other statistical measures for evaluating the discriminative power of patterns are available as well. The choice of significance assessment measure generally will not change the performance of the pattern discovery algorithms. In addition, some other user-specified constraints in combination with specific domain knowledge can also be imported in certain applications. For a discriminative pattern, we should investigate whether its discriminative power mainly comes from its significant sub-patterns. If p has m candidate sub-patterns, we denote the sets of transactions where these sub-patterns occur as D(p1 ), D(p2 ), ..., D(pm ), respectively. If D(p1 ), D(p2 ), ..., D(pm ) are used as the new data sets instead of D when estimating the discriminative power, then we have m different significance values for p: S ig(p, D(p1 )), S ig(p, D(p2 )) , ..., S ig(p, D(pm )). If all these significance values can pass the given significance threshold, then p is a conditional discriminative pattern. Thus, two significance constraints on both the global level and the local level are imposed. In order to simplify the problem, we focus on the patterns that occur more frequently in D1 than in D2 . If we want to obtain patterns that are over-expressed in D2 , we can just exchange D1 and D2 . More precisely, the global and the local significance value are defined as: S igg (p, D) = S up(p, D1 ) − S up(p, D2 ),

(2)

S igl (p, D) = min {S igg (p, D(pi ))}.

(3)

1≤i≤m

Definition 2. p is a pseudo discriminative pattern if p is frequent in D1 and S igg (p, D) is no less than a user-specified threshold θ(g) sig . Definition 3. p is a conditional discriminative pattern if p is a pseudo discriminative pattern and S igl (p, D) is no less than a user-specified threshold θ(l) sig . 4

ACCEPTED MANUSCRIPT

However, a pseudo discriminative pattern p of length k has 2k − 2 sub-patterns. If we take into account all these sub-patterns, it will be time-consuming to accomplish the mining task. In fact, it is sufficient to only check those sub-patterns that are pseudo discriminative patterns. This is because if one sub-pattern is not a pseudo discriminative pattern, then it will not be included in the final result set. As a result, the redundancy issue certainly will not be caused by such kinds of sub-patterns. To calculate the value of S igl (p, D) for the pattern p, we need to first obtain the value of S igg (p, D(pi )) for each significant sub-pattern pi :

CR IP T

S igg (p, D(pi )) = S up(p, D1 (pi )) − S up(p, D2 (pi )) S up(p, D2 ) S up(p, D1 ) − . = S up(pi , D1 ) S up(pi , D2 )

(4) (5)

Label + + + + + – – – – –

CE

PT

ED

ID 1 2 3 4 5 6 7 8 9 10



and × represent the presence and absence of an item, respectively.

M

Table 1: A sample data set. For every transaction,

AN US

Once we have the value of S igg (p, D(pi )), S igl (p, D) can be calculated accordingly. To further illustrate how the concept of conditional discriminative pattern can filter out insignificant patterns, we take the data in Table 1 as an example. Here columns stand for items and rows represent transactions from two classes. Let us consider three different typological patterns: P1 = {A, D}, P2 = {B, C} and P3 = {A, B, C}. In this example, P1 is obviously not a discriminative pattern since it has identical frequency in both positive data and negative data. Thus, it cannot be a conditional discriminative pattern as well. P2 is a pattern that contains 2 items, which has two sub-patterns: {B} and {C}. The value of S up(P2, D1 ) is 0.8, the value of S igg (P2, D) is 0.6 and S igl (P2, D) = 0.47. As a result, (l) we can regard P2 as a conditional discriminative pattern if we set θ sup = 0.5 and θ(g) sig = θ sig = 0.4. P3 is a pseudo discriminative pattern since its support value and global significance value are all above the thresholds. However, S igl (P3, D) = 0 indicates that P3 is not a conditional discriminative pattern, whose discriminative power mainly comes from its sub-pattern P2. This simple example illustrates that it is feasible to filter out those discriminative patterns that are redundant with respect to their sub-patterns according to our definition.

A √ √ √ √ √ √ √ √ √ √

B √ √ √ √ √ × √ √ √ ×

C √ √ √

D √

× √ √

× × × × × √

× √ × √

× √



2.2. Problem formulation

AC

In the discovery of conditional discriminative patterns, there are three parameters to measure the significance (g) (l) of patterns: θ sup , θ(g) sig and θ sig . As mentioned above, θ sup is the support threshold, θ sig is the global significance

threshold and θ(l) sig is the local significance threshold. Here we provide a clear problem formulation of the conditional discriminative pattern mining. • Input: A data set D consists of two classes: the positive data set D1 and the negative data set D2 , three (l) parameters, θ sup , θ(g) sig , θ sig . • Output: A set of conditional discriminative patterns, where each pattern p satisfies: (l) (1) S up(p, D1 ) ≥ θ sup ; (2) S igg (p, D) ≥ θ(g) sig ; (3) S igl (p, D) ≥ θ sig . 5

ACCEPTED MANUSCRIPT

Table 2: An example data set and the ranks of items.

Transaction {a,b,c,d} {a,b,d,f} {b,c,e} {a,c} {b,c,d} {a,c,e,f}

Item α c a b d e f

Occ(α, D) 5 4 4 3 2 2

Rank 1 2 3 4 5 6

CR IP T

Class 1 1 1 2 2 2 2.3. CDPM algorithm

AN US

2.3.1. CP-tree and PDP-tree CP-tree (Conditional-Pattern tree) is an extension of FP-tree (Frequent-Pattern tree)[23], which is a data structure used in the CDPM algorithm. Here we utilize an example to show how to build CP-trees. Table 2 presents a transaction data set. In this data set, there are two classes (class 1 and class 2) and each class consists of three transactions. The sets of transactions in class 1 and class 2 correspond to the positive data and the negative data, respectively. The items in the data set will be first sorted in a decresing order with respect to Occ(α, D), in which α is an item. In this example, we can obtain an ordered sequence, c  a  b  d  e  f. Then the items in each transaction can be resorted according to . After obtaining the ranks of these different items, we can use the rank to present the corresponding item as shown in Table 3. Table 3: The example data set in which items are represented by their ranks.

AC

CE

1 2 3 4 5 6

Class 1 1 1 2 2 2

M

=⇒

Transaction {1,2,3,4} {2,3,4,6} {1,3,5} {1,2} {1,3,4} {1,2,5,6}

Root

Head of node-links

PT

Item

Transaction {c,a,b,d} {a,b,d,f} {c,b,e} {c,a} {c,b,d} {c,a,e,f}

ED

Class 1 1 1 2 2 2

1[2,3] [ ] 2[1,2] 2[1 [ ,2]

33[1,0] [[1,0] 1 0]

3[1,1] 3[1 3 [1,1 1]]

5[0 1]] 55[0,1]

4 1 0] 4[1,0] 0]

2[1,0] ,0 0

5[1 0]] 55[1,0]

3[1,0] ,0 4[0,1] 4[0 0

6 [0,1] [0 1] [0,1

4[1,0] 10

6[1,0] 1

Figure 1: The initial CP-tree of the example data and its header table.

The construction of the initial CP-tree is very similar to that of FP-tree. In FP-tree, each node has a value, which indicates the frequency count. In order to mine discriminative patterns, there are two different frequency counts: pos count (frequency count in the positive data) and neg count (frequency count in the negative data). The initial CP-tree of the data set D in Table 3 is described in Figure 1. For every item α, all the nodes containing α are connected. In the header table, there is a head of node-link for each item and these heads are sorted according to their frequency counts. Like FP-growth [23], CDPM traverses a suffix enumeration tree by constructing a conditional CPtree for every item to mine discriminative patterns. For example, let us consider how to build a conditional CP-tree for 6

ACCEPTED MANUSCRIPT

Item

Root

Head of node-links

1 2

1[2,1] [

2[1,0]

2[1,0] 0]

Figure 2: The conditional CP-tree of item 3.

CR IP T

item 3. There are three “transactions” conditioned by item 3: {1,2}[1,0], {1}[1,1] and {2}[1,0] and the corresponding conditional CP-tree is shown in Figure 2. To calculate the value of S igl (p, D) expediently, we define another tree called PDP-tree (Pseudo-DiscriminativePattern tree). PDP-tree is a trie tree in which each node also contains two counts: pos count and neg count. In fact, PDP-tree is the same as CP-tree except that the head of node-link is excluded for PDP-tree. The function of PDP-tree is to store the pseudo discriminative patterns that CDPM has already detected in the pattern enumeration process.

AN US

2.3.2. CDPM algorithm The CDPM algorithm is presented in Algorithm 1. In this algorithm, we first initialize several variables. L is the set of conditional discriminative patterns, and P is a PDP-tree that stores the pseudo discriminative patterns. At the beginning, x is an empty pattern. Then, CDPM call Growth(T, x), in which T is the initial CP-tree. (l) Algorithm 1 CDPM(D, θ sup , θ(g) sig , θ sig )

L ← an empty conditional discriminative pattern set P ← an empty PDP-tree T ← the initial CP-tree of D x ← an empty pattern call Growth(T ,x) 6: return L

M

1: 2: 3: 4: 5:

AC

CE

PT

ED

In CDPM, Growth(T, x) is the key procedure of CDPM, which is an extension of the popular frequent pattern discovery algorithm FP-Growth. FP-Growth utilizes a divide-and-conquer strategy to discover frequent patterns without candidate generation in a depth-first manner. This method constructs a frequent pattern tree (called FP-tree) by updating the frequency counts along the prefix paths and removing all infrequent items. Similar operations are repeated until all the frequent patterns are discovered.

൏ ‫ݔ‬ǡ ܶ଴ ൐

1. If ‫ ݕ‬is a conditional discriminative pattern, add it into the result set ‫ ܮ‬and insert it into the PDP-tree P.

Add the items of the head of ܶ଴ into ‫ ݔ‬to construct a series of patterns.

Enumerate pattern ‫ݕ‬

൏ ‫ݕ‬ǡ ܶଵ ൐

2. If ‫ ݕ‬is a pseudo discriminative pattern, insert it into the PDP-tree ܲ. 3. If ‫ ݕ‬is a frequent pattern, construct the corresponding conditional CP-tree ܶଵ , and then call ”‘™–Š ܶଵ ǡ ‫ ݕ‬.

Figure 3: The main process of Growth(T 0 , x).

7

ACCEPTED MANUSCRIPT

AN US

CR IP T

The main process of Growth(T, x) is shown in Figure 3. To obtain each candidate pattern, we iteratively add an item of the head of T 0 to x to generate a new pattern y. Then, we apply frequency and significance tests to evaluate whether y is a conditional discriminative pattern, a pseudo discriminative pattern or a frequent pattern. If y turns out to be a conditional discriminative pattern, this pattern will be included into the result set and should be inserted into the PDP-tree P as well. Otherwise, y will be inserted into the PDP-tree P if this pattern is just a pseudo discriminative pattern. Additonally, if y is a frequent pattern, we will recursively call Growth(y, T 1 ), in which T 1 is the corresponding conditional CP-tree. The pseudocode of Growth(T, x) is provided in Algorithm 2. Next, we will present the corresponding detailed explanations. In Growth(T 0 , x), we enumerate the items according to their corresponding orders in the header table (line 3-4). For an enumerated item α, we add it to the end of pattern x to generate a new pattern y (line 5). We assign α.pos count to y.pos count and α.neg count to y.neg count. S up(y, D1 ) and S igg (y, D) should be calculated so as to judge whether y is a conditional discriminative pattern (line 6-7). If y is a pseudo discriminative pattern, we need to calculate the S igl (y, D) (line 8-9). If S igl (y, D) ≥ θ(l) sig , it means that the pattern y is a conditional discriminative pattern and it will be added into L (line 11). To calculate the value of S igl (z, D) for some pattern z (z is a super-pattern of y) in the future, y will be inserted into P if y is a pseudo discriminative pattern (line 13). As y is ordered by  , i.e., the opposite of , we need to reverse y before it is inserted into P. Similar to FP-growth, if y is a frequent pattern, then we will construct the conditional CP-tree of the item α, T , and recursively call Growth(T, y) (line 15-18). Here we can ensure that when we deal with the pattern y, all significant subsets of the pattern y have been considered, since the order of the heads of the items is sorted according to  and the order of enumeration is also ruled by  in Algorithm 2. Growth is a recursive program and the terminating condition is when T 0 is an empty tree. Notably, insert pse p(reverse(y), k, P) and get sig l(x, root) are called in Growth(T, x) and the details of these two functions are described in Algorithm 3 and Algorithm 4.

AC

CE

PT

ED

M

Algorithm 2 Growth(T 0 , x) 1: H0 ← the header table of T 0 2: size ← the number of items in H0 3: for i = 1 to size do 4: α ← the ith item of H0 5: generate pattern y = x ∪ α with pos count = α.pos count and neg count = α.neg count 6: S up(y, D1 ) ← pos count/|D1 | 7: S igg (y, D) ← pos count/|D1 | − neg count/|D2 | 8: if S up(y, D1 ) ≥ θ sup and S igg (y, D) ≥ θ(g) sig then 9: S igl (y, D) ← get sig l(y, P.root) 10: if S igl (y, D) ≥ θ(l) sig then 11: add y into L 12: end if 13: insert pse p(reverse(y), 0, P.root) 14: end if 15: if S up(y, D1 ) ≥ θ sup then 16: T ← the conditional CP-tree of item α 17: Growth(T, y) 18: end if 19: end for

We maintain the PDP-tree P in order to facilitate the calculation of the value of local significance. As long as x is a pseudo discriminative pattern, it should be inserted into the PDP-tree P by calling the function of insert pse p(x, k, root). More precisely, insert pse p(x, k, root) is to insert the kth item of the pattern x into the node root of the PDPtree P. The case when k > x.length is pointless and the function insert pse p(x, k, root) is conducted only on condition that k ≤ x.length (line 1). We first find one child of root that has an item that is equal to x[k].item, and then assign the index to a pointer variable dubbed pos (line 2). If pos is not null, we will recursively call insert pse p(x, k + 1, root.children[pos]) (line 7). Otherwise, we would build a new node A with item = x[k].item 8

ACCEPTED MANUSCRIPT

and assign x[k].pos count to A.pos count and x[k].neg count to A.neg count (line 8-10). Note that if k = x.length, we should label that A is the end of the pattern x (line 11-14). Then, A is added to the end of the list of root’s children and insert pse p(x, k + 1, A) is called recursively (line 16-17).

AN US

CR IP T

Algorithm 3 insert pse p(x, k, root) 1: if k ≤ x.length then 2: pos ← the index which satisfies that x[k].item = root.children[pos].item 3: if pos , null then 4: insert pse p(x, k + 1, root.children[pos]) 5: return 6: end if 7: build a new node A 8: A.item ← x[k].item 9: A.pos count ← x[k].pos count 10: A.neg count ← x[k].neg count 11: if k = x.length then 12: A.istail ← true 13: else 14: A.istail ← f alse 15: end if 16: add A to the end of the list of root’s children 17: insert pse p(x, k + 1, A) 18: end if

AC

CE

PT

ED

M

Algorithm 4 get sig l(x, root) 1: ret ← ∞ 2: S ← {k | ∃ m ( root.children[k].item = x[m].item )} 3: for k in S do 4: if root.children[k].istail = true then 5: f f requency ← x.pos count/root.children[k].pos count 6: if root.children[k].neg count = 0 then 7: b f requency ← 1 8: else 9: b f requency ← x.neg count/root.children[k].neg count 10: end if 11: ret = min(ret, f f requency − b f requency) 12: if ret ≤ θ(l) sig then 13: return ret 14: end if 15: end if 16: ret ← min(ret, get sig l(x, root.children[k])) 17: end for 18: return ret

The function of get sig l(x, root) is to calculate the local significance value of the pattern x, that is S igl (x, D). To achieve this goal, we need to calculate the global significance value S igg (x, D(xi )) for each xi , which is a pseudo discriminative sub-pattern of x. In get sig l(x, root), we only consider the children of root whose items are also correspondingly contained in the pattern x. Thus, the children can compose a set S (line 2). Then we enumerate every element in S in order to calculate the local significance of the pattern x. In addition, we can recursively try to find more candidate subsets of x (line 16) and correspondingly get the value of S igg (x, D(xi )) if root.children[k].istail is true (line 4-15). When the neg count of root’s child equals 0, x.neg count must also be 0. In this case, we regard 9

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

that 0/0 equals to 1 (line 6-7). In addition, there is a pruning step, that is, if ret ≤ θ(l) sig , the pattern x is certainly not a conditional discriminative pattern (line 15-17). For the purpose of illustration and ease of interpretation, we utilize the data in Table 2 as an example to explain (l) how Growth(T 0 , x) works. We set the threshold parameters as θ sup = θ(g) sig = θ sig = 0.3. In order to make the process easy to understand, we only consider the specific situation when x is {3}. Accordingly, the corresponding CP-tree is the tree in Figure 2 and the corresponding PDP-tree P, only contains one pattern, that is {3}, whose pos count and neg count are 3 and 1, respectively. Then, we enumerate the items in T 0 to construct new patterns. First, we add the item ‘1’ into x to obtain a new pattern y1 , {3,1}[2,1], in which 2 and 1 are the values of pos count and neg count of y1 . We can calculate that the value S up(y1 , D) is 0.67 and that of S igg (y1 , D) is 0.33. Hence, y1 is a pseudo discriminative pattern since it is frequent and globally significant according to the definition. In addition, there is only one sub-pattern of y1 , {3}[3,1] in the PDP-tree P and we can get the value of S igl (y1 , D) is −0.33. Thus y1 is not a conditional discriminative pattern since its local significance value is less than the given threshold. Meanwhile, we should insert it into P as introduced in Figure 3. Furthermore, we should construct the CP-tree of y1 , T 1 , and call Growth(T 1 , y1 ) recursively. Next, we add the item ‘2’ into x and get a new pattern y2 , which is {3,2}[2,0]. The corresponding values of S up(y2 , D) and S igg (y2 , D) are both 0.67, which are bigger than the given corresponding thresholds. The pattern {3}[3,1] is the only sub-pattern in P and the value of S igl (y2 , D) is 0.67 as well. According to the definition, y2 is a conditional discriminative pattern and should be added into L and inserted into P. Then the Growth method will be recursively called to construct more patterns in similar ways. In the worst case, CDPM has an exponential time complexity with respect to the number of items when the support threshold is zero. Similar to FP-growth, the practical running efficiency of CDPM depends on the userspecified support threshold. That is, the practical time complexity of CDPM is mainly determined by the number of candidate frequent patterns. Therefore, when the support threshold is relatively high, CDPM can be very efficient in practice. The data sets in practical applications are usually complex, which contain not only categorical attributes but also numeric attributes. The proposed algorithm can only handle categorical data sets in the current stage. To discover discriminative patterns from data sets with numeric attributes, some proper discretization algorithms (supervised or unsupervised) should be applied first to transform numeric attributes into categorical attributes.

ED

3. Experiments

AC

CE

PT

In order to evaluate the performance of our algorithm, we conduct a series of experiments. First, we compare the number of patterns discovered by CDPM and the number of patterns reported by the algorithm without considering the local significance. Second, to demonstrate that the information contained in the patterns from CDPM is no less than that of the algorithm without considering the local significance, we use the patterns from both methods to build classifiers to test their classification accuracies. Finally, we compare our algorithm with the SNS algorithm [30], which is a recently proposed algorithm for the discovery of non-redundant discriminative patterns. All the experiments were conducted on a workstation with a Intel(R) Xeon(R) CPU(E5607 @ 2.27GHz) and 8 GB memory. We use the 5-fold cross-validation to test the classification accuracy. In the construction of classifier, we extract m features, where m is the number of patterns and each feature corresponds to a pattern x. That is, for every transaction t, if x ⊆ t, then the corresponding feature value is set to be 1 and 0 otherwise [25]. We choose Naive Bayes, support vector machine (SVM) and decision tree implemented in the scikit-learn package [38] as the classifiers in the experiments. To obtain stable performance evaluation results of classifiers, we repeat the cross-validation process 10 times to compute the average classification accuracy. 3.1. Data sets from UCI

3.1.1. The Data Set Description We use 6 data sets from UCI [15]: mushroom, breast-cancer, car, tic-tac-toe, vote and monks. The details of these data sets are provided in http://archive.ics.uci.edu/ml/. Here we use mushroom as an example to describe our pre-processing procedure. There are 22 attributes and 8124 transactions in mushroom. Since all the attributes are nominal, we transform the nominal attributes into 116 binary attributes. The positive set and negative set are 10

ACCEPTED MANUSCRIPT

composed of transactions from two classes in this data set, respectively. More precisely, the positive set has 3916 transactions and the negative data set has 4208 transactions. Note that in the data set car, there are four classes, which are “unacc”, “acc”, “good”, “vgood”. As the numbers of transactions in “acc”, “good”, “vgood” are less than that of “unacc”, we merge these three classes into one class named “acc”. 3.1.2. Results (l) For each data set, we fix θ sup and θ(g) sig to examine the number of patterns mined by CDPM by varying θ sig . Here the situation when θ(l) sig =0 denotes that the patterns are generated without considering local significance. From Figure

CR IP T

4, we can obviously see that CDPM can remove a lot of redundant patterns. When θ sup and θ(g) sig are fixed, within a

(l) certain range, the number of patterns has a negative correlation with θ(l) sig . That is, bigger θ sig leads to the reduction of the number of patterns detected. It should be noted that for most parameters, CDPM can remove a lot of redundant patterns. But if θ sup or θ(g) sig is high, such effect is less evident. mushroom (θsup=0.45 and θ(g) =0.05) sig

3000

45

breast−cancer (θsup=0.25 and θ(g) =0.15) sig

40

1500

1000

500

35 30 25 20 15 10 5

0

0 0

0.5

1

40

20

0.5

1

0

θ(l) sig

=0.05) tic−tac−toe (θsup=0.20 and θ(g) sig

=0.45) vote (θsup=0.50 and θ(g) sig

200

M

6000

60

0.5

1

θ(l) sig

=0.05) monks (θsup=0.05 and θ(g) sig

180

50

40

30

20

10

0

The number of patterns

The number of patterns

5000

4000

3000

ED

The number of patterns

60

0

0

θ(l) sig

70

80

AN US

2000

The number of patterns

100

The number of patterns

The number of patterns

2500

car (θsup=0.10 and θ(g) =0.05) sig

120

2000

1000

0.5

PT

θ(l) sig

1

140 120 100 80 60 40 20

0

0

160

0

0

0.5

1

0

θ(l) sig

0.5

1

θ(l) sig

CE

Figure 4: The number of patterns mined by CDPM from 6 UCI data sets when the local significance threshold varies from 0.05 to 1. Note that θ(l) sig =0 means that the patterns are mined without considering the local significance.

AC

To make the comparative analysis more straightforward, here we use several identified discriminative patterns in the breast-cancer data set in UCI as examples for illustration. When the parameters are respectively set to be θ sup = (l) 0.25, θ(g) sig = 0.15 and θ sig = 0.4, 43 significant patterns are generated by the method with only global significance evaluation while the number of patterns discovered by CDPM is 7. Furthermore, these 7 patterns are all of size 1, in which {deg malig = 3}, {irradiat = yes} and {node caps = yes} appear more frequently in the positive data while another 4 patterns {inv nodes = 0 − 2}, {irradiat = no}, {node caps = no} and {deg malig = 2}1 have more occurrences in the negative data. In addition, {irradiat = yes, breast = le f t} is a pattern of size 2, which possesses two sub-patterns of size 1: {irradiat = yes} and {breast = le f t}. Since the pattern {breast = le f t} is insignificant, the influences in discriminative power on its super-pattern can be negligible. So next we emphasize the analysis on the another sub-pattern {irradiat = yes}. According to the significance assessment results, the global significance value of {irradiat = yes, breast = le f t} is 0.18 and the local significance value of this pattern is 0.30, while those significance 1 The

left sides of the equal signs are attribute names and the right sides are the corresponding values.

11

ACCEPTED MANUSCRIPT

AN US

CR IP T

values of the pattern {irradiat = yes} are 0.18 and 1.0, respectively. Particularly, these two patterns possess the same level of global significance, rendering little improvement of the super-pattern upon the discriminative power compared to that of the sub-pattern. Accordingly, we can deduce that the discriminative power of {irradiat = yes, breast = le f t} mainly derives from its sub-pattern {irradiat = yes}. In real-world applications, filtering out {irradiat = yes, breast = le f t} and keeping {breast = le f t} might be a good choice since they distinguish different classes with the equivalent ability. In this regard, {irradiat = yes, breast = le f t} can be considered as a redundant pattern according to our (l) definition with θ(g) sig = 0.15 and θ sig = 0.4, although it can pass the global significance evaluation in the traditional way. Furthermore, similar analysis can also be made for the other patterns. Hence, CDPM is of considerable value in reducing the redundancy in pattern mining by filtering out the patterns whose discriminative power mainly comes from their sub-patterns. As CDPM will remove many patterns, it is very important to check if those removed patterns are really redundant. Here we test this hypothesis by utilizing the patterns discovered by CDPM to build a classifier and test its classification accuracy. For each data set, we fix the support threshold as well as the global significance threshold and vary θ(l) sig from 0.05 to 1. In addition, three different classifiers (Naive Bayes, SVM and Decision Tree) are employed to make the results more convincing. As shown in Figure 5, the accuracies of the classifiers built from the conditional discriminative patterns are comparable to that of the classifiers built from the patterns containing redundant ones. In some data sets, classifiers generated from CDPM can even have higher accuracies. This implies that the discriminative information will not be reduced after removing the redundant patterns. mushroom (θsup=0.45,θ(g) =0.05) sig

breast−cancer (θsup=0.25,θ(g) =0.15) sig

1

1

0.95

0.9

0.95

0.9

0.9

0.8

0.85

0.8 0.75 0.7

0.7

Accuracy

Accuracy

Accuracy

0.85

0.6 0.5

0.65

0.8

0.75 0.7

0.65

0.4 0.6

0.6

Naive Bayes SVM Decision Tree 0

0.25

0.5

0.75

0.3

M

0.55 0.5

car (θsup=0.10,θ(g) =0.05) sig

1

0.2

1

θ(l) sig

0

ED

tic−tac−toe (θsup=0.20,θ(g) =0.05) sig 1 0.9

0.25

Naive Bayes SVM Decision Tree

0.5

0.75

Naive Bayes SVM Decision Tree

0.55 0.5

1

0

0.25

θ(l) sig

0.5

0.75

1

θ(l) sig

vote (θsup=0.50,θ(g) =0.45) sig

monks (θsup=0.05,θ(g) =0.05) sig

1

1

0.95

0.9

0.9

0.8

0.8

0.5 0.4

Naive Bayes SVM Decision Tree

0.3

0

0.25

0.5

CE

0.2

θ(l) sig

0.75

1

Accuracy

0.6

Accuracy

0.7

PT

Accuracy

0.85 0.8

0.75 0.7

0.7 0.6 0.5

0.65 0.4 0.6 Naive Bayes SVM Decision Tree

0.55 0.5

0

0.25

0.5

θ(l) sig

0.75

Naive Bayes SVM Decision Tree

0.3

1

0.2

0

0.25

0.5

0.75

1

θ(l) sig

AC

Figure 5: The accuracies of the classifiers which are built by the patterns mined by CDPM from 6 UCI data sets when the local significance threshold varies from 0.05 to 1. Here three different classifiers are used: Naive Bayes, SVM and Decision Tree. The situation when θ(l) sig =0 means that the patterns are produced without considering the local significance.

Even we repeat the 5-fold cross-validation process 10 times, different runs may still generate different classification results. To check if such performance fluctuations are negligible, we also calculate the standard deviation of 10 classification accuracy values, where each accuracy value is generated from an independent 5-fold cross-validation procedure. As shown in Figure 6, the standard deviations are generally less than 0.04 on most UCI data sets for all classifiers. Therefore, the reported classification results in the experiments are stable and convincing. In the above experiments, we mainly focus on the performance comparison between our algorithm and the standard algorithm without removing redundant patterns. To further demonstrate the merit of the CDPM algorithm, we compare our algorithm with the SNS algorithm [30]. The SNS algorithm is a recently proposed algorithm that is also devoted to removing subset-induced redundant discriminative patterns. We obtain the SNS software from http://nugget. 12

ACCEPTED MANUSCRIPT

mushroom (θsup=0.45,θ(g) =0.05) sig Naive Bayes SVM Decision Tree

0.18

0.16

0.14 0.12 0.1 0.08 0.06

0.1 0.08 0.06

0.02

0.02

0

0.25

0.5

0.75

0

1

0.25

0.5

0.12 0.1 0.08 0.06

0.08 0.06

0.02

0

0.25

0.5

0.75

1

0

0.75

1

=0.05) monks (θsup=0.05,θ(g) sig

Naive Bayes SVM Decision Tree

0.16

0.1

0.02

0.5

0.18

0.12

0.04

0.25

0.2

0.14

0.04

0

θ(l) sig

Standard deviation

0.16

0.14

0

1

Naive Bayes SVM Decision Tree

0.18

Standard deviation

Standard deviation

0.16

0.06

0.14 0.12 0.1

0.08 0.06 0.04

AN US

0.18

0

0.75

=0.45) vote (θsup=0.50,θ(g) sig

0.2

Naive Bayes SVM Decision Tree

0.1 0.08

0.02

0

θ(l) sig

=0.05) tic−tac−toe (θsup=0.20,θ(g) sig

0.12

0.04

θ(l) sig

0.2

0.14

CR IP T

0

0.16

0.12

0.04

Naive Bayes SVM Decision Tree

0.18

0.14

0.04

car (θsup=0.10,θ(g) =0.05) sig

0.2

Naive Bayes SVM Decision Tree

0.18

Standard deviation

Standard deviation

0.16

breast−cancer (θsup=0.25,θ(g) =0.15) sig

0.2

Standard deviation

0.2

0.02

0

θ(l) sig

0.25

0.5

0.75

0

1

0

0.25

θ(l) sig

0.5

0.75

1

θ(l) sig

Figure 6: The standard deviations of classification accuracies. The classifiers are constructed with the patterns mined by CDPM from 6 UCI data sets when the local significance threshold varies from 0.05 to 1. The situation when θ(l) sig =0 means that the patterns are produced without considering the local significance.

ED

M

unisa.edu.au/jiuyong/Subgroup/subgroup.htm. The executable SNS software allows the users to specify two parameters: the support threshold and the maximal length of reported patterns. In our experiments, we use the same support threshold value in both CDPM and SNS but remove the constraint on the length of patterns for SNS. To make a fair comparison on the classification accuracy, a similar number of patterns should be reported by these two algorithms. As shown in Figure 7, the number of patterns reported by CDPM is generally less than that of SNS. Therefore, the subsequent comparison on the performance of classifiers constructed based on the same set of patterns from these two algorithms is convincing and fair.

PT

60

SNS CDPM

AC

CE

The number of patterns

50 40 30 20 10

0

mushroom breast−cancer

car

tic−tac−toe

vote

monks (g)

Figure 7: The number of patterns mined by SNS and CDPM from 6 UCI data sets. In CDPM, the threshold values for θ sup and θ sig are the same as specified in Figure 5 and

θ(l) sig =0.75.

In SNS, the support value on each data set is same to that used in CDPM.

Figure 8 presents the comparison between SNS and CDPM with respect to the classification accuracy, where the 13

ACCEPTED MANUSCRIPT

classifiers are built based on the discriminative patterns reported by each algorithm. This figure shows that CDPM can always achieve similar or better performance compared to SNS. In particular, CDPM outperforms SNS significantly on the monks data set across all classifiers. mushroom

0.6 0.5

0.8 0.7 0.6

Naive Bayes

SVM

0.5

Decision Tree

SVM

0.8 0.7 0.6

0.5

0.5

Naive Bayes

SVM

Decision Tree

1

0.9

0.6

0.5

Accuracy

0.7

0.7

Naive Bayes

SVM

Decision Tree

SNS CDPM

0.9 0.8 0.7

AN US

0.8

0.8

monks

SNS CDPM

1

Accuracy

Accuracy

1

Decision Tree

vote

0.9

0.9

0.6

Naive Bayes

tic−tac−toe SNS CDPM

SNS CDPM

CR IP T

0.7

1

0.9

Accuracy

0.8

car

SNS CDPM

1

0.9

Accuracy

Accuracy

1

breast−cancer

SNS CDPM

0.6

Naive Bayes

SVM

Decision Tree

0.5

Naive Bayes

SVM

Decision Tree

Figure 8: The accuracies of the classifiers that are built by the patterns mined by SNS and CDPM from 6 UCI data sets. In CDPM, the threshold (g) values for θ sup and θ sig are the same as specified in Figure 5 and θ(l) sig =0.75. In SNS, the support threshold value on each data set is same to that used in CDPM.

M

3.2. A Breast Cancer Gene Expression Data Set

CE

PT

ED

3.2.1. The Data Set Description The breast cancer gene expression data set derived from [14] is a binary data set that has 11,962 items and 295 transactions. It is constructed from a real data set with the expression profiles of 25,000 genes in 295 breast cancer patients. The patients are categorized into two classes, where the set of survivors corresponds to the positive data and the set of rest patients is used as the negative data. In the experiment, we only focus on 5,981 genes as in [14] since their frequency counts are significantly different with at least a 2-fold change between two classes. Between-gene variations have been properly eliminated to normalize the data. This data set is stored in a binary table. Two binary columns together present the information of a single gene: a 1 in the first column means the expression of the gene is less than –0.2, whereas a 1 in the second column indicates the expression of the gene is greater than 0.2. The genes whose expression values between –0.2 and 0.2 are not included in this data set since they are expected to be irrelevant by involving substantial noises. We transform these 5,981 genes into 11,962 binary attributes. Discriminative pattern mining on this data set can facilitate us to find out the pathogeny of breast cancer and even its cure. Our experiments are designed to evaluate the efficacy of CDPM on the complex and real data set.

AC

3.2.2. Results The experiments on this data set are primarily to assess the performance of CDPM on large real data sets. As this data set has 11,962 binary attributes, the number of potential patterns is very large. To obtain the experimental results within a reasonable time slot, we impose an additional constraint on the maximal length of candidate patterns and we only conduct one experiment for each group of parameter combinations. Table 4 shows the experimental results on the breast cancer gene expression data set, from which we can intuitively see that the use of local significance is effective in removing many redundant patterns (when θ(l) sig , 0). Figure 9 lists the number of patterns of different sizes mined by CDPM and the method without using local significance test. We can find that these two methods generate the same set of patterns of size 1. This is because such patterns have no sub-patterns so that they inherently get rid of the influence from sub-patterns. However, there is an obvious difference between the number of patterns of size>1. With the increase of the pattern size, the gap between the number of 14

CDP PDP

2

10

0

10

10

2

Size

3

θsup=0.50 θ(g) =0.30 θ(l) =0.10 sig sig

The number of patterns

4

1

CDP PDP

2

10

0

10

1

2

Size

3

4

10

θsup=0.55 θ(g) =0.30 θ(l) =0.10 sig sig CDP PDP

2

10

0

10

5

10

0

10

1

2

Size

3

θsup=0.40 θ(g) =0.20 θ(l) =0.20 sig sig

CR IP T

10

θsup=0.60 θ(g) =0.30 θ(l) =0.10 sig sig

The number of patterns

4

CDP PDP

1

2

Size

AN US

The number of patterns

The number of patterns

ACCEPTED MANUSCRIPT

3

Figure 9: The number of patterns of different sizes with different parameters. As the size of a pattern can be very large, for the ease of illustration, we only present the number of patterns whose sizes are less than four. CDP (Conditional Discriminative Patterns) denotes the set of conditional discriminative patterns and PDP (Pseudo Discriminative Patterns) denotes the set of discriminative patterns mined by the method without considering the local significance.

Table 4: Experimental results on the breast cancer gene expression data. In this table, θ(l) sig = 0 means that the patterns are generated without considering the local significance.

0.60 0.30 0.55 0.30

#Patterns

0 0.10 0 0.10 0 0.10 0 0.20

3130 72 60542 83 21217 425 61290 879

PT

0.50 0.30

θ(l) sig

0.40 0.20

Pattern size 1-10 1-10 1-10 1-10 1-5 1-5 1-3 1-3

M

θ(g) sig

Time(sec)

ED

θ sup

1396 1313 3268 3213 13584 8756 38002 20653

Naive Bayes 0.597 0.574 0.623 0.618 0.662 0.660 0.758 0.811

SVM 0.710 0.714 0.672 0.691 0.603 0.656 0.727 0.755

Decision Tree 0.625 0.661 0.689 0.673 0.721 0.751 0.758 0.755

CE

* All the experiments in this table can be accomplished in 24 hours and the experimental results which cost more than 24 hours are not listed. * We don’t compare our algorithm with the SNS algorithm on this data set because SNS does not scale well with the number of dimensions. Actually, we have tried to run SNS on this data set, but it failed to generate the results due to its unrealistic memory requirement.

AC

relevant patterns discovered by CDPM and the traditional method becomes more visible, even within several orders of magnitude in some cases. This is because that a pattern of bigger size would possess more sub-patterns, rendering potential greater influences of its significant constituent parts. Thus, the empirical comparison shows that CDPM outperforms the traditional method in mining non-redundant discriminative patterns. Moreover, increasing the size of target patterns will highlight the advantage of CDPM over the method without local significance test. As the method without considering the local significance will need more memory to store the patterns and more I/O operations to write patterns into a file, we can see that the method without considering local significance is more time-consuming than CDPM. Furthermore, all classifiers based on only hundreds of patterns provided by CDPM can achieve similar and even better performance than those classifiers based on tens of thousands of patterns. This indicates that those patterns removed by CDPM are really redundant. For high-dimensional data sets such as the breast cancer gene expression data, the union of a very frequent pattern 15

ACCEPTED MANUSCRIPT

and a discriminative pattern is very likely to form a “new” discriminative pattern. But the discriminative power of this new pattern mainly derives from its discriminative sub-patterns. This illustrates why our CDPM algorithm can remove so many irrelevant patterns on this data set and meanwhile the classification performance is still very satisfactory. Through this experiment, we can see that CDPM can still achieve good performance on big real data sets. This is very important since real-life data sets are usually complex in nature and huge in size. 4. Conclusion and future work

AN US

CR IP T

In this paper, we propose the concept of conditional discriminative pattern and presents a new algorithm called CDPM to mine such patterns effectively. To demonstrate the performance of CDPM, we conduct a series of experiments on UCI data sets and a breast cancer gene expression data set. The empirical performance studies show that CDPM can remove many redundant patterns that have similar discriminative ability with respect to their sub-patterns. In addition, the application of CDPM is not restrict to discriminative pattern discovery. This new approach can also be useful in a wide range of knowledge extraction problems such as classification and disease biomarker discovery in bioinformatics. Therefore, CDPM is a promising tool for analyzing data derived from different fields to discover non-redundant knowledge in terms of patterns. In the future, we will develop new algorithms for mining top-k conditional discriminative patterns. Moreover, how to build an accurate classifier from conditional discriminative patterns will be further investigated as well. Acknowledgements

M

The comments and suggestions from the editor and anonymous reviewers have greatly improved the paper. We would like to thank Prof. Jiuyong Li for his help on the SNS software. This work was partially supported by the Natural Science Foundation of China under Grant No.61572094 and No.71472023, and the Fundamental Research Funds for the Central Universities of China (DUT14QY07). References

AC

CE

PT

ED

[1] Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules in large databases. In: Bocca, J. B., Jarke, M., Zaniolo, C. (Eds.), Proceedings of the 20th International Conference on Very Large Data Bases. Morgan Kaufmann, Santiago, Chile, pp. 487–499. [2] Agresti, A., 2013. Categorical data analysis. John Wiley & Sons. [3] Azevedo, P. J., 2010. Rules for contrast sets. Intelligent Data Analysis 14 (6), 623–640. [4] Bay, S. D., Pazzani, M. J., 1999. Detecting change in categorical data: mining contrast sets. In: Lee, D., Schkolnick, M., Provost, F. J., Srikant, R. (Eds.), Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, USA, pp. 302–306. [5] Bay, S. D., Pazzani, M. J., 2001. Detecting group differences: mining contrast sets. Data Mining and Knowledge Discovery 5 (3), 213–246. [6] Boley, M., Grosskreutz, H., 2009. Non-redundant subgroup discovery using a closure system. In: Buntine, W. L., Grobelnik, M., Mladenic, D., Shawe-Taylor, J. (Eds.), Machine Learning and Knowledge Discovery in Databases. Vol. 5781. Springer, Heidelberg, Germany, pp. 179–194. [7] Boley, M., Lucchese, C., Paurat, D., G¨artner, T., 2011. Direct local pattern sampling by efficient two-step random procedures. In: Apt`e, C., Ghosh, J., Smyth, P. (Eds.), Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, USA, pp. 582–590. [8] Carmona, C., Ruiz-Rodado, V., del Jesus, M., Weber, A., Grootveld, M., Gonz´alez, P., Elizondo, D., 2015. A fuzzy genetic programmingbased algorithm for subgroup discovery and the application to one problem of pathogenesis of acute sore throat conditions in humans. Information Sciences 298, 180–197. [9] Cheng, H., Yan, X., Han, J., Hsu, C., 2007. Discriminative frequent pattern analysis for effective classification. In: Chirkova, R., Dogac, ¨ A., Ozsu, M. T., Sellis, T. K. (Eds.), Proceedings of the 23rd International Conference on Data Engineering. IEEE, Washington, DC., pp. 716–725. [10] Cheng, H., Yan, X., Han, J., Yu, P. S., 2008. Direct discriminative pattern mining for effective classification. In: Alonso, G., Blakeley, J. A., Chen, A. L. P. (Eds.), Proceedings of the 24th International Conference on Data Engineering. IEEE, Washington, DC., pp. 169–178. ¨ [11] Cong, G., Tan, K.-L., Tung, A. K., Xu, X., 2005. Mining top-k covering rule groups for gene expression data. In: Ozcan, F. (Ed.), Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, Baltimore, MD, pp. 670–681. [12] De Raedt, L., Zimmermann, A., 2007. Constraint-based pattern set mining. In: Apte, C., Skillicorn, D., Liu, B., Parthasarathy, S. (Eds.), Proceedings of the 7th SIAM International Conference on Data Mining. SIAM, Philadelphia, USA, pp. 237–248. [13] Dong, G., Li, J., 1999. Efficient mining of emerging patterns: discovering trends and differences. In: Fayyad, U. M., Chaudhuri, S., Madigan, D. (Eds.), Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, USA, pp. 43–52.

16

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

[14] Fang, G., Pandey, G., Wang, W., Gupta, M., Steinbach, M., 2012. Mining low-support discriminative patterns from dense and highdimensional data. IEEE Transactions on Knowledge and Data Engineering 24 (2), 279–294. [15] Frank, A., Asuncion, A., 2010. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. irvine, ca: University of california. School of Information and Computer Science 213. [16] Gamberger, D., Lavrac, N., 2002. Expert-guided subgroup discovery: methodology and application. Journal of Artificial Intelligence Research 17 (1), 501–527. [17] Garriga, G. C., Kralj, P., Lavraˇc, N., 2008. Closed sets for labeled data. The Journal of Machine Learning Research 9, 559–580. [18] Gong, H., He, Z., 2012. Permutation methods for testing the significance of phosphorylation motifs. Statistics and Its Interface 5, 61–73. [19] Good, P. I., 2005. Permutation, parametric and bootstrap tests of hypotheses. Springer, New York, Ch. 9. [20] Grosskreutz, H., Paurat, D., 2011. Fast discovery of relevant subgroups using a reduced search space. Tech. rep., Technical report, Fraunhofer Institute IAIS. [21] Großkreutz, H., Paurat, D., R¨uping, S., 2012. An enhanced relevance criterion for more concise supervised pattern discovery. In: Yang, Q., Agarwal, D., Pei, J. (Eds.), Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, Beijing, China, pp. 1442–1450. [22] Guns, T., Nijssen, S., De Raedt, L., 2013. K-pattern set mining under constraints. IEEE Transactions on Knowledge and Data Engineering 25 (2), 402–418. [23] Han, J., Pei, J., Yin, Y., Mao, R., 2004. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data mining and knowledge discovery 8 (1), 53–87. [24] Herrera, F., Carmona, C. J., Gonz´alez, P., Jesus, M. J. d., 2011. An overview on subgroup discovery: foundations and applications. Knowledge and Information Systems 29 (3), 495–525. [25] Kameya, Y., Sato, T., 2012. Rp-growth: top-k mining of relevant patterns with minimum support raising. In: Ghosh, J., Liu, H., Davidson, I., Domeniconi, C., Kamath, C. (Eds.), Proceedings of the 12th SIAM International Conference on Data Mining. SIAM, Austin, Texas, USA, pp. 816–827. [26] Knobbe, A., Cr´emilleux, B., F¨urnkranz, J., Scholz, M., 2008. From local patterns to global models: The lego approach to data mining. In: F¨urnkranz, J., Knobbe, A. (Eds.), Proceedings of the ECML PKDD 2008 Workshop. Antwerp, Belgium, pp. 1–16. [27] Lavraˇc, N., Gamberger, D., 2004. Relevancy in constraint-based subgroup discovery. In: Boulicaut, J., Raedt, L. D., Mannila, H. (Eds.), Constraint-Based Mining and Inductive Databases, European Workshop on Inductive Databases and Constraint Based Mining. Vol. 3848. Springer, pp. 243–266. [28] Lavraˇc, N., Kavˇsek, B., Flach, P., Todorovski, L., 2004. Subgroup discovery with CN2-SD. Journal of Machine Learning Research 5, 153– 188. [29] Li, J., Liu, G., Wong, L., 2007. Mining statistically important equivalence classes and delta-discriminative emerging patterns. In: Berkhin, P., Caruana, R., Wu, X. (Eds.), Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, USA, pp. 430–439. [30] Li, J., Liu, J., Toivonen, H., Satou, K., Sun, Y., Sun, B., 2014. Discovering statistically non-redundant subgroups. Knowledge-Based Systems 67, 315–327. [31] Li, W., Han, J., Pei, J., 2001. CMAR: accurate and efficient classification based on multiple class-association rules. In: Cercone, N., Lin, T. Y., Wu, X. (Eds.), Proceedings of the 2001 IEEE International Conference on Data Mining. IEEE Computer Society, Los Alamitos, USA, pp. 369–376. [32] Liu, H., Yang, Y. C., Chen, Z., Zheng, Y., 2014. A tree-based contrast set mining approach to detecting group differences. INFORMS Journal on Computing 26 (2), 208–221. [33] Liu, X., Wu, J., Gong, H., Deng, S., He, Z., 2014. Mining conditional phosphorylation motifs. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11 (5), 915–927. [34] Liu, X., Wu, J., Gu, F., Wang, J., He, Z., 2015. Discriminative pattern mining and its applications in bioinformatics. Briefings in Bioinformatics 16 (5), 884–900. [35] Ma, L., L.Assimes, T., B.Asadi, N., Iribarren, C., Quertermous, T., H.Wong, W., 6 2010. An almost exhaustive search-based sequential permutation method for detecting epistasis in disease association studies. Genetic Epidemiology 34, 434–443. [36] Novak, P. K., Lavraˇc, N., Webb, G. I., 2009. Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning Research 10, 377–403. [37] Pang-Ning, T., Steinbach, M., Kumar, V., 2006. Introduction to data mining. Addison-Wesley. [38] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830. [39] Ramamohanarao, K., Bailey, J., 2003. Discovery of emerging patterns and their use in classification. In: Gedeon, T. D., Fung, L. C. C. (Eds.), AI 2003: Advances in Artificial Intelligence, 16th Australian Conference on Artificial Intelligence. Springer, Heidelberg, pp. 1–12. [40] Schwartz, D., Gygi, S. P., 2005. An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets. Nature biotechnology 23 (11), 1391–1398. [41] Terlecki, P., Walczak, K., 2007. Jumping emerging patterns with negation in transaction databases–classification and discovery. Information Sciences 177 (24), 5675–5690. [42] van Leeuwen, M., Knobbe, A., 2012. Diverse subgroup set discovery. Data Mining and Knowledge Discovery 25 (2), 208–242. [43] Wang, T., Kettenbach, A. N., Gerber, S. A., Bailey-Kellogg, C., 2012. MMFPh: A maximal motif finder for phosphoproteomics datasets. Bioinformatics 28 (12), 1562–1570. [44] Webb, G. I., 2007. Discovering significant patterns. Machine Learning 68 (1), 1–33. [45] Wrobel, S., 1997. An algorithm for multi-relational discovery of subgroups. In: Komorowski, H. J., Zytkow, J. M. (Eds.), Principles of Data Mining and Knowledge Discovery. Springer, Heidelberg, Germany, pp. 78–87. [46] Yin, X., Han, J., 2003. CPAR: classification based on predictive association rules. In: Barbar´a, D., Kamath, C. (Eds.), Proceedings of the 3rd

17

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

SIAM International Conference on Data Mining. SIAM, Philadelphia, USA, pp. 331–335.

18