Mining inter-sequence patterns

Mining inter-sequence patterns

Expert Systems with Applications 36 (2009) 8649–8658 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

716KB Sizes 2 Downloads 73 Views

Expert Systems with Applications 36 (2009) 8649–8658

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Mining inter-sequence patterns Chun-Sheng Wang a,*, Anthony J.T. Lee b a b

Department of Information Management, Jinwen University of Science and Technology, 99, An-Chung Road, Hsin-Tien, Taipei, Taiwan, ROC Department of Information Management, National Taiwan University, No. 1, Section 4, Roosevelt Road, Taipei, Taiwan, ROC

a r t i c l e

i n f o

Keywords: Data mining Inter-sequence pattern Inter-transaction pattern Sequential pattern

a b s t r a c t Sequential pattern and inter-transaction pattern mining have long been important issues in data mining research. The former finds sequential patterns without considering the relationships between transactions in databases, while the latter finds inter-transaction patterns without considering the ordered relationships of items within each transaction. However, if we want to find patterns that cross transactions in a sequence database, called inter-sequence patterns, neither of the above models can perform the task. In this paper, we propose a new data mining model for mining frequent inter-sequence patterns. We design two algorithms, M-Apriori and EISP-Miner, to find such patterns. The former is an Apriori-like algorithm that can mine inter-sequence patterns, but it is not efficient. The latter, a new method that we propose, employs several mechanisms for mining inter-sequence patterns efficiently. Experiments show that EISPMiner is very efficient and outperforms M-Apriori by several orders of magnitude. Ó 2008 Elsevier Ltd. All rights reserved.

1. Introduction Finding frequent patterns in different types of database is one of the most important issues in data mining (Han & Kamber, 2000; Roddick & Spiliopoulou, 2002). Consequently, sequential pattern mining has been studied extensively. Many algorithms (Agrawal & Srikant, 1995, 1996; Ayres, Flannick, Gehrke, & Yiu, 2002; Han et al., 2000; Pei et al., 2004; Zaki, 2001) have been proposed for finding sequential patterns in sequence databases, in which every transaction contains a sequence. However, all the above algorithms treat sequences independently, without considering the relationships between sequences. We call this intra-sequence pattern mining, because all patterns are bounded in a single sequence. In addition to intra-sequence pattern mining, the problem of finding inter-transaction patterns has also been widely studied and several algorithms have been proposed (Feng, Dillon, & Liu, 2001, 2002; Huang, Chang, & Lin, 2004; Lu, Feng, & Han, 2000; Lee & Wang, 2007; Tung, Lu, Han, & Feng, 2003). Although such algorithms can find frequent patterns containing the itemsets across several transactions, they do not consider the ordered relationships between items within a transaction, because the items are treated as unordered sets (i.e., itemsets). In this paper, we are interested in inter-sequence patterns, which describe associations across several sequences. We propose a new data mining model that mines frequent inter-sequence patterns in sequence databases. We call this inter-sequence pattern mining. Since there is an ordered relationship between the items (or * Corresponding author. Tel.: +886 2 8212 2000 6306. E-mail addresses: [email protected], [email protected] (C.-S. Wang). 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.10.008

itemsets) in a transaction in a sequence database, inter-transaction pattern mining algorithms are not suitable for mining frequent inter-sequence patterns. Similarly, since there is a cross-sequence relationship between the transactions in a sequence database, traditional intra-sequence pattern mining algorithms are not suitable either. Inter-sequence pattern mining is more general than either sequential pattern mining or inter-transaction pattern mining. By mining inter-sequence patterns, we can discover both a sequential pattern within a transaction and sequential patterns across several different transactions. Therefore, our data mining model provides more informative patterns than the two traditional models. For example, in financial markets, an inter-sequence association rule might be: ‘‘If the steel market’s price index increases more than the exchange rate of the US dollar in the first month, the real estate market’s price index will probably increase more than that of the gold market in the third month”. This is a useful inter-sequence association rule that could help investors manage their portfolios, since they could invest more in real estate than in gold two months ahead. Likewise, a rule like ‘‘if the demand for beer is greater than the demand for orange juice in week one, the demand for soda will likely be greater than that for cola two weeks later” could help retailers plan future beverage purchases. In another example, the following rule may help weather forecasters: ‘‘if more tornados occurred in Texas than in Kansas last year, it is likely that there will be more tornados in Colorado than in Utah this year”. Designing an efficient algorithm is critical for mining frequent inter-sequence patterns. We observe that inter-sequence patterns have the following anti-monotone Apriori property: if any length k pattern is not frequent in the database, then its length (k + 1)

8650

C.-S. Wang, A.J.T. Lee / Expert Systems with Applications 36 (2009) 8649–8658

super-pattern cannot be frequent. To mine such patterns, we modify the Apriori algorithm (Agrawal & Srikant, 1994) and call it the MApriori algorithm. Apriori-like approaches (Agrawal & Srikant, 1994; Lu et al., 2000; Tung et al., 2003) perform well when mining intra-sequence patterns. However, their performance declines dramatically when they are modified to mine inter-sequence patterns. The M-Apriori algorithm suffers from the following non-trivial costs. (1) Handling a huge number of candidates is costly, since M-Apriori may generate a large number of candidates at each level, especially when candidates are generated inter-sequentially. (2) At each level, it is tedious to scan the database repeatedly to determine the support for each candidate. (3) When performing subset matching, each database scan may incur a huge computational overhead. (4) Using breadth-first search approaches to generate candidates level-by-level requires that a large number of candidates must be kept at certain levels, which may overload M-Apriori’s memory capacity. To resolve the above issues, we propose an algorithm called EISP-Miner (Enhanced Inter-Sequence Pattern Miner), which can efficiently discover a complete set of frequent inter-sequence patterns. The algorithm is comprised of two phases. First, it finds frequent 1-patterns and converts original sequences into a structure called a patternlist for each 1-pattern, which stores a frequent pattern and a list of location information for that pattern. In other words, the original sequences are converted into a vertical representation of frequent patterns. Second, it uses a new data structure, called an ISP-tree, to enumerates all frequent inter-sequence patterns by joining patternlists in a depth-first search order. By using the ISP-tree and patternlists, EISP-Miner only requires one database scan, and can localize the joining and support counting operations to a small number of patternlists. Moreover, the search method is a partition-based, divide-and-conquer approach, rather than Apriori-like level-wise generation and checking of frequent patterns. Because of these features, the EISP-Miner algorithm is more efficient than the M-Apriori algorithm. The remainder of this paper is organized as follows: Section 2 defines the frequent inter-sequence pattern mining problem and introduces some notations. In Section 3, we present the M-Apriori and EISP-Miner algorithms. Section 4 describes the experiments and their results. Finally, we present our conclusions and indicate some future research directions in Section 5.

2. A new model for inter-sequence pattern mining 2.1. An enhanced sequence database model In traditional sequential pattern mining models, each transaction in database contains a sequence of itemsets. Although transactions may occur in different contexts, such as time and location, this contextual information is ignored in traditional sequential pattern mining because the patterns are intra-transactional in nature. However, if we are interested in inter-sequence patterns across multiple transactions, the context in which a transaction occurs is important. Therefore, we define an enhanced sequence database for intersequence pattern mining as follows. An itemset t = (u1, u2, . . . , um) is a set of items, where ui is an item for 1 6 i 6 m. When there is only one item in an itemset, the parentheses can be omitted; that is, (u) can be written as u. Items in an itemset are listed in alphabetical order. A sequence s = ht1, t2, . . . , tni is an ordered list of itemsets, where tj is an itemset for 1 6 j 6 n. A sequence database D = {s1, s2, . . . , sjDj}, where jDj is the number of transactions in D and si (1 6 i 6 jDj) is a transaction of the form hDat, Sequencei. Dat is the domain attribute of si that describes the contextual information, such as the time stamp or space location associated with si.

Table 1 A sequence database.

To demonstrate our new inter-sequence mining model, we use a sequence database containing four transactions, shown in Table 1, as a running example in this paper. Note that we write hC(ABC)Ai, instead of h(C)(BAC)(A)i, for the sequence of the second transaction. 2.2. Inter-sequence context An inter-sequence context can be defined through the domain attribute of an enhanced sequence database. Let t1 and t2 be domain attributes for sequences s1 and s2, respectively. If we take t1 as the reference point, the span between s2 and s1 is defined as [t2  t1]. The sequence s2 at domain attribute t2 with respect to t1 is called an extended sequence (e-sequence for short) and denoted as s2[t2  t1]. For example, in Table 1, if we take the domain attribute in the first transaction as the reference point, the extended sequence of the second transaction is hC(ABC)Ai[1]. Since a sequence contains itemsets, traditional concepts regarding itemsets and items can be applied in an inter-sequence context. Let an extended sequence s[i] = ht1 ,t2 ,. . . , tni[i], where tj is an itemset for 1 6 j 6 n and [i] is the span of s. We define tj associated with [i] as an extended itemset (e-itemset for short) denoted by htji[i]. Also, if tj = (u1, u2, . . . , um), where uk is an item for 1 6 k 6 m, we define uk associated with [i] as an extended item (e-item for short) denoted by (uk)[i]. For example, the extended sequence hC(ABC)Ai[1] contains three e-itemsets, hCi[1], h(ABC)i[1], and hAi[1], which can be decomposed into five e-items, (C)[1], (A)[1], (B)[1], (C)[1], and (A)[1]. Before applying the extended sequence concept to an enhanced sequence database, we define the following notations. Given a list of k consecutive transactions z1 = ht1, s1i, z2 = ht2, s2i, . . . , zk = htk, ski in a sequence database, w = s1[0] [ s2[t2  t1] [    [ sk[tk  t1] is called a megasequence, where k P 1. Since w takes t1 as the reference point, we say that w starts from t1. In a megasequence the span between the domain attribute of the first transaction and that of the last transaction must be less than or equal to maxspan (i.e. tk  t1 6 maxspan in w), where maxspan is a user-specified maximum span threshold. The third column in Table 1a shows four megasequences, each of which has a maxspan equal to 1. We introduce the concept of megasequences to find patterns that cross transaction boundaries in a sequence database. That is, we let a = s1[w1], s2[w2], . . . , sm[wm] be a subset of w, where each si[wi], 1 6 i 6 m, is an e-sequence. We normalize a as b = s1[0], s2[w2  w1], . . . , sm[wm  w1] and call b a pattern. Consider a pattern b = (u1)[v1], (u2)[v2], . . . , (un)[vn] expressed in e-item form, where each (ui) [vi], 1 6 i 6 n, is an e-item. The number of e-items in a pattern is called the length of the pattern (i.e. n in b), and a

C.-S. Wang, A.J.T. Lee / Expert Systems with Applications 36 (2009) 8649–8658

pattern of length k is called a k-pattern. To determine the relationship between two consecutive e-items (ui)[vi] and (ui+1)[vi+1] in b, 1 6 i < n, we define three types of extension for (ui+1)[vi+1]. (1) If vi = vi+1 and (uiui+1) is an itemset, we say (ui+1)[vi+1] is an itemsetextension (+i) of (ui)[vi]; (2) If vi = vi+1 and huiui+1i form a sequence, we say (ui+1)[vi+1] is a sequence-extension (+s) of (ui)[vi]. (3) If vi < vi+1, we say (ui+1)[vi+1] is an inter-extension (+t) of (ui)[vi]. For example, h(BD)i[0]hCAi[1] is a 4-pattern, where (D)[0] is an itemset-extension of (B)[0], (C)[1] is inter-extension of (D)[0], and (A)[1] is a sequence-extension of (C)[1]. In other words, (A)[0] +i (B)[0] = h(AB)i[0], (A)[0] +s (B)[0] = hABi[0], and (A)[0] +t (B)[1] = hAi[0]hBi [1]. 2.3. Inter-sequence pattern mining We use the support for each pattern as the primary measurement in the inter-sequence pattern mining framework. To determine the support, we must define the relationships between patterns. First, given two sequences, s = hs1, s2, . . . , sni and s0 ¼ hs01 ; s02 ; . . . s0m i, where n 6 m, we say that s is a subsequence of s0 if we can find n numbers, j1, j2, . . . , jn, such that: (1) 1 6 j1 < j2 <    < jn 6 m, and (2) s1 # s0j1 ; s2 # s0j2 ; . . . ; sn # s0jn . For example, hA(BC)DFi is a subsequence of hA(ABC)(AC)D(CF)i, but it is not a subsequence of h(ABC)(AC)D(CF)i. Second, assume there are two patterns, a = s1[i1], s2[i2], . . . , sn[in] and b = v1[j1], v2[j2], . . . , vm[jm], where 1 6 n 6 m. Then, a is a subpattern of b if we find n e-sequences, vk1[jk1], vk2[jk2], . . . , vkn[jkn] in b, such that i1 = jk1 and s1 is a subsequence of vk1; i2 = jk2 and s2 is a subsequence of vk2, . . .; and in = jkn and sn is a subsequence of vkn. We can also say that b contains (is a superpattern of) a. For example, both hDi[0]hBACi[3] and h(AD)i[0]h(AC)Ci[1] are subpatterns of hD(AD)i[0]hB(AC)Ci [1]hB(A)CCi[3]. In a sequence database D, let a be a pattern, and Ta be a set of megasequences in D, where each megasequence in Ta contains a. The support of a, support(a), is defined by jTaj. If support(a) is not less than the user-specified minimum support threshold minsup, a is called a frequent pattern. An inter-sequence association rule is written in the form of a ? b, where both a and a [ b are frequent patterns; a \ b = /, conf(a ? b) is not less than the user-specified minimum confidence; and the confidence of the rule conf(a ? b) is defined as support(a [ b)/support(a). The purpose of mining frequent inter-sequence patterns is to find a complete set of frequent patterns in a sequence database with respect to the user-specified minsup and maxspan thresholds. Suppose that minsup = 2 and maxspan = 1. Let us consider the input sequence database D shown in Table 1. Table 1b shows the complete set of frequent patterns with their supports listed afterwards. Among them, support(hCBi[0]hAi[1]) = 2, so it is a frequent pattern. Also we have support(hCBi[0]) = 2, so conf(hCBi[0] ? hAi[1]) = 2/2 = 100%. 3. Algorithms for mining inter-sequence patterns 3.1. The M-Apriori algorithm In algorithms that find frequent inter-sequence patterns efficiently, the anti-monotone property of patterns is the key to reducing the search space. Property 1. (Anti-monotone property). If an inter-sequence pattern p is frequent, so is any subpattern of p. In other words, if p is not frequent, its superpatterns cannot be frequent. Based on this property, we modify the Apriori algorithm’s patternjoining and pattern-pruning operations. The new algorithm for finding frequent patterns is called the M-Apriori algorithm. We now introduce some definitions.

8651

Definition 1. Let a = (im)[x], b = (in)[y] be two e-items. We say that: (1) a is equal to b, a = b, if ((im = in) ^ (x = y)); and (2) a is smaller than b, a < b, if x < y or ((x = y) ^ (im < in)). For example, (C)[1] = (C)[1], (B)[1] < (A)[2], and (A)[1] < (C)[1]. Definition 2. Let p be a pattern. We define a function subi,j(p) to be (j  i + 1) numbers of subset e-items of p from position i to j. For and example, sub1,4(h(AD)i[0]h(AC)Ci[1]) = h(AD)i[0]h(AC)i[1] sub5,5(h(AD)i[0]h(AC)Ci[1]) = (C)[1]. Definition 3. Let a = hui[0] and b = hvi[0] be two frequent 1-patterns. We define that a is joinable to b in any instance, which yields three types of join operation: (1) itemset-join: a [i b = {h(uv)i[0]}j{h(uv)i[0]}; (2) sequence-join: a [s b = {huvi[0]}; and (3) inter-join: a [t b = {hui[0]hvi[x]j1 6 x 6 maxspan}. For example, suppose maxspan = 2; then, hAi[0] [i hBi[0] = h(AB)i[0], hAi[0] [s hBi[0] = hABi[0], and hAi[0] [t hBi[0] = {hAi[0]hBi[1], hAi[0] hBi[2]}. Definition 4. Let a and b be two frequent k-patterns, where k > 1, subk,k(a) = (u)[i], and subk,k(b) = (v)[j]. We say that a is joinable to b if sub1,k1(a) = sub1,k1(b) and i 6 j, which yields three types of join operation: (1) itemset-join: a [i b = {a +i (v)[j]j(i = j) ^ (u < v)}; (2) sequence-join: a [s b = {a +s (v)[j]ji = j}; and (3) inter-join: a [t b = {a +t (v)[j]ji < j}. For example, hABi[0] [i hACi[0] = hA(BC)i[0], hABi[0] [s hACi[0] = hABCi[0], and hABi[0] [t hAi[0]hCi[2] = hABi[0] hCi[2]. Let Lk represent a set of frequent k-patterns, and Ck be a candidate set of k-patterns, we have the M-Apriori algorithm in Fig. 1. As shown in Fig. 1, M-Apriori needs to scan the sequence database multiple times. In step 1, the algorithm scans the database to find L1. Let k = 2. In step 3, the set of all frequent (k  1)-patterns Lk1, found in the (k-1)th scan, is used to generate the candidate pattern C 0k . Then Ck is pruned in step 4. The candidate generation and pruning operations ensure that Ck is a superset of the set of all frequent k-patterns. In step 5, the algorithm scans consecutive transactions in the sequence database to form megasequences. For each megasequence w in D, M-Apriori determines which candidate c in Ck contained in w to count the support of c. At the end of the scan, Ck is examined to check which candidates are frequent in order to obtain Lk. In step 2, the algorithm terminates when Lk becomes empty. Analyzing the M-Apriori algorithm, we find it has several nontrivial computation costs. First, after a set of candidates of a certain length has been generated, the whole sequence database must be scanned to determine the support of each candidate. This incurs a high I/O overhead if a frequent pattern is very long. Second, because the patterns are the inter-sequence type, many more candidates are generated; consequently, searching the database to count the candidates’ support is inefficient and the main memory usage may be excessive. Third, the candidate generation, pruning, and support counting procedures require subset matching, which is very time consuming, especially when matching patterns that cross the boundaries of transactions.

3.2. The EISP-Miner algorithm To overcome the shortcomings of the M-Apriori algorithm, we have developed the EISP-Miner algorithm, which is very efficient. It uses new data structures and search strategies that mine for frequent inter-sequence patterns without having to generate candidates or scan a database repeatedly. In the following, we present the main elements of EISP-Miner.

8652

C.-S. Wang, A.J.T. Lee / Expert Systems with Applications 36 (2009) 8649–8658

Fig. 1. The M-Apriori algorithm.

3.2.1. The patternlist We have devised a data structure called a patternlist, which stores location information about frequent patterns during the mining process. Fig. 2 illustrates the initial construction of patternlists based on the running example in Table 1. In the figure, four patternlists of 1-patterns, hAi[0], hBi[0], hCi[0], and hDi[0], are constructed first. In addition to the pattern itself, each patternlist stores dat (column t) and position (column p) information about the corresponding pattern as it appears in the database. Take the patternlist of hCi[0] as an example. It contains two dats (1, 2) and two sets of positions ({1}, {1, 2}), where position set {1} is associated with dat 1 and position set {1, 2} is associated with dat 2. Note that the position number is counted at the itemset base. Thus, the numbers in the patternlist of hCi[0] mean that the pattern is contained in the sequences of dats 1 and 2 in the database. It is located in the first itemset of the sequence of dat 1; and in the first and second itemsets of the sequence of dat 2. Based on this example, a patternlist can be defined as follows.

alist. If a is a k-pattern and support (alist) P minsup, we say that alist is a frequent k-patternlist. By the above definition, the four 1-patternlists in Fig. 2 can be expressed as hAi[0]{1.2, 2.23, 3.1, 4.1}, hBi[0]{1.2, 2.2}, hCi[0]{1.1, 2. 12}, and hDi[0]{3.2}, For which, we have support(hAi[0]list) = 4, support(hCi[0]list) = 2, and support(hDi support(hBi[0]list) = 2, [0]list) = 1, respectively. The supports of the first three patternlists are frequent, so they are retained; however, support for the last patternlist is infrequent, so it is pruned. Note that in the above cases, the t-values and p-values in a patternlist are locations of the pattern’s last e-item in the database; for a 1-pattern, the last eitem is the pattern itself.

Definition 5. Given a pattern a, we define a patternlist alist = a{t1  p11 p12 . . . p1m1,t2  p21 p22 . . . p2m2,tn  pn1pn2 . . . pnmn}, where {t1  p11 p12 . . . p1m1, t2  p21 p22 . . . p2m2, tn  pn1 pn2 . . . pnmn} is called the list; ti is the dat (t-value); and pij is the position (p-value) that a‘s last e-item appears at in the database 1 6 i 6 n and 1 6 j 6 mi. We also define support (alist) as the number of t-values contained in

Definition 6. An ISP-tree is a search tree (T) that has the following properties. (1) A node in T represents a frequent patternlist. (2) The root of T contains a null patternlist, NULL. (3) Child nodes of the root node include all frequent 1-patternlists. (4) If a node corresponds to a frequent k-patternlist alist, where k P 1, then any child node of alist is a frequent (k + 1)-patternlist blist. In this case, a and b share the

[0] Dat 1 2

Sequence

t

3.2.2. The ISP-tree To organize and store the frequent patternlists generated during the mining process, we use a search tree structure called an ISPtree.

<

B>[0]

p

t

p

[0]

[0]

t

t

P



1

2

1

2

1

1

2

2

2

1,2



2

2,3

3



3

1

4



4

1

3

[0]

[0]

[0]

{1.2,2.23,3.1,4.1}

{1.2,2.2}

{1.1,2.12}

Fig. 2. Initial construction of the 1-patternlists in Table 1.

p

2

8653

C.-S. Wang, A.J.T. Lee / Expert Systems with Applications 36 (2009) 8649–8658

Next, we explain how the join operation is performed. Fig. 4 shows three 1-patternlists joined by hCi[0]{1.1, 2.12} and hAi[0]{1.2, 2.23, 3.1, 4.1}. (1) For itemset-join, since C > A, these 1patternlists cannot be joined. (2) For sequence-join, we get a joined 2-pattern hCAi[0] and a list {1.2, 2.23} by matching the t-values and p-values of the two corresponding patternlists. That is, we get identical t-values 1, 2 for both patternlists and get any p-values (2, 23) on the second patternlist that are larger than the p-values on the first patternlist. (3) For inter-join, we get a joined 2-pattern hCi[0]hAi[1] and a list {2.23, 3.1} by matching the t-values and pvalues of the two corresponding patternlists. That is, we increase the t-values of the first patternlist by 1 to get identical t-values 2, 3 and any p-values (23, 1) after the t-values on the second patternlist. Fig. 5 shows three types of k-patternlist join operation, where k > 1. (1) For hCAi[0]{1.2, 2.23} and hCBi[0]{1.2, 2.2}, we use itemset-join to get a joined 3-pattern hC(AB)i[0] and a list {1.2, 2.2} by matching identical t-values 1, 2 and p-values 2, 2 on the two corresponding patternlists. (2) For hCAi[0]{1.2, 2.23} and hCBi[0]{1.2, 2.2}, we use sequence-join; however, we cannot find any p-values on the second patternlist larger than those on the first patternlist for identical t-values; thus, a sequence joined patternlist cannot be generated. (3) For hC(AB)[0]i{1.2, 2.2} and hCAi[0]hAi[1]{2.23, 3.1}, we use inter-join to get a joined 4-pattern hC(AB)i[0]h Ai[1] and a list {2.23, 3.1} by matching the t-values and p-values on the two corresponding patternlists. That is, we increase the t-values on the first patternlist by 1 to get identical tvalues 2, 3 and any p-values (23, 1) after the t-values on the second patternlist.

same k prefix e-items, and the last e-item of b is either an itemsetextension, a sequence-extension, or an inter-extension of a. Given the definition of an ISP-Tree, let us consider the sequence database in Table 1, where minsup = 2 and maxspan = 1. All the patternlists of frequent inter-sequence patterns can be enumerated in the ISP-tree, as shown in Fig. 3. Note that joining the patternlists in the ISP-tree is the key operation when generating frequent patterns. Definition 7. Let alist and blist be two frequent 1-patternlists, where a = hui[0], b = hvi[0]. Also, let ta  pa and tb  pb be the t-values and p-values in alist and blist, respectively. We define that alist is joinable to blist in any instance, which yields three types of join operation: (1) itemset-join: alist [i blist = {h(uv)i[0] {tb  pb}j(ta = tb) ^ (pa = pb)}; (2) sequence-join: alist [s blist = {huvi[0]{tb  pb}j(ta = tb) ^ (pa < pb)}; and (3) inter-join: alist [t blist = {hui[0]hvi[x]{tb  pb}j (0 < x 6 maxspan) ^ (tb  ta = x)}. Definition 8. Let alist and blist be two frequent k-patternlists, where k > 1, subk,k(a) = hui[i], and subk,k(b) = hvi[j]. Also, let ta  pa and tb  pb be the t-values and p-values in alist and blist, respectively. We define that alist is joinable to blist if a is joinable to b, which yields three types of join operation: (1) itemset-join: alist [i blist = {a +i hvi[j]{tb  pb}j(i = j) ^ (u < v) ^ (ta = tb) ^ (pa = pb)}; (2) sequence-join: alist [s blist = {a +s hvi[j]{tb  pb}j(i = j) ^ (ta = tb) ^ (pa < pb)}; and (3) inter-join: alist [t blist = {a +t hvi[j]{tb  pb}j(i < j) ^ (tb  ta = j  i)}. After the join operation, the support of a joined patternlist is equal to the number of t-values on its list. If the support of a patternlist is not less than minsup, it is stored in the ISP-tree; otherwise, it is pruned.

Definition 9. Let alist be a frequent k-patternlist in an ISP-tree T, where k P 1. We define that the extended group of alist is

NULL

[0] {1.2,2.23,3.1,4.1} t [0][1] {2.23,3.1,4.1}

[0] {1.2,2.2} t

i <(AB)>[0] {1.2,2.2}

s

[0]
[1] {2.23,3.1}

t

t

[0] {1.2,2.23} t

<(AB)>[0]
[1] {2.23,3.1}

[0] {1.1,2.12}

t = inter-extension

[0]
[1] {2.23,3.1}

[0] {1.2,2.2} t

i

[0]
[1] {2.23,3.1}

[0] {1.2,2.2} t

i = itemset-extension s = sequence-extension

s

[0]
[1] {2.23,3.1} Fig. 3. An ISP-tree and patternlists.

Fig. 4. Three types of 1-patternlist join operation.

[0]
[1] {2.23,3.1}

8654

C.-S. Wang, A.J.T. Lee / Expert Systems with Applications 36 (2009) 8649–8658

Fig. 5. Three types of k-patternlist join operation, where k > 1.

Tjalist = {b1list, b2list, . . . , bmlist}, where each bilist (1 6 i 6 m) is a frequent (k + 1)-patternlist and a child of alist in T. We also define that the joinable group of alist is J(alist) = {a1list, a2list, . . . , anlist}, where alist is joinable to ailist, which is a frequent k-patternlist in T, k P 1, 1 6 i 6 n. Lemma 1. Let alist be a frequent k-patternlist and blist be a frequent (k + 1)-patternlist that is also a child of alist in the ISP-tree T, where k P 1. We then have: (1) Tjalist = {hlist} = (alist [i clist) _ (alist [s clist) _ (alist [t clist)jclist 2 J(alist) ^ (support(hlist) P minsup)}; (2) blist 2 Tjalist; and (3) J(blist) = {hlistj(hlist 2 Tjalist) ^ (b is joinable to h)}. We use Lemma 1 to join the patternlists and generate frequent patterns in the ISP-tree, where the nodes are processed in a depth-first search order. The generation of frequent patternlists is comprises two steps. Let alist be a patternlist in the ISP-tree T. First, alist is joined with each clist in J(alist), after which we can find all alist’s child patternlists, denoted as Tjalist. For each of alist’s children, blist in Tjalist, J(blist) is the set of patternlists in Tjalist that is joinable to blist . We can therefore join blist with each clist in J(blist) to find all b’s child patternlists, denoted as Tjblist. Second, let alist be a child of blist. We perform the first step recursively in a depth-first search order. Consider the example in Fig. 3. Let alist = hAi[0]{1.2, 2.23, 3.1, 4.1} and blist = h(AB)i[0]{1.2, 2.2}. From the figure, we know that J(alist) = {hAi[0]list, hBi[0]list, hCi[0]list}; Tjalist = {hAi[0]hAi[1]list, h(AB)i[0]list}; J(blist) = {hAi[0]hAi[1]list}; and Tjblist = {h(AB)i[0]hAi [1]list}. Fig. 2 shows that Tjalist can be obtained by joining alist with each patternlist in J(alist), while T j blist can be obtained by joining blist with each patternlist in J(blist). 3.2.3. The EISP-Miner algorithm Based on the structure of patternlists, the ISP-tree, and the patternlist joining methods, we propose the EISP-Miner algorithm in Fig. 6.

Our proposed EISP-Miner algorithm comprises two functions: ISP-Join1 and ISP-Joink. The algorithm and the two functions are shown in Figs. 6–8, respectively. As shown in step 1 of Fig. 6, the database D is scanned to find the frequent 1-patternlist,TjNULL, which is the extended group of the root node in the ISP-tree T. Next, in step 2 of Fig. 6, for each frequent 1-patternlist alist, we call ISP-Join1 to join alist with each clist in J(alist) to get Tjalist, where J(alist) is the set equal to TjNULL and Tjalist is the set of frequent 2patternlists extended by alist. In Fig. 7, the joined frequent 2-patternlists generated by hlist in step 4, qlist in step 6, and rlist in step 8 are added to Tjalist and their corresponding frequent 2-patterns h, q, and r are added to FP. After we get Tjalist, in step 3 of Fig. 6, we call ISP-Joink to join a k-patternlist blist in Tjalist with each k-patternlist clist in J (blist) to get Tjblist, where J(blist) is a subset of Tjblist containing all frequent k-patternlists that blist can join. In Fig. 8, the frequent joined (k + 1)-patternlists generated by hlist in step 4, qlist in step 6, and rlist in step 8 are added to Tjblist and their corresponding frequent (k + 1)-patterns h, q, and r are added to FP. After all Tjblist have been collected in step 10 of Fig. 8, we apply the ISPJoink function to find longer frequent patterns recursively in a depth-first search order until no further frequent patternlists can be generated. Note that in step 12 of Fig. 8, after we have added all frequent patterns of Tjalist to FP and before leaving the current ISP-Joink function, we can reduce memory usage by deleting Tjalist from T. EISP-Miner can discover a complete set of frequent inter-sequence patterns very efficiently, as it only requires one database scan and can limit the joining and support counting operations to a small number of patternlists. The support is counted by matching the t-values and p-values on the patternlists, instead of the costly operation of matching subsets in mega-sequences. The search method used for mining is a partition-based, divide-andconquer approach, rather than Apriori-like, level-wise generation and checking of frequent patterns. The proposed method dramati-

Fig. 6. The EISP-Miner algorithm.

C.-S. Wang, A.J.T. Lee / Expert Systems with Applications 36 (2009) 8649–8658

8655

Fig. 7. The ISP-Join1 function.

Fig. 8. The ISP-Joink function.

cally reduces the size of patternlists generated at subsequent search levels. All these features make the EISP-Miner algorithm more efficient than the M-Apriori algorithm. 3.3. An EISP-Miner example Let us again consider the example in Table 1, where minsup = 2 and maxspan = 1. The frequent patternlists generated during the mining process are shown in Fig. 3. First, we compute the support count for each 1-pattern in the database. The support counts of patterns hAi[0], hBi[0], hCi[0], and hDi[0], are 4, 2, 2, and 1, respectively. Since minsup = 2, a pattern is frequent if its support count is at least 2. Consequently, hAi[0]{1.2, 2.23, 3.1, 4.1}, hBi[0]{1.2, 2.2}, and hCi[0]{1.1, 2.12} are added to the ISP-tree, as shown in Fig. 1. Note that since we have

three 1-patternlists, hAi[0]list, hBi[0]list, and hCi[0]list, there are three corresponding joinable groups: J(hAi[0]list) = {hAi[0]list, hBi[0]list, hCi [0]list}, J(hBi[0]list) = {hAi[0]list, hBi[0]list, hCi[0]list}, and J(hCi[0]list) = {hAi[0]list, hBi[0]list, hCi[0]list}. We now describe how to join hAi[0]list with every pattern in J(hAi[0]list) to find frequent 2-patternlists. Since hAi[0]{1.2, 2.23, 3.1, 4.1} contains three t-values {1, 2, 3}, each of which is one less than the t-values {2, 3, 4} in hAi[0]{1.2, 2.23,3.1, 4.1}, an interjoined 2-patternlist, hAi[0]hAi[1]{2.23, 3.1, 4.1} can be generated. Next, for hAi[0]{1.2, 2.23, 3.1, 4.1} and hBi[0]{1.2, 2.2}, since hAi[0]list shares {1.2, 2.2} with hBi[0]list, we obtain an itemset-join 2-pattern, h(AB)i[0]{1.2, 2.2}. Consequently, TjhAi[0]list = {h(AB)i[0]list, hAi[0] hAi[1]list}, as shown in Fig. 3. Now, since we have TjhAi[0]list, we can join each 2-patternlist blist in TjhAi[0]list with every 2-patternlist in J(blist) to find frequent

C.-S. Wang, A.J.T. Lee / Expert Systems with Applications 36 (2009) 8649–8658

3-patternlists. Consider the patternlist h(AB)i[0]list. We know J(h(AB)i[0]list) = {hAi[0]hAi[1]list} and we run h(AB)i[0]list [t hAi[0] h Ai[1]list. By matching the t-values and p-values of the two joinable patternlists, we can join h(AB)i[0]{1.2, 2.2} and hAi[0]hAi[1]{2.23, 3.1, 4.1} to get an inter join 3-patternlist, h(AB)i[0]hAi[1]{2.23, 3.1}. Consequently, Tjh(AB)ilist = {h(AB)i[0]h Ai[1]list}, as shown in Fig. 3. A similar procedure can be applied to hBi[0]list and hCi[0]list by calling the ISP-Joink function recursively. Fig. 3 shows the complete set of frequent patternlists. Starting from the hBi[0]list, the only patternlist is hBi[0]hAi[1]list. However, starting from hCi[0]list, the patternlists are hCAi[0]list, hCAi[0]hAi[1]list, hC(AB)i[0]list, hC(AB)i[0]hAi[1]list, hCi[0]hAi[1]list, hCBi[0]list, and hCBi[0]hAi[1]list. The mining process is terminated once hCBi[0]hAi[1]list has been processed, because a frequent patternlist cannot be extended from that patternlist.

10000

M-Apriori

Run time (sec)

8656

1000

EISP-Miner

100 10 1 0.50%

1.00%

1.50%

2.00%

Minimum support Fig. 9. Run time vs. minimum support.

4. Performance studies

4.2. Experiments on synthetic data

We evaluated the performance of the M-Apriori and EISP-Miner algorithms using a synthetic dataset and a real dataset. The former was generated with different parameters synthetically, while the latter comprised real stock index trading data obtained from Reuters. All experiments were performed on an IBM Compatible PC with an Intel Pentium IV CPU 2.8 GHz, 1G byte main memory running on Microsoft Windows XP. Each method was implemented using Microsoft Visual C++ 6.0. All runtimes in the figures are in seconds.

As shown in Table 2, except we choose one parameter as variable for comparison in the rest of this section, we set L = 25,000, N = 1000, I = 1.5, Ns = 5000, S = 6, W = 2, jDj = 10,000, C = 4, and T = 2 as default values in all experiments on the synthetic data. Fig. 9 illustrates the run time versus the minimum support, where the support varies from 0.5% to 2%. The EISP-Miner algorithm runs 30–180 times faster than the M-Apriori algorithm, because the latter generates a huge number of candidates when the support is small. To determine the supports, whenever M-Apriori reads a new sequence from the database, each candidate must be matched with maxspan + 1 consecutive sequences, which is very time-consuming. In EISP-Miner, on the other hand, the supports are determined by joining the patternlists of joinable patterns. Thus, EISPMiner is more efficient than M-Apriori. Fig. 10 shows the run time versus the number of transactions in the database which varies from 10 K to 100 K. The execution time of both algorithms increases almost linearly as the number of transactions increases, but EISP-Miner runs faster than M-Apriori in all cases. Next, we investigate how both algorithms are affected by the average number of itemsets per sequence. The results are shown in Fig. 11. When the average length of a transaction increases, the execution time increases in both cases. The reason is that, under the same support threshold, when the average sequence length increases, more patterns are generated; therefore, more candidates need to be counted. In M-Apriori, it is necessary to perform subset matching on the sequences in order to count the support for each candidate; the longer the sequence, the greater the amount of subset matching required. In contrast, EISP-Miner counts the support by matching the t-values and p-values of the joined patternlists

4.1. Generation of synthetic data There are three phases in the generation of synthetic data. First we generate potentially frequent itemsets, and then potentially frequent patterns. Finally, we generate sequences from the patterns. This method is similar to those of (Agrawal & Srikant (1994), Agrawal & Srikant (1995) & Tung et al. (2003)). The parameters used to generate the synthetic data are shown in Table 2. We first generate all potentially frequent itemsets, Titem. The method used in this part is the same as that proposed by Agrawal and Srikant (1994). Let L be the number of potentially frequent itemsets, and N be the total number of distinct items. The length of each itemset is a Poisson distribution with mean equal to I. Next, we generate Ns potentially frequent patterns, Te_seq. When we generate each pattern, we need to determine the total number of itemsets in it. The number of itemsets is a Poisson distribution whose mean is equal to S. All the itemsets in the first potentially frequent pattern are selected from Titem, and every itemset is associated with a relative time in the range 0 to W (maxspan). Finally, from Te_seq, we generate jDj transactions, each of which contains a series of frequent patterns. The number of itemsets and the average length of the itemsets are Poisson distributions whose means are equal to C and T, respectively.

500

Table 2 Parameters. Parameter

Description

Default

L N I Ns S W jDj C T

# of potential frequent itemsets # of distinct items Avg. length of potential frequent itemsets # of potential frequent patterns Avg. # of itemsets in potential frequent patterns Maxspan # of sequences in the database Avg. # of itemsets per sequence Avg. # of items per itemset in a sequence

25,000 1000 1.5 5000 6 2 10,000 4 2

Run time (sec)

M-Apriori 400

EISP-Miner

300 200 100 0 0

50

Number of transactions (k) Fig. 10. Run time vs. number of transactions.

100

8657

C.-S. Wang, A.J.T. Lee / Expert Systems with Applications 36 (2009) 8649–8658

10000

10000

M-Apriori Memory (M Byte)

Run time (sec)

1000

EISP-Miner 100 10

M-Apriori EISP-Miner 1000

100

1 10 0.50%

0 2

3

4

5

6

1.00%

Average number of itemsets per sequence

1.50%

2.00%

Minimum support

Fig. 11. Run time vs. Avg. # of itemsets per sequence.

Fig. 13. Memory usage vs. minimum support.

500

M-Apriori

1000

Memory (M Byte)

Run time (sec)

10000

EISP-Miner 100 10 1 0

0

1

2

3

4

5

Maxspan

400

M-Apriori

300

EISP-Miner

200 100 0 0

20

40

60

80

100

Number of transactions (k)

Fig. 12. Run time vs. maxspan. Fig. 14. Memory usage vs. # of transactions.

without subset matching; thus, once again, it is more efficient than the M-Apriori algorithm. In Fig. 12, we show the impact of maxspan on the performance of the algorithms. When maxspan increases, more candidates and patterns are generated, so the execution time of both algorithms increases. We observe that the run time of EISP-Miner increases slowly as maxspan increases. However, when maxspan increases in M-Apriori, the range of sequences in megasequences that must be scanned to determine the supports for candidate patterns also increases, and the number of generated candidates grows rapidly in a breadth-first search manner. Hence, the time required to determine the candidates’ support increases. In contrast, by joining patternlists in the ISP-tree, EISP-Miner uses depth-first search, so it does not have to scan the database to count the supports of candidates. Thus, it performs very efficiently when maxspan increases. Fig. 13 shows the memory usage as the minimum support increases from 0.5% to 2%; and Fig. 14 shows the memory usage of both algorithms as the number of transactions increases from 10 K to 100 K. Since EISP-Miner uses a depth-first search strategy, it does not have to store whole candidates in the main memory for certain level. As a result, it requires 5–150 times less memory than the M-Apriori algorithm. In summary, since EISP-Miner algorithm employs an ISP-tree and patternlists to mine frequent patterns, it only requires one database scan and localizes candidate joining and support counting operations to joinable patterns. Since EISP-Miner avoids costly inter-sequence subset matching, it requires less main memory storage space and outperforms M-Apriori algorithm by several orders of magnitude.

4.3. Experiments on real data In this section, we apply our proposed algorithm to mining stock index sector rotation patterns (Cavaglia, Cho, & Singer, 2001). The sequence database mined is the SECTOR-INDEX, the American Global Industry Classification Standard (GICS) stock price index, obtained from Reuters. It consists of 10 sector indexes and has 2334 trading-day records for the period 1995/1/3 to 2004/4/7. The stock index’s row data is transformed into a sequence database as follows. First, we calculate the daily index fluctuation rate for each sector. Next, we compute the weekly average index fluctuation rate for each sector by averaging the daily fluctuation rates for one week (usually 5 records). Then, we remove the sectors with negative average index fluctuation rates and sort the sector codes by their weekly fluctuation rates in decreasing order, which forms a sequence. The sequence database contains 483 sequences. Next, we apply EISP-Miner to the sequence database with minsup = 5.8% (=28/483) and maxspan = 4. After mining the database, we obtain 183,742 frequent patterns. Some interesting inter-sequence association rules can be mined from the sequence database. The following are two such rules. Rule 1: ‘‘If the Health Care sector is stronger than the Energy sector in week 1 and the Energy sector is stronger than the Consumer Staples sector in week 1, then the Utilities sector is likely to be strong in week 2”. The support = 6.00% and confidence = 78.38%. Rule 2: ‘‘If the Energy sector is strong in week 1, the Industrials sector is strong in week 2, and the latter is stronger than the Consumer Staples sector in week 3, then the Energy sector is likely to be strong in week 4”. The support = 5.8% and confidence = 90.32%.

8658

C.-S. Wang, A.J.T. Lee / Expert Systems with Applications 36 (2009) 8649–8658

addressed in future research. First, inter-sequence mining algorithms usually generate a large number of frequent patterns. To reduce the number of frequent patterns, we could consider other pattern structures, such as the closed patterns discussed in Lee, Wang, Weng, Chen, and Wu (2008), Yan, Han, and Afshar (2003), Wang and Han (2004), Zaki and Hsiao (2005)). We could also extend our proposed algorithm from two-dimensional transaction databases to higher dimension databases. Finally, without generalization, too many patterns may be mined and they may be too detailed. However, by generalizing with a concept hierarchy, we may be able to obtain patterns or rules that are more abstract and meaningful.

M-Apriori

10000

Run time (sec)

EISP-Miner 1000 100 10 1 0 4.0%

5.0%

6.0%

7.0% Acknowledgement

Minimum support

This research was supported in part by the National Science Council of Republic of China under Grant No. NSC 94-2416-H002-017.

Fig. 15. SECTOR-INDEX: Run time vs. minimum support.

M-Apriori

References

10000

Run time (sec)

EISP-Miner 100

1

0 0

1

2

3

4

5

Maxspan Fig. 16. SECTOR-INDEX: Run time vs. maxspan.

We also performed run time versus minimum support and run time versus maxspan experiments. The results, shown in Fig.s 15 and 16 respectively, demonstrate that EISP-Miner outperforms M-Apriori by several orders of magnitude. 5. Conclusions We have proposed two algorithms, M-Apriori and EISP-Miner, for mining all frequent inter-sequence patterns such that a pattern can be used to describe associations across many different sequences. Although M-Apriori is an Apriori-like algorithm that can mine inter-sequence patterns, it is not efficient. EISP-Miner, on the other hand, is a new method that we propose. It employs several mechanisms to mine inter-sequence patterns efficiently. There are two phases in EISP-Miner. First, we convert an original sequence database into a vertical format called patternlists, each of which records the location of a frequent item in a database. In the second phase, we devise a data structure, called an ISP-tree, which enumerates all frequent inter-sequence patterns based on a depth-first search strategy. By using the ISP-tree and patternlists to mine frequent patterns, EISP-Miner only requires one database scan and can localize the joining and support counting operations to a small number of patternlists, which avoids costly inter-sequence subset matching. Therefore, EISP-Miner is more efficient than M-Apriori. The experiment results show that EISP-Miner outperforms M-Apriori by several orders of magnitude. Furthermore, it requires less main memory storage space. Although we have shown that EISP-Miner can mine frequent inter-sequence patterns efficiently, there are still some issues to be

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 1994 international conference on very large data bases (VLDB’94) (pp. 487–499). Santiago, Chile. Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Proceedings of the 1995 international conference data engineering (ICED’95) (pp. 3–14). Taipei, Taiwan. Agrawal, R., & Srikant, R. (1996). Mining sequential patterns: Generalizations and performance improvements. In Proceedings of fifth international conference on extending database technology (pp. 3–17). Avignon, France. Ayres, J., Flannick, J., Gehrke, J., & Yiu, T. (2002). Sequential pattern mining using a bitmap representation. In Proceedings of the eighth international conference on knowledge discovery and data mining (pp. 429–435). Alberta, Canada. Cavaglia, S., Cho, D., & Singer, B. (2001). Risk of sector rotation strategies. Journal of Portfolio Management, 27, 35–44. Feng, L., Dillon, T. S., & Liu, J. (2001). Inter-transaction association rules for multidimensional contests for prediction and their application to studying meteorological data. Data and Knowledge Engineering, 37(1), 85–115. Feng, L., Yu, J. X., Lu, H., & Han, J. (2002). A template model for multidimensional inter-transaction association rules. The VLDB Journal, 11(2), 153–175. Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., & Hsu, M.-C. (2000). Freespan: Frequent pattern-projected sequential pattern mining. In Proceedings of the international conference on knowledge discovery and data mining (pp. 355–359). Han, J., & Kamber, M. (2000). Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufman. Huang, K. Y., Chang, C. H., & Lin, K. Z. (2004). Cocoa: An efficient algorithm for mining inter-transaction associations for temporal database. In Proceedings of the eighth European conference on principles and practice of knowledge discovery in databases (PKDD’04). Lecture notes in computer science (Vol. 3202, pp. 509511). Springer. Lee, A. J. T., & Wang, C. S. (2007). An efficient algorithm for mining frequent intertransaction patterns. Information Sciences, 177(17), 3453–3476. Lee, A. J. T., Wang, C. S., Weng, W. Y., Chen, Y. A., & Wu, H. W. (2008). An efficient algorithm for mining closed inter-transaction itemsets. Data and Knowledge Engineering, 66(1), 68–91. Lu, H., Feng, L., & Han, J. (2000). Beyond intratransaction association analysis: Mining multidimensional inter-transaction association rules. ACM Transactions on Information Systems, 18(4), 423–454. Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., et al. (2004). Mining sequential patterns by pattern-growth: The prefixspan approach. IEEE Transactions on Knowledge and Data Engineering, 16(10), 1424–1440. Roddick, J. F., & Spiliopoulou, M. (2002). A survey of temporal knowledge discovery paradigms and methods. IEEE Transactions on Knowledge and Data Engineering, 14(4), 750–767. Tung, K. H., Lu, H., Han, J., & Feng, L. (2003). Efficient mining of intertransaction association rules. IEEE Transactions on Knowledge and Data Engineering, 15(1), 43–56. Wang, J., & Han, J. (2004). BIDE: Efficient mining of closed sequences. In Proceedings of the 2004 international conference data engineering (ICED’04) (pp. 79–90). Boston, Massachusetts. Yan, X., Han, J., & Afshar, R. (2003). CloSpan: Mining closed sequential patterns in large databases. In Proceedings of the third SIAM international conference on data mining. San Francisco, CA. Zaki, M. J. (2001). SPADE: An efficient algorithm for mining frequent sequences. Machine Learning Journal, 42, 31–60. Zaki, M. J., & Hsiao, C. J. (2005). Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Transactions on Knowledge and Data Engineering, 17(4), 462–478.