Expert Systems with Applications 36 (2009) 8649–8658
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: ww...
Expert Systems with Applications 36 (2009) 8649–8658
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Mining inter-sequence patterns Chun-Sheng Wang a,*, Anthony J.T. Lee b a b
Department of Information Management, Jinwen University of Science and Technology, 99, An-Chung Road, Hsin-Tien, Taipei, Taiwan, ROC Department of Information Management, National Taiwan University, No. 1, Section 4, Roosevelt Road, Taipei, Taiwan, ROC
a r t i c l e
i n f o
Keywords: Data mining Inter-sequence pattern Inter-transaction pattern Sequential pattern
a b s t r a c t Sequential pattern and inter-transaction pattern mining have long been important issues in data mining research. The former finds sequential patterns without considering the relationships between transactions in databases, while the latter finds inter-transaction patterns without considering the ordered relationships of items within each transaction. However, if we want to find patterns that cross transactions in a sequence database, called inter-sequence patterns, neither of the above models can perform the task. In this paper, we propose a new data mining model for mining frequent inter-sequence patterns. We design two algorithms, M-Apriori and EISP-Miner, to find such patterns. The former is an Apriori-like algorithm that can mine inter-sequence patterns, but it is not efficient. The latter, a new method that we propose, employs several mechanisms for mining inter-sequence patterns efficiently. Experiments show that EISPMiner is very efficient and outperforms M-Apriori by several orders of magnitude. Ó 2008 Elsevier Ltd. All rights reserved.
1. Introduction Finding frequent patterns in different types of database is one of the most important issues in data mining (Han & Kamber, 2000; Roddick & Spiliopoulou, 2002). Consequently, sequential pattern mining has been studied extensively. Many algorithms (Agrawal & Srikant, 1995, 1996; Ayres, Flannick, Gehrke, & Yiu, 2002; Han et al., 2000; Pei et al., 2004; Zaki, 2001) have been proposed for finding sequential patterns in sequence databases, in which every transaction contains a sequence. However, all the above algorithms treat sequences independently, without considering the relationships between sequences. We call this intra-sequence pattern mining, because all patterns are bounded in a single sequence. In addition to intra-sequence pattern mining, the problem of finding inter-transaction patterns has also been widely studied and several algorithms have been proposed (Feng, Dillon, & Liu, 2001, 2002; Huang, Chang, & Lin, 2004; Lu, Feng, & Han, 2000; Lee & Wang, 2007; Tung, Lu, Han, & Feng, 2003). Although such algorithms can find frequent patterns containing the itemsets across several transactions, they do not consider the ordered relationships between items within a transaction, because the items are treated as unordered sets (i.e., itemsets). In this paper, we are interested in inter-sequence patterns, which describe associations across several sequences. We propose a new data mining model that mines frequent inter-sequence patterns in sequence databases. We call this inter-sequence pattern mining. Since there is an ordered relationship between the items (or * Corresponding author. Tel.: +886 2 8212 2000 6306. E-mail addresses: [email protected], [email protected] (C.-S. Wang). 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.10.008
itemsets) in a transaction in a sequence database, inter-transaction pattern mining algorithms are not suitable for mining frequent inter-sequence patterns. Similarly, since there is a cross-sequence relationship between the transactions in a sequence database, traditional intra-sequence pattern mining algorithms are not suitable either. Inter-sequence pattern mining is more general than either sequential pattern mining or inter-transaction pattern mining. By mining inter-sequence patterns, we can discover both a sequential pattern within a transaction and sequential patterns across several different transactions. Therefore, our data mining model provides more informative patterns than the two traditional models. For example, in financial markets, an inter-sequence association rule might be: ‘‘If the steel market’s price index increases more than the exchange rate of the US dollar in the first month, the real estate market’s price index will probably increase more than that of the gold market in the third month”. This is a useful inter-sequence association rule that could help investors manage their portfolios, since they could invest more in real estate than in gold two months ahead. Likewise, a rule like ‘‘if the demand for beer is greater than the demand for orange juice in week one, the demand for soda will likely be greater than that for cola two weeks later” could help retailers plan future beverage purchases. In another example, the following rule may help weather forecasters: ‘‘if more tornados occurred in Texas than in Kansas last year, it is likely that there will be more tornados in Colorado than in Utah this year”. Designing an efficient algorithm is critical for mining frequent inter-sequence patterns. We observe that inter-sequence patterns have the following anti-monotone Apriori property: if any length k pattern is not frequent in the database, then its length (k + 1)
8650
C.-S. Wang, A.J.T. Lee / Expert Systems with Applications 36 (2009) 8649–8658
super-pattern cannot be frequent. To mine such patterns, we modify the Apriori algorithm (Agrawal & Srikant, 1994) and call it the MApriori algorithm. Apriori-like approaches (Agrawal & Srikant, 1994; Lu et al., 2000; Tung et al., 2003) perform well when mining intra-sequence patterns. However, their performance declines dramatically when they are modified to mine inter-sequence patterns. The M-Apriori algorithm suffers from the following non-trivial costs. (1) Handling a huge number of candidates is costly, since M-Apriori may generate a large number of candidates at each level, especially when candidates are generated inter-sequentially. (2) At each level, it is tedious to scan the database repeatedly to determine the support for each candidate. (3) When performing subset matching, each database scan may incur a huge computational overhead. (4) Using breadth-first search approaches to generate candidates level-by-level requires that a large number of candidates must be kept at certain levels, which may overload M-Apriori’s memory capacity. To resolve the above issues, we propose an algorithm called EISP-Miner (Enhanced Inter-Sequence Pattern Miner), which can efficiently discover a complete set of frequent inter-sequence patterns. The algorithm is comprised of two phases. First, it finds frequent 1-patterns and converts original sequences into a structure called a patternlist for each 1-pattern, which stores a frequent pattern and a list of location information for that pattern. In other words, the original sequences are converted into a vertical representation of frequent patterns. Second, it uses a new data structure, called an ISP-tree, to enumerates all frequent inter-sequence patterns by joining patternlists in a depth-first search order. By using the ISP-tree and patternlists, EISP-Miner only requires one database scan, and can localize the joining and support counting operations to a small number of patternlists. Moreover, the search method is a partition-based, divide-and-conquer approach, rather than Apriori-like level-wise generation and checking of frequent patterns. Because of these features, the EISP-Miner algorithm is more efficient than the M-Apriori algorithm. The remainder of this paper is organized as follows: Section 2 defines the frequent inter-sequence pattern mining problem and introduces some notations. In Section 3, we present the M-Apriori and EISP-Miner algorithms. Section 4 describes the experiments and their results. Finally, we present our conclusions and indicate some future research directions in Section 5.
2. A new model for inter-sequence pattern mining 2.1. An enhanced sequence database model In traditional sequential pattern mining models, each transaction in database contains a sequence of itemsets. Although transactions may occur in different contexts, such as time and location, this contextual information is ignored in traditional sequential pattern mining because the patterns are intra-transactional in nature. However, if we are interested in inter-sequence patterns across multiple transactions, the context in which a transaction occurs is important. Therefore, we define an enhanced sequence database for intersequence pattern mining as follows. An itemset t = (u1, u2, . . . , um) is a set of items, where ui is an item for 1 6 i 6 m. When there is only one item in an itemset, the parentheses can be omitted; that is, (u) can be written as u. Items in an itemset are listed in alphabetical order. A sequence s = ht1, t2, . . . , tni is an ordered list of itemsets, where tj is an itemset for 1 6 j 6 n. A sequence database D = {s1, s2, . . . , sjDj}, where jDj is the number of transactions in D and si (1 6 i 6 jDj) is a transaction of the form hDat, Sequencei. Dat is the domain attribute of si that describes the contextual information, such as the time stamp or space location associated with si.
Table 1 A sequence database.
To demonstrate our new inter-sequence mining model, we use a sequence database containing four transactions, shown in Table 1, as a running example in this paper. Note that we write hC(ABC)Ai, instead of h(C)(BAC)(A)i, for the sequence of the second transaction. 2.2. Inter-sequence context An inter-sequence context can be defined through the domain attribute of an enhanced sequence database. Let t1 and t2 be domain attributes for sequences s1 and s2, respectively. If we take t1 as the reference point, the span between s2 and s1 is defined as [t2 t1]. The sequence s2 at domain attribute t2 with respect to t1 is called an extended sequence (e-sequence for short) and denoted as s2[t2 t1]. For example, in Table 1, if we take the domain attribute in the first transaction as the reference point, the extended sequence of the second transaction is hC(ABC)Ai[1]. Since a sequence contains itemsets, traditional concepts regarding itemsets and items can be applied in an inter-sequence context. Let an extended sequence s[i] = ht1 ,t2 ,. . . , tni[i], where tj is an itemset for 1 6 j 6 n and [i] is the span of s. We define tj associated with [i] as an extended itemset (e-itemset for short) denoted by htji[i]. Also, if tj = (u1, u2, . . . , um), where uk is an item for 1 6 k 6 m, we define uk associated with [i] as an extended item (e-item for short) denoted by (uk)[i]. For example, the extended sequence hC(ABC)Ai[1] contains three e-itemsets, hCi[1], h(ABC)i[1], and hAi[1], which can be decomposed into five e-items, (C)[1], (A)[1], (B)[1], (C)[1], and (A)[1]. Before applying the extended sequence concept to an enhanced sequence database, we define the following notations. Given a list of k consecutive transactions z1 = ht1, s1i, z2 = ht2, s2i, . . . , zk = htk, ski in a sequence database, w = s1[0] [ s2[t2 t1] [ [ sk[tk t1] is called a megasequence, where k P 1. Since w takes t1 as the reference point, we say that w starts from t1. In a megasequence the span between the domain attribute of the first transaction and that of the last transaction must be less than or equal to maxspan (i.e. tk t1 6 maxspan in w), where maxspan is a user-specified maximum span threshold. The third column in Table 1a shows four megasequences, each of which has a maxspan equal to 1. We introduce the concept of megasequences to find patterns that cross transaction boundaries in a sequence database. That is, we let a = s1[w1], s2[w2], . . . , sm[wm] be a subset of w, where each si[wi], 1 6 i 6 m, is an e-sequence. We normalize a as b = s1[0], s2[w2 w1], . . . , sm[wm w1] and call b a pattern. Consider a pattern b = (u1)[v1], (u2)[v2], . . . , (un)[vn] expressed in e-item form, where each (ui) [vi], 1 6 i 6 n, is an e-item. The number of e-items in a pattern is called the length of the pattern (i.e. n in b), and a
C.-S. Wang, A.J.T. Lee / Expert Systems with Applications 36 (2009) 8649–8658
pattern of length k is called a k-pattern. To determine the relationship between two consecutive e-items (ui)[vi] and (ui+1)[vi+1] in b, 1 6 i < n, we define three types of extension for (ui+1)[vi+1]. (1) If vi = vi+1 and (uiui+1) is an itemset, we say (ui+1)[vi+1] is an itemsetextension (+i) of (ui)[vi]; (2) If vi = vi+1 and huiui+1i form a sequence, we say (ui+1)[vi+1] is a sequence-extension (+s) of (ui)[vi]. (3) If vi < vi+1, we say (ui+1)[vi+1] is an inter-extension (+t) of (ui)[vi]. For example, h(BD)i[0]hCAi[1] is a 4-pattern, where (D)[0] is an itemset-extension of (B)[0], (C)[1] is inter-extension of (D)[0], and (A)[1] is a sequence-extension of (C)[1]. In other words, (A)[0] +i (B)[0] = h(AB)i[0], (A)[0] +s (B)[0] = hABi[0], and (A)[0] +t (B)[1] = hAi[0]hBi [1]. 2.3. Inter-sequence pattern mining We use the support for each pattern as the primary measurement in the inter-sequence pattern mining framework. To determine the support, we must define the relationships between patterns. First, given two sequences, s = hs1, s2, . . . , sni and s0 ¼ hs01 ; s02 ; . . . s0m i, where n 6 m, we say that s is a subsequence of s0 if we can find n numbers, j1, j2, . . . , jn, such that: (1) 1 6 j1 < j2 < < jn 6 m, and (2) s1 # s0j1 ; s2 # s0j2 ; . . . ; sn # s0jn . For example, hA(BC)DFi is a subsequence of hA(ABC)(AC)D(CF)i, but it is not a subsequence of h(ABC)(AC)D(CF)i. Second, assume there are two patterns, a = s1[i1], s2[i2], . . . , sn[in] and b = v1[j1], v2[j2], . . . , vm[jm], where 1 6 n 6 m. Then, a is a subpattern of b if we find n e-sequences, vk1[jk1], vk2[jk2], . . . , vkn[jkn] in b, such that i1 = jk1 and s1 is a subsequence of vk1; i2 = jk2 and s2 is a subsequence of vk2, . . .; and in = jkn and sn is a subsequence of vkn. We can also say that b contains (is a superpattern of) a. For example, both hDi[0]hBACi[3] and h(AD)i[0]h(AC)Ci[1] are subpatterns of hD(AD)i[0]hB(AC)Ci [1]hB(A)CCi[3]. In a sequence database D, let a be a pattern, and Ta be a set of megasequences in D, where each megasequence in Ta contains a. The support of a, support(a), is defined by jTaj. If support(a) is not less than the user-specified minimum support threshold minsup, a is called a frequent pattern. An inter-sequence association rule is written in the form of a ? b, where both a and a [ b are frequent patterns; a \ b = /, conf(a ? b) is not less than the user-specified minimum confidence; and the confidence of the rule conf(a ? b) is defined as support(a [ b)/support(a). The purpose of mining frequent inter-sequence patterns is to find a complete set of frequent patterns in a sequence database with respect to the user-specified minsup and maxspan thresholds. Suppose that minsup = 2 and maxspan = 1. Let us consider the input sequence database D shown in Table 1. Table 1b shows the complete set of frequent patterns with their supports listed afterwards. Among them, support(hCBi[0]hAi[1]) = 2, so it is a frequent pattern. Also we have support(hCBi[0]) = 2, so conf(hCBi[0] ? hAi[1]) = 2/2 = 100%. 3. Algorithms for mining inter-sequence patterns 3.1. The M-Apriori algorithm In algorithms that find frequent inter-sequence patterns efficiently, the anti-monotone property of patterns is the key to reducing the search space. Property 1. (Anti-monotone property). If an inter-sequence pattern p is frequent, so is any subpattern of p. In other words, if p is not frequent, its superpatterns cannot be frequent. Based on this property, we modify the Apriori algorithm’s patternjoining and pattern-pruning operations. The new algorithm for finding frequent patterns is called the M-Apriori algorithm. We now introduce some definitions.
8651
Definition 1. Let a = (im)[x], b = (in)[y] be two e-items. We say that: (1) a is equal to b, a = b, if ((im = in) ^ (x = y)); and (2) a is smaller than b, a < b, if x < y or ((x = y) ^ (im < in)). For example, (C)[1] = (C)[1], (B)[1] < (A)[2], and (A)[1] < (C)[1]. Definition 2. Let p be a pattern. We define a function subi,j(p) to be (j i + 1) numbers of subset e-items of p from position i to j. For and example, sub1,4(h(AD)i[0]h(AC)Ci[1]) = h(AD)i[0]h(AC)i[1] sub5,5(h(AD)i[0]h(AC)Ci[1]) = (C)[1]. Definition 3. Let a = hui[0] and b = hvi[0] be two frequent 1-patterns. We define that a is joinable to b in any instance, which yields three types of join operation: (1) itemset-join: a [i b = {h(uv)i[0]}j{h(uv)i[0]}; (2) sequence-join: a [s b = {huvi[0]}; and (3) inter-join: a [t b = {hui[0]hvi[x]j1 6 x 6 maxspan}. For example, suppose maxspan = 2; then, hAi[0] [i hBi[0] = h(AB)i[0], hAi[0] [s hBi[0] = hABi[0], and hAi[0] [t hBi[0] = {hAi[0]hBi[1], hAi[0] hBi[2]}. Definition 4. Let a and b be two frequent k-patterns, where k > 1, subk,k(a) = (u)[i], and subk,k(b) = (v)[j]. We say that a is joinable to b if sub1,k1(a) = sub1,k1(b) and i 6 j, which yields three types of join operation: (1) itemset-join: a [i b = {a +i (v)[j]j(i = j) ^ (u < v)}; (2) sequence-join: a [s b = {a +s (v)[j]ji = j}; and (3) inter-join: a [t b = {a +t (v)[j]ji < j}. For example, hABi[0] [i hACi[0] = hA(BC)i[0], hABi[0] [s hACi[0] = hABCi[0], and hABi[0] [t hAi[0]hCi[2] = hABi[0] hCi[2]. Let Lk represent a set of frequent k-patterns, and Ck be a candidate set of k-patterns, we have the M-Apriori algorithm in Fig. 1. As shown in Fig. 1, M-Apriori needs to scan the sequence database multiple times. In step 1, the algorithm scans the database to find L1. Let k = 2. In step 3, the set of all frequent (k 1)-patterns Lk1, found in the (k-1)th scan, is used to generate the candidate pattern C 0k . Then Ck is pruned in step 4. The candidate generation and pruning operations ensure that Ck is a superset of the set of all frequent k-patterns. In step 5, the algorithm scans consecutive transactions in the sequence database to form megasequences. For each megasequence w in D, M-Apriori determines which candidate c in Ck contained in w to count the support of c. At the end of the scan, Ck is examined to check which candidates are frequent in order to obtain Lk. In step 2, the algorithm terminates when Lk becomes empty. Analyzing the M-Apriori algorithm, we find it has several nontrivial computation costs. First, after a set of candidates of a certain length has been generated, the whole sequence database must be scanned to determine the support of each candidate. This incurs a high I/O overhead if a frequent pattern is very long. Second, because the patterns are the inter-sequence type, many more candidates are generated; consequently, searching the database to count the candidates’ support is inefficient and the main memory usage may be excessive. Third, the candidate generation, pruning, and support counting procedures require subset matching, which is very time consuming, especially when matching patterns that cross the boundaries of transactions.
3.2. The EISP-Miner algorithm To overcome the shortcomings of the M-Apriori algorithm, we have developed the EISP-Miner algorithm, which is very efficient. It uses new data structures and search strategies that mine for frequent inter-sequence patterns without having to generate candidates or scan a database repeatedly. In the following, we present the main elements of EISP-Miner.
8652
C.-S. Wang, A.J.T. Lee / Expert Systems with Applications 36 (2009) 8649–8658
Fig. 1. The M-Apriori algorithm.
3.2.1. The patternlist We have devised a data structure called a patternlist, which stores location information about frequent patterns during the mining process. Fig. 2 illustrates the initial construction of patternlists based on the running example in Table 1. In the figure, four patternlists of 1-patterns, hAi[0], hBi[0], hCi[0], and hDi[0], are constructed first. In addition to the pattern itself, each patternlist stores dat (column t) and position (column p) information about the corresponding pattern as it appears in the database. Take the patternlist of hCi[0] as an example. It contains two dats (1, 2) and two sets of positions ({1}, {1, 2}), where position set {1} is associated with dat 1 and position set {1, 2} is associated with dat 2. Note that the position number is counted at the itemset base. Thus, the numbers in the patternlist of hCi[0] mean that the pattern is contained in the sequences of dats 1 and 2 in the database. It is located in the first itemset of the sequence of dat 1; and in the first and second itemsets of the sequence of dat 2. Based on this example, a patternlist can be defined as follows.
alist. If a is a k-pattern and support (alist) P minsup, we say that alist is a frequent k-patternlist. By the above definition, the four 1-patternlists in Fig. 2 can be expressed as hAi[0]{1.2, 2.23, 3.1, 4.1}, hBi[0]{1.2, 2.2}, hCi[0]{1.1, 2. 12}, and hDi[0]{3.2}, For which, we have support(hAi[0]list) = 4, support(hCi[0]list) = 2, and support(hDi support(hBi[0]list) = 2, [0]list) = 1, respectively. The supports of the first three patternlists are frequent, so they are retained; however, support for the last patternlist is infrequent, so it is pruned. Note that in the above cases, the t-values and p-values in a patternlist are locations of the pattern’s last e-item in the database; for a 1-pattern, the last eitem is the pattern itself.
Definition 5. Given a pattern a, we define a patternlist alist = a{t1 p11 p12 . . . p1m1,t2 p21 p22 . . . p2m2,tn pn1pn2 . . . pnmn}, where {t1 p11 p12 . . . p1m1, t2 p21 p22 . . . p2m2, tn pn1 pn2 . . . pnmn} is called the list; ti is the dat (t-value); and pij is the position (p-value) that a‘s last e-item appears at in the database 1 6 i 6 n and 1 6 j 6 mi. We also define support (alist) as the number of t-values contained in
Definition 6. An ISP-tree is a search tree (T) that has the following properties. (1) A node in T represents a frequent patternlist. (2) The root of T contains a null patternlist, NULL. (3) Child nodes of the root node include all frequent 1-patternlists. (4) If a node corresponds to a frequent k-patternlist alist, where k P 1, then any child node of alist is a frequent (k + 1)-patternlist blist. In this case, a and b share the