Mining distinct and contiguous sequential patterns from large vehicle trajectories

Mining distinct and contiguous sequential patterns from large vehicle trajectories

Knowledge-Based Systems xxx (xxxx) xxx Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/k...

1MB Sizes 0 Downloads 57 Views

Knowledge-Based Systems xxx (xxxx) xxx

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Mining distinct and contiguous sequential patterns from large vehicle trajectories✩ Luke Bermingham, Ickjai Lee



Computer Science & Information Technology Academy, College of Science & Engineering, James Cook University, PO Box 6811, Cairns, QLD 4870, Australia

article

info

Article history: Received 4 June 2019 Received in revised form 22 September 2019 Accepted 25 September 2019 Available online xxxx Keywords: Vehicle trajectory Data mining Sequential pattern mining Contiguous patterns

a b s t r a c t We focus on the problem of using contiguous SPM to extract succinct, redundancy controlled patterns from large vehicle trajectories. Although there exist several techniques to reduce the contiguous sequential pattern output such as closed and max SPM, they still produce massive redundant pattern outputs when the input sequence database is sufficiently large and homogeneous — as is often the case for vehicle trajectories. Therefore, in this work we propose DC-SPAN: a distinct contiguous SPM algorithm. DC-SPAN mines a set of sequential patterns where the maximum redundancy of the pattern output is controlled by a user-specified parameter. Through various experiments using real world trajectory datasets we show DC-SPAN effectively controls the redundancy of the pattern output with trade-offs in pattern distinctness. Additionally, our experiments also indicate that DC-SPAN efficiently computes these patterns, incurring only a marginal running time cost over existing state-of-the-art contiguous SPM approaches. Lastly, due to the less redundant and more succinct pattern output we also briefly explore visualisation as a useful technique to interpret the discovered vehicle routes. Crown Copyright © 2019 Published by Elsevier B.V. All rights reserved.

1. Introduction Due to the affordability and widespread availability of GPS technology, the generation and collection of vehicle movements, or trajectories, is relatively straightforward and cost effective. These vehicle trajectories present a valuable opportunity to extract knowledge in domains such as urban planning [1], route planning [2], and traffic congestion [3]. In this work we focus on extracting this knowledge from vehicle trajectories through Sequential Pattern Mining (SPM). SPM is the process of finding frequently occurring sequences within a sequence database. However, SPM of vehicle trajectories is difficult for two reasons: (1) SPM requires sequences of discrete items; however, because GPS technology suffers from spatial uncertainty and urban black holes the trajectory recordings are often quite noisy and far from discrete; (2) vehicle trajectories commonly contain hundreds of thousands, if not millions, of recordings — which cause many existing SPM approaches to have massive, redundant, and therefore incomprehensible pattern outputs [4,5]. ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.105076. ∗ Corresponding author. E-mail addresses: [email protected] (L. Bermingham), [email protected] (I. Lee).

The first problem is solved in other work [6] that constrains the vehicle trajectories to the appropriate road network as a pre-processing step. Doing so removes spatial uncertainty and converts the trajectories into discrete sequences of road node visitations. However, the second problem of mining a smaller, less redundant, set of sequential patterns from the vehicle trajectories still remains. One approach towards alleviating this problem is to only mine a set of patterns that does not allow any sub-patterns in the output, in other words, to mine the so-called max patterns. Additionally, the pattern output can be further reduced by enforcing a constraint on the discovered sequential patterns that requires the items in the candidate patterns to exist contiguously in the underlying sequence database. This constraint is called contiguous SPM [7,8] and is well suited to vehicle trajectories because it guarantees that the resulting vehicle patterns will always travel along real-world routes. Whereas, without this constraint sequential patterns may be discovered that jump from place to place. However, even with max contiguous SPM, the pattern output can still become highly redundant if the vehicle sequences are sufficiently homogeneous. To illustrate this problem we present an example in Fig. 1, which is a simplified scenario containing six vehicles moving through an intersection. In Table 1, we present the result of mining the set contiguous max patterns from this example using a support of two (i.e minSup = 2). From Table 1, we observe that even the frequent contiguous sequential patterns can become quite redundant. Specifically,

https://doi.org/10.1016/j.knosys.2019.105076 0950-7051/Crown Copyright © 2019 Published by Elsevier B.V. All rights reserved.

Please cite this article as: L. Bermingham and I. Lee, Mining distinct and contiguous sequential patterns from large vehicle trajectories, Knowledge-Based Systems (2019) 105076, https://doi.org/10.1016/j.knosys.2019.105076.

2

L. Bermingham and I. Lee / Knowledge-Based Systems xxx (xxxx) xxx

2.2. Sequential pattern mining

Fig. 1. An example of simplified vehicle trajectories scenario (trajectories = {s1 , s2 , s3 , s4 , s5 , s6 }). Table 1 Frequent contiguous sequential patterns of the data shown in Fig. 1. ID

Sequences

Contiguous patterns

Support

1 2 3

⟨A, B, C , D⟩ ⟨A, B, C , D⟩ ⟨A, B, C , D⟩

⟨A, B, C , D⟩

3

4 5

⟨A, B, C , E ⟩ ⟨A, B, C , E ⟩

⟨A, B, C , E ⟩

2

6

⟨A, B, C , F ⟩

three-quarters of each pattern is repeated redundantly. We highlight that our scenario is not an exceptional or contrived case; but rather, demonstrates an issue that is even further exacerbated when mining large real-world vehicle trajectories. Therefore, to solve this problem we present our algorithm to mine a set of Distinct Contiguous Sequential PAtterNs (DC-SPAN). 2. Literature review We briefly discuss main differences between Association Rules Mining (ARM) and SPM, and review some recent approaches. Since our work has a strong basis in both SPM and trajectory data mining, and therefore we review relevant works from both of these fields subsequently. 2.1. Association rules mining ARM is to find co-occurrence relationships from transactional databases, and it has been extended to several spatial datasets [9– 13]. One main issue with ARM is that it does not consider the order of data items, and thus it is unable to find sequential patterns. In some spatial settings, such orderings are of importance. SPM is to consider the order of data items to find sequential patterns. Some studies [9,10] have been proposed with spatial datasets to improve traditional ARM by removing the number of uninteresting patterns and repeating patterns. They incorporate dependency relationships and semantics into mining processes to prune repeating patterns. However, these approaches are not designed to consider the sequence of data, and thus unable to detect sequential patterns from sequence natured trajectory datasets. There have been some studies with vehicle movement trajectories [14–17]. In [14,15], authors analysed train movement situations in order to optimise early warnings and accident management whilst studies [16,17] mined GPS trajectories to find similar user behaviours or traffic information.

There are a number of existing SPM approaches that exist in the literature [18–23]; however, none of them is particularly well suited for mining extremely long and homogeneous vehicle trajectory sequences [8]. Additionally, vehicle trajectories move along a constrained road network, moving from one road node to the next, whereas traditional SPM algorithms such as those aforementioned are not required to find sequential patterns consisting of adjacent items in the underlying sequence database. This constraint of mining patterns that consist of contiguous items in the underlying sequence database is fairly common, particularly in domains like biology [24] and geography [4] where patterns consisting of nearby items are more meaningful. Therefore, it follows that there exists various approaches such as [7,25– 28] that all constrain the SPM process so that only items that are within a user specified max-gap of each other are considered. For example, in our domain where we assume that vehicles always move from one road node to the next the max-gap parameter would be set to one, which results in all discovered patterns being contiguous. In our preliminary investigation, CC-SPAN [7], a closed contiguous (max-gap = 1), SPM algorithm was the fastest contiguous SPM approach we found to mine large sequence databases. CCSPAN works by splitting each sequence in the database into single items that are iteratively grown one item at a time according to underlying adjacent items in the sequence database. These grown candidate sequences are pruned based on three techniques which they call checked snippet pruning, pre-post-subsequence pruning, and support pruning. Due to CC-SPAN’s efficiency, we use it throughout this work as a benchmark and basis for mining contiguous sequential patterns (specifically, in Sections 4 and 5). Despite CC-SPAN’s efficiency, it and all existing gap-constrained SPM approaches only mine a set of all closed or max sequential patterns. Recall, we highlighted in Section 1, that even max contiguous sequential mining approaches (i.e the least redundant of all the concise approaches) are subject to producing highly redundant pattern outputs that are unsuitable for vehicular trajectory data mining. There are, however, some existing SPM approaches that do try to compress and reduce the redundancy of the pattern output. For example, GoKrimp [29] uses the Minimum Description Length (MDL) principle to mine a set of sequential patterns that reasonably compresses a given sequence database. Whilst this compression-based sequential patterns concept is somewhat similar to our notion of distinct sequential patterns, GoKrimp does not support contiguous SPM. Therefore, we do not consider it appropriate for our task of mining vehicle trajectories. Overall, it seems apparent that there exists no known approach that is suitable to mine a less redundant set of contiguous sequential patterns from large vehicle trajectories. 2.3. Sequential pattern mining of trajectory data In general, there are quite a number of trajectory data mining approaches, however, in this subsection we focus on just those that use SPM techniques to extract patterns. Despite trajectories being sequential in nature, they do not easily conform to the required format for SPM. The reason being is that SPM requires the input sequences to consist of discrete items (i.e. integer or character ids), such as: products, websites, proteins, and so on. Whereas trajectories are often far from discrete: commonly consisting of geographic coordinates that are spatially noisy and uncertain [30,31]. Therefore, the first task towards SPM of trajectories is to transform each trajectory entry into an identifiable discrete item.

Please cite this article as: L. Bermingham and I. Lee, Mining distinct and contiguous sequential patterns from large vehicle trajectories, Knowledge-Based Systems (2019) 105076, https://doi.org/10.1016/j.knosys.2019.105076.

L. Bermingham and I. Lee / Knowledge-Based Systems xxx (xxxx) xxx

The first group of trajectory-based SPM approaches we consider are those discretise trajectory entries by discovering natural features or clusters within the dataset. For example, in [32] the authors convert raw geographic trajectories into discrete items by using a line simplification to find key segments within their dataset. Then, each trajectory entry within a user-specified distance of these key segments is associated accordingly. Once all the trajectory entries are associated with segments the set of all sequential patterns is mined. Another such approach is presented in [33]. In [33] the authors propose the concept of partitioning the trajectory study region into a spatial grid of uniform cells and then counting how many trajectories pass through these grid cells. These grid cells are then iteratively expanded to include their neighbours as long as a minimum overall trajectory count is maintained across the cells. These expanded groups of cells are called Regions-of-Interest (RoIs). The original trajectory dataset is then transformed into a series of discrete RoI visitations and from there the set of sequential patterns is mined. An alternate approach to discretising the trajectory entries into regions is to constrain them to an underlying network. For example, in [6] the authors constrain vehicle trajectories to the relevant underlying road network using map-matching. By matching each trajectory entry with a likely node from the road network, the inherent spatial uncertainty of the GPS recordings is removed and the dataset is discretised. With a discretised trajectory dataset, they then mine some relevant sequential patterns to produce likely candidates for travel time estimations. Overall, mining sequential patterns from trajectory datasets is definitely possible and beneficial, however, none of the literature we reviewed considers the specific challenges of mining an overall view of vehicle patterns using SPM. That is, even after some form of discretisation the long length of the sequences will result in a huge number of sequential patterns being discovered (i.e too many patterns to meaningfully interpret). Additionally, as we are interested in mining a set of patterns that describes the overall trends of vehicles, redundant patterns are not desirable. Yet, existing studies that use SPM on trajectories only focus on mining the set of all sequential patterns. Finally, existing trajectory SPM approaches focus on mining unconstrained (i.e not necessarily contiguous) sequential patterns, which in our context of vehicle trajectories translates into discovering patterns where spatial jumps between distant places are legal. For example, such approaches may find a pattern such as, home → shops. This kind of high-level pattern tells us nothing of the actual roads the vehicles travelled along, which would be extremely valuable in domains such as urban planning, route planning, and traffic management [34,35]. Overall, in this study we are interested in mining the specific, detailed, patterns that groups of vehicles have taken: which is why, unlike previous studies, we focus on mining a set of distinct contiguous sequential patterns. 3. Problem statement Our problem is that given a highly repetitious database of vehicle sequences and their underlying road network, we wish to find the most travelled segments of the underlying road network. Using existing data mining approaches, we can easily find the set of closed-contiguous or max-contiguous sequential patterns that exist at some user-specified minimum support. However, because real-world vehicle sequences databases are so large, with many shared roads that branch off, the output contains too many overlapping paths to reasonably infer which specific road segments are of particular interest. Our approach to this problem is to mine the set of closed-contiguous or max-contiguous sequential patterns and then prune them using a user-specified maximum

3

redundancy parameter. In our context of vehicle sequences, any redundancy in the pattern output directly relates to the same road segments being visualised repeatedly. Thus, by controlling the redundancy of the output we can produce a set of patterns (i.e road segments) that only overlaps as much as the user allows, and therefore can be visualised and interpreted much more effectively. In the remainder of this section, we introduce some key concepts used throughout this work. Then we formally define the problem of mining distinct contiguous sequential patterns. 3.1. Preliminary concepts 3.1.1. Existing SPM concepts Definition 1 (Items). Let I = {a1 , a2 , . . . , an } be a set of items. An item is represented as an integer or character. Definition 2 (Sequence). A sequence S is an ordered list of items,

⟨ai , aj , ak , . . . , am ⟩, where ai occurs before aj , which occurs before ak and so on. This broad definition of a sequence means that many different types of data are candidates for SPM. Some examples within the field include retail transactions [18], nucleic acid sequences [24], and in our case, vehicle trajectories. For readers familiar with SPM, you may note a sequence is often defined as a list of item sets. This is useful for some datasets where multiple items can occur in a sequence simultaneously. For example, in retail transactions a customer can buy {bread, milk, apple} in one transaction, then later, {juice, chocolate}. However, in our context of mining vehicle trajectories, we define sequences as ordered lists of single item since no vehicle can be in two places at once. Definition 3 (Sequence Containment). A sequence Sa = ⟨a1 , a2 , . . . , an ⟩, is said to be contained in a sequence Sb = ⟨b1 , b2 , . . . , bm ⟩ iff there exist integers 1 ≤ i1 < i2 < · · · in ≤ m such that a1 = bi1 , a2 = bi2 , . . . , an = bin (denoted as Sa ⊑ Sb ). Additionally, we highlight that if Sa is contained in Sb , then we can refer to Sa as a sub-sequence of Sb , and by extension, we can refer to Sb as a super-sequence of Sa . Definition 4 (Sequence Database). A sequence database is a list of sequences, SDB = ⟨S1 , S2 , . . . , Sn ⟩. Typically a sequence database is a plain-text file where there is one sequence per line and each item in the sequence is consistently delimited. Definition 5 (Sequence Support). Given a sequence Sa , its support is the number of sequences in a sequence database SDB that contain Sa . Finding the support of a sequence Sa is denoted as sup(Sa , SDB) and for a sequential pattern that has its support stored it is denoted sup(Sa ). Support is typically used to find frequently occurring sequences within the sequence database, or in other words, frequent sequential patterns. Definition 6 (Sequential Pattern). Given a user specified minimum support threshold minSup and a sequence database SDB, a sequence Sa is considered a sequential pattern if sup(Sa , SDB) ≥ minSup. A sequential pattern is usually output as a sequence with its support value, like so, {a, b, c } [SUP:10]. A sequential pattern represents a frequently occurring trend within a sequence

Please cite this article as: L. Bermingham and I. Lee, Mining distinct and contiguous sequential patterns from large vehicle trajectories, Knowledge-Based Systems (2019) 105076, https://doi.org/10.1016/j.knosys.2019.105076.

4

L. Bermingham and I. Lee / Knowledge-Based Systems xxx (xxxx) xxx

database. Automatically identifying such trends through SPM is useful because it can lead to knowledge discoveries which would be extremely time consuming and tedious for a human to identify manually. In large sequence databases that contain many long sequences, it is common to uncover a truly massive number of sequential patterns. This is because a large sequential pattern contains a combinatorial number of smaller sub-patterns. The example given by [36] clearly illustrates the problem with discovering all sequential patterns. Consider, a (sequential pattern of ) 100 length 100, {a1 , a2 , . . . , a100 }, it contains 1 = 100 length 1 sub-patterns: {a1 }, {a2 }, . . . , {a100 }; 2 length 2 sub-patterns: {a1 , a2 }, {a1 , a3 }, . . . , {a99 , a100 }; and so on. The total number of sub-patterns the length 100 sequential pattern contains is:

(100)

(

)

100 1

( +

)

100 2

( + ··· +

100 100

)

100

=2

− 1 ≈ 1.27 × 10 . (1) 30

This is a truly massive number of sequential patterns — far too many to compute, let alone meaningfully interpret. One common solution that a user may employ to reduce the pattern output is to increase the minSup parameter. This will require candidate sequences to occur in more sequences from the database in order to become patterns. Increasing minSup does, however, escalate the likelihood that important sequential patterns will be discarded as false negatives. Another approach for reducing the pattern output is to use a mining approach that mines a concise representation of the sequential patterns. ‘‘A concise representation is a subset of all sequential patterns that is meaningful and summarises the whole set of sequential patterns’’ [37]. Concise representations come in two varieties, lossless and lossy. The pattern output is lossless if the set of all sequential patterns (with their support scores) can be recovered without scanning the sequence database. Whereas, the pattern output is lossy if the set of all sequential patterns cannot be recovered without scanning the database. Two common concise representations are closed patterns and max patterns. Definition 7 (Closed Pattern). Given a set of all sequential patterns AS, a sequential pattern Sa is closed iff Sa ∈ AS ∧ ∄Sb ∈ AS such that Sa ⊏ Sb ∧ sup(Sa ) = sup(Sb ). The set of all closed pattern is denoted CS and CS ⊆ AS. Closed patterns considerably reduce the pattern output by ensuring that for every sequential pattern in the output there exists no sub-pattern in the output with the same support. Additionally, because of this rule closed patterns are lossless [37]. Definition 8 (Max Pattern). Given a set of all sequential patterns AS, a sequential pattern Sa is maximal iff Sa ∈ AS ∧ ∄Sb ∈ AS such that Sa ⊏ Sb . The set of all max patterns is denoted MS and MS ⊆ CS ⊆ AS. The set of all max pattern is generally even more concise than the set of all closed patterns. This is because max patterns discard many redundant sequential patterns by ensuring that no pattern in the output is a sub-pattern of any other. The tradeoff for reducing redundancy in this way is that max patterns are lossy [37]. Definition 9 (Contiguously Contained). A sequence Sa = ⟨a1 , a2 , . . . , an ⟩, is said to be contiguously contained in a sequence Sb = ⟨b1 , b2 , . . . , bm ⟩ iff there exist integers 1 ≤ i, i + 1, i + 2, . . . , i + n − 1 such that a1 = bi , a2 = bi+1 , a3 = bi+2 , . . ., and an = bi+n−1 . Contiguous containment is an additional constraint that is applied to SPM to reduce the pattern output. Additionally, it is also used to discover specific patterns that appear in contiguous

blocks in the underlying sequence database. Sequential patterns that are discovered using the contiguous constraint are called contiguous sequential patterns. Readers please note, that for brevity we will not define the contiguous version of all the different types of sequential patterns (Definitions 6, 7, and 8) and instead, when we refer to the contiguous versions of these patterns, we ask the reader to keep in mind a small change to Definition 5 (calculating the support of a sequence). The change is to replace ‘‘contained’’ (Definition 3) with ‘‘contiguously contained’’ (Definition 9). Making this change the definitions for the different types of sequential patterns (Definitions 6, 7, and 8) all hold and now define their respective contiguous versions. 3.1.2. Distinct and contiguous sequential concepts Definition 10 (Pair). Given a sequence Sa = ⟨a1 , a2 , . . . , an ⟩, a pair p is a tuple of any two adjacent items in Sa . That is, p = ⟨ai , ai+1 ⟩ for 1 ≤ i < n. A pair is the next unit up from an item, however; unlike an item a pair conveys the underlying sequential nature of the data. In other words, a pair represents sequential information whereas an item cannot, which is of importance in spatio-temporal trajectory data mining [30]. Definition 11 (Pair Set). Given a sequence database SDB = {S1 , S2 , . . . , Sm }, a pair set PS is the set of all possible pairs that occur within it. That is, PS(SDB) = {p | p is a pair in Sk for ∀ Sk ∈ SDB}, whilst PS(Sk ) is a set of pairs in Sk . In reality, a pair set of a trajectory is a set of all possible contiguous recordings representing two adjacent movement points. As we can see from Fig. 1 and Table 1, PS(s1 ) = {AB, BC , CD} whilst PS(SDB) = {AB, BC , CD, CE , CF }. That is, AB is a pair set in s1 . Additionally, the number of pairs within the pair set |PS(s1 )| = 3 whilst |PS(SDB)| = 5. Definition 12 (Cover Map). Given a set of all sequential patterns AS, a cover map CM is a key–value map where each key is a pair in PS(AS) and each associated value is the frequency that the pair occurs in AS. Note, a cover map function is denoted as CM(AS). We call the frequency associated with each pair the cover of the pair. This is because it represents how much of the sequential patterns is covered by that particular pair. Additionally, please note that in practice this and all the other maps we define in this section are implemented as hash-maps so that they have O(1) lookup and all the standard map operations, such as get(p), contains(p), put(p, frequency), and remov e(p) for a pair p. Definition 13 (Sequence Cover). Given a cover map CM(AS) of a set of sequential patterns AS and a pair set of sequence PS(S) = ∑n {p1 , p2 , . . . , pn }, the cover of the sequence is i=1 CM .get(pi ), where get(pi ) returns the frequency of pi in AS. Computing the cover of a sequence is denoted as cov er(S , CM), and for a sequential pattern that already has its cover stored it is denoted as cov er(S). By breaking a given sequence into its pair set, we can determine how much of the sequential pattern database AS is covered by that sequence. The support value computed by existing approaches does not provide this kind of representative information. We provide an illustrative example to explain the extra information provided by computing the sequence cover. Consider two patterns found by a traditional SPM approach, {a, b, c } [SUP:10] and {b, a, c } [SUP:10]. With only the support information available, we are left to assume these two patterns

Please cite this article as: L. Bermingham and I. Lee, Mining distinct and contiguous sequential patterns from large vehicle trajectories, Knowledge-Based Systems (2019) 105076, https://doi.org/10.1016/j.knosys.2019.105076.

L. Bermingham and I. Lee / Knowledge-Based Systems xxx (xxxx) xxx

represent equal portions of the underlying sequence database. However, if we compute the cover for these two sequential patterns the result may become, {a, b, c } [SUP:10 COVER:50] and {b, a, c } [SUP:10 COVER:150]. Although, both of these patterns are found in 10 sequences, the cover reveals that the second pattern has sub-sequences that occur in three times more of the sequential pattern database. In other words, the cover provides us with a quantitative metric to judge how representative a sequence is with regard to the underlying sequential pattern database, AS. In reality, sequence cover indicates a quantitative measure of sequence repetition. A higher sequence cover, a higher repetition. For instance, cov er(s1 ) = CM .get(AB) + CM .get(BC ) + CM .get(CD) = 5 + 5 + 3 = 13 whilst cov er(s4 ) = CM .get(AB) + CM .get(BC ) + CM .get(CE) = 5 + 5 + 2 = 12 for the example shown in Fig. 1 and Table 1. That is, s1 has a higher sequence cover value than that of s2 , namely s1 is more repetitive than s2 . Definition 14 (Sequence Map). Given a sequence database SDB = {S1 , S2 , . . . , Sm } and a set of all sequential patterns AS, a sequence map SM(Sk ) for Sk ∈ AS is a key–value map where each key is the first unique id of Sk in SDB and each value is a sequence in AS. Creating a sequence map is denoted by SM(AS). The sequence map is used for fast lookup of sequences (or sequential patterns) by their id. For the example shown in Fig. 1 and Table 1, SM(s1 ) = id : 1 whilst SM(s4 ) = id : 4. Definition 15 (Pair to Sequence Ids Map). Given a sequence map SM(S) of a sequence S, a pair p ∈ PS(S) to sequence ids map P2SID(p) is a key–value map where each key is a pair p and each associated value is a set of sequences ids indicating which sequences contain that pair. Notation for creating a pair to sequence id map is P2SID(SM) whilst P2SID(p) returns ids of sequential patterns. The pair to sequence ids map is a data-structure we use to track which pairs have appeared in the output of our distinct SPM. Once a pair appears in the pattern output, it is removed from the map to indicate it has been marked as redundant. Controlling the pair redundancy within the pattern output is one of the unique features of our algorithm. In reality, P2SID of a pair set AB returns a set of sequence ids to represent the degree of repetitiveness of the pair set in contiguous patterns. For instance, P2SID(AB) = {1,4}, P2SID(BC ) = {1,4}, P2SID(CD) = {1}, and P2SID(CE) = {4} for the example shown in Fig. 1 and Table 1. This means AB and BC are more repetitive than pair sets CD and CE in contiguous pattern outputs. Definition 16 (Sequence Redundancy). Given a pair p to sequence map P2SID(p) and the pair set of a sequence S, PS(S) = {p1 , p2 , . . . , pn }, the redundancy of the sequence S (redund(S)) is computed:

∑n

i=1

{

0,

if P2SID(pi ).contains(),

1,

otherwise.

n P2SID(p).contains() function checks if p is in each sequential pattern in a given set of sequential patterns. Computing the redundancy of a sequence is denoted as redund(S , P2SID). For the example shown in Fig. 1 and Table 1, redund(s1 ) = (redund(AB) + redund(BC ) + redund(CD))/|PS(s1 )| = (1 + 1 + 0)/3 = 2/3, whilst redund(s4 ) = (redund(AB) + redund(BC ) + redund(CE))/|PS(s4 )| = (1 + 1 + 0)/3 = 2/3. We highlight that the redundancy of all sub-sequences and sequential patterns found in a sequential pattern database will initially be zero because P2SID contains every relevant pair to begin with. However, as pairs in the map are removed sequences that contain those pairs will have their redundancy increased.

5

Definition 17 (Distinct Pattern). Given the set of all sequential patterns for a database AS, a user specified maximum redundancy threshold maxRedund, and a pair p to sequence ids map P2SID(p), a sequential pattern Sa is distinct iff Sa ∈ AS ∧ redund(Sa ) ≤ maxRedund ∧ ∄Sb ∈ AS such that cov er(Sb ) > cov er(Sa ). Note, the set of all distinct patterns is denoted DS. This definition implies that there can only be one distinct pattern in the set of all sequential patterns, however; as we explain in more detail in Section 4 once a distinct pattern has been found it is removed from AS and its relevant pairs are also removed from the P2SID. This allows us to find a set of distinct patterns each with maximal cover w.r.t the already found distinct patterns. Additionally, by repeatedly finding the distinct pattern with the maximum cover we produce a concise redundancy controlled set of sequential patterns that is representative of as large a portion of the underlying sequential database as possible. For the example shown in Fig. 1 and Table 1, two sequential patterns s1 and s4 are distinct patterns when maxRedund = 70%. In reality, these distinct patterns are unique sequential patterns minimising redundancies. 3.2. Problem definition Based on the all required definitions in Section 3.1, we now introduce the problem that we aim to solve in this work. Definition 18 (Distinct Contiguous SPM). Given a sequence database SDB a user specified minimum support threshold minSup, and a user specified maximum redundancy maxRedund, distinct contiguous SPM is to find a set of all distinct contiguous sequential patterns. 4. Methodology In Fig. 2, we present our overall framework for mining distinct contiguous sequential patterns from vehicle trajectories. Briefly, each stage in Fig. 2 of our framework is follows: 1. Raw vehicle trajectories. The purpose of our framework is to extract a set of patterns that represents frequent routes that vehicles have taken within their relevant road networks. The vehicle trajectories we consider in this work are all recorded using GPS and are stored as plain-text files as sequences of timestamped geographic coordinates. More details on the specific datasets we used are provided in Section 5.1. 2. Road network. In order to mine the vehicle trajectories using SPM, we assume that the vehicles are constrained to travelling along the relevant road network in the study region. We obtained the relevant road networks for each of our datasets from MapZen’s Open Street Maps metro extracts repository.1 3. Map-matching. It is well established that GPS recordings can be noisy and inaccurate. Thus, it is not an easy task to match each recording in the vehicle’s trajectory with the correct road segment from the underlying road network. For this task of map-matching the vehicle trajectories we used the HMM based approach proposed in [38]. For our purposes, we find that it yields logical matches for all of our datasets. 1 https://goo.gl/c0qrIS.

Please cite this article as: L. Bermingham and I. Lee, Mining distinct and contiguous sequential patterns from large vehicle trajectories, Knowledge-Based Systems (2019) 105076, https://doi.org/10.1016/j.knosys.2019.105076.

6

L. Bermingham and I. Lee / Knowledge-Based Systems xxx (xxxx) xxx

8. Distinct contiguous sequential patterns. This is the output of our algorithm and we use it in our experiments to measure pattern output succinctness, compression, and redundancy. Additionally, this output is what we use to interpret the vehicle patterns we uncover as it is much more succinct than the output from Step 5. Additionally, this increased succinctness allows us to visualise the vehicular patterns we uncovered, as we demonstrate in Section 5.6. Please note that Step 5 corresponds to the state-of-theart algorithm whilst Step 7 corresponds to our proposed DC-SPAN.

Fig. 2. Our framework for mining distinct contiguous sequential patterns from vehicle trajectories.

4. Road node visitation sequences. Using the trajectories and relevant road network as input, the map-matching produces a plain-text file containing the sequence of road nodes that each trajectory visited. This is our sequence database and we use it for SPM in the following stage of our framework. 5. Contiguous SPM. The sequence database produced in the previous stage is now mined for contiguous sequential patterns. However, as discussed in Section 1 the set of patterns produced is often quite redundant and in the case of vehicle trajectories overlapping patterns basically show the same section of road being visited with minor detours. In order to refine these redundant patterns later on, we store them as a sequence database. 6. Contiguous sequential patterns. This is a plain-text database of contiguous sequential patterns mined from the previous stage. 7. Distinct SPM. We run our algorithm on the sequence database of contiguous sequential patterns and refine it down to a set of patterns that does not surpass a userspecified maximum redundancy parameter. This so-called distinct set of patterns is once again stored as a sequence database.

The main contribution of our work is DC-SPAN: our algorithm for discovering distinct contiguous sequential patterns. We present the details of DC-SPAN in Algorithm 1. Note that DCSPAN does not compute contiguous sequential patterns itself and instead refines the output of an existing contiguous SPM algorithm. Thus, DC-SPAN inherits the performance characteristics and bottlenecks of whichever algorithm is chosen. For example, in our implementation, the function call to MineACSP(SDB, minSup) on Line 2 is replaced with a call to a modified CC-SPAN [7] algorithm that mines the set of all contiguous sequential patterns. Our explanation of this choice and the resulting performance characteristics are reported in Section 4. Despite our implementation choices we highlight that in practice there is no explicit need to compute the set of all contiguous patterns inside the algorithm and if desired they can be precomputed by other means and passed in as a parameter. First of all, our DC-SPAN calls a SPM function to generate the set of all contiguous sequential patterns for a given dataset in Line 2. After obtaining the set of all contiguous patterns, Lines 3–6 in DC-SPAN initialise all the required data structures (CM, SM and P2SID) that are used to refine the patterns down to the set of socalled distinct patterns (See Definition 17). DC-SPAN now builds the cover for each sequence in Lines 7 and 8, before it processes SM in the main loop between Lines 10 and 25. Inside DC-SPAN’s main loop, Lines 10 and 25, the most covered pattern is found and stored in the set of distinct patterns in Line 11 (see Definition 13 for an explanation of sequence/pattern cover). Then, all the pairs from the distinct pattern are removed from the pair to sequence ids map in Line 12, effectively marking them redundant because they will now appear in the distinct pattern output. Next, Lines 16 and 18 retrieve all ids using P2SID. Finally, all patterns in the set of all contiguous sequential patterns that contain too many redundant pairs are removed in order to ensure the user-specified maximum redundancy is not surpassed. This process repeats until there are no remaining patterns to refine as in Lines 20 and 23. Note that, the source code of DC-SPAN is publicly available.2 5. Results and discussion In order to gauge the efficiency and effectiveness of DC-SPAN at mining distinct contiguous sequential patterns from large vehicle trajectories, we conducted experiments measuring running time, compression, distinctness, and redundancy. Where appropriate we compared DC-SPAN against other contiguous SPM algorithms that mined all, closed, and max patterns. One problem we faced is that we planned to use CM-SPAM [23] to mine the set of all contiguous patterns and VMSP [39] to mine the set of all max contiguous patterns. However, both of these algorithms ran out of memory mining the large trajectory datasets, thus they turned out to be unsuitable for this study. That is, they are not suitable for large trajectory datasets. As discussed earlier, we mine the set 2 https://github.com/lukehb/137-SPM.

Please cite this article as: L. Bermingham and I. Lee, Mining distinct and contiguous sequential patterns from large vehicle trajectories, Knowledge-Based Systems (2019) 105076, https://doi.org/10.1016/j.knosys.2019.105076.

L. Bermingham and I. Lee / Knowledge-Based Systems xxx (xxxx) xxx

Algorithm 1 DC-SPAN algorithm.

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

Input: (1) SDB, a sequence database; (2) maxRedund, the maximum allowed redundancy; (3) minSup, the minimum allowed support; Output: DS, the set of all distinct contiguous sequential patterns; function DCSPAN(SDB, maxRedund, minSup) // Use a relevant contiguous SPM algorithm. Assign AS to MineACSP(SDB, minSup); Assign DS to ∅; // Make a cover map (Definition 12); Compute CM(AS); // Make a sequence map of patterns (Definition 14); Compute SM(AS); // Make a pair to sequence id map (Definition 15); Compute P2SID(SM); // Find the cover of each sequence. for (Sequential pattern S in SM) do Assign S .cov er to cov er(S , CM); end for while SM is not empty do // Find pattern with max cover. Assign Smax to argmax cov er(S); {S ∈SM }

12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27:

Remove Smax from SM; // Write pattern into memory or disk. Save Smax to DS; // Remove the relevant pairs. Assign PSmax to PS(Smax ); Assign IDs to ∅; for (Each pair p in PSmax ) do Add all ids from P2SID(p).get() to IDs; Remove p from P2SID; end for // Remove patterns that have become redundant. for (Sequence id i in IDs) do if redund(SM(Si ).get(), P2SID) ≥ maxRedund then Remove i from SM; end if end for end while return DS; end function

of all closed contiguous sequential patterns using CC-SPAN [7]. Using CC-SPAN as a base we slightly modified the source code to produce two algorithms: one to mine the set of all contiguous sequential patterns (AC-SPAN), and the other to mine the set of all max contiguous sequential patterns (MC-SPAN). Thus, for most of our experiments we compared the output produced by DC-SPAN against AC-SPAN, CC-SPAN, and MC-SPAN. Additionally, we highlight that all of our experiments were run on a machine with an i5-520M processor and 5 GB of unallocated memory. Furthermore, all of the algorithms were implemented in Java and an adequate JVM warm-up was used before all experiments.

Table 2 Transformed trajectory datasets. Name

TDrive

Buses

Trucks

Sequences Items Avg seq length Distinct items

7,806 5,400,239 692 50,933

782 702,384 898 14,329

50 204,662 4,093 13,279

into sequences of visited road network nodes (see Section 4 for an explanation of this process). ‘‘TDrive’’ is the first and biggest dataset we used and is publicly available3 from Microsoft Research Asia. The TDrive dataset contains six days’ worth of taxi trajectories in the Beijing area [34,40]. ‘‘Buses’’ is the second dataset we used and is publicly available4 from the Dublin city council’s ‘‘Insight’’ project. The Buses dataset contains approximately a month of bus trajectories moving around Dublin. This dataset is massive so we used just a subset, which itself alone contains 782 bus trajectories. ‘‘Trucks’’ is the last and smallest dataset we used and is a well researched trajectory dataset [35,41–44] that is publicly available from the Chorochronos archive.5 The trucks dataset is so-called because it contains various cement trucks making daily deliveries around the Athens region. In Table 2 we present some specific information about each of the datasets after they underwent map-matching. 5.2. Running time The aim of this experiment is to empirically measure the efficiency of DC-SPAN against existing all (AC-SPAN), closed (CCSPAN), and max (MC-SPAN) contiguous SPM algorithms when mining large real-world vehicular sequence databases. We present the results of this experiment in Fig. 3. Analysing Fig. 3, we observe that DC-SPAN has a negligible running time compared to the other approaches we measured. Though, it is misleading not to restate that DC-SPAN requires the computation of the set of all contiguous sequential patterns during its routine (i.e it runs AC-SPAN internally). Therefore, the overall running time for DC-SPAN to produce a result can be thought of as DC-SPAN + AC-SPAN. Therefore, the running time of DC-SPAN is always tied to whichever algorithm is used to mine the set of all contiguous sequential patterns (AC-SPAN in this case). Additionally, it follows that the overall running time of DCSPAN is longer than mining the set of closed or max contiguous sequential patterns. Another observation we make is that the running time of DC-SPAN appears to be mostly insensitive to the changing support levels, which cannot be said for the other algorithms. Overall, these results suggest to us that mining the set of distinct contiguous sequential patterns imposes little performance overhead if the set of all contiguous sequential patterns is already computed. 5.3. Compression The aim of this experiment was to measure the relative compression achieved by each algorithm’s pattern output. The compression value we computed is the size of the pattern output relative to the set of all contiguous sequential patterns. Specifically, we computed compression in this experiment using Eq. (2). Compression = 1 −

5.1. Experiment datasets

7

IXS IAS

,

(2)

where each term is as follows: Across all of our experiments, we used the same three vehicular trajectory datasets as input. We highlight that prior to experimentation these vehicular trajectory datasets were transformed from a series of geographic coordinates and timestamps

3 https://goo.gl/C8MN3Y. 4 https://goo.gl/CekXgX. 5 https://goo.gl/ljnLT1.

Please cite this article as: L. Bermingham and I. Lee, Mining distinct and contiguous sequential patterns from large vehicle trajectories, Knowledge-Based Systems (2019) 105076, https://doi.org/10.1016/j.knosys.2019.105076.

8

L. Bermingham and I. Lee / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 3. Running time analysis (lower is better). Fig. 4. The pattern output compression achieved by each algorithm (higher is better). DC-SPAN was tested at varying maximum redundancies.

• IXS : the total number of items in a given algorithm’s contiguous sequential pattern output;

• IAS : the total number of items in the set of all contiguous sequential patterns. We clarify that computing the compression produced by a specific algorithm at a given minimum support requires computing the set of all contiguous sequential patterns at that same support level. With the preliminaries defined, we present the results from this experiment in Fig. 4. The results in Fig. 4 indicate to us that for all the tested datasets DC-SPAN produces a smaller pattern output than the approaches that mined the set of closed or max contiguous sequential patterns. Figs. 4b and 4c indicate DC-SPAN achieves approximately a 99% compression for the respective datasets. We

highlight that for these datasets the set of all closed and max patterns also achieves very high compressions, around 98%–99%. This result is explained by the fact that the set of all contiguous patterns for these datasets is massive, containing many redundant sub-patterns. Thus, when these sub-patterns are removed huge portions of the patterns are compressed away. Due to the pattern output being far smaller, the results from Fig. 4a are perhaps more telling of the overall compression abilities of each algorithm. Specifically, we highlight that in Fig. 4a it is more clearly indicated that DC-SPAN achieves a better overall compression than the other algorithms. Additionally, we also observe that increasing the maximum allowed redundancy of DC-SPAN

Please cite this article as: L. Bermingham and I. Lee, Mining distinct and contiguous sequential patterns from large vehicle trajectories, Knowledge-Based Systems (2019) 105076, https://doi.org/10.1016/j.knosys.2019.105076.

L. Bermingham and I. Lee / Knowledge-Based Systems xxx (xxxx) xxx

9

shifts its compression closer towards the set of all max contiguous sequential patterns. 5.4. Distinctness The aim of this experiment was to measure the percentage of patterns that are distinctively expressed by DC-SPAN. The distinctness of an algorithm at a given support is computed by counting the number of all contiguous sequential patterns that are not contiguously contained (See Definition 9) in the given pattern output. The distinctness of a given pattern output is given by Eq. (3). 1, if S ⊑ XS ,

{ ∑

S ∈AS

Distinctness = 1 −

0, otherwise.

|AS |

(3)

where each term is as follows:

• AS: the set of all contiguous sequential patterns; • S is ∈ AS; • XS: the set of contiguous sequential patterns produced by a given algorithm. We highlight that this definition for distinctness gives all, closed, and max contiguous SPM algorithms a distinctness of zero because by their very definitions (See Definitions 7 and 8) every sub-pattern from the set of all contiguous sequential patterns will be contiguously contained in their output. This is not the case for DC-SPAN which discards sequential patterns based on their redundancy. Therefore in this experiment we do not test ACSPAN, CC-SPAN, or MC-SPAN, but instead, we only test DC-SPAN at varying maximum redundancy levels of 0%, 25%, 50%, and 75%. The results of this experiment are provided in Fig. 5. Fig. 5 shows some quite varied results in terms of the distinctness DC-SPAN achieves for each of the three datasets. Specifically, for the TDrive dataset the output gets more distinct as support is increased, whilst for the Buses dataset the distinctness is mostly steady, and finally for the Trucks dataset the distinctness gradually declines as the support is increased. Our investigation into these results reveals that for each case the result is explained by the homogeneity or heterogeneity of the pattern output produced by mining the set of all contiguous sequential patterns. In this context we say the pattern output is homogeneous if the sequential patterns discovered have a high number of pairs shared among them, and the opposite case if it is heterogeneous. For example, in the results for the TDrive dataset the set of all contiguous patterns becomes increasingly homogeneous because the total number of patterns discovered is so small. At a relative support of 0.18 (i.e minimum absolute support of 1405 sequences) only 48 sequences are found with 4 pairs among them. Many of the sequences consist only of single items and are therefore discarded by DC-SPAN as they have a pair cover of zero. DC-SPAN is designed for mining long sequences and setting the support close to its maximum for the dataset will produce small patterns that it readily discards. Additionally, for the Trucks dataset the set of all contiguous sequential patterns is very homogeneous. For example, at a relative support of 0.06 the output contains 1,008,755 sequences with a total of 58,003,641 items. This indicates that a huge number of sub-patterns makes up the pattern output, meaning it is very homogeneous. With so many patterns sharing pairs, huge chunks of the output become redundant as the distinct patterns are mined. Thus, in this case, the main reason distinctness decreases as support is raised is simply because the set of all sequential patterns is pruned and becomes more heterogeneous.

Fig. 5. The percentage of all sequential patterns distinct by each algorithm (higher is better). DC-SPAN was tested at varying maximum redundancies.

Finally, for the Buses dataset the distinctness scores are fairly consistent across varying support levels. Once again our investigation into the results revealed that the pattern output for the Buses dataset is the most heterogeneous of the three datasets we tested. Specifically, we found that the number of distinct pairs in the pattern output was very close the total number of pairs. This means that most of the patterns in the output already have quite a low number of shared pairs and therefore mining the set of distinct contiguous patterns causes less patterns to become redundant. Additionally, from Fig. 5 we observe that for all of the datasets we tested an increase in the maximum redundancy parameter correlates with a decrease in distinctness. In other words, the higher the maximum redundancy parameter the closer the output becomes to the set of all max contiguous patterns.

Please cite this article as: L. Bermingham and I. Lee, Mining distinct and contiguous sequential patterns from large vehicle trajectories, Knowledge-Based Systems (2019) 105076, https://doi.org/10.1016/j.knosys.2019.105076.

10

L. Bermingham and I. Lee / Knowledge-Based Systems xxx (xxxx) xxx

maximum allowed redundancy is increased DC-SPAN produces pattern outputs with expectedly more redundancy. Additionally, even at a high maximum redundancy of 75% DC-SPAN achieves an overall redundancy that is at least equal to, if not substantially lower than, both the closed and max pattern outputs across all datasets and support levels. 5.6. Visualisation Large vehicle trajectory datasets are often huge and repetitive making them difficult to visually inspect, but ideal to mine. In our experience using traditional SPM approaches on vehicle trajectory datasets often produces pattern outputs that are still too dense to visually interpret because of the large number of repeated and redundant patterns. However, by using our distinct contiguous SPM algorithm the redundancy can be controlled, thus meaning visualisation now becomes a relevant process for interpreting the pattern output. In Fig. 7 we present visualisations for each of the raw trajectory datasets we used in our experiments and an accompanying pattern output mined by DC-SPAN. Noisiness of GPS technology is a well known difficulty and is frequent in these vehicular datasets, particularly the TDrive and Buses datasets (see Figs. 7a and 7b). Despite this, our combination of map-matching and distinct contiguous SPM has uncovered a succinct and clean set of sequential patterns (i.e. road segments in our context) which map very accurately to the underlying road network topology. These road segments are easy to identify and interpret through visual inspection, which is something that was not possible using the pattern outputs of existing contiguousclosed and max-contiguous SPM approaches. We argue that these visualisations effectively accomplish the problem we outlined in Section 3.1 to allow a human user to easily interpret the visualised road segments and identify the hot-spots within the road network. The visualisations shown in Fig. 7 lead us to conclude these patterns do indeed align with plausible real-world highways, intersections, and roadways —which supports the validity of our approach in general. 6. Conclusion

Fig. 6. The percentage of redundant pairs produced by each algorithm (lower is better). DC-SPAN was tested at varying maximum redundancies.

5.5. Redundancy The aim of this experiment was to measure the number of redundant pairs in the pattern output of each algorithm we tested. We ask readers to refer to Definitions 15 and 16 for our explanation of pair redundancy. The redundancy score we compute in this experiment is the number of redundant pairs divided by the total number of pairs, giving us a percentage to describe the overall pattern output redundancy. The results of this experiment are provided in Fig. 6. From Fig. 6 we observe that DC-SPAN achieves its intended purpose of controlling the redundancy of the pattern output. Specifically, we highlight that when the maximum redundancy is set to 0% a redundancy of 0% is obtained. Furthermore, as the

Although there exist many efficient and effective SPM algorithms, none of them are particularly well suited for mining large vehicle trajectories where the patterns should ideally both be contiguous and have their redundancy controlled. In this work, we have presented our approach, DC-SPAN, to solve this problem. Through experimentation we have shown that DC-SPAN is able to mine distinct, non-redundant, contiguous sequential patterns from large and varied real-world vehicle trajectories with very little additional overhead compared to existing approaches. Additionally, the experimental results also revealed that the set of distinct patterns mined by our approach is more succinct than traditional approaches. Specifically, the experimental results indicated there exists a trade-off between increased redundancy and decreased compression and distinctness. The more skewed this trade-off is towards decreased compression and distinctness the more our algorithm resembles max contiguous SPM. However, this trade-off is mostly moot in practice because realistically we expect the usefulness of the algorithm to become apparent at fairly aggressive maximum redundancy settings, in many cases perhaps using settings of 0% maximum redundancy. In our visualisation experiment we have shown that such settings can reduce real-world cases of large, repetitious, vehicle sequences to a set of succinct and easily interpretable road segments. Overall, we conclude that the main usefulness of our algorithm is the tuneable redundancy of the pattern output, which

Please cite this article as: L. Bermingham and I. Lee, Mining distinct and contiguous sequential patterns from large vehicle trajectories, Knowledge-Based Systems (2019) 105076, https://doi.org/10.1016/j.knosys.2019.105076.

L. Bermingham and I. Lee / Knowledge-Based Systems xxx (xxxx) xxx

11

Fig. 7. The raw trajectory datasets (top row) and the distinct contiguous patterns mined from them (bottom row).

in the case of vehicle sequences directly enables effective visualisation and interpretation of pattern outputs that were previously too large and repetitious. Based on this work, some possible future directions include:

• Increasing efficiency. DC-SPAN mines the pattern output of another algorithm, therefore its running time is inherently tied to the chosen algorithm. In order to break this dependency, we may investigate the mining of distinct contiguous sequential patterns within a single algorithm; • Other domains. In this work we focused solely on mining vehicle trajectories, however, other fields of study such as biology may also find the patterns produced by DC-SPAN interesting; • Big data. Our implementation relies on the sequence databases fitting into machine memory, however, many vehicle trajectory datasets are extremely massive and cannot easily be loaded into memory. Thus, as a possible future direction we may investigate an on-disk or distributed computing modification for DC-SPAN. References [1] Y. Zheng, L. Capra, O. Wolfson, H. Yang, Urban computing: Concepts, methodologies, and applications, ACM Trans. Intell. Syst. Technol. 5 (3) (2014) 38:1–38:55, http://dx.doi.org/10.1145/2629592. [2] Z. Chen, H.T. Shen, X. Zhou, Discovering popular routes from trajectories, in: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, in: ICDE ’11, IEEE Computer Society, Washington, DC, USA, 2011, pp. 900–911, http://dx.doi.org/10.1109/ICDE.2011.5767890. [3] X. Kong, Z. Xu, G. Shen, J. Wang, Q. Yang, B. Zhang, Urban traffic congestion estimation and prediction based on floating car trajectory data, Future Gener. Comput. Syst. 61 (C) (2016) 97–107, http://dx.doi.org/10.1016/j. future.2015.11.013.

[4] S. Atev, G. Miller, N.P. Papanikolopoulos, Clustering of vehicle trajectories, IEEE Trans. Intell. Transp. Syst. 11 (3) (2010) 647–657. [5] Z. Chen, H.T. Shen, X. Zhou, Discovering popular routes from trajectories, in: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, in: ICDE ’11, IEEE Computer Society, Washington, DC, USA, 2011, pp. 900–911. [6] Y. Wang, Y. Zheng, Y. Xue, Travel time estimation of a path using sparse trajectories, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, in: KDD ’14, ACM, New York, NY, USA, 2014, pp. 25–34, http://dx.doi.org/10.1145/2623330. 2623656. [7] J. Zhang, Y. Wang, D. Yang, Ccspan: Mining closed contiguous sequential patterns, Knowl.-Based Syst. 89 (2015) 1–13, http://dx.doi.org/10.1016/j. knosys.2015.06.014. [8] C. Yang, G. Gidófalvi, Mining and visual exploration of closed contiguous sequential patterns in trajectories, Int. J. Geogr. Inf. Sci. 32 (7) (2018) 1282–1304. [9] V. Bogorny, B. Kuijpers, L.O. Alvares, Reducing uninteresting spatial association rules in geographic databases using background knowledge: A summary of results, Int. J. Geogr. Inf. Sci. 22 (4) (2008) 361–386, http: //dx.doi.org/10.1080/13658810701412991. [10] V. Bogorny, J.F. Valiati, L.O. Alvares, Semantic-based pruning of redundant and uninteresting frequent geographic patterns, GeoInformatica 14 (2) (2010) 201–220, http://dx.doi.org/10.1007/s10707-009-0082-7. [11] I. Lee, P. Phillips, Urban crime analysis through areal categorized multivariate associations mining, Appl. Artif. Intell. 22 (5) (2008) 483–499, http://dx.doi.org/10.1080/08839510802028496. [12] I. Lee, V. Estivill-Castro, Exploration of massive crime data sets through data mining techniques, Appl. Artif. Intell. 25 (5) (2011) 362–379, http: //dx.doi.org/10.1080/08839514.2011.570153. [13] P. Phillips, I. Lee, Mining co-distribution patterns for large crime datasets, Expert Syst. Appl. 39 (14) (2012) 11556–11563, http://dx.doi.org/10.1016/ j.eswa.2012.03.071. [14] Y. Zhou, X. Tao, Z. Yu, H. Fujita, Train-movement situation recognition for safety justification using moving-horizon tbm-based multisensor data fusion, Knowl.-Based Syst. 177 (2019) 117–126, http://dx.doi.org/10.1016/ j.knosys.2019.04.010.

Please cite this article as: L. Bermingham and I. Lee, Mining distinct and contiguous sequential patterns from large vehicle trajectories, Knowledge-Based Systems (2019) 105076, https://doi.org/10.1016/j.knosys.2019.105076.

12

L. Bermingham and I. Lee / Knowledge-Based Systems xxx (xxxx) xxx

[15] Y. Zhou, X. Tao, L. Luan, Z. Wang, Safety justification of train movement dynamic processes using evidence theory and reference models, Knowl.Based Syst. 139 (2018) 78–88, http://dx.doi.org/10.1016/j.knosys.2017.10. 012. [16] P. Mazumdar, B.K. Patra, R. Lock, S.B. Korra, An approach to compute user similarity for gps applications, Knowl.-Based Syst. 113 (2016) 125–142, http://dx.doi.org/10.1016/j.knosys.2016.09.017. [17] J. Yu, P. Lu, Learning traffic signal phase and timing information from lowsampling rate taxi gps trajectories, Knowl.-Based Syst. 110 (2016) 275–292, http://dx.doi.org/10.1016/j.knosys.2016.07.036. [18] R. Srikant, R. Agrawal, Mining sequential patterns: Generalizations and performance improvements, in: Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, in: EDBT ’96, Springer-Verlag, London, UK, UK, 1996, pp. 3–17. [19] Z. Yang, Y. Wang, M. Kitsuregawa, Lapin: Effective sequential pattern mining algorithms by last position induction for dense databases, in: Proceedings of the 12th International Conference on Database Systems for Advanced Applications, in: DASFAA’07, Springer-Verlag, Berlin, Heidelberg, 2007, pp. 1020–1023. [20] J. Ayres, J. Flannick, J. Gehrke, T. Yiu, Sequential pattern mining using a bitmap representation, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, in: KDD ’02, ACM, New York, NY, USA, 2002, pp. 429–435, http://dx.doi.org/ 10.1145/775047.775109. [21] M.J. Zaki, Spade: An efficient algorithm for mining frequent sequences, Mach. Learn. 42 (1) (2001) 31–60, http://dx.doi.org/10.1023/A: 1007652502315. [22] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, M.-C. Hsu, Mining sequential patterns by pattern-growth: The prefixspan approach, IEEE Trans. Knowl. Data Eng. 16 (11) (2004) 1424–1440, http: //dx.doi.org/10.1109/TKDE.2004.77. [23] P. Fournier-Viger, A. Gomariz, M. Campos, R. Thomas, Fast vertical mining of sequential patterns using co-occurrence information, in: V.S. Tseng, T.B. Ho, Z.-H. Zhou, A.L.P. Chen, H.-Y. Kao (Eds.), Advances in Knowledge Discovery and Data Mining: 18th Pacific-Asia Conference, PAKDD 2014, Tainan, Taiwan, May 13-16, 2014. Proceedings, Part I, Springer International Publishing, Cham, 2014, pp. 40–52. [24] T.P. Exarchos, C. Papaloukas, C. Lampros, D.I. Fotiadis, Mining sequential patterns for protein fold recognition, J. Biomed. Inform. 41 (1) (2008) 165–179. [25] C. Antunes, A.L. Oliveira, Generalization of pattern-growth methods for sequential pattern mining with gap constraints, in: P. Perner, A. Rosenfeld (Eds.), Machine Learning and Data Mining in Pattern Recognition: Third International Conference, MLDM 2003 Leipzig, Germany, July 5–7, 2003 Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, 2003, pp. 239–251. [26] X. Zhu, X. Wu, Mining complex patterns across sequences with gap requirements, in: Proceedings of the 20th International Joint Conference on Artifical Intelligence, in: IJCAI’07, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2007, pp. 2934–2940. [27] C. Li, Q. Yang, J. Wang, M. Li, Efficient mining of gap-constrained subsequences and its various applications, ACM Trans. Knowl. Discov. Data 6 (1) (2012) 2:1–2:39, http://dx.doi.org/10.1145/2133360.2133362. [28] T. Van, B. Vo, B. Le, Mining sequential patterns with itemset constraints, Knowl. Inf. Syst. 57 (2) (2018) 311–330, http://dx.doi.org/10.1007/s10115018-1161-6. [29] H.T. Lam, F. Mörchen, D. Fradkin, T. Calders, Mining compressing sequential patterns, Stat. Anal. Data Min. 7 (1) (2014) 34–52, http://dx.doi.org/10. 1002/sam.11192.

[30] Y. Zheng, Trajectory data mining: An overview, ACM Trans. Intell. Syst. Technol. 6 (3) (2015) 29:1–29:41, http://dx.doi.org/10.1145/2743025. [31] D.-W. Choi, J. Pei, T. Heinis, Efficient mining of regional movement patterns in semantic trajectories, Proc. VLDB Endow. 10 (13) (2017) 2073–2084, http://dx.doi.org/10.14778/3151106.3151111. [32] H. Cao, N. Mamoulis, D.W. Cheung, Mining frequent spatio-temporal sequential patterns, in: Proceedings of the Fifth IEEE International Conference on Data Mining, in: ICDM ’05, IEEE Computer Society, Washington, DC, USA, 2005, pp. 82–89, http://dx.doi.org/10.1109/ICDM.2005.95. [33] F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi, Trajectory pattern mining, in: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, in: KDD ’07, ACM, New York, NY, USA, 2007, pp. 330–339, http://dx.doi.org/10.1145/1281192.1281230. [34] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, Y. Huang, T-drive: Driving directions based on taxi trajectories, in: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, in: GIS ’10, ACM, New York, NY, USA, 2010, pp. 99–108, http://dx.doi.org/10.1145/1869790.1869807. [35] J.C. Herrera, D.B. Work, R. Herring, X.J. Ban, Q. Jacobson, A.M. Bayen, Evaluation of traffic data obtained via gps-enabled mobile phones: The mobile century field experiment, Transp. Res. C 18 (4) (2010) 568–583, http://dx.doi.org/10.1016/j.trc.2009.10.006. [36] J. Pei, J. Han, R. Mao, Closet: An efficient algorithm for mining frequent closed itemsets, in: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, ACM, New York, NY, USA, 2000, pp. 21–30. [37] P. Fournier-Viger, J.C.-W. Lin, R.U. Kiran, Y.S. Koh, R. Thomas, A survey of sequential pattern mining, Data Sci. Pattern Recognit. 1 (1) (2017) 54–77. [38] P. Newson, J. Krumm, Hidden markov map matching through noise and sparseness, in: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, in: GIS ’09, ACM, New York, NY, USA, 2009, pp. 336–343, http://dx.doi.org/10.1145/ 1653771.1653818. [39] P. Fournier-Viger, C.-W. Wu, A. Gomariz, V.S. Tseng, Vmsp: Efficient vertical mining of maximal sequential patterns, in: M. Sokolova, P. van Beek (Eds.), Advances in Artificial Intelligence: 27th Canadian Conference on Artificial Intelligence, Canadian AI 2014, Montréal, QC, Canada, May 6-9, 2014. Proceedings, Springer International Publishing, Cham, 2014, pp. 83–94. [40] J. Yuan, Y. Zheng, X. Xie, G. Sun, Driving with knowledge from the physical world, in: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, in: KDD ’11, ACM, New York, NY, USA, 2011, pp. 316–324, http://dx.doi.org/10.1145/2020408.2020462. [41] N. Pelekis, I. Kopanakis, E. Kotsifakos, E. Frentzos, Y. Theodoridis, Clustering trajectories of moving objects in an uncertain world, in: 2009 Ninth IEEE International Conference on Data Mining, IEEE Computer Society, Washington, DC, USA, 2009, pp. 417–427, http://dx.doi.org/10.1109/ICDM. 2009.57. [42] N. Pelekis, I. Kopanakis, E.E. Kotsifakos, E. Frentzos, Y. Theodoridis, Clustering uncertain trajectories, Knowl. Inf. Syst. 28 (1) (2011) 117–147, http://dx.doi.org/10.1007/s10115-010-0316-x. [43] O. Abul, F. Bonchi, M. Nanni, Never walk alone: Uncertainty for anonymity in moving objects databases, in: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, in: ICDE ’08, IEEE Computer Society, Washington, DC, USA, 2008, pp. 376–385, http://dx.doi.org/10. 1109/ICDE.2008.4497446. [44] C. Panagiotakis, N. Pelekis, I. Kopanakis, E. Ramasso, Y. Theodoridis, Segmentation and sampling of moving object trajectories based on representativeness, IEEE Trans. Knowl. Data Eng. 24 (7) (2012) 1328–1343, http://dx.doi.org/10.1109/TKDE.2011.39.

Please cite this article as: L. Bermingham and I. Lee, Mining distinct and contiguous sequential patterns from large vehicle trajectories, Knowledge-Based Systems (2019) 105076, https://doi.org/10.1016/j.knosys.2019.105076.