A general methodology for n-dimensional trajectory clustering

A general methodology for n-dimensional trajectory clustering

Expert Systems with Applications 42 (2015) 7573–7581 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

2MB Sizes 0 Downloads 27 Views

Expert Systems with Applications 42 (2015) 7573–7581

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

A general methodology for n-dimensional trajectory clustering Luke Bermingham, Ickjai Lee ⇑ Information Technology, College of Business, Law and Governance, James Cook University, PO Box 6811, Cairns, QLD 4870, Australia

a r t i c l e

i n f o

Article history: Available online 15 June 2015 Keywords: Trajectory clustering High dimensional clustering Trajectory data mining

a b s t r a c t Trajectory data is rich in dimensionality, often containing valuable patterns in more than just the spatial and temporal dimensions. Yet existing trajectory clustering techniques only consider a fixed number of dimensions. We propose a general trajectory clustering methodology which can detect clusters using any arbitrary number of the n-dimensions available in the data. To exemplify our methodology we apply it an existing trajectory clustering approach, TRACLUS, to create the so-called, ND-TRACLUS. Furthermore, in order to better describe the trajectory clusters uncovered when clustering arbitrary dimensions we also introduce, Retraspam, a novel algorithm for n-dimensional representative trajectory formulation. We qualitatively and quantitatively evaluate both our methodology and Retraspam using two real world datasets and find valuable, previously unknown higher dimensional trajectory patterns. Ó 2015 Elsevier Ltd. All rights reserved.

1. Introduction GPS, sensors, animal trackers, video surveillance systems, and the internet are all being used more frequently than ever before to record moving entities. These entity movements which are recorded instance by instance can then be reconnected together into a sequence known as a trajectory. This abundance of trajectory data presents a valuable opportunity for both industry and scientific communities to extract previously unknown and insightful patterns regarding entity movements. A popular data mining approach used to discover trajectory movement patterns is trajectory clustering. Trajectory clustering is the grouping together of similar entity movements in a trajectory dataset. To the best of our knowledge, existing approaches all consider some fixed set of the following dimensions during trajectory clustering: space, time, direction, and speed. These dimensions are logical as they are common to all spatio-temporal trajectories, however trajectory data often has high dimensionality. Yet using existing fixed dimension approaches certain high-dimensional trajectory clusters are unobtainable. Thus, there exists the need to perform trajectory clustering using any set of the available quantitative dimensions. Therefore, in this paper we introduce a general methodology for creating n-dimensional trajectory clustering algorithms. However we are aware that, trajectory clustering by itself, especially as dimensionality increases, is not always sufficient to reveal interesting patterns in the underlying trajectory data. Even after

clustering, results can still be too noisy or dense to interpret. To combat this issue a number of previous works (Lee, Han, & Whang, 2007; Li, Lee, Li, & Han, 2010; Yuan, Xia, Zhang, Zhou, & Ji, 2011) introduce algorithms for formulating representative trajectories. In general, the algorithms extract the underlying movement information describing a given trajectory cluster and represent that information as a single comprehensible trajectory. However, when processing n-dimensional trajectory clusters the fixed dimensionality representative trajectory methods described in previous works are not applicable. Therefore, the second contribution of this paper is a n-dimensional representative trajectory formulation algorithm. The rest of paper is organised as follows. Section 2 reviews the relevant existing trajectory clustering methods and highlights the problems they face when processing n-dimensional trajectory data. Section 3 identifies and formulates a problem statement. Section 4 covers the following: (1) our n-dimensional trajectory clustering methodology, (2) the application of our methodology to an existing fixed dimensionality approach, and finally (3) our n-dimensional representative trajectory formulation algorithm. Section 5 presents our proposed n-dimensional representative trajectory formulation algorithm. Section 6 shows the experimental results on a range of real world datasets when we apply the n-dimensional trajectory clustering and representative trajectory formulation algorithms we propose. Finally, Section 7 concludes our findings and provides several directions for future work. 2. Literature review

⇑ Corresponding author. Tel.: +61 7 4232 1083; fax: +61 7 4232 1284. E-mail addresses: [email protected] (L. Bermingham), Ickjai. [email protected] (I. Lee). http://dx.doi.org/10.1016/j.eswa.2015.06.014 0957-4174/Ó 2015 Elsevier Ltd. All rights reserved.

In this section we review a number of relevant and recent trajectory clustering methods. Our goal is to demonstrate to the reader

7574

L. Bermingham, I. Lee / Expert Systems with Applications 42 (2015) 7573–7581

the lack existing trajectory clustering methods unable to process an arbitrary set of dimensions and the hindrances preventing methods from being extended to consider extra dimensions. We group the methods we review by the problems they face when extending to n-dimensions. We assert that all the methods we consider have one or both of the following two problems: use of a distance function to compute similarity or use of a fixed dimensionality index to store and retrieve trajectories (or trajectory segments). 2.1. Distance function methods In data mining, clustering relies on determining the (dis)similarity between a given set of objects (Lee & Yang, 2009). A common approach for computing this similarity is using a distance function. However, as Beyer, Goldstein, Ramakrishnan, and Shaft (1999) highlight, due to the so-called curse of dimensionality as dimensionality increases there exists certain data distributions where ‘‘the contrast in distances to different data points becomes non-existent’’. This is one of the reasons distance function based methods are unsuitable for n-dimension trajectory clustering. The second argument against distance function based methods being used to cluster arbitrary dimensions is the lack of generality they impose. All of the distance function based trajectory clustering methods we review utilise distance functions that are designed to compute similarity using a very specific number of data dimensions (i.e spatio-temporal distance functions). To clarify we do not claim these fixed dimensionality distance function are ineffective, on the contrary, by design they can uncover very specific types of trajectory patterns. However, what we do claim is that trajectory clustering methods which utilise fixed dimensionality distance functions miss potentially valuable extra dimensional trajectory patterns. The final issue we highlight with using distance functions in trajectory clustering is the difficulty in extending distance functions to handle additional dimensions. Consider for example a tropical cyclone trajectory dataset that contains dimensions of: space, time, wind speed, rain fall, and humidity. Firstly, a fixed dimensionality distance function could not cluster the cyclone trajectories using all available dimensions. Secondly due to the semantic difference across the dimensions the notion of a distance function that uses all the dimensions becomes ill defined. We now briefly review a number of existing trajectory clustering methods which compute similarity using distance functions. TRACLUS (Lee et al., 2007) is a spatial trajectory clustering method that is based on DBSCAN (Ester, Kriegel, Sander, & Xu, 1996). TRACLUS partitions raw trajectory data into segments to remove noise and decrease clustering time, and then groups the partitioned line segments into clusters. During the clustering phase of TRACLUS similarity is computed using a geometric distance measure. Zhu, Luo, Yin, and Zhou (2010) propose a grid-based clustering method that extracts frequent movement corridors using a spatio-temporal Fréchet distance. Li et al. (2010) make an extension to TraClus that enables it to handle incremental trajectory data. Clustering is performed incrementally using the concept of micro-clusters that are updated as new data arrives. However, similar to TRACLUS this method still relies on a geometric distance function. More recent approaches include Wu, Yeh, and Chen (2013) who uses a spatial shift distance, a temporal speed distance, and a dimensional weighting function to perform a modified k-means clustering on trajectory segments. Lastly, Costa, Manco, and Masciari (2014) converts raw trajectory data into the frequency space using discrete fourier transform and then applies a modified k-means for clustering. Costa et al. (2014) claim their approach ‘‘can be generalised to a multi-dimensional representation’’. Indeed this is true of the mathematical transforms applied, however the clustering method they employ relies on a fixed

dimensionality distance function. Therefore, of the literature presented none can perform trajectory clustering using any arbitrary number of dimensions. 2.2. Fixed dimensionality index methods In order to facilitate efficient neighbourhood queries, indexing structures are often employed in clustering algorithms. Many existing trajectory clustering techniques utilise indexing structures in their algorithms. However, the existing algorithms we review that utilise indexing structures all maintain fixed dimensionality indexing structures. Therefore, due to their fixed dimensionality these indexing structures have a number of the same aforementioned problems as distance functions. Specifically, difficulty in extending to additional dimensions and inability to discover extra dimensional trajectory patterns. Furthermore, a number of indexing structures that exist compute neighbourhood queries using distance functions, and therefore by extension share all the aforementioned issues of distance functions. We now briefly review a number of existing trajectory clustering methods which utilise fixed dimensionality indexing structures. Yuan et al. (2011) introduce a trajectory clustering algorithm that uses a scoring and similarity measure based on segment direction, speed, and angle. These scored segments are then indexed in an R⁄tree like data structure for clustering. Yanwei, Qin, Xiaodong, Huan, and Jie (2013) introduce an incremental spatio-temporal trajectory clustering method that uses a fixed dimensionality index tree to store line segments. However, unlike Yuan et al. (2011) this approach does not use a scoring metric, but rather a geometric distance function between trajectory line segments. Lastly, Deng, Hu, Zhu, Huang, and Du (2014) introduce a parallelised GPGPU approach that uses a spatio-temporal index and distance function to perform density based trajectory clustering. Therefore, we highlight again that, of the literature presented none can perform trajectory clustering using any arbitrary number of dimensions. 3. Problem statement In Table 1, we summarise our review of existing trajectory clustering approaches and the issues each face when extended to n-dimensional trajectory data. Based on a review of the existing literature we have identified the following gaps in field of trajectory clustering:  No existing method can perform trajectory clustering using any set of the available quantitative trajectory dimensions; Table 1 Overview of relevant trajectory clustering literature, highlighting the restrictions present for clustering n-dimensional trajectory data. Existing method

Clustering concept

n-D 

Issue for n-dimensional data

Lee et al. (2007) Zhu et al. (2010) Li et al. (2010) Yuan et al. (2011)

Density Hierarchy Density Density

x, y x; y; t; s x, y x; y; d; s

Yanwei et al. (2013)

Density

x; y; t

Wu et al. (2013) Deng et al. (2014) Costa et al. (2014)

Density Density Density

x; y; t; s x; y; t x; y; t; d

Distance function Distance function Distance function Fixed dimensionality index Fixed dimensionality index and distance function Distance function Distance function Distance function

  n-D are the dimensions considered in each approach. The lettered abbreviations are as follows: x/y = spatial dimensions (i.e. longitude and latitude), t = temporal dimension (i.e. time-stamp), s = speed dimension, d = directional dimension.

L. Bermingham, I. Lee / Expert Systems with Applications 42 (2015) 7573–7581

7575

 By extension there currently exists no n-dimensional representative trajectory formulation method. In order to overcome the identified gaps in the literature we propose: (1) a methodology for achieving n-dimensional trajectory clustering, and (2) an n-dimensional representative trajectory formulation method. 4. n-Dimensional trajectory clustering methodology In order to achieve trajectory clustering using any set of arbitrary dimensions we must deviate from existing approaches. Specifically, our methodology avoids the use of both distance functions and fixed dimensionality indexing structures. Presumably, without the use of such fundamental clustering techniques the reader may think our methodology lacks applicability to existing approaches. Therefore, to prove the applicability of our methodology we apply it to TRACLUS (Lee et al., 2007). We choose TRACLUS because it is a well known spatial trajectory clustering algorithm that uses a distance function. Thus, it is an ideal candidate to demonstrate the full application of our methodology for creating n-dimensional trajectory clustering algorithms. In the following sections we detail the specifics of the various steps required to apply our methodology. 4.1. n-Dimensional trajectory feature vectors The core of our n-dimensional clustering approach is representing trajectories in an n-dimensional feature space. First, we consider each point in the trajectory sequence as n-dimensional (storing a value for each considered dimension). Then, due to the moving entity nature of trajectories we connect sequential points together to form n-dimensional feature vectors (or in geometric terms, line segments). An illustration of our data representation is shown in Fig. 1. 4.2. n-Dimensional indexing structure The next stage of our methodology is calculating line segment similarity without the use of a distance function. We achieve this by utilising an n-dimensional indexing structure that fulfils neighbourhood queries using n-dimensional hyper rectangles. The specific data structure we choose is the PR-Tree (Arge, Berg, Haverkort, & Yi, 2008).1 We choose the PR-Tree not only for its ability to index our n-dimensional trajectory line segments, but also because of its practical efficiency, as was shown experimentally (Arge et al., 2008). Arge et al. (2008) claim the theoretical optimal query of the pffiffiffiffiffiffiffiffiffi PR-Tree can be returned in Oð N=B þ T=BÞ2 I/Os. The PR-Tree is designed to store and return neighbourhood queries with regard to n-dimensional hyper rectangles. Therefore, in order to populate the PR-Tree with the n-dimensional trajectory line segments we must compute their n-dimensional Minimum Bounding Boxes (n-D MBBs). Once the MBBs are known we can perform queries using trajectory line segments. The real usefulness is gained when we expand the line segment MBBs in each dimension (using specified epsilon parameters) because then line segment neighbourhood queries can be performed. These neighbourhood queries are utilised in our methodology to determine similar line segments during clustering. This process of calculating the MBBs and expanding them for neighbourhood querying is shown in Fig. 2. 1 The specific PR-Tree implementation we use is found at: http://www.khelekore. org/prtree/. 2 Where N is the number of rectangles stored in the tree, B is the disk block size, and T is the output size.

Fig. 1. n-Dimensional trajectory feature vectors.

4.3. Using our methodology to create ND-TRACLUS TRACLUS performs line segment trajectory clustering by using a modified DBSCAN approach. This modified DBSCAN performs neighbourhood queries using a spatial distance function. For reasons listed in Section 2.1 this use of a distance function hinders TRACLUS from being applied to n-dimensional trajectory data. Therefore to create a so-called ND-TRACLUS we must apply our methodology. The first and fundamental stage of our methodology, as discussed in Section 4.1, is the expression of trajectory data as n-dimensional line segments. Once we have n-dimensional line segments we can construct an n-dimensional indexing structure for performing neighbourhood queries. In the context of TRACLUS the n-dimensional index allows us to perform the modified DBSCAN using any set of arbitrary dimensions. This construction of a PR-Tree for our ND-TRACLUS implementation is given in Algorithm 1. Algorithm 1. BuildND 1: Input: A set of n-dimensional line segments, D. 2: Output: 3: (1) A PRTree of Rn line segments P ¼ fS1 ; . . . ; SnSegs g. 4: for line segment LS 2 D do 5: /⁄ Compute a Rn MBB ⁄/ 6: Compute N for LS 7: /⁄ Set the segment state for TRACLUS ⁄/ 8: Set status of LS to UNASSIGNED 9: Insert N into P 10: end for 11: return P

However, if we were applying our methodology to a trajectory clustering algorithm that did not rely on a distance function and already had a specific indexing structure then we could forgo this

Fig. 2. Computing the n-dimensional MBB of a trajectory line segment and then expanding it in each dimension for neighbourhood querying.

7576

L. Bermingham, I. Lee / Expert Systems with Applications 42 (2015) 7573–7581

step. For example in the case of Yuan et al. (2011), which already contains an indexing structure we could try and modify the indexing structure to be considerate of our n-dimensional line segments, however due to the scoring function they use we speculate this may be non-trivial. We now revise the modified DBSCAN algorithm presented in Lee et al. (2007). We make minor changes to facilitate the extension to n-dimensions. The biggest change we make is the inclusion of multiple epsilon values when searching the line segment neighbourhood. ND-TRACLUS is a straightforward extension of TRACLUS to n-dimensions using our methodology, therefore any original parameters that are tied to dimensionality must be replicated as dimensionality increases. We highlight this is not the fault of our methodology but is reflection on the implicit nature of the underlying algorithm being extended. We present our extended ND-TRACLUS algorithm, with the revised modified DBSCAN in Algorithm 2. Algorithm 2. ND-TRACLUS 1: 2: 3: 4: 5:

Input: (1) T , a trajectory database of line segments. (2) MinLns, min line segments per cluster. (3) Eps½, array of epsilon in each dimension. Output: C, a set of line segment clusters.

6: /⁄ Build PR-Tree ⁄/ 7: Assign P to BuildND(T ) 8: /⁄ Trajectory line segment clusters ⁄/ 9: Assign mathcalC to ; 10: for Segment LS in P do 11: if LS is UNASSIGNED then 12: /⁄ Create an empty cluster ⁄/ 13: Assign Ci to ; 14: /⁄ Create a set of growth candidates ⁄/ 15: Assign G to {LS} 16: while (length of G > 0) do 17: /⁄ Assign current growth candidate ⁄/ 18: Create G0 by popping head of G 19: Create E by expanding G0 by Eps½; 20: /⁄ Find neighbours of the expanded segment ⁄/ 21: Assign N to the query of E in P; 22: /⁄ N.B. N always contains LS ⁄/ 23: if (N has >¼ MinLns þ 1 segments) then 24: for Segment S in N do 25: Set status of S to ASSIGNED 26: Add S to Ci 27: /⁄ Add neighbours to growth candidates ⁄/ 28: if (S – LS) then 29: Add S to G 30: end if 31: end for 32: else 33: for S in N do 34: Set status of S to NOISE 35: end for 36: end if 37: end while 38: /⁄ Preserve the current cluster ⁄/ 39: if Ci – ; then 40: Add Ci to C 41: end if 42: end if 43: end for 44: return C

5. n-Dimensional representative trajectory formulation We now introduce the other contribution of this paper, our algorithm for representative trajectory formulation using n-dimensional trajectory line segment clusters. Our approach is based on splitting and merging the most representative line segments to form simplified, yet dataset preserving trajectory representations. Due to this splitting and merging process we call our method, ‘‘Retraspam’’ (REpresentative TRAjectory SPlit And Merge). Retraspam takes a trajectory line segment cluster as input and divides it to extract the most representative segments. Any nearby representative segments are then grouped together to form sub-graphs, which finally have their disjoint segments merged together to create the representative trajectories. A visual overview of the process is illustrated in Fig. 3. The exact Retraspam algorithm is given in two parts, first the splitting part in Algorithm 3 and secondly the merging part in Algorithm 5. Algorithm 3 Retraspam 1: 2: 3: 4: 5:

Input: (1) C, n-dimensional line segments cluster. (2) MinLns, min line segments per cluster. (3) Eps½, array of epsilon in each dimension. Output: G, Representative trajectory sub-graphs.

6: /⁄ (1) Extracting and splitting phase ⁄/ 7: /⁄ Create PRTree from segments in cluster ⁄/ 8: Create P from BuildND(C) 9: /⁄ Create a MBB from the PR-Tree ⁄/ 10: Create Q 0 from boundary of P 11: Set DEPTH of Q 0 to 0 12: /⁄ A set of MBB to split ⁄/ 13: Create toSplit ¼ fQ 0 g 14: /⁄ Create MBB from the epsilon dimensions ⁄/ 15: Create Q small from MBB of Eps½ 16: Assign RepSegs to ; 17: while (toSplit is not empty) do 18: Assign Q cur to first popped off toSplit 19: Assign Segs to the query of Q cur in P 20: /⁄ Remove an already represented segments ⁄/ 21: Remove all RepSegs from Segs 22: Assign N cells to number of Segs 23: /⁄ If current MBB is dense enough to split 24: and the current MBB is greater than any epsilons ⁄/ 25: if (N cells > MinLns AND Q cur greater than Q small ) then 26: /⁄ Halve the current MBB across n-dimensions ⁄/ 27: Assign Q splits to the halving of Q cur in Rn 28: Set DEPTH of all MBB in Q splits to DEPTH of Q cur þ 1 29: Add Q splits to toSplit 30: else(N cells > 0 AND ðN cells þ DEPTH of Q cur Þ > MinLns) 31: Assign Srep to ExtractRepSeg(Dims; Segs) 32: Add Srep to RepSegs 33: end if 34: end while 35: return GraphAndJoin(RepSegs; MinLns)

Firstly we detail the process taking place in Algorithm 3. We start by inputting a trajectory line segment cluster for representative trajectory extraction. This is followed on lines 7–8 by populating a PRTree with the n-dimensional line segment cluster. This is the same process as described in Section 4.2. The next step is from lines 9–34, in this step a MBB of the whole PR-Tree is computed and repeatedly divided in half using each dimension. Specifically

L. Bermingham, I. Lee / Expert Systems with Applications 42 (2015) 7573–7581

a 2D MBB becomes four rectangles, a 3D MBB becomes 8 cubes, and so on for each order of dimensionality. During this splitting process a heuristic based on the minimum lines and epsilon parameters from the clustering stage are used to determine how to process the split quadrants. Lines 23–29 show that if the MBB segment density is too high and the MBB is still larger than each epsilon dimension then the MBB is split. Whereas, lines 30–33 state that if the density of the current MBB plus the number of times it has been split is greater than or equal to the minimum number of lines to form a cluster then a representative segment should be extracted from the current MBB. The process of extracting a representative line segments from a set of line segments is given in Algorithm 4. Algorithm 4. ExtractRepSeg 1: 2: 3: 4:

Input: Dims, a set of clustering dimensions. Segs, a set of n-dimensional line segments. Output: A representative line segment Srep .

5: for (Dimension Dimcur in Dims) do 6: /⁄ Find mean dimensional value of segments ⁄/ 7: Compute l of Dimcur from Segs 8: /⁄Find dimension standard deviation of segments⁄/ 9: Compute r of Dimcur from Segs 10: for (Segment S in Segs) do 11: Add to SCORE of S; ðjDimcur of S  ljÞ  r 12: end for 13: end for 14: return S with lowest SCORE

To briefly explain Algorithm 4, a representative line segment is computed by finding the average and standard deviation of all the line segments in each dimension. A score is then calculated based on how many standard deviations away from the mean the current segments current dimensional values are. To summarise, the line segment with the lowest total dimensional dissonance from the dimensional mean must be a good average representation, and therefore the best candidate to represent the segment group. Once the representative line segment extraction phase is complete, the next stage of Retraspam is to merge the segments together into sub-graphs, this process is presented in Algorithm 5. Algorithm 5. GraphAndJoin 1: 2: 3: 4:

Input: (1) RepSegs, representative n-dimensional line segments. (2) MinLns, min line segments per cluster. Output: G, Representative trajectory sub-graphs.

5: /⁄ (2) Graph construction phase ⁄/ 6: /⁄ Create PRTree from representative segments ⁄/ 7: Create P from BuildND(RepSegs) 8: /⁄ Create a set for processed segments ⁄/ 9: Create Sdone as ; 10: for Segment S in RepSegs do 11: /⁄ Find neighbours of segment ⁄/ 12: Assign N to query of MBB S in P 13: Remove all Sdone from N 14: /⁄ Check if segment in sub-graphs 15: if (S not in Sdone ) then 16: /⁄ Start a sub-graph with segment 17: Create Gcur ¼ fS; N 0 ;    ; N n g

7577

18: Set fN 0 ;    ; N n g as edges of S 19: Add S and fN 0 ;    ; N n g to Sdone 20: Add Gcur to G; 21: else 22: /⁄ Find sub-graph containing segment 23: if (N not empty) then 24: Find Gcur containing S 25: Add fN 0 ;    ; N n g to Gcur 26: Set fN 0 ;    ; N n g as edges of S 27: Add fN 0 ;    ; N n g to Sdone 28: end if 29: end if 30: end for 31: 32: 33: 34: 34: 36: 37: 38: 39:

/⁄ (3) Pruning phase ⁄/ for Subgraph Gcur in G do if (Size of Gcur <¼ MinLns) then Remove Gcur from G else If Gcur has leaf nodes then Remove leaf nodes end if end for /⁄ (4) Merging phase ⁄/

40: for Subgraph Gcur in G do 41: for Segment S in Gcur do 42: if (S is disjoint) then 43: /⁄ Find the closest segment ⁄/ 44: Find Sclose to S 45: /⁄ Join segments by closest terminals ⁄/ 46: In Gcur create edge Sclose to S 47: end if 48: end for 49: end for 50: return G Algorithm 5 is the final stage of the Retraspam algorithm, merging the representative segments together. The process begins on lines 6–7 with the creation of a PRTree that is filled with representative n-dimensional line segments. Lines 9–30 show the construction of sub-graphs from these line segments. Specifically, each line segment is iterated over and its MBB neighbours computed in lines 10–13. If the current line segment has not been processed yet then in lines 14–20 the current line segment and its neighbours are added into to a new sub-graph, where the current segment has edge connections to all the computed neighbours. Whereas, if the current segment has been processed, lines 21–29 find the sub-graph the current segment belongs to and adds the neighbour segments as edges to the current segment. This concludes the graph construction phase and begins the pruning phase. The pruning phase spans lines 31–38. Its purpose is firstly to remove any sub-graphs which do not have at least the minimum number of lines needed to form a cluster (i.e noise), and secondly to remove any leaf nodes from the sub-graphs as these are assumed to be of no interest. Lastly, this brings us to the merging phase from lines 39–49. In the merging phase each sub-graph is traversed and any segments that are not connected to every other segment (disjoint) are connected by their closest terminal to the closest terminal of a non-disjoint line segment in the sub-graph. The result is a fully connected, yet potentially multi-path representative trajectory that preserves real trajectory movements. There are a number advantages that Retraspam has over traditional representative trajectory methodologies. Firstly the representative trajectories it produces utilise the underlying data, which provides a more accurate representation than averaging based representative methods such as Lee et al. (2007). Secondly it can uncover complex graph-like representative trajectory paths

7578

L. Bermingham, I. Lee / Expert Systems with Applications 42 (2015) 7573–7581

experiments we perform is comparing the efficiency and effectiveness of Retraspam against the Lee et al. (2007) representative trajectory method, which we henceforth refer to as RepAvg, due to its averaging nature. Our experimental framework is implemented in Java which allows it to recruit the 3D open-gl GIS visualisation features of NASA’s ‘‘WorldWind SDK 2.0’’.3 The three dimensions are utilised in visualisation of trajectories and patterns as follows: fx ! longitude; y ! latitude; z ! temporalg (z is the up-axis). 6.1. Trajectory datasets In the experiments we conduct, we use two real world datasets, which we refer to as the Athens trucks dataset4 and Starkey elk dataset.5 By using real datasets we can gauge the applicability of our proposed methods in terms of their suitability for other real world trajectory datasets. We chose the Starkey elk dataset as it used by Lee et al. (2007) in their experiments, thus it is of interest whether our methodology when applied to their trajectory clustering algorithm can uncover previously undiscovered higher dimensional patterns. The other dataset, the Athens truck dataset is a popular dataset used previously in the field of trajectory clustering (Pelekis, Kopanakis, Kotsifakos, Frentzos, & Theodoridis (2009)). The Athens truck dataset is a road network constrained trajectory dataset of GPS-tracked trucks transporting concrete. The truck dataset essentially maps the transit roads of Athens, therefore it is an ideal candidate to test each method’s ability to preserve the underlying road features. The relevant characteristics of each dataset are shown in Table 2, additionally Fig. 4 provides a visual illustration of each dataset. 6.2. ND-TRACLUS experiments

Fig. 3. Retraspam representative trajectory formulation.

Table 2 Dataset characteristics.

Trajectories: Entries: Time range: Diagonal distance: Data type:

Starkey elk

Athens trucks

33 47204 May 6 1993–Aug 15th 1993 11 km

50 112203 Aug 6 2002–Sept 16th 2002 50 km

Real, noisy

Real, structured

(i.e. a forking river can now be represented). Finally, the density-based approach it uses considers n-dimensions, similar to that used in Section 4.2, therefore n-dimensional patterns are still preserved in the results. 6. Experimental results We now conduct a number of experiments on the applied version of our methodology, ND-TRACLUS. Our experiments aim to discover the effect our methodology has on the original clustering algorithm, TRACLUS, with regard to: (1) efficiency, and (2) uncovering previously unknown trajectory clusters. The second set of

Using the approach outlined in Section 4.3 we use our methodology to produce ND-TRACLUS, which can cluster in n-dimensions. In order to evaluate our methodology we conduct a number of experiments on ND-TRACLUS to ascertain that it: (1) retains useful clustering results as dimensionality increases, and (2) produces results in an affordable running time. The first set of experiments we conduct on the Starkey elk and Athens trucks datasets are designed to highlight the effects of increasing dimensionality on trajectory clustering. In these experiments we investigate the power of higher dimensional clustering in regards to finding previously unknown dimensional patterns, the effects that dimensionality has in general on trajectory clustering, and the correlation between certain dimensional combinations. First off we conduct clustering with increasing dimensionality on the Starkey elk dataset. ND-TRACLUS inherits all the properties on TRACLUS, meaning whilst it can benefit from density based clustering, it also suffers from parameter sensitivity. Thus all the ND-TRACLUS trajectory clustering experiments we conduct in this paper are performed using hand picked optimal parameters. We present the results of this experiment in Fig. 5. We chose the following parameters: spatial epsilon of 55.454 m, temporal epsilon 1 h 6 min 53 s 395 ms, directional epsilon of 30°, speed epsilon of 7  108 m/s. Fig. 5 indicates that as dimensionality increases clusters are harder to form and thus representative trajectories are less frequent. In particular, we draw focus to the fact that in order to keep 3 4 5

http://worldwind.arc.nasa.gov/java/ http://www.chorochronos.org/?q=node/5 http://www.fs.fed.us/pnw/starkey/mapsdata.shtml

L. Bermingham, I. Lee / Expert Systems with Applications 42 (2015) 7573–7581

7579

Fig. 4. Datasets used in experiments: (a) Starkey project elk telemetry dataset; (b) Athens trucks dataset. Trajectories shown in red. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 6. Higher dimensional trajectory clusters in Athens trucks data: (a) 3D (x, y, time) – 13 trajectories, Min Lines 15; (b) 4D (x, y, direction, time) – 21 trajectories, Min Lines 14; (c) 5D (x, y, direction, time, speed) – 20 trajectories, Min Lines 7.

Fig. 5. Dimensional effects on Starkey elk data: (a) 3D (x, y, time) – 8 trajectories, Min Lines 12; (b) 4D (x, y, direction, time) – 5 trajectories, Min Lines 5; (c) 5D (x, y, direction, time, speed) – 13 trajectories, Min Lines 3.

discovering patterns as dimensionality increased we had to decrease the minimum number of lines required to form a cluster. Despite this, the experimental results show that ND-TRACLUS uncovers a number of interesting observations due to the inclusion of higher dimensionality. Specifically, in Fig. 5(a) we see the initial detection of the core spatio-temporal movement regions, which when evolved by the dimensional effects of (b) and (c) are refined, but definitely preserved. We assert this preservation of the same region throughout higher dimensions, validates the detection of a real pattern, this is one advantage of higher dimensional clustering offered by our methodology. In contrast to this preserved cluster, ND-TRACLUS also detects patterns unique to each level of dimensionality. For example in (b) we see the inclusion of entity movement direction and this reveals a large straight pattern (orange) moving along the top of the dataset. Additionally in (c) we observe the core regions from (a) staying in the same region yet demonstrating new paths, we conjecture the inclusion of speed discovers specific elk movements where many elk moved at the steady rate in the same direction (i.e to a new feeding area). These results indicate ND-TRACLUS can discover previously unavailable higher dimensional trajectory patterns. We now perform the same experiment on the Athens truck dataset. The results are shown in Fig. 6. Clustering was performed using the following parameters: spatial epsilon of 246.739 m, temporal epsilon of 3 h 51 min 36 s 115 ms, directional epsilon of 10°, speed epsilon of 5  106 m/s. The results, Fig. 6, are similar to the results from the Starkey dataset, as dimensionality increases the core patterns become refined, yet are still preserved. This is particularly evident in the transition from (a) to (b), we observe the large spatio-temporal pattern to the North of the dataset being divided into smaller,

straighter subsets of the original pattern. Furthermore, this splitting behaviour is also observed when transitioning from (b) to (c). Additionally, mirroring the higher dimensional findings of the Starkey dataset, we again find that when dimensionality is increased in the Athens dataset, new unique patterns emerge. Notably, in (b) and (c) we discover the pattern of trucks travelling on a south-eastern road. This is significant because this pattern is not found in (a), which leads us to conclude that added dimensionality does not solely reveal subset patterns, but given the optimal parameters can also uncover new, previously unknown patterns. The next experiment we conduct is to evaluate the efficiency implication of our methodology. We do this by using a range of parameters and measuring the running time of the clustering stage for both TRACLUS and ND-TRACLUS. The results are shown in Fig. 7. Fig. 7 indicates an expected result, specifically ND-TRACLUS clusters the Starkey elk data approximately 3 times faster, and the larger Athens Trucks dataset approximately 6 times faster. This is because the clustering stage of original TRACLUS is Oðn2 Þ without a spatial index, whereas ND-TRACLUS utilises an indexing structure. What this experiment clearly does show is that there is certainly no efficiency penalty imposed by applying our methodology to TRACLUS, in fact the opposite. We expect this hold true for application of our methodology to other trajectory clustering approaches too. 6.3. Retraspam experiments The first experiment we conduct is designed to examine the quality of representative trajectories produced by Retraspam against those produced by RepAvg. In this experiment we perform trajectory clustering on the Athens data using TRACLUS. Then using the TRACLUS clusters as input we perform Retraspam and RepAvg. The representative trajectories we extract using each approach as shown side by side in Fig. 8. The results shown in Fig. 8 reveal that the RepAvg approach produces representative trajectories in areas where there is no road network. Whereas our Retraspam approach, which utilises the underlying data, clearly maps to the constrained road network. The results also indicate a valuable type of specific pattern our Retraspam approach is able to uncover. Specifically, Fig. 8(a) uncovers forking movements in the road network. This is significant because due to RepAvg’s averaging

7580

L. Bermingham, I. Lee / Expert Systems with Applications 42 (2015) 7573–7581

Fig. 7. Average running time of clustering stage. Fig. 9. Density based RoI score applied to representative trajectory methods.

based approach it is not possible for it to uncover such a pattern. Finally, analysis of the forking patterns uncovered by Retraspam indicate they align with major highways in Athens. Therefore we claim these results indicate that Retraspam outperforms existing approaches in terms of representative trajectory quality. We further investigate this outcome with quantitative effectiveness experiments. In order to quantitatively gauge the effectiveness of Retraspam against AvgRep we utilise a density based Region of Interest (RoI) detection method presented by Giannotti, Nanni, Pinelli, and Pedreschi (2007). The RoI detection method partitions the study region into a grid, and then computes the density of each grid cell using the number of trajectories that pass through it. Dense cells are then connected into dense regions using an expansion heuristic proposed in Giannotti et al. (2007). Once dense RoIs have been calculated we can compute the percentage of a representative trajectory that passes through those RoIs. RoIs are a coarse indicator of dataset density, whereas representative trajectories are a granular indicator. Therefore, any representative trajectories not passing through the much coarser RoIs must not accurately represent the underlying dataset density. Results of this experiment, averaged across a range of clustering parameters, are shown in Fig. 9. The results in Fig. 9 indicate that across our experimental datasets Retraspam outperforms RepAvg in terms of effectively representing the underlying dataset density. Retraspam is a density based approach, whereas RepAvg is not, therefore we expect these results to hold across other datasets as well. The final experiment we conduct is comparison of running times between the Retraspam and RepAvg. Once again we perform this experiment on our experimental datasets and average the

Fig. 8. Athens truck dataset and the representative trajectories found using Retraspam (a) and RepAvg (b). Blue loops in (b) highlight the non-representative result. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

results across a range of parameters. The results of this experiment are shown in Fig. 10. The results in Fig. 10 reveal quantitatively that Retraspam far outperforms RepAvg as dataset size increases. We reason this is because Retraspam constructs representative trajectories using a PR-Tree, whereas RepAvg is exhaustive. Therefore, we expect this disparity in running speed to widen across even larger datasets. 6.4. Results summary and lessons from arbitrary dimension trajectory clustering After conducting experiments a number of interesting results have been found. The following is a summary of the most significant findings in regards to the aims of this paper:  Our proposed representative trajectory formulation method Retraspam proved to be both more quantitatively more efficient and effective than existing approach RepAvg.  The application of our methodology to create ND-TRACLUS was shown to uncover previously unknown n-dimensional patterns from a range of trajectory datasets, whilst imposing no running time penalty.  Our methodology extends existing approaches. Thus as dimensionality increases fundamental weaknesses with existing approaches, such as parameters sensitivity, are compounded.

Fig. 10. Running time of representative methods.

L. Bermingham, I. Lee / Expert Systems with Applications 42 (2015) 7573–7581

 Clustering with increasing dimensionality can be useful for confirming strong patterns that exist across dimensions within trajectory data, whilst also uncovering new specific dimensionality patterns. However, in our experience parameters for cluster formation generally have to be relaxed as additional dimensions are considered.  Clustering in carefully selected sets of dimensions can reveal high dimensional results that have some correlation. In our experiments this process was performed manually, however perhaps feature selection can automate discovery of these correlated trajectory dimensions. 7. Conclusion Trajectory clustering has already been shown to detect valuable patterns in trajectory data. However, existing approaches do not exploit the valuable sets of dimensions available in trajectory datasets. Therefore, in this paper we have presented a methodology which can be applied to perform n-dimensional trajectory clustering. Furthermore, given n-dimensional trajectory line segment clusters, we also present a representative trajectory formulation algorithm, Retraspam, to describe the core movement information of these clusters. We apply our methodology to an existing trajectory clustering approach, TRACLUS, and extend it to n-dimensions to create the so-called ND-TRACLUS. Then we conduct a number of experiments to verify the usefulness of our framework by evaluating the efficiency and effectiveness of ND-TRACLUS and Retraspam. The experiments have shown that the method produced by our methodology can produce effective n-dimensional trajectory clusters, which can then be represented efficiently and effectively by our n-dimensional representative trajectory formulation algorithm Retraspam. The following are some of the future research directions:  Evaluating the effect of true high dimensional trajectory clustering using a vast number of dimensions.  Application of our methodology to other existing fixed dimensionality trajectory clustering approaches.  Many of the methods in the literature (Elnekave, Last, & Maimon, 2007; Hu, Li, & Tian, 2013; Jensen, Lin, & Ooi, 2007; Yanwei et al., 2013) make consideration for streaming trajectory clustering given the nature of GPS data being recorded incrementally, thus future research should make some consideration to handle incremental data.

7581

References Arge, L., Berg, M. D., Haverkort, H., & Yi, K. (2008). The priority r-tree: A practically efficient and worst-case optimal r-tree. ACM Transactions on Algorithms, 4(1), 9:1–9:30. . Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is ‘‘nearest neighbor’’ meaningful. In C. Beeri & P. Buneman (Eds.), Database theory — ICDT’99. Lecture notes in computer science (Vol. 1540, pp. 217–235). Berlin Heidelberg: Springer. Costa, G., Manco, G., & Masciari, E. (2014). Dealing with trajectory streams by clustering and mathematical transforms. Journal of Intelligent Information Systems, 42(1), 155–177. Deng, Z., Hu, Y., Zhu, M., Huang, X., & Du, B. (2014). A scalable and fast optics for clustering trajectory big data. Cluster Computing, 1–14. Elnekave, S., Last, M., & Maimon, O. (2007). Incremental clustering of mobile objects. In Proceedings of the IEEE 23rd international conference on data engineering workshop (pp. 585–592). IEEE Computer Society. Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Second international conference on knowledge discovery and data mining (pp. 226–231). AAAI Press. Giannotti, F., Nanni, M., Pinelli, F., & Pedreschi, D. (2007). Trajectory pattern mining. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 330–339). ACM. Hu, W., Li, X., & Tian, G. (2013). An incremental DPMM-based method for trajectory clustering, modeling, and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(5), 1051–1065. Jensen, C. S., Lin, D., & Ooi, B. C. (2007). Continuous clustering of moving objects. IEEE Transactions on Knowledge and Data Engineering, 19(9), 1161–1174. Lee, J.-G., Han, J., & Whang, K.-Y. (2007). Trajectory clustering: A partition-andgroup framework. In Proceedings of the 2007 ACM SIGMOD international conference on management of data (pp. 593–604). ACM Press. Lee, I., & Yang, J. (2009). Common clustering algorithms. Comprehensive chemometrics: Chemical and biochemical data analysis (Vol. 2, pp. 577–618). Elsevier. Li, Z., Lee, J.-G., Li, X., & Han, J. (2010). Incremental clustering for trajectories. Proceedings of the 15th international conference on database systems for advanced applications (Vol. Part II, pp. 32–46). Springer-Verlag. Pelekis, N., Kopanakis, I., Kotsifakos, E., Frentzos, E., & Theodoridis, Y. (2009). Clustering trajectories of moving objects in an uncertain world. In Proceedings of the ninth IEEE international conference on data mining (pp. 417–427). IEEE Computer Society. Wu, H.-R., Yeh, M.-Y., & Chen, M.-S. (2013). Profiling moving objects by dividing and clustering trajectories spatiotemporally. IEEE Transactions on Knowledge and Data Engineering, 25(11), 2615–2628. Yanwei, Y., Qin, W., Xiaodong, W., Huan, W., & Jie, H. (2013). Online clustering for trajectory data stream of moving objects. Computer Science and Information Systems, 10(3), 49. Yuan, G., Xia, S., Zhang, L., Zhou, Y., & Ji, C. (2011). An efficient trajectory-clustering algorithm based on an index tree. Transactions of the Institute of Measurement and Control, 34(7), 850–861. Zhu, H., Luo, J., Yin, H., & Zhou, X. (2010). Mining trajectory corridors using Frechet distance and meshing grids. In Proceedings of the 14th Pacific-Asia conference on advances in knowledge discovery and data mining (pp. 228–237). Vol. Part I.