Image and Vision Computing 32 (2014) 1102–1116
Contents lists available at ScienceDirect
Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis
Review article
Dynamic scene understanding using temporal association rules☆ Ayesha M. Talha, Imran N. Junejo ⁎ Dept. Of Computer Science, University of Sharjah, U.A.E. 27272
a r t i c l e
i n f o
Article history: Received 23 July 2013 Received in revised form 12 April 2014 Accepted 28 August 2014 Available online 20 October 2014 Keywords: Scene understanding Computer vision Association rules Traffic surveillance
a b s t r a c t The basic goal of scene understanding is to organize the video into sets of events and to find the associated temporal dependencies. Such systems aim to automatically interpret activities in the scene, as well as detect unusual events that could be of particular interest, such as traffic violations and unauthorized entry. The objective of this work, therefore, is to learn behaviors of multi-agent actions and interactions in a semi-supervised manner. Using tracked object trajectories, we organize similar motion trajectories into clusters using the spectral clustering technique. This set of clusters depicts the different paths/routes, i.e., the distinct events taking place at various locations in the scene. A temporal mining algorithm is used to mine interval-based frequent temporal patterns occurring in the scene. A temporal pattern indicates a set of events that are linked based on their relationship with other events in the set, and we use Allen's interval-based temporal logic to describe these relations. The resulting frequent patterns are used to generate temporal association rules, which convey the semantic information contained in the scene. Our overall aim is to generate rules that govern the dynamics of the scene and perform anomaly detection. We apply the proposed approach on two publicly available complex traffic datasets and demonstrate considerable improvements over the existing techniques. © 2014 Elsevier B.V. All rights reserved.
Contents 1.
2. 3. 4. 5.
6.
7.
Introduction . . . . . . . . . . . . . . . . 1.1. Motivation . . . . . . . . . . . . . 1.2. Contributions . . . . . . . . . . . . 1.3. Assumptions . . . . . . . . . . . . Related work . . . . . . . . . . . . . . . Overview . . . . . . . . . . . . . . . . . Feature extraction and segmentation . . . . . Video association mining . . . . . . . . . . 5.1. Event sequence representation . . . . 5.2. Frequent temporal pattern discovery . 5.2.1. Complexity . . . . . . . . . 5.2.2. Pruning patterns . . . . . . 5.3. Generating temporal association rules . 5.3.1. Complexity . . . . . . . . . Hierarchical spatio-temporal anomaly detection 6.1. Spatial level . . . . . . . . . . . . . 6.2. Spatio-temporal level . . . . . . . . Experimental evaluation . . . . . . . . . . 7.1. Datasets . . . . . . . . . . . . . . 7.1.1. Junction . . . . . . . . . . 7.1.2. Roundabout . . . . . . . . . 7.1.3. Motion segmentation . . . . 7.2. Discovered frequent temporal patterns 7.3. Generated temporal association rules .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
☆ This paper has been recommended for acceptance by Ivan Laptev. ⁎ Corresponding author. E-mail address:
[email protected] (I.N. Junejo).
http://dx.doi.org/10.1016/j.imavis.2014.08.010 0262-8856/© 2014 Elsevier B.V. All rights reserved.
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
1103 1103 1103 1103 1104 1106 1106 1108 1108 1108 1109 1109 1110 1111 1111 1111 1112 1112 1112 1112 1112 1112 1112 1113
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
7.4. Spatial anomaly detection . . . . 7.5. Spatio-temporal anomaly detection 7.6. Comparative analysis . . . . . . 8. Conclusions . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
1. Introduction In visual surveillance, there has been an increasing interest in recognizing object behaviors, by interpreting high-level semantics of scene dynamics. However, computing relationships between different actions in the scene or detecting rare events in an ocean of video data is a daunting task. Analyzing event interactions manually is practically impossible, and is solely dependent on human operators. In addition, as the scene gets crowded, the complexity of the relationships between the agents increases as well. Even though it has become an active research area, it is still a complex problem with a lot of constraints, and an unsupervised method is required to make the task easier. An elegant solution to this problem can open doors to a wide spectrum of applications, such as video surveillance [1], anomaly detection [2], and crowd analysis [3]. Typically, the input to a dynamic scene analysis system is a video, and the first task is to detect moving objects and record their motion characteristics, in the form of object trajectories (or optical flows). Each trajectory denotes an individual event in the scene during a time interval. This step is generally followed by behavior or activity segmentation, which identifies semantically meaningful components and groupings to reveal different events. Traditionally, algorithms such as K-means and fuzzy clustering have been used extensively, while many recent works have explored spectral clustering and normalized cuts [4]. The resulting clusters model the various events, indicating the spatial layout of the scene. Finally, the last step is to learn the temporal scene behavior. Behavior in our context explains the way an object acts in relation to the other objects in the scene. It can be defined as a sequence of events with spatial and temporal constraints. Recently, probabilistic methods such as Dynamic Bayesian Networks (DBN) [5], Hidden Markov Models (HMM) [6], and Probabilistic Topic Models (PTM) [2], have been used extensively by the computer vision community to learn the scene dynamics. The dynamic scene understanding problem can be expressed as: obtain the motion patterns in the scene, build the scene structure and lastly, interpret the high-level semantics of the scene. A dynamic scene may also involve multiple agents interacting with one another, and the actions may occur in parallel with one another or recur over time. Thus, we are interested in answering questions such as: what is happening in the scene, where the objects are located and how they interact within their environment. In this work, we aim at developing a robust system that can learn the scene model with minimal human intervention. In this regard, video mining can help extract salient information from a video without such supervision [7]. In order to analyze and discover the temporal interdependencies and relationships between various events occurring in a scene, we make use of temporal mining algorithms. These relationships between events are modeled as temporal patterns, discovered using a frequent temporal pattern mining algorithm. A frequent temporal pattern can be defined as a set of composite events that occur repetitively in the video, and are expressed using temporal relations in Allen's taxonomy [8], such as before, after, and meet. Once these frequent patterns are obtained, forward temporal association rules are generated. These rules capture the correlations between the frequent temporal patterns present in the video. We define an anomaly as an atypical behavioral pattern based entirely on the model in context, thus every scene can have a different set of anomalies. In this work, anomaly detection is performed in a hierarchical manner. First, we identify unusual events within a spatial context. These spatial anomalies can be found once unique event clusters
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
1103
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
1114 1114 1114 1115 1115
are identified. The second type of anomalous behavior can be found by using frequent temporal patterns (and their time duration) to discriminate between the usual and the unusual complex composite events. 1.1. Motivation Our goal is to extract complex activity patterns in a multi-agent environment. This is not trivial, as in most real-world scenarios, the underlying dynamic scene behavior is very complex and perhaps ambiguous, making high-level activity interpretation a challenge. Most of the existing techniques employ various probabilistic models, however, the learning and inference in such methods is computationally prohibitive. Moreover, as the scene gets crowded, the complexity of the relationships increases, and this necessitates a huge amount of training data for accurate analysis. Therefore, in this work we have proposed to learn the scene dynamics using temporal mining techniques. The frequent pattern discovery algorithm utilized in this work has an exploratory nature of operation. In addition, pattern matching allows for accurate and efficient anomaly detection. 1.2. Contributions • To the best of our knowledge, temporal mining techniques have not been used for event recognition in dynamic scenes. We discover frequent temporal patterns using [9] to learn the scene behavior. • We indicate exactly how two events are related (overlaps, equals, starts, etc.) using Allen's relations [8]. Moreover, we include the duration of composite events in each pattern. • To eliminate the spurious frequent temporal patterns discovered, we suggest a few steps in Section 5.2 in order to prune the pattern space. • Once these patterns are obtained, we generate temporal rules. These temporal association rules help model the traffic cycle sequence, which is the main test domain for our work. • Using a hierarchical anomaly detection algorithm, spatial anomalies are detected based on object trajectories, and spatio-temporal anomalies are identified using a frequent pattern matching approach. 1.3. Assumptions • We track objects to obtain events that unfold over time. As with any trajectory-based approach, a good tracking algorithm is needed to overcome its inherent issues. In this paper, we focus only on vehicle motion in complex traffic scenes. Pedestrian activity is disregarded as complete trajectories are hard to obtain in crowded scenes. • For the temporal mining algorithms, user-defined parameters have to be determined by domain experts. Even though mining techniques do not require the definition of events or rules in advance, the temporal support and the confidence thresholds (cf. Table 1) have to be specified. The work is organized as follows: Section 2 presents some existing works on the topic. Section 3 briefly describes the proposed methodology. Section 4 focuses on feature extraction and segmentation, while Section 5 presents the second phase, i.e., using the video mining techniques to learn the dynamic scene model. The anomaly detection methodology is discussed in length in Section 6. Experiments are conducted on two datasets, and the results with evaluation measures are illustrated and explained in Section 7, followed by conclusions in Section 8.
1104
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
Table 1 List of temporal mining symbols. Symbol
Description
Starting time of a given event Finishing time of a given event Dimension of the pattern (number of intervals) A frequent temporal pattern Time duration of composite event based on the relation The fraction of all sequences in the dataset in which a pattern is contained confidence The probability of the consequent occurring given the antecedent minS Minimum support threshold, a pattern is frequent if support is ≥mins minC Minimum confidence threshold, a rule is strong if confidence is ≥minC s f k patt duration support
2. Related work Existing approaches in the literature generally start with motion feature extraction, such as object trajectories or optical flow. Event modeling is done by clustering these features using similarity based distance measures. Trajectory-based approaches [3,10–12] primarily rely on how well a tracker performs. The results may be compromised in crowded scenarios due to the presence of multiple objects, interobject occlusions and low resolution videos [3,13]. In their seminal work, Stauffer and Grimson [14] present an advanced robust tracker based on a novel probabilistic method for background subtraction. Their method models each pixel as a mixture of Gaussians and an online approximation is used to update the model. On the other hand, if the video at hand is of high resolution, both spatially and temporally, optical flow maybe the feature of choice: [1, 2,6,13,15–19]. Yang et al. [20] represent the direction of pixel-wise optical flow as a Bag of Words and apply Diffusion Maps to it. The key motion patterns are obtained by performing a spectral analysis on the Markov matrix of the graph. Saleemi et al. [21] represent the salient unquantized optical flow patterns as a Gaussian Mixture Model (GMM), initialized using K-means. They show that motion patterns can be statistically inferred by computing conditional optical flow
expectation. A key limitation of all the methods discussed above is that they model the scene structure, in terms of spatial layout and basic object patterns, ignoring the higher-level global scene behavior. Some researchers have dealt with scene learning using generative models. A generative model is a hierarchical probabilistic model which typically has some hidden parameters. Moreover, in probabilistic approaches, events are classified as outliers if they have a low-likelihood under the learned model. Kratz and Nishino [6] use a multivariate Gaussian distribution on spatio-temporal gradients, and model temporal behaviors with a distribution-based HMM. They then use the symmetric Kullback–Leibler (KL) divergence distance measure to extract prototypes. Lastly, statistical anomalies are detected via the forward– backward algorithm. Hervieu et al. [22] compute trajectories with a tracking method using a color-based particle filter. Trajectories are defined by combining curvature and motion magnitude. The temporal causality of the features is then captured by HMMs, and the similarity between the trajectories is expressed by exploiting the quantizationbased HMM framework. HMMs are utilized in the last two approaches, and their main drawback is that they require more training data compared to other approaches in order to avoid over-fitting. Shi et al. [23] train Propagation Nets (an extension of DBNs) and boosted event detectors for activity modeling and recognition. Activities represent the duration of intervals explicitly and are ordered with respect to time. The main limitation is that the structure of the Net has to be predefined, hence restricting the number of events in an activity. Benezeth et al. [24] obtain motion labels by background subtraction. Then, they learn co-occurrence statistics for Markov Random Field's (MRF) potential function to detect abnormalities via a likelihood ratio test. However, they correlate activities within a short time frame and hence, they are unable to model a complex dynamic scene behavior. Prabhakar et al. [25] eliminate the need for accurate flows or trajectories when exploring the concept of causality to express event relations. They model space–time interest points by a point-process model and then, identify patterns of interaction over long intervals using the Granger causality. However, their approach is restricted to classifying videos with just one interaction. Probabilistic Topic Models (PTMs) are capable of handling noise in motion features. These models divide the video into clips (documents)
Fig. 1. The proposed dynamic scene analysis framework.
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
(a)
1105
(b) Fig. 2. Observed raw trajectories.
by quantizing pixel motion in the image. Kuettel et al. [1] first extract activities based on a Hierarchical Dirichlet Process (HDP) model. Next, they learn co-occurring activities jointly with their time dependencies using a new model: Dependent Dirichlet Process (DDP)-HMM. Lastly, Gibbs sampling is used to model the training data. However, the temporal order of activities is considered only at the global scene level, and the inter-clip traffic state changes are disregarded as well. Similarly, Jouneau and Carincotte [10] use HMMs to classify particle-based trajectories and utilize HDP-HMMs to identify co-occurring trajectories. Furthermore, they discover the temporal relations between trajectories as well as the abnormalities, where a clip is labeled as an anomaly even if there is some normal interaction within the sequence. In [26], Hospedales et al. build a Markov Clustering TM based on DBN and LDA that can capture the temporal characteristics of the object behaviors hierarchically. However, their hybrid model requires the user to define the number of topics in advance. Moreover, as we show in Section 7.6, they find only one global rule for any dynamic scene as a result of ignoring the temporal order aspect. Varadarajan and Odobez [17] also made use of topic models. Their pLSA-TM found the spatio-temporal correlations among optical flow features. Abnormal events are found by computing the likelihood from modeled GMMs. But, there is no incorporation of time information in their model. This is a major limitation of topic models as they fail to
establish temporal dependencies among low-level features. Additionally, it is computationally expensive as well. In [27] however, the same authors mine sequential patterns to overcome these shortcomings. Zhou et al. [3] start with fragments of trajectories called tracklets, obtained from a keypoint tracker. Random Field TM is used to cluster these tracklets. The LDA model is then integrated with MRFs as a prior to enforce spatial and temporal coherence between the tracklets. However, it cannot be readily used for visual surveillance as estimating the trajectory of each individual in the scene is nearly impossible. In order to determine the shared number of topics automatically and overcome the main limitation of topic models, non-parametric approaches are required. For instance, Emonet et al. [15] mine activities hierarchically in videos using temporal motifs and discover the periodic cycle dictated by traffic lights. While topic model approaches do not correlate atomic activities inside a clip, Zen and Ricci [28] propose an Earth Mover's Distance (EMD) prototype learning algorithm to overcome this limitation. Here, temporal segmentation of the video is done to obtain a set of multi-scale patterns, which are represented using histograms. Lastly, a multi-scale anomaly score is used to find anomalous patterns. However, they fail to model the global scene behavior. On the other hand, some approaches use rule-based approaches and event logic for event recognition. Morariu and Davis [11] represent lowlevel features using Allen's Interval Logic. We utilize Allen's Logic [8] as
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 3. The obtained events: from A to H for the Junction dataset.
1106
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
(a)
(b)
(c)
(d)
(e)
Fig. 4. The obtained events: from A to E for the Roundabout dataset.
well as it is quite powerful for event reasoning. Probabilistic inference is done using Markov Logic Networks (MLN), where knowledge is provided manually via rules. Brendel et al. [29] represent temporal constraints among events using Probabilistic Event Logic (PEL), where a PEL knowledge base comprises of a set of weighted logic formulas. However, in both of these approaches, rules have to be defined manually to capture inter-related relationships between events. Similarly, Zhang et al. [12] demonstrate the utility of a grammar-based approach using Allen's Logic. Even though the authors manage to learn event rules automatically using a Minimum Description Length-based grammar induction algorithm based on Stochastic Context-free Grammar (SCFG), their parser is unable to handle sub-events with parallel temporal relations. Sridhar et al. [30] discover event classes from videos by learning the spatio-temporal relationships between objects across several sets of interacting tracks in space and time (using Allens logic). Then, a Markov Chain Monte Carlo procedure is used to interpret the event class represented by the activities. However, their approach requires huge datasets to learn events. Zhang et al. [31] propose an interval temporal Bayesian network to overcome the shortcomings of traditional graphical models. They combine Bayesian networks with interval algebra (i.e. Allen's 13 temporal relations) to model multiple event dependencies over time intervals in order to recognize both parallel and sequential activities. The structure and parameters are learned automatically from training data using advanced algorithms. Unlike our approach, their model is unable to capture multiple occurrences of the same event in an activity. None of these approaches go beyond learning event rules, i.e. do not attempt anomaly detection. In contrast to grammar-based approaches, Hamid et al. [32] adopt a more data-driven approach. They model contiguous activity subsequences as event n-grams and use it to learn the activity structure. This allows them to discover the various activity classes and find abnormal events. Despite the fact that they demonstrate the applicability of the approach in various domains, the atomic actions as well as the start and end of activities are known a priori. Moreover, overlapping activities and parallel events cannot be modeled with their method. Most of the approaches discussed previously, derive their scene models based on activities taking place at fixed points in time. In our work on the other hand, we model events based on their duration. Our approach uses temporal video mining algorithms. Video Mining comprises of two main steps: (1) transforming the video into clusters, and (2) applying temporal mining algorithms to identify and establish relationships between activities or events [33]. Jakkula et al. [34] use the Apriori algorithm to mine temporal patterns to build a robust anomaly detection system for smart homes. They utilize a probabilistic model
where the evidence is calculated for current activities with respect to previous activities. Their work simply focuses on deviations in movements of a single individual, whereas we are concerned with modeling multi-object correlations in a complex dynamic scene. 3. Overview The proposed approach is illustrated in Fig. 1, comprising of the following steps: • Feature extraction: We employ a semi-automatic mean-shift tracker [35] to obtain object trajectories. • Motion segmentation: Spectral clustering is used to cluster trajectories into different event classes. The number of clusters is determined iteratively. • Learning frequent temporal patterns: Relationships are discovered between events based on their time duration characteristics. Temporal patterns, often represented by composite events consisting of a hierarchy of relations, have the time duration of each relationship appended to it. Temporal patterns are retained based on their frequency of occurrence. • Generating temporal association rules: Temporal association rules are generated from pairs of frequent temporal patterns. Forward temporal association rules are formulated based on the confidence, to be defined later. • Detecting anomalous events: Anomalies are detected at two levels: 1) spatial level, where test trajectories are matched to the model clusters to determine their membership and 2) spatio-temporal level, where anomalies are detected based on the scene model represented by the learned frequent temporal patterns. 4. Feature extraction and segmentation Objects tend to follow common pathways in a traffic scenario, and two key points are of particular interest: the entry point, where an object appears in the scene, and the exit point where it disappears from the scene. Since we focus solely on traffic scenarios in this work, we use [35] to perform the object tracking, and pedestrian trajectories, if any, are subsequently removed (as in [28,36]). Moving average low-pass filters are used to remove noise from the trajectories. The extracted vehicle trajectories are of un-equal lengths and Dynamic Time Warping (DTW), based on the simple Euclidean distance, is used to find the distance between trajectories [37]. For two
Table 2 Allen's temporal relationships. r
Temporal relation
Endpoints
b m e o s d f
X before Y X meets Y X equals Y X overlaps Y X starts Y X during Y X finishes Y
sx sx sx sx sx sy sy
b fx b sy b fy b sy = fx b fy = sy b fx = fy b sy b fx b fy = sy b fx b fy b sx b fx b fy b sx b fx = fy
Inverse relation
Duration
Y after X Y met by X Y equals X Y overlapped by X Y starts with X Y contains X Y finishes with X
sy − sx fx ⋁ sy fx − sx fx − sy fx − sx ⋁ fy − sy fy − sy fx − sx ⋁ fy − sy
Fig. 5. The composite event pattern (above) and the associated temporal relation matrix.
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
1107
Fig. 6. Temporal association rule.
trajectories Ta and Tb of length n and m, respectively, let DTW(Ta,Tb) denote the DTW distance between them. We use spectral clustering where an affinity matrix is constructed using these DTW distances within a Gaussian kernel function [38]: for two given trajectories, the similarity function is defined as: sðT a ; T b Þ ¼ exp
! −DTW 2 ðT a ; T b Þ : σ aσ b
ð1Þ
Selecting a σ value for each trajectory can be simply done by studying the statistics of the trajectory neighborhood. For instance, σ a ¼ DTW ðT a ; T c Þ:
ð2Þ
This implies that the local scaling parameter (σa) of Ta is the DTW distance between Ta and the cth nearest trajectory. It is important to
Table 3 Event sequences.
CCR ¼
A 0 5 B 0 9 C 9 11 B 10 11 C 0 7 B 0 3 A 3 11 B 9 11 A 0 11 C 1 6 D 1 5 C 7 11 D 10 11 A 0 4 D 0 1 C 0 3 E 6 7 G 7 11
(a)
note that, the resulting similarity measure is symmetric in nature. Kmeans is then employed in the last spectral clustering step to obtain the dominant events in the scene [39]. Raw trajectories for the datasets used in this work (to be explained in detail in Section 7), are shown in Fig. 2. Trajectories with 25 observations (frames) or less are eliminated. This can possibly be a tracking error as any vehicle motion lasts more than 1 s. Events are represented by the clustered trajectories, where each cluster is representative of an individual route in the scene. Using [39], the number of clusters in Junction (i.e. dominant routes) was found to be 8, and these are shown in Fig. 3 with their respective event descriptions. The Roundabout dataset has 5 dominant routes, discovered events and descriptions are given in Fig. 4. The accuracy of spectral clustering can be evaluated by finding the one-to-one mapping between the ground truth and the clustering labels. The clustering quality is measured by the Correct Clustering Rate (CCR) [40]. A high CCR value indicates that the different trajectories are grouped in different clusters, while trajectories which are spatially closer, are assigned to the same cluster: K 1X t : N c¼1 c
ð3Þ
N denotes the total number of trajectories in the dataset. The summation denotes the correct number of trajectories assigned to each
(b) Fig. 7. Minimum support threshold and minimum confidence threshold.
1108
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
ordered by their start time in temporal order i.e., triplet1 is before triplet2 iff:
Table 4 k-pattern Sets. Junction
2-patterns 3-patterns 4-patterns
Roundabout
Before
After
Before
After
55 33 4
40 26 3
23 5 1
17 4 1
1) s1 b s2 2) s1 = s2 and f1 b f2 3) s1 = s2 and f1 = f2 and e1 b e2. Every event triplet has to be maximal such that the intervals for the same event do not overlap in the same sequence. If they do, they are merged into one interval [9]: ðevent label; minfs1 ; s2 g; maxf f 1 ; f 2 gÞ:
cluster, where tc signifies the total number of trajectories belonging to the cth cluster. Using Eq. (3), the resulting accuracy for the Junction is 95.6%. The Roundabout dataset has a similar accuracy, of 95.8%.
ð5Þ
This event sequence set is utilized for mining the frequent temporal patterns, as described next. 5.2. Frequent temporal pattern discovery
5. Video association mining Events reoccur over time, and this means that each event corresponds to multiple time intervals. We first start by forming event sequences and then, extract the frequent temporal patterns from them. Allen's First Order Interval Logic is used to describe relationships between event pairs in sequences. Next, temporal association rules are generated from the obtained frequent patterns (Section 5.3). Association rules are used to predict future events or the expected behavior between various objects in the scene. These rules help infer the underlying dynamics of the traffic cycle contained within the scene. Table 1 shows the symbols used in this paper. 5.1. Event sequence representation An ordered event sequence is defined by a series of triplets: ðe1 ; s1 ; f 1 Þ; ðe2 ; s2 ; f 2 Þ; ðe3 ; s3 ; f 3 Þ; …
ð4Þ
each triplet has an associated event e, and occurs during a closed interval [s,f], where s and f are defined in Table 1. Event triplets are
Table 5 Junction frequent temporal patterns. 2-patterns before(D,E)[4.5], {4.35%} before(A,G)[7], {2.17%} before(C,B)[9], {2.17%} equals(A,B)[11], {2.17%} equals(C,D)[3], {2.17%} meets(B,C)[8.5], {4.35%} meets(H,E)[4], {2.17%} starts(C,D)[4.5], {4.35%} starts(G,E)[1], {2.17%} finishes(D,C)[5.5], {4.35%} during(A,B)[6], {6.52%} during(H,F)[3.5], {4.35%} overlaps(B,D)[1], {2.17%} overlaps(E,G)[2], {4.35%} 3-patterns finishes(overlaps(C,A),B) [2], {2.17%} before(starts(A,C),G) [7], {2.17%} during(before(D,G),E) [4], {2.17%}
An event in the scene may have a temporal relation with other events in the same sequence, and there might also be composite events. Allen's symbolic temporal relations [8] are used to encode these relationships. Table 2 shows Allen's relations and the related constraints: A composite event consisting of k events, i.e., a temporal pattern of length k, and also known as a k-pattern. In order to capture the inter-event time constraints, the time duration is computed for each pair of events and incorporated in the temporal pattern structure. The temporal relation among k events is captured by a k × k matrix. This temporal matrix is an anti-symmetrical matrix, and describes Allen's relations between atomic events used to form the composite event. The upper triangle is sufficient to convey the k-pattern. An instance of a canonical 3-pattern is shown below with its temporal matrix in Fig. 5, where the duration of the composite events is expressed in seconds. Such a temporal pattern matrix can also be interpreted as an adjacency matrix of a graph, having interval relationships as edge labels [41]. The support of a temporal pattern is defined as the number of event sequences in which the pattern occurs. A k-pattern can only be generated if its support exceeds a minimum threshold, minS. A frequent pattern is defined as a pattern that occurs many times in the data and we aim at unearthing these interesting patterns in the data [41]. Apriori [42] is the first algorithm designed for frequent itemset mining. It is based on a level-wise principle as well as the anti-monotone property of the set theory. Our algorithm of choice, IEMiner [9], is based on Apriori. But, unlike Apriori, IEMiner does not require multiple scans of the event sequences. For the sake of completeness, Algorithm 1 below describes the approach briefly: • Candidate generation: Initially, all the 2-patterns are generated. In subsequent passes, a (k + 1)-pattern is generated from a frequent
Table 6 Roundabout frequent temporal patterns. 2-patterns starts(A,B)[5], {5%} overlaps(D,E)[4], {2.5%} overlaps(D,C)[4], {5%} before(E,A)[9], {5%} meets(D,E)[10], {2.5%} 3-patterns finishes(before(E,B),A) [3], {2.5%} before(starts(B,A),D) [7], {2.5%}
4-patterns during(before(starts(D,C),G),E) [4], {2.17%} meets(before(starts(A,C),E),G) [7], {2.17%}
4-patterns finishes(before(starts(B,A),D),C) [2], {2.5%}
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
1109
Fig. 8. Dominant traffic cycle for the Junction.
k-pattern and a 2-pattern. This is done by storing the dominant event in each pattern, which is the event that has the latest end time among all events in the pattern. A (k + 1)-pattern is formed by joining a (k)-pattern and a frequent 2-pattern if they share the same dominant event. But, it is considered a candidate pattern if and only if, the 2-pattern involved occurs in at least (k - 1) frequent k-patterns. • Counting support: At each level in the algorithm, the number of occurrences of each candidate is counted to determine whether it is frequent or not. This is achieved by maintaining a list of active events, events that are taking place within the time interval under observation.
for n k-patterns. However, since the value of k is generally less than 10 and IEMiner scans the event sequences only once (to count the support) for s events, the running time is greatly reduced. 5.2.2. Pruning patterns The IEMiner algorithm, unfortunately, generates a huge number of temporal patterns. Algorithm 1 Learning Frequent Temporal Pattern
The final output is a set of frequent temporal patterns, along with their support measure. Each temporal pattern represents a composite event in the scene. Furthermore, each frequent pattern relationship denotes its time duration, where the duration is computed as the average of the individual durations of each pattern. 5.2.1. Complexity The algorithm checks the occurrence of a k-pattern in a set of event sequences with s events, which takes O(ks) time. As a whole, the computational complexity is O(nks) as the procedure is repeated
Table 7 Junction rules for the dominant traffic cycle. starts(A,B) [4] → meets(starts(A,B),C) [9] {50%} before(starts(D,C),G) [5.5] → overlaps(before(starts(D,C),G),E) [4] {50%} before(G,F) [4] → during(before(G,F),H) [5] {100%} during(F,H) [3.5] → before(during(F,H),A) [4] {50%}
We observed in our experiments that a relationship {X r Y}, may be represented by different temporal relations, whereas in essence they are quite similar. However, while mining temporal patterns using [9], the aforementioned concern was ignored. Thus, we introduce global constraints to prune the pattern search space. We obtain the disjunctive combination of temporal patterns by applying the following rules:
1110
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
Fig. 9. Junction traffic alternate light cycle.
1. Add supports of 2-patterns with either [overlaps,equals, during].
formed if the antecedent is a sub-pattern of the consequent and it is of interest only if it has a high confidence value.
2. Add supports if patt1 has relation starts/finishes and patt2 has any one of [overlaps,equals,during]. 3. Eliminate rules where a relationship is established between the same event, such as A starts A.
confidence ¼
Consequently, the combined support of the resulting pattern has to be calculated. The disjunction of patterns is computed by adding their individual supports. The resulting combination is retained only if the combined support is greater than or equal to minS. The final step is to recompute the time duration as the average of the combined durations.
Temporal association rule mining [43] is used to discover timedependent correlations between events in data. Briefly, the algorithm (cf. Algorithm 2) operates as follows: a temporal association rule is constructed from every pair of frequent temporal pattern, patt1,patt2, where the antecedent pattern, has to be a 2-pattern, at least. A rule is generated if its confidence is higher than minC. An example of a temporal rule is illustrated in Fig. 6 [2-pattern→ 3-pattern]:
5.3. Generating temporal association rules Once frequent temporal patterns are discovered, the goal is to obtain meaningful temporal association rules that model the spatio-temporal events in the scene. We can think of this step as modeling the global behavior of the objects in the scene, while the previous step of computing the frequent temporal patterns as finding the local behavior between the scene objects. A rule comprises of a pair of propositions, having the following implication: when the left-hand side, the antecedent, is true, then the right-hand side, the consequent, will be true as well. Just like a pattern's frequency is measured by its support, a rule's strength is measured by its confidence. The support of a pattern is the joint probability of events X and Y, i.e., P(X,Y) while the confidence (cf. Eq. (6)) is defined as the conditional probability of two patterns, P(patt2 patt1). A rule is
Table 8 Junction rules for alternate traffic light cycle. starts(A,C) [2.5] → before(starts(A,C),D) [4], {75%} meets(A,D) [2] → finishes(meets(A,D),B) [3], {50%}
supportY : supportX
ð6Þ
(B starts A)[4.5s]→(overlaps (B starts A), C)[1.25s] {confidence=100%}. It shows that events A and B start together, followed by the cooccurrence of events A and C. Algorithm 2 Temporal Assosiation Rule Generation
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
Fig. 10. Roundabout: traffic light cycle.
5.3.1. Complexity Rule generation has almost negligible running time as compared to frequent pattern discovery. For n k-patterns with a maximum of s events, the complexity is O(2sn). Usually the resulting number of frequent k-patterns are limited (cf. Table 4) and hence, temporal association rule generation is very efficient. 6. Hierarchical spatio-temporal anomaly detection 6.1. Spatial level Each trajectory cluster defines a single event and each event is represented by its cluster centroid. That is, the centroid models the general appearance of trajectories for any given event [37]. Having obtained the individual events in the scene, trajectories in test clips are classified to their respective event categories. The nearest-neighbor classification scheme is utilized for this purpose, where the distance of each test trajectory is computed to all other centroid trajectories using the DTW distance measure. In order to classify trajectories with similar spatial characteristics but opposite direction, we calculate the direction vector of the test trajectory, followed by the direction vector of the centroid trajectory of the cluster it is classified to. Next, the DTW distance between these two directional vectors is obtained. If the distance is greater than the threshold of the cluster it belongs to, it is regarded as an outlier. The threshold is the maximum distance among all trajectories in a cluster with respect to its centroid trajectory. This directional information helps find anomalies that deviate greatly from the centroid trajectory of its membership cluster, such as illegal U-turns or cars moving in the opposite direction, as shown in Fig. 11. Table 9 Roundabout rules governing the scene. before(starts(B,A),D) [7] → finishes(before(starts(B,A),D),C) [2] {100%} before(F,B) [7] → finishes(before(F,B),A) [3] {100%}
Algorithm 3 Spatio-Temporal Anomaly Detection
1111
1112
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
Fig. 11. Junction: illegal U-turn.
The first 8 min of the video was used for training while the remainder was retained for testing purposes.
6.2. Spatio-temporal level To detect spatio-temporal anomalies, the first step is to obtain the event sequence from the test video. Next, a composite event pattern is extracted from the test event sequence. For anomaly detection, patterns are matched in a hierarchical manner against frequent patterns obtained from the training set, after pruning has been done. The procedure starts by comparing 2-patterns from the test sequence to the 2patterns from the set of frequent patterns. Additionally, the law of transitivity [44] is incorporated with the matching process to avoid getting false positives. In addition, the composite test event time duration should also be within an acceptable range. This range is based on durations of the trained frequent patterns:
7.1.3. Motion segmentation The raw trajectories for both datasets have been shown above in Fig. 2. The number of clusters in Junction (i.e. dominant routes) was found to be 8, and 5 for the Roundabout dataset.
μ trainP −2 σ trainP ≤durationtestP ≤μ trainP þ 2 σ trainP :
7.2. Discovered frequent temporal patterns
ð7Þ
The same procedure is repeated for patterns at higher levels. If a mismatch is encountered, the test sequence is classified as an anomaly. The hierarchical anomaly detection procedure is listed as Algorithm 3. 7. Experimental evaluation 7.1. Datasets We test our system on two public datasets [45]. These datasets feature complex activities between numerous agents in the scene, governed by traffic lights. 7.1.1. Junction The Junction dataset contains a busy street intersection where the traffic flow in different directions is regulated by the traffic lights, which imposes a temporal order on the scene behavior. The video is approximately 50 min long, recorded at ps, with a frame size of 360 × 288.
7.1.2. Roundabout This is a more complex dataset, as vehicles follow a circular path. Similar to Junction, traffic flows in different directions are regulated by traffic lights. The video is approximately 62 min long. It is recorded at 25 fps, where the frame size is 360 × 288. The video is divided into two parts, the training set (12 min) and the remainder for testing.
Having classified trajectories into events, the event sequences for each clip can now be formulated. Each event sequence has a list of events, followed by their start time and end time. Some sample event sequences from Junction are shown in Table 3, where [A 0 5] in the first sequence indicates that event A lasts for 5 s from the start time within that sequence. A frequent temporal pattern mining algorithm, as defined in Section 5.2, is used to learn recurring patterns in the scene. One of the advantages of using a temporal mining technique is its fast speed and low computational cost. These event sequences are used as input to the mining algorithm. The output is composed of sets of k-patterns. Each k-pattern shows a composite event with its time duration and support. Sequences in these datasets are short and thus, resulting patterns have a low frequency count. Consequently, the value of the minS threshold has to be low. Experiments were conducted with different values of the minS threshold and the resulting number of the patterns for each is shown
Fig. 12. Roundabout: driving in the wrong direction.
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
1113
Fig. 13. Junction: fire truck interrupting traffic.
in Fig. 7a. Based on this experimentation, minS is set to 2% for both datasets. This precise threshold ensures that we neither produce excessive patterns nor too few. For Junction, this resulted in a total of 92 patterns, while the Roundabout has 29 patterns in all. Roundabout has fewer activities and hence a small number of patterns are sufficient to model its underlying dynamic behavior. The total number of frequent k-pattern sets before and after pruning for both datasets are shown in Table 4. Lastly, we discuss k-pattern representations. Each frequent temporal pattern shows atomic events along with their relations, appended with its average time duration. Additionally, the support of each pattern, shown as a percentage here, is appended at the end of the pattern. In Table 5, a sample of k-patterns mined from the Junction dataset is shown. While, in Table 6, sample k-patterns resulting from Roundabout training set are shown. The resulting patterns are composed using all seven Allen's temporal relations. We have listed only a few patterns for each relation, and the first 2-pattern from each category is explained as: 1) before(D,E)[4.5], {4.35%}: event D occurs 4.5s before event E.
7.3. Generated temporal association rules All possible forward rules are enumerated using Algorithm 2, as defined in Section 5. Rules are retained only if their confidence value is higher than minC. The accuracy vs. the different values of the minC is computed and the results are displayed in Fig. 7b. Based on this, the minC for both datasets is set to 50%. The dominant traffic cycle, as shown in Fig. 8, can be explained using the four temporal rules as (cf. Table 7): 1) Rule 1: Events A and B start together, representing a vertical flow in opposite directions. From the 2-pattern: equals(A,B) in Table 5, it is clear that they last for the same time duration. Event C starts when A and B finish (meets relationship). The rule has a confidence value of 50%, implying that the rule holds only for 50% of the sequences containing the 2-pattern. 2) Rule 2: Events C and D start together, indicating left and right turning traffic flow in the scene. The 2-pattern: equals(C,D) (Table 5) shows that they last for the same duration. Event G takes place after C and D are done. Moreover, the overlaps relation implies that events G and E co-occur for a short time interval. 3) Rule 3: Event G ends before F starts. Event F is contained within H's interval. 4) Rule 4: This follows from Rule 3, where Event F is contained within H's interval. This is followed by Event A, thus completing the cycle.
2) equals(A,B)[11], {2.17%}: event A and event B occur together for 11s. 3) meets(B,C)[8.5], {4.35%}: event B finishes after 8.5s, and event C follows. 4) starts(C,D)[4.5], {4.35%}: event C and event D start together, 4.5s into the cycle. 5) finishes(D,C)[5.5], {4.35%}: event D and event C finish together after 5.5s. 6) during(A,B)[6], {6.52%}: event B occurs alongside event A for 6s. 7) overlaps(B,D)[1], {2.17%}: event B and event D co-occur for 1s.
Junction is a street intersection with a complex behavior. Therefore, the method produces an alternate traffic cycle, illustrated in Fig. 9. Only rules relating to events A, B, C and D are affected. These rules are shown in Table 8. There are two main differences between this cycle and the previous one (cf. Fig. 8):
A support value of 4.35% indicates the percentage of sequences in the dataset that contain this relationship.
• Vertical flow (A) is interleaved with a rightward turning traffic (C) • Vertical flow (B) is interleaved with a leftward turning traffic (D).
Fig. 14. Roundabout: jumping the red light.
1114
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
Table 10 Junction and roundabout clustering accuracy.
Junction Roundabout
Table 12 Roundabout typical event comparison.
Ours
EMD-L1 [28]
Standard pLSA [36]
95.6% 95.8%
92.4% 86.4%
89.7% 84.5%
Horizontal flow (AB) Vertical flow (CDE) Cycle
The rules governing the Roundabout traffic flow sequence, as shown in Fig. 10, are listed in Table 9. The rules can be interpreted as follows: 1) Rule 1: Events A and B co-occur for the same duration {equals(B,A)}. This activity is followed by event D. Events C and D start and finish together {equals(C,D)}. The confidence value for the rule is 100%. This implies that for 100% of the sequences containing the 3-pattern (left-side), the 4-pattern will also occur. 2) Rule 2: Lastly, vertical traffic flow denoted by F precedes events A and B. 7.4. Spatial anomaly detection Given a test trajectory, we have to classify it to the set of obtained activity clusters (A,B,…). A trajectory is considered atypical if it does not match any representative clustered routes. The direction vectors are computed to check if the trajectories have the same direction of motion. If not, the trajectory is an outlier and the test clip denotes a spatial anomaly. This type of anomaly was discovered in Junction, as well as Roundabout. For instance, in the Junction dataset, a car performs an illegal U-turn and disrupts the traffic flow. The outlying trajectory along with other test trajectories is shown in Fig. 11. An example from the Roundabout dataset is shown in Fig. 12, a vehicle interrupts the traffic flow by moving in the reverse direction of the flow. 7.5. Spatio-temporal anomaly detection The temporal patterns extracted from the test sequences are matched with the frequent temporal patterns learned from the training set. For the Junction dataset, an instance of an anomalous interaction is shown in Fig. 13. In this test clip, a fire-truck disrupts the traffic flow. The resulting pattern from the test sequence that is not part of the frequent patterns is: starts(F,A), implying that events A and F cannot co-occur. Therefore, it is considered to be anomalous. An example of a normal behavior can be seen in Fig. 6, and the corresponding temporal rule is shown in Section 5.3. Furthermore, a comprehensive list of the various types of anomalies in Junction can be found in Table 13. In the Roundabout dataset, an example of an anomalous interaction is shown in Fig. 14. This test sequence indicates an example of reckless driving, as event A and event D cannot co-occur. The resulting pattern from the test sequence that is not part of the frequent patterns is: overlaps(D,A). Thus, it is an anomaly. Table 14 lists the different anomalies found in Roundabout. 7.6. Comparative analysis The Junction and Roundabout datasets have been used extensively in the literature for dynamic scene analysis. These traffic scenes are more challenging compared to the MIT dataset [15,19,26] as the scene is
Ours
Zen and Ricci[28]
Emonet et al.[15]
Yes Yes 1
Only A Only D 0
Yes Yes 0
busier and more complex. We compare the typical behaviors as well as the anomalies with state-of-the-art approaches and demonstrate that our method compares favorably. In our experiments, we identified individual traffic flows, where each flow represents an atomic event. In Junction, 8 different flow directions were discovered, while in Roundabout, 5 were identified. We compare our clustering accuracy (using Eq. (3)) to other methods and present slightly better results in Table 10. The complex spatio-temporal behavior between the events is modeled using temporal patterns and rules, and the resulting traffic cycles are shown in Figs. 8, 9 and 10. Most of the existing approaches were successful in finding the typical activities in Junction (cf. Table 11), even though some do not show the traffic light cycles. But, very few existing works attempt to model Roundabout due to its complex nature. Table 12 shows two approaches that found the dominant activities without the traffic cycle. On the contrary, only few existing approaches [10,26,28,36] perform anomaly detection. It is therefore not possible to do a fair comparison with [28,36] since they only show instances of 3 and 7 anomalies respectively along with an anomaly score. To quantitatively compare the anomalies discovered in Junction with [26,10], we have tested on the same duration of the video (around 40 min). Based on the ground truth, we determine the accuracy of the detection process in terms of: 1) True Positive (TP): abnormal test sequence is classified as abnormal 2) False Positive (FP): normal behavior classified as abnormal. The various classes of anomalies discovered in Junction are shown in Table 13. The Drive Wrong Way category which disrupts the normal traffic flow consists mostly of fire trucks and police cars. The Normal category indicates a behavior which is tagged as abnormal. Our method was able to find single-agent (i.e. Illegal U-turns) as well as multi-agent anomalies and the combined accuracy is depicted in the table. In order to make a better comparison, we add a column depicting the Accuracy % for each of the anomaly detected by the different methods. Our accuracy is comparably much higher and we model all the events in a single clip. Also, we correlate events to one another with an explicit time duration to find abnormalities easily. Moreover, spatio-temporal interactions are well-defined in longer clips. Whereas, the drawback of probabilistic topic models is that they might fail to detect abnormal events occurring alongside regular events within the same clip and this is apparent in their low accuracy. For the Roundabout dataset, none of the existing works perform anomaly detection. Therefore, we show our anomaly detection results (cf. Table 14) based on the ground truth provided in [45], which is reasonably high.
Table 11 Junction typical event comparison.
Vertical flow (AB) Interleaved vertical flow (CD) Horizontal flow (EF) Horizontal turn flow (GH) Cycles
Ours
Hospedales et al. [26]
Zen and Ricci [28]
Loy et al. [18]
Kuettel et al. [1]
Emonet et al. [15]
Yes Yes Yes Yes 2
Yes Yes Yes Only H 1
Yes No Yes No 0
Yes Yes Yes Yes 0
Yes Yes Yes Yes 2
Yes Yes Yes Yes 1
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
1115
Table 13 Junction anomaly detection — quantitative analysis: GT stands for ground truth and FTP stands for frequent temporal patterns. Ours GT Clip size
12s (300f)
Break red light Illegal U-turn Drive wrong way Unusual turns Normal TP (%) FP (%)
7 12 4 2 185
Hospedales et al.[26] FTP
Accuracy
GT
MCTM
Jouneau and Carincotte[10] Accuracy
1s (25f) 7 10 3 2 183 88.0 1.1
100% 83% 75% 100% 98%
13 15 15 10 2663
Table 14 Roundabout anomaly detection: GT stands for ground truth and FTP stands for frequent temporal patterns. GT Clip size
12 s (300f)
Break red light Drive wrong way Normal TP (%) FP (%)
3 3 53
FTP
3 2 50 83.3 5.7
8. Conclusions In this work, we have proposed a method that analyzes traffic patterns and detect irregular events. To the best of our knowledge, temporal mining techniques have not been used for event recognition in dynamic scenes. We first discover frequent temporal patterns and use Allen's temporal relations [8] for representation. The time duration of composite events is included in the pattern as well. Temporal association rules are then generated from these frequent patterns. These association rules help model the traffic cycle sequence and detect anomalous behavior. Recent works suggest traffic cycle as an ordered sequence of activities. We on the other hand, indicate exactly how two events are related (overlaps, equals, starts, etc.) and the time interval they occupy. Spatio-temporal anomalies are identified and detected in a hierarchical manner. We show results on two standard public datasets and demonstrate considerable improvement over the current methods.
References [1] D. Kuettel, M. Breitenstein, L. Van Gool, V. Ferrari, What's going on? Discovering spatio-temporal dependencies in dynamic scenes, IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1951–1958. [2] V. Mahadevan, W. Li, V. Bhalodia, N. Vasconcelos, Anomaly detection in crowded scenes, IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1975–1981. [3] B. Zhou, X. Wang, X. Tang, Random field topic model for semantic region analysis in crowded scenes from tracklets, IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3441–3448. [4] U. Von Luxburg, A tutorial on spectral clustering, Stat. Comput. 17 (4) (2007) 395–416. [5] D. Damen, D. Hogg, Recognizing linked events: searching the space of feasible explanations, IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 927–934. [6] L. Kratz, K. Nishino, Spatio-temporal motion pattern modeling of extremely crowded scenes, The 1st International Workshop on Machine Learning for Visionbased Motion Analysis, 2008. [7] N. Harikrishna, S. Satheesh, S. Sriram, K. Easwarakumar, Temporal classification of events in cricket videos, National Conference on Communications (NCC), IEEE, 2011, pp. 1–5. [8] J. Allen, G. Ferguson, Actions and events in interval temporal logic, J. Log. Comput. 4 (5) (1994) 531. [9] D. Patel, W. Hsu, M. Lee, Mining relationships among interval-based events for classification, Proceedings of the SIGMOD International Conference on Management of Data, ACM, 2008, pp. 393–404.
GT
HDP-HMM
Accuracy
– 3 11 – 8 51.9 –
– 27% 69% – –
2s (60f) 4 5 12 6 2636 52.8 1.0
31% 33% 80% 60% 99%
– 11 16 – –
[10] E. Jouneau, C. Carincotte, Particle-based Tracking Model for Automatic Anomaly Detection, ICIP, 2011. [11] V. Morariu, L. Davis, Multi-agent event recognition in structured scenarios, IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3289–3296. [12] Z. Zhang, K. Huang, T. Tan, L. Wang, Trajectory series analysis based event rule induction for visual surveillance, IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [13] T. Hospedales, J. Li, S. Gong, T. Xiang, Identifying rare and subtle behaviours: a weakly supervised joint topic model, IEEE Trans. Pattern Anal. Mach. Intell. 99 (2011). [14] C. Stauffer, W. Grimson, Learning patterns of activity using real-time tracking, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 747–757. [15] R. Emonet, J. Varadarajan, J. Odobez, Extracting and locating temporal motifs in video scenes using a hierarchical nonparametric Bayesian model, IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3233–3240. [16] J. Li, S. Gong, T. Xiang, Discovering multi-camera behaviour correlations for on-thefly global activity prediction and anomaly detection, IEEE International Workshop on Visual Surveillance, Kyoto, Japan, 2009. [17] J. Varadarajan, J. Odobez, Topic models for scene analysis and abnormality detection, IEEE 12th International Conference on Computer Vision Workshops, 2009, pp. 1338–1345. [18] C. Loy, T. Xiang, S. Gong, Stream-based active unusual event detection, ACCV 2010., 161–175. [19] L. Song, F. Jiang, Z. Shi, A. Katsaggelos, Understanding dynamic scenes by hierarchical motion pattern mining, IEEE International Conference on Multimedia and Expo, 2011, pp. 1–6. [20] Y. Yang, J. Liu, M. Shah, Video scene understanding using multi-scale analysis, IEEE 12th International Conference on Computer Vision, 2009, pp. 1669–1676. [21] I. Saleemi, L. Hartung, M. Shah, Scene understanding by statistical modeling of motion patterns, IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2069–2076. [22] A. Hervieu, P. Bouthemy, J. Le Cadre, A statistical video content recognition method using invariant features on object trajectories, IEEE Trans. Circ. Syst. Video Technol. 18 (11) (2008) 1533–1543. [23] Y. Shi, A. Bobick, I. Essa, Learning temporal sequence model from partially labeled data, Computer Vision and Pattern Recognition, 2006, pp. 1631–1638. [24] Y. Benezeth, P. Jodoin, V. Saligrama, C. Rosenberger, Abnormal events detection based on spatio-temporal co-occurences, IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 2458–2465. [25] K. Prabhakar, S. Oh, P. Wang, G. Abowd, J. Rehg, Temporal causality for the analysis of visual events, IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1967–1974. [26] T. Hospedales, S. Gong, T. Xiang, A Markov clustering topic model for mining behaviour in video, IEEE 12th International Conference on Computer Vision, 2009, pp. 1165–1172. [27] J. Varadarajan, R. Emonet, J.-M. Odobez, A sequential topic model for mining recurrent activities from long term video logs, Int. J. Comput. Vis. (2013) 1–27. [28] G. Zen, E. Ricci, Earth mover's prototypes: a convex learning approach for discovering activity patterns in dynamic scenes, IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3225–3232. [29] W. Brendel, A. Fern, S. Todorovic, Probabilistic event logic for interval-based event recognition, IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3329–3336. [30] M. Sridhar, A.G. Cohn, D.C. Hogg, Unsupervised Learning of Event Classes from Video, AAAI, 2010. [31] Y. Zhang, E. Swears, N. Larios, Z. Wang, Q. Ji, Modeling Temporal Interactions with Interval Temporal Bayesian Networks for Complex Activity Recognition, IEEE Trans. Pattern Anal. Mach. Intell. (2013) 1. [32] R. Hamid, S. Maddi, A. Johnson, A. Bobick, I. Essa, C. Isbell, A novel sequence representation for unsupervised analysis of human activities, Artif. Intell. 173 (14) (2009) 1221–1244. [33] B. SivaSelvan, N. Gopalan, Efficient algorithms for video association mining, Adv. Artif. Intell. (2007) 250–260. [34] V. Jakkula, D. Cook, A. Crandall, Temporal pattern discovery for anomaly detection in a smart home, 3rd IET International Conference on Intelligent Environments, 2007, pp. 339–345.
1116
A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116
[35] A. Yilmaz, K. Shafique, N. Lobo, X. Li, T. Olson, M. Shah, Target-tracking in FLIR imagery using mean-shift and global motion compensation, IEEE Workshop on Computer Vision Beyond Visible, Spectrum, 2001, pp. 54–58. [36] J. Li, S. Gong, T. Xiang, Scene segmentation for behaviour correlation, ECCV, 2008., 383–395. [37] I. Junejo, H. Foroosh, Euclidean path modeling for video surveillance, Image Vis. Comput. 26 (4) (2008) 512–528. [38] P. Perona, L. Zelnik-Manor, Self-tuning spectral clustering, Adv. Neural Inf. Process. Syst. 17 (2004) 1601–1608. [39] S. Atev, O. Masoud, N. Papanikolopoulos, Learning Traffic Patterns at Intersections by Spectral Clustering of Motion Trajectories, IEEE International Conference on Intelligent Robots and Systems, 2006, pp. 4851–4856.
[40] B. Morris, M. Trivedi, Learning trajectory patterns by clustering: experimental studies and comparative evaluation, IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 312–319. [41] S. Laxman, P. Sastry, A survey of temporal data mining, SADHANA Acad. Proc. Eng. Sci. 31 (2) (2006) 173–198. [42] R. Agrawal, T. Imieliński, A. Swami, Mining association rules between sets of items in large databases, ACM SIGMOD Record, vol. 22, no. 2, 1993, pp. 207–216. [43] E. Winarko, J. Roddick, Discovering richer temporal association rules from intervalbased data, Data Warehous. Knowl. Discov. (2005) 315–325. [44] F. Höppner, Learning temporal rules from state sequences, IJCAI Workshop on Learning from Temporal and Spatial Data, vol. 25, 2001. [45] Datasets, http://www.eecs.qmul.ac.uk/~jianli/Dataset-List.html.