Dynamic scene understanding using temporal association rules

Image and Vision Computing 32 (2014) 1102–1116 Contents lists available at ScienceDirect Image and Vision Computing journal homepage: www.elsevier.c...

Download PDF

4MB Sizes 0 Downloads 94 Views

Report

Full Text

Image and Vision Computing 32 (2014) 1102–1116

Contents lists available at ScienceDirect

Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis

Review article

Dynamic scene understanding using temporal association rules☆ Ayesha M. Talha, Imran N. Junejo ⁎ Dept. Of Computer Science, University of Sharjah, U.A.E. 27272

a r t i c l e

i n f o

Article history: Received 23 July 2013 Received in revised form 12 April 2014 Accepted 28 August 2014 Available online 20 October 2014 Keywords: Scene understanding Computer vision Association rules Trafﬁc surveillance

a b s t r a c t The basic goal of scene understanding is to organize the video into sets of events and to ﬁnd the associated temporal dependencies. Such systems aim to automatically interpret activities in the scene, as well as detect unusual events that could be of particular interest, such as trafﬁc violations and unauthorized entry. The objective of this work, therefore, is to learn behaviors of multi-agent actions and interactions in a semi-supervised manner. Using tracked object trajectories, we organize similar motion trajectories into clusters using the spectral clustering technique. This set of clusters depicts the different paths/routes, i.e., the distinct events taking place at various locations in the scene. A temporal mining algorithm is used to mine interval-based frequent temporal patterns occurring in the scene. A temporal pattern indicates a set of events that are linked based on their relationship with other events in the set, and we use Allen's interval-based temporal logic to describe these relations. The resulting frequent patterns are used to generate temporal association rules, which convey the semantic information contained in the scene. Our overall aim is to generate rules that govern the dynamics of the scene and perform anomaly detection. We apply the proposed approach on two publicly available complex trafﬁc datasets and demonstrate considerable improvements over the existing techniques. © 2014 Elsevier B.V. All rights reserved.

Contents 1.

2. 3. 4. 5.

6.

7.

Introduction . . . . . . . . . . . . . . . . 1.1. Motivation . . . . . . . . . . . . . 1.2. Contributions . . . . . . . . . . . . 1.3. Assumptions . . . . . . . . . . . . Related work . . . . . . . . . . . . . . . Overview . . . . . . . . . . . . . . . . . Feature extraction and segmentation . . . . . Video association mining . . . . . . . . . . 5.1. Event sequence representation . . . . 5.2. Frequent temporal pattern discovery . 5.2.1. Complexity . . . . . . . . . 5.2.2. Pruning patterns . . . . . . 5.3. Generating temporal association rules . 5.3.1. Complexity . . . . . . . . . Hierarchical spatio-temporal anomaly detection 6.1. Spatial level . . . . . . . . . . . . . 6.2. Spatio-temporal level . . . . . . . . Experimental evaluation . . . . . . . . . . 7.1. Datasets . . . . . . . . . . . . . . 7.1.1. Junction . . . . . . . . . . 7.1.2. Roundabout . . . . . . . . . 7.1.3. Motion segmentation . . . . 7.2. Discovered frequent temporal patterns 7.3. Generated temporal association rules .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

☆ This paper has been recommended for acceptance by Ivan Laptev. ⁎ Corresponding author. E-mail address: [email protected] (I.N. Junejo).

http://dx.doi.org/10.1016/j.imavis.2014.08.010 0262-8856/© 2014 Elsevier B.V. All rights reserved.

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

1103 1103 1103 1103 1104 1106 1106 1108 1108 1108 1109 1109 1110 1111 1111 1111 1112 1112 1112 1112 1112 1112 1112 1113

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

7.4. Spatial anomaly detection . . . . 7.5. Spatio-temporal anomaly detection 7.6. Comparative analysis . . . . . . 8. Conclusions . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

1. Introduction In visual surveillance, there has been an increasing interest in recognizing object behaviors, by interpreting high-level semantics of scene dynamics. However, computing relationships between different actions in the scene or detecting rare events in an ocean of video data is a daunting task. Analyzing event interactions manually is practically impossible, and is solely dependent on human operators. In addition, as the scene gets crowded, the complexity of the relationships between the agents increases as well. Even though it has become an active research area, it is still a complex problem with a lot of constraints, and an unsupervised method is required to make the task easier. An elegant solution to this problem can open doors to a wide spectrum of applications, such as video surveillance [1], anomaly detection [2], and crowd analysis [3]. Typically, the input to a dynamic scene analysis system is a video, and the ﬁrst task is to detect moving objects and record their motion characteristics, in the form of object trajectories (or optical ﬂows). Each trajectory denotes an individual event in the scene during a time interval. This step is generally followed by behavior or activity segmentation, which identiﬁes semantically meaningful components and groupings to reveal different events. Traditionally, algorithms such as K-means and fuzzy clustering have been used extensively, while many recent works have explored spectral clustering and normalized cuts [4]. The resulting clusters model the various events, indicating the spatial layout of the scene. Finally, the last step is to learn the temporal scene behavior. Behavior in our context explains the way an object acts in relation to the other objects in the scene. It can be deﬁned as a sequence of events with spatial and temporal constraints. Recently, probabilistic methods such as Dynamic Bayesian Networks (DBN) [5], Hidden Markov Models (HMM) [6], and Probabilistic Topic Models (PTM) [2], have been used extensively by the computer vision community to learn the scene dynamics. The dynamic scene understanding problem can be expressed as: obtain the motion patterns in the scene, build the scene structure and lastly, interpret the high-level semantics of the scene. A dynamic scene may also involve multiple agents interacting with one another, and the actions may occur in parallel with one another or recur over time. Thus, we are interested in answering questions such as: what is happening in the scene, where the objects are located and how they interact within their environment. In this work, we aim at developing a robust system that can learn the scene model with minimal human intervention. In this regard, video mining can help extract salient information from a video without such supervision [7]. In order to analyze and discover the temporal interdependencies and relationships between various events occurring in a scene, we make use of temporal mining algorithms. These relationships between events are modeled as temporal patterns, discovered using a frequent temporal pattern mining algorithm. A frequent temporal pattern can be deﬁned as a set of composite events that occur repetitively in the video, and are expressed using temporal relations in Allen's taxonomy [8], such as before, after, and meet. Once these frequent patterns are obtained, forward temporal association rules are generated. These rules capture the correlations between the frequent temporal patterns present in the video. We deﬁne an anomaly as an atypical behavioral pattern based entirely on the model in context, thus every scene can have a different set of anomalies. In this work, anomaly detection is performed in a hierarchical manner. First, we identify unusual events within a spatial context. These spatial anomalies can be found once unique event clusters

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

1103

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

1114 1114 1114 1115 1115

are identiﬁed. The second type of anomalous behavior can be found by using frequent temporal patterns (and their time duration) to discriminate between the usual and the unusual complex composite events. 1.1. Motivation Our goal is to extract complex activity patterns in a multi-agent environment. This is not trivial, as in most real-world scenarios, the underlying dynamic scene behavior is very complex and perhaps ambiguous, making high-level activity interpretation a challenge. Most of the existing techniques employ various probabilistic models, however, the learning and inference in such methods is computationally prohibitive. Moreover, as the scene gets crowded, the complexity of the relationships increases, and this necessitates a huge amount of training data for accurate analysis. Therefore, in this work we have proposed to learn the scene dynamics using temporal mining techniques. The frequent pattern discovery algorithm utilized in this work has an exploratory nature of operation. In addition, pattern matching allows for accurate and efﬁcient anomaly detection. 1.2. Contributions • To the best of our knowledge, temporal mining techniques have not been used for event recognition in dynamic scenes. We discover frequent temporal patterns using [9] to learn the scene behavior. • We indicate exactly how two events are related (overlaps, equals, starts, etc.) using Allen's relations [8]. Moreover, we include the duration of composite events in each pattern. • To eliminate the spurious frequent temporal patterns discovered, we suggest a few steps in Section 5.2 in order to prune the pattern space. • Once these patterns are obtained, we generate temporal rules. These temporal association rules help model the trafﬁc cycle sequence, which is the main test domain for our work. • Using a hierarchical anomaly detection algorithm, spatial anomalies are detected based on object trajectories, and spatio-temporal anomalies are identiﬁed using a frequent pattern matching approach. 1.3. Assumptions • We track objects to obtain events that unfold over time. As with any trajectory-based approach, a good tracking algorithm is needed to overcome its inherent issues. In this paper, we focus only on vehicle motion in complex trafﬁc scenes. Pedestrian activity is disregarded as complete trajectories are hard to obtain in crowded scenes. • For the temporal mining algorithms, user-deﬁned parameters have to be determined by domain experts. Even though mining techniques do not require the deﬁnition of events or rules in advance, the temporal support and the conﬁdence thresholds (cf. Table 1) have to be speciﬁed. The work is organized as follows: Section 2 presents some existing works on the topic. Section 3 brieﬂy describes the proposed methodology. Section 4 focuses on feature extraction and segmentation, while Section 5 presents the second phase, i.e., using the video mining techniques to learn the dynamic scene model. The anomaly detection methodology is discussed in length in Section 6. Experiments are conducted on two datasets, and the results with evaluation measures are illustrated and explained in Section 7, followed by conclusions in Section 8.

1104

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

Table 1 List of temporal mining symbols. Symbol

Description

Starting time of a given event Finishing time of a given event Dimension of the pattern (number of intervals) A frequent temporal pattern Time duration of composite event based on the relation The fraction of all sequences in the dataset in which a pattern is contained conﬁdence The probability of the consequent occurring given the antecedent minS Minimum support threshold, a pattern is frequent if support is ≥mins minC Minimum conﬁdence threshold, a rule is strong if conﬁdence is ≥minC s f k patt duration support

2. Related work Existing approaches in the literature generally start with motion feature extraction, such as object trajectories or optical ﬂow. Event modeling is done by clustering these features using similarity based distance measures. Trajectory-based approaches [3,10–12] primarily rely on how well a tracker performs. The results may be compromised in crowded scenarios due to the presence of multiple objects, interobject occlusions and low resolution videos [3,13]. In their seminal work, Stauffer and Grimson [14] present an advanced robust tracker based on a novel probabilistic method for background subtraction. Their method models each pixel as a mixture of Gaussians and an online approximation is used to update the model. On the other hand, if the video at hand is of high resolution, both spatially and temporally, optical ﬂow maybe the feature of choice: [1, 2,6,13,15–19]. Yang et al. [20] represent the direction of pixel-wise optical ﬂow as a Bag of Words and apply Diffusion Maps to it. The key motion patterns are obtained by performing a spectral analysis on the Markov matrix of the graph. Saleemi et al. [21] represent the salient unquantized optical ﬂow patterns as a Gaussian Mixture Model (GMM), initialized using K-means. They show that motion patterns can be statistically inferred by computing conditional optical ﬂow

expectation. A key limitation of all the methods discussed above is that they model the scene structure, in terms of spatial layout and basic object patterns, ignoring the higher-level global scene behavior. Some researchers have dealt with scene learning using generative models. A generative model is a hierarchical probabilistic model which typically has some hidden parameters. Moreover, in probabilistic approaches, events are classiﬁed as outliers if they have a low-likelihood under the learned model. Kratz and Nishino [6] use a multivariate Gaussian distribution on spatio-temporal gradients, and model temporal behaviors with a distribution-based HMM. They then use the symmetric Kullback–Leibler (KL) divergence distance measure to extract prototypes. Lastly, statistical anomalies are detected via the forward– backward algorithm. Hervieu et al. [22] compute trajectories with a tracking method using a color-based particle ﬁlter. Trajectories are deﬁned by combining curvature and motion magnitude. The temporal causality of the features is then captured by HMMs, and the similarity between the trajectories is expressed by exploiting the quantizationbased HMM framework. HMMs are utilized in the last two approaches, and their main drawback is that they require more training data compared to other approaches in order to avoid over-ﬁtting. Shi et al. [23] train Propagation Nets (an extension of DBNs) and boosted event detectors for activity modeling and recognition. Activities represent the duration of intervals explicitly and are ordered with respect to time. The main limitation is that the structure of the Net has to be predeﬁned, hence restricting the number of events in an activity. Benezeth et al. [24] obtain motion labels by background subtraction. Then, they learn co-occurrence statistics for Markov Random Field's (MRF) potential function to detect abnormalities via a likelihood ratio test. However, they correlate activities within a short time frame and hence, they are unable to model a complex dynamic scene behavior. Prabhakar et al. [25] eliminate the need for accurate ﬂows or trajectories when exploring the concept of causality to express event relations. They model space–time interest points by a point-process model and then, identify patterns of interaction over long intervals using the Granger causality. However, their approach is restricted to classifying videos with just one interaction. Probabilistic Topic Models (PTMs) are capable of handling noise in motion features. These models divide the video into clips (documents)

Fig. 1. The proposed dynamic scene analysis framework.

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

(a)

1105

(b) Fig. 2. Observed raw trajectories.

by quantizing pixel motion in the image. Kuettel et al. [1] ﬁrst extract activities based on a Hierarchical Dirichlet Process (HDP) model. Next, they learn co-occurring activities jointly with their time dependencies using a new model: Dependent Dirichlet Process (DDP)-HMM. Lastly, Gibbs sampling is used to model the training data. However, the temporal order of activities is considered only at the global scene level, and the inter-clip trafﬁc state changes are disregarded as well. Similarly, Jouneau and Carincotte [10] use HMMs to classify particle-based trajectories and utilize HDP-HMMs to identify co-occurring trajectories. Furthermore, they discover the temporal relations between trajectories as well as the abnormalities, where a clip is labeled as an anomaly even if there is some normal interaction within the sequence. In [26], Hospedales et al. build a Markov Clustering TM based on DBN and LDA that can capture the temporal characteristics of the object behaviors hierarchically. However, their hybrid model requires the user to deﬁne the number of topics in advance. Moreover, as we show in Section 7.6, they ﬁnd only one global rule for any dynamic scene as a result of ignoring the temporal order aspect. Varadarajan and Odobez [17] also made use of topic models. Their pLSA-TM found the spatio-temporal correlations among optical ﬂow features. Abnormal events are found by computing the likelihood from modeled GMMs. But, there is no incorporation of time information in their model. This is a major limitation of topic models as they fail to

establish temporal dependencies among low-level features. Additionally, it is computationally expensive as well. In [27] however, the same authors mine sequential patterns to overcome these shortcomings. Zhou et al. [3] start with fragments of trajectories called tracklets, obtained from a keypoint tracker. Random Field TM is used to cluster these tracklets. The LDA model is then integrated with MRFs as a prior to enforce spatial and temporal coherence between the tracklets. However, it cannot be readily used for visual surveillance as estimating the trajectory of each individual in the scene is nearly impossible. In order to determine the shared number of topics automatically and overcome the main limitation of topic models, non-parametric approaches are required. For instance, Emonet et al. [15] mine activities hierarchically in videos using temporal motifs and discover the periodic cycle dictated by trafﬁc lights. While topic model approaches do not correlate atomic activities inside a clip, Zen and Ricci [28] propose an Earth Mover's Distance (EMD) prototype learning algorithm to overcome this limitation. Here, temporal segmentation of the video is done to obtain a set of multi-scale patterns, which are represented using histograms. Lastly, a multi-scale anomaly score is used to ﬁnd anomalous patterns. However, they fail to model the global scene behavior. On the other hand, some approaches use rule-based approaches and event logic for event recognition. Morariu and Davis [11] represent lowlevel features using Allen's Interval Logic. We utilize Allen's Logic [8] as

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 3. The obtained events: from A to H for the Junction dataset.

1106

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

(a)

(b)

(c)

(d)

(e)

Fig. 4. The obtained events: from A to E for the Roundabout dataset.

well as it is quite powerful for event reasoning. Probabilistic inference is done using Markov Logic Networks (MLN), where knowledge is provided manually via rules. Brendel et al. [29] represent temporal constraints among events using Probabilistic Event Logic (PEL), where a PEL knowledge base comprises of a set of weighted logic formulas. However, in both of these approaches, rules have to be deﬁned manually to capture inter-related relationships between events. Similarly, Zhang et al. [12] demonstrate the utility of a grammar-based approach using Allen's Logic. Even though the authors manage to learn event rules automatically using a Minimum Description Length-based grammar induction algorithm based on Stochastic Context-free Grammar (SCFG), their parser is unable to handle sub-events with parallel temporal relations. Sridhar et al. [30] discover event classes from videos by learning the spatio-temporal relationships between objects across several sets of interacting tracks in space and time (using Allens logic). Then, a Markov Chain Monte Carlo procedure is used to interpret the event class represented by the activities. However, their approach requires huge datasets to learn events. Zhang et al. [31] propose an interval temporal Bayesian network to overcome the shortcomings of traditional graphical models. They combine Bayesian networks with interval algebra (i.e. Allen's 13 temporal relations) to model multiple event dependencies over time intervals in order to recognize both parallel and sequential activities. The structure and parameters are learned automatically from training data using advanced algorithms. Unlike our approach, their model is unable to capture multiple occurrences of the same event in an activity. None of these approaches go beyond learning event rules, i.e. do not attempt anomaly detection. In contrast to grammar-based approaches, Hamid et al. [32] adopt a more data-driven approach. They model contiguous activity subsequences as event n-grams and use it to learn the activity structure. This allows them to discover the various activity classes and ﬁnd abnormal events. Despite the fact that they demonstrate the applicability of the approach in various domains, the atomic actions as well as the start and end of activities are known a priori. Moreover, overlapping activities and parallel events cannot be modeled with their method. Most of the approaches discussed previously, derive their scene models based on activities taking place at ﬁxed points in time. In our work on the other hand, we model events based on their duration. Our approach uses temporal video mining algorithms. Video Mining comprises of two main steps: (1) transforming the video into clusters, and (2) applying temporal mining algorithms to identify and establish relationships between activities or events [33]. Jakkula et al. [34] use the Apriori algorithm to mine temporal patterns to build a robust anomaly detection system for smart homes. They utilize a probabilistic model

where the evidence is calculated for current activities with respect to previous activities. Their work simply focuses on deviations in movements of a single individual, whereas we are concerned with modeling multi-object correlations in a complex dynamic scene. 3. Overview The proposed approach is illustrated in Fig. 1, comprising of the following steps: • Feature extraction: We employ a semi-automatic mean-shift tracker [35] to obtain object trajectories. • Motion segmentation: Spectral clustering is used to cluster trajectories into different event classes. The number of clusters is determined iteratively. • Learning frequent temporal patterns: Relationships are discovered between events based on their time duration characteristics. Temporal patterns, often represented by composite events consisting of a hierarchy of relations, have the time duration of each relationship appended to it. Temporal patterns are retained based on their frequency of occurrence. • Generating temporal association rules: Temporal association rules are generated from pairs of frequent temporal patterns. Forward temporal association rules are formulated based on the conﬁdence, to be deﬁned later. • Detecting anomalous events: Anomalies are detected at two levels: 1) spatial level, where test trajectories are matched to the model clusters to determine their membership and 2) spatio-temporal level, where anomalies are detected based on the scene model represented by the learned frequent temporal patterns. 4. Feature extraction and segmentation Objects tend to follow common pathways in a trafﬁc scenario, and two key points are of particular interest: the entry point, where an object appears in the scene, and the exit point where it disappears from the scene. Since we focus solely on trafﬁc scenarios in this work, we use [35] to perform the object tracking, and pedestrian trajectories, if any, are subsequently removed (as in [28,36]). Moving average low-pass ﬁlters are used to remove noise from the trajectories. The extracted vehicle trajectories are of un-equal lengths and Dynamic Time Warping (DTW), based on the simple Euclidean distance, is used to ﬁnd the distance between trajectories [37]. For two

Table 2 Allen's temporal relationships. r

Temporal relation

Endpoints

b m e o s d f

X before Y X meets Y X equals Y X overlaps Y X starts Y X during Y X ﬁnishes Y

sx sx sx sx sx sy sy

b fx b sy b fy b sy = fx b fy = sy b fx = fy b sy b fx b fy = sy b fx b fy b sx b fx b fy b sx b fx = fy

Inverse relation

Duration

Y after X Y met by X Y equals X Y overlapped by X Y starts with X Y contains X Y ﬁnishes with X

sy − sx fx ⋁ sy fx − sx fx − sy fx − sx ⋁ fy − sy fy − sy fx − sx ⋁ fy − sy

Fig. 5. The composite event pattern (above) and the associated temporal relation matrix.

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

1107

Fig. 6. Temporal association rule.

trajectories Ta and Tb of length n and m, respectively, let DTW(Ta,Tb) denote the DTW distance between them. We use spectral clustering where an afﬁnity matrix is constructed using these DTW distances within a Gaussian kernel function [38]: for two given trajectories, the similarity function is deﬁned as: sðT a ; T b Þ ¼ exp

! −DTW 2 ðT a ; T b Þ : σ aσ b

ð1Þ

Selecting a σ value for each trajectory can be simply done by studying the statistics of the trajectory neighborhood. For instance, σ a ¼ DTW ðT a ; T c Þ:

ð2Þ

This implies that the local scaling parameter (σa) of Ta is the DTW distance between Ta and the cth nearest trajectory. It is important to

Table 3 Event sequences.

CCR ¼

A 0 5 B 0 9 C 9 11 B 10 11 C 0 7 B 0 3 A 3 11 B 9 11 A 0 11 C 1 6 D 1 5 C 7 11 D 10 11 A 0 4 D 0 1 C 0 3 E 6 7 G 7 11

(a)

note that, the resulting similarity measure is symmetric in nature. Kmeans is then employed in the last spectral clustering step to obtain the dominant events in the scene [39]. Raw trajectories for the datasets used in this work (to be explained in detail in Section 7), are shown in Fig. 2. Trajectories with 25 observations (frames) or less are eliminated. This can possibly be a tracking error as any vehicle motion lasts more than 1 s. Events are represented by the clustered trajectories, where each cluster is representative of an individual route in the scene. Using [39], the number of clusters in Junction (i.e. dominant routes) was found to be 8, and these are shown in Fig. 3 with their respective event descriptions. The Roundabout dataset has 5 dominant routes, discovered events and descriptions are given in Fig. 4. The accuracy of spectral clustering can be evaluated by ﬁnding the one-to-one mapping between the ground truth and the clustering labels. The clustering quality is measured by the Correct Clustering Rate (CCR) [40]. A high CCR value indicates that the different trajectories are grouped in different clusters, while trajectories which are spatially closer, are assigned to the same cluster: K 1X t : N c¼1 c

ð3Þ

N denotes the total number of trajectories in the dataset. The summation denotes the correct number of trajectories assigned to each

(b) Fig. 7. Minimum support threshold and minimum conﬁdence threshold.

1108

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

ordered by their start time in temporal order i.e., triplet1 is before triplet2 iff:

Table 4 k-pattern Sets. Junction

2-patterns 3-patterns 4-patterns

Roundabout

Before

After

Before

After

55 33 4

40 26 3

23 5 1

17 4 1

1) s1 b s2 2) s1 = s2 and f1 b f2 3) s1 = s2 and f1 = f2 and e1 b e2. Every event triplet has to be maximal such that the intervals for the same event do not overlap in the same sequence. If they do, they are merged into one interval [9]: ðevent label; minfs1 ; s2 g; maxf f 1 ; f 2 gÞ:

cluster, where tc signiﬁes the total number of trajectories belonging to the cth cluster. Using Eq. (3), the resulting accuracy for the Junction is 95.6%. The Roundabout dataset has a similar accuracy, of 95.8%.

ð5Þ

This event sequence set is utilized for mining the frequent temporal patterns, as described next. 5.2. Frequent temporal pattern discovery

5. Video association mining Events reoccur over time, and this means that each event corresponds to multiple time intervals. We ﬁrst start by forming event sequences and then, extract the frequent temporal patterns from them. Allen's First Order Interval Logic is used to describe relationships between event pairs in sequences. Next, temporal association rules are generated from the obtained frequent patterns (Section 5.3). Association rules are used to predict future events or the expected behavior between various objects in the scene. These rules help infer the underlying dynamics of the trafﬁc cycle contained within the scene. Table 1 shows the symbols used in this paper. 5.1. Event sequence representation An ordered event sequence is deﬁned by a series of triplets: ðe1 ; s1 ; f 1 Þ; ðe2 ; s2 ; f 2 Þ; ðe3 ; s3 ; f 3 Þ; …

ð4Þ

each triplet has an associated event e, and occurs during a closed interval [s,f], where s and f are deﬁned in Table 1. Event triplets are

Table 5 Junction frequent temporal patterns. 2-patterns before(D,E)[4.5], {4.35%} before(A,G)[7], {2.17%} before(C,B)[9], {2.17%} equals(A,B)[11], {2.17%} equals(C,D)[3], {2.17%} meets(B,C)[8.5], {4.35%} meets(H,E)[4], {2.17%} starts(C,D)[4.5], {4.35%} starts(G,E)[1], {2.17%} ﬁnishes(D,C)[5.5], {4.35%} during(A,B)[6], {6.52%} during(H,F)[3.5], {4.35%} overlaps(B,D)[1], {2.17%} overlaps(E,G)[2], {4.35%} 3-patterns ﬁnishes(overlaps(C,A),B) [2], {2.17%} before(starts(A,C),G) [7], {2.17%} during(before(D,G),E) [4], {2.17%}

An event in the scene may have a temporal relation with other events in the same sequence, and there might also be composite events. Allen's symbolic temporal relations [8] are used to encode these relationships. Table 2 shows Allen's relations and the related constraints: A composite event consisting of k events, i.e., a temporal pattern of length k, and also known as a k-pattern. In order to capture the inter-event time constraints, the time duration is computed for each pair of events and incorporated in the temporal pattern structure. The temporal relation among k events is captured by a k × k matrix. This temporal matrix is an anti-symmetrical matrix, and describes Allen's relations between atomic events used to form the composite event. The upper triangle is sufﬁcient to convey the k-pattern. An instance of a canonical 3-pattern is shown below with its temporal matrix in Fig. 5, where the duration of the composite events is expressed in seconds. Such a temporal pattern matrix can also be interpreted as an adjacency matrix of a graph, having interval relationships as edge labels [41]. The support of a temporal pattern is deﬁned as the number of event sequences in which the pattern occurs. A k-pattern can only be generated if its support exceeds a minimum threshold, minS. A frequent pattern is deﬁned as a pattern that occurs many times in the data and we aim at unearthing these interesting patterns in the data [41]. Apriori [42] is the ﬁrst algorithm designed for frequent itemset mining. It is based on a level-wise principle as well as the anti-monotone property of the set theory. Our algorithm of choice, IEMiner [9], is based on Apriori. But, unlike Apriori, IEMiner does not require multiple scans of the event sequences. For the sake of completeness, Algorithm 1 below describes the approach brieﬂy: • Candidate generation: Initially, all the 2-patterns are generated. In subsequent passes, a (k + 1)-pattern is generated from a frequent

Table 6 Roundabout frequent temporal patterns. 2-patterns starts(A,B)[5], {5%} overlaps(D,E)[4], {2.5%} overlaps(D,C)[4], {5%} before(E,A)[9], {5%} meets(D,E)[10], {2.5%} 3-patterns ﬁnishes(before(E,B),A) [3], {2.5%} before(starts(B,A),D) [7], {2.5%}

4-patterns during(before(starts(D,C),G),E) [4], {2.17%} meets(before(starts(A,C),E),G) [7], {2.17%}

4-patterns ﬁnishes(before(starts(B,A),D),C) [2], {2.5%}

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

1109

Fig. 8. Dominant trafﬁc cycle for the Junction.

k-pattern and a 2-pattern. This is done by storing the dominant event in each pattern, which is the event that has the latest end time among all events in the pattern. A (k + 1)-pattern is formed by joining a (k)-pattern and a frequent 2-pattern if they share the same dominant event. But, it is considered a candidate pattern if and only if, the 2-pattern involved occurs in at least (k - 1) frequent k-patterns. • Counting support: At each level in the algorithm, the number of occurrences of each candidate is counted to determine whether it is frequent or not. This is achieved by maintaining a list of active events, events that are taking place within the time interval under observation.

for n k-patterns. However, since the value of k is generally less than 10 and IEMiner scans the event sequences only once (to count the support) for s events, the running time is greatly reduced. 5.2.2. Pruning patterns The IEMiner algorithm, unfortunately, generates a huge number of temporal patterns. Algorithm 1 Learning Frequent Temporal Pattern

The ﬁnal output is a set of frequent temporal patterns, along with their support measure. Each temporal pattern represents a composite event in the scene. Furthermore, each frequent pattern relationship denotes its time duration, where the duration is computed as the average of the individual durations of each pattern. 5.2.1. Complexity The algorithm checks the occurrence of a k-pattern in a set of event sequences with s events, which takes O(ks) time. As a whole, the computational complexity is O(nks) as the procedure is repeated

Table 7 Junction rules for the dominant trafﬁc cycle. starts(A,B) [4] → meets(starts(A,B),C) [9] {50%} before(starts(D,C),G) [5.5] → overlaps(before(starts(D,C),G),E) [4] {50%} before(G,F) [4] → during(before(G,F),H) [5] {100%} during(F,H) [3.5] → before(during(F,H),A) [4] {50%}

We observed in our experiments that a relationship {X r Y}, may be represented by different temporal relations, whereas in essence they are quite similar. However, while mining temporal patterns using [9], the aforementioned concern was ignored. Thus, we introduce global constraints to prune the pattern search space. We obtain the disjunctive combination of temporal patterns by applying the following rules:

1110

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

Fig. 9. Junction trafﬁc alternate light cycle.

1. Add supports of 2-patterns with either [overlaps,equals, during].

formed if the antecedent is a sub-pattern of the consequent and it is of interest only if it has a high conﬁdence value.

2. Add supports if patt1 has relation starts/ﬁnishes and patt2 has any one of [overlaps,equals,during]. 3. Eliminate rules where a relationship is established between the same event, such as A starts A.

confidence ¼

Consequently, the combined support of the resulting pattern has to be calculated. The disjunction of patterns is computed by adding their individual supports. The resulting combination is retained only if the combined support is greater than or equal to minS. The ﬁnal step is to recompute the time duration as the average of the combined durations.

Temporal association rule mining [43] is used to discover timedependent correlations between events in data. Brieﬂy, the algorithm (cf. Algorithm 2) operates as follows: a temporal association rule is constructed from every pair of frequent temporal pattern, patt1,patt2, where the antecedent pattern, has to be a 2-pattern, at least. A rule is generated if its conﬁdence is higher than minC. An example of a temporal rule is illustrated in Fig. 6 [2-pattern→ 3-pattern]:

5.3. Generating temporal association rules Once frequent temporal patterns are discovered, the goal is to obtain meaningful temporal association rules that model the spatio-temporal events in the scene. We can think of this step as modeling the global behavior of the objects in the scene, while the previous step of computing the frequent temporal patterns as ﬁnding the local behavior between the scene objects. A rule comprises of a pair of propositions, having the following implication: when the left-hand side, the antecedent, is true, then the right-hand side, the consequent, will be true as well. Just like a pattern's frequency is measured by its support, a rule's strength is measured by its conﬁdence. The support of a pattern is the joint probability of events X and Y, i.e., P(X,Y) while the conﬁdence (cf. Eq. (6)) is deﬁned as the conditional probability of two patterns, P(patt2 patt1). A rule is

Table 8 Junction rules for alternate trafﬁc light cycle. starts(A,C) [2.5] → before(starts(A,C),D) [4], {75%} meets(A,D) [2] → ﬁnishes(meets(A,D),B) [3], {50%}

supportY : supportX

ð6Þ

(B starts A)[4.5s]→(overlaps (B starts A), C)[1.25s] {conﬁdence=100%}. It shows that events A and B start together, followed by the cooccurrence of events A and C. Algorithm 2 Temporal Assosiation Rule Generation

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

Fig. 10. Roundabout: trafﬁc light cycle.

5.3.1. Complexity Rule generation has almost negligible running time as compared to frequent pattern discovery. For n k-patterns with a maximum of s events, the complexity is O(2sn). Usually the resulting number of frequent k-patterns are limited (cf. Table 4) and hence, temporal association rule generation is very efﬁcient. 6. Hierarchical spatio-temporal anomaly detection 6.1. Spatial level Each trajectory cluster deﬁnes a single event and each event is represented by its cluster centroid. That is, the centroid models the general appearance of trajectories for any given event [37]. Having obtained the individual events in the scene, trajectories in test clips are classiﬁed to their respective event categories. The nearest-neighbor classiﬁcation scheme is utilized for this purpose, where the distance of each test trajectory is computed to all other centroid trajectories using the DTW distance measure. In order to classify trajectories with similar spatial characteristics but opposite direction, we calculate the direction vector of the test trajectory, followed by the direction vector of the centroid trajectory of the cluster it is classiﬁed to. Next, the DTW distance between these two directional vectors is obtained. If the distance is greater than the threshold of the cluster it belongs to, it is regarded as an outlier. The threshold is the maximum distance among all trajectories in a cluster with respect to its centroid trajectory. This directional information helps ﬁnd anomalies that deviate greatly from the centroid trajectory of its membership cluster, such as illegal U-turns or cars moving in the opposite direction, as shown in Fig. 11. Table 9 Roundabout rules governing the scene. before(starts(B,A),D) [7] → ﬁnishes(before(starts(B,A),D),C) [2] {100%} before(F,B) [7] → ﬁnishes(before(F,B),A) [3] {100%}

Algorithm 3 Spatio-Temporal Anomaly Detection

1111

1112

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

Fig. 11. Junction: illegal U-turn.

The ﬁrst 8 min of the video was used for training while the remainder was retained for testing purposes.

6.2. Spatio-temporal level To detect spatio-temporal anomalies, the ﬁrst step is to obtain the event sequence from the test video. Next, a composite event pattern is extracted from the test event sequence. For anomaly detection, patterns are matched in a hierarchical manner against frequent patterns obtained from the training set, after pruning has been done. The procedure starts by comparing 2-patterns from the test sequence to the 2patterns from the set of frequent patterns. Additionally, the law of transitivity [44] is incorporated with the matching process to avoid getting false positives. In addition, the composite test event time duration should also be within an acceptable range. This range is based on durations of the trained frequent patterns:

7.1.3. Motion segmentation The raw trajectories for both datasets have been shown above in Fig. 2. The number of clusters in Junction (i.e. dominant routes) was found to be 8, and 5 for the Roundabout dataset.

μ trainP −2 σ trainP ≤durationtestP ≤μ trainP þ 2 σ trainP :

7.2. Discovered frequent temporal patterns

ð7Þ

The same procedure is repeated for patterns at higher levels. If a mismatch is encountered, the test sequence is classiﬁed as an anomaly. The hierarchical anomaly detection procedure is listed as Algorithm 3. 7. Experimental evaluation 7.1. Datasets We test our system on two public datasets [45]. These datasets feature complex activities between numerous agents in the scene, governed by trafﬁc lights. 7.1.1. Junction The Junction dataset contains a busy street intersection where the trafﬁc ﬂow in different directions is regulated by the trafﬁc lights, which imposes a temporal order on the scene behavior. The video is approximately 50 min long, recorded at ps, with a frame size of 360 × 288.

7.1.2. Roundabout This is a more complex dataset, as vehicles follow a circular path. Similar to Junction, trafﬁc ﬂows in different directions are regulated by trafﬁc lights. The video is approximately 62 min long. It is recorded at 25 fps, where the frame size is 360 × 288. The video is divided into two parts, the training set (12 min) and the remainder for testing.

Having classiﬁed trajectories into events, the event sequences for each clip can now be formulated. Each event sequence has a list of events, followed by their start time and end time. Some sample event sequences from Junction are shown in Table 3, where [A 0 5] in the ﬁrst sequence indicates that event A lasts for 5 s from the start time within that sequence. A frequent temporal pattern mining algorithm, as deﬁned in Section 5.2, is used to learn recurring patterns in the scene. One of the advantages of using a temporal mining technique is its fast speed and low computational cost. These event sequences are used as input to the mining algorithm. The output is composed of sets of k-patterns. Each k-pattern shows a composite event with its time duration and support. Sequences in these datasets are short and thus, resulting patterns have a low frequency count. Consequently, the value of the minS threshold has to be low. Experiments were conducted with different values of the minS threshold and the resulting number of the patterns for each is shown

Fig. 12. Roundabout: driving in the wrong direction.

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

1113

Fig. 13. Junction: ﬁre truck interrupting trafﬁc.

in Fig. 7a. Based on this experimentation, minS is set to 2% for both datasets. This precise threshold ensures that we neither produce excessive patterns nor too few. For Junction, this resulted in a total of 92 patterns, while the Roundabout has 29 patterns in all. Roundabout has fewer activities and hence a small number of patterns are sufﬁcient to model its underlying dynamic behavior. The total number of frequent k-pattern sets before and after pruning for both datasets are shown in Table 4. Lastly, we discuss k-pattern representations. Each frequent temporal pattern shows atomic events along with their relations, appended with its average time duration. Additionally, the support of each pattern, shown as a percentage here, is appended at the end of the pattern. In Table 5, a sample of k-patterns mined from the Junction dataset is shown. While, in Table 6, sample k-patterns resulting from Roundabout training set are shown. The resulting patterns are composed using all seven Allen's temporal relations. We have listed only a few patterns for each relation, and the ﬁrst 2-pattern from each category is explained as: 1) before(D,E)[4.5], {4.35%}: event D occurs 4.5s before event E.

7.3. Generated temporal association rules All possible forward rules are enumerated using Algorithm 2, as deﬁned in Section 5. Rules are retained only if their conﬁdence value is higher than minC. The accuracy vs. the different values of the minC is computed and the results are displayed in Fig. 7b. Based on this, the minC for both datasets is set to 50%. The dominant trafﬁc cycle, as shown in Fig. 8, can be explained using the four temporal rules as (cf. Table 7): 1) Rule 1: Events A and B start together, representing a vertical ﬂow in opposite directions. From the 2-pattern: equals(A,B) in Table 5, it is clear that they last for the same time duration. Event C starts when A and B ﬁnish (meets relationship). The rule has a conﬁdence value of 50%, implying that the rule holds only for 50% of the sequences containing the 2-pattern. 2) Rule 2: Events C and D start together, indicating left and right turning trafﬁc ﬂow in the scene. The 2-pattern: equals(C,D) (Table 5) shows that they last for the same duration. Event G takes place after C and D are done. Moreover, the overlaps relation implies that events G and E co-occur for a short time interval. 3) Rule 3: Event G ends before F starts. Event F is contained within H's interval. 4) Rule 4: This follows from Rule 3, where Event F is contained within H's interval. This is followed by Event A, thus completing the cycle.

2) equals(A,B)[11], {2.17%}: event A and event B occur together for 11s. 3) meets(B,C)[8.5], {4.35%}: event B ﬁnishes after 8.5s, and event C follows. 4) starts(C,D)[4.5], {4.35%}: event C and event D start together, 4.5s into the cycle. 5) ﬁnishes(D,C)[5.5], {4.35%}: event D and event C ﬁnish together after 5.5s. 6) during(A,B)[6], {6.52%}: event B occurs alongside event A for 6s. 7) overlaps(B,D)[1], {2.17%}: event B and event D co-occur for 1s.

Junction is a street intersection with a complex behavior. Therefore, the method produces an alternate trafﬁc cycle, illustrated in Fig. 9. Only rules relating to events A, B, C and D are affected. These rules are shown in Table 8. There are two main differences between this cycle and the previous one (cf. Fig. 8):

A support value of 4.35% indicates the percentage of sequences in the dataset that contain this relationship.

• Vertical ﬂow (A) is interleaved with a rightward turning trafﬁc (C) • Vertical ﬂow (B) is interleaved with a leftward turning trafﬁc (D).

Fig. 14. Roundabout: jumping the red light.

1114

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

Table 10 Junction and roundabout clustering accuracy.

Junction Roundabout

Table 12 Roundabout typical event comparison.

Ours

EMD-L1 [28]

Standard pLSA [36]

95.6% 95.8%

92.4% 86.4%

89.7% 84.5%

Horizontal ﬂow (AB) Vertical ﬂow (CDE) Cycle

The rules governing the Roundabout trafﬁc ﬂow sequence, as shown in Fig. 10, are listed in Table 9. The rules can be interpreted as follows: 1) Rule 1: Events A and B co-occur for the same duration {equals(B,A)}. This activity is followed by event D. Events C and D start and ﬁnish together {equals(C,D)}. The conﬁdence value for the rule is 100%. This implies that for 100% of the sequences containing the 3-pattern (left-side), the 4-pattern will also occur. 2) Rule 2: Lastly, vertical trafﬁc ﬂow denoted by F precedes events A and B. 7.4. Spatial anomaly detection Given a test trajectory, we have to classify it to the set of obtained activity clusters (A,B,…). A trajectory is considered atypical if it does not match any representative clustered routes. The direction vectors are computed to check if the trajectories have the same direction of motion. If not, the trajectory is an outlier and the test clip denotes a spatial anomaly. This type of anomaly was discovered in Junction, as well as Roundabout. For instance, in the Junction dataset, a car performs an illegal U-turn and disrupts the trafﬁc ﬂow. The outlying trajectory along with other test trajectories is shown in Fig. 11. An example from the Roundabout dataset is shown in Fig. 12, a vehicle interrupts the trafﬁc ﬂow by moving in the reverse direction of the ﬂow. 7.5. Spatio-temporal anomaly detection The temporal patterns extracted from the test sequences are matched with the frequent temporal patterns learned from the training set. For the Junction dataset, an instance of an anomalous interaction is shown in Fig. 13. In this test clip, a ﬁre-truck disrupts the trafﬁc ﬂow. The resulting pattern from the test sequence that is not part of the frequent patterns is: starts(F,A), implying that events A and F cannot co-occur. Therefore, it is considered to be anomalous. An example of a normal behavior can be seen in Fig. 6, and the corresponding temporal rule is shown in Section 5.3. Furthermore, a comprehensive list of the various types of anomalies in Junction can be found in Table 13. In the Roundabout dataset, an example of an anomalous interaction is shown in Fig. 14. This test sequence indicates an example of reckless driving, as event A and event D cannot co-occur. The resulting pattern from the test sequence that is not part of the frequent patterns is: overlaps(D,A). Thus, it is an anomaly. Table 14 lists the different anomalies found in Roundabout. 7.6. Comparative analysis The Junction and Roundabout datasets have been used extensively in the literature for dynamic scene analysis. These trafﬁc scenes are more challenging compared to the MIT dataset [15,19,26] as the scene is

Ours

Zen and Ricci[28]

Emonet et al.[15]

Yes Yes 1

Only A Only D 0

Yes Yes 0

busier and more complex. We compare the typical behaviors as well as the anomalies with state-of-the-art approaches and demonstrate that our method compares favorably. In our experiments, we identiﬁed individual trafﬁc ﬂows, where each ﬂow represents an atomic event. In Junction, 8 different ﬂow directions were discovered, while in Roundabout, 5 were identiﬁed. We compare our clustering accuracy (using Eq. (3)) to other methods and present slightly better results in Table 10. The complex spatio-temporal behavior between the events is modeled using temporal patterns and rules, and the resulting trafﬁc cycles are shown in Figs. 8, 9 and 10. Most of the existing approaches were successful in ﬁnding the typical activities in Junction (cf. Table 11), even though some do not show the trafﬁc light cycles. But, very few existing works attempt to model Roundabout due to its complex nature. Table 12 shows two approaches that found the dominant activities without the trafﬁc cycle. On the contrary, only few existing approaches [10,26,28,36] perform anomaly detection. It is therefore not possible to do a fair comparison with [28,36] since they only show instances of 3 and 7 anomalies respectively along with an anomaly score. To quantitatively compare the anomalies discovered in Junction with [26,10], we have tested on the same duration of the video (around 40 min). Based on the ground truth, we determine the accuracy of the detection process in terms of: 1) True Positive (TP): abnormal test sequence is classiﬁed as abnormal 2) False Positive (FP): normal behavior classiﬁed as abnormal. The various classes of anomalies discovered in Junction are shown in Table 13. The Drive Wrong Way category which disrupts the normal trafﬁc ﬂow consists mostly of ﬁre trucks and police cars. The Normal category indicates a behavior which is tagged as abnormal. Our method was able to ﬁnd single-agent (i.e. Illegal U-turns) as well as multi-agent anomalies and the combined accuracy is depicted in the table. In order to make a better comparison, we add a column depicting the Accuracy % for each of the anomaly detected by the different methods. Our accuracy is comparably much higher and we model all the events in a single clip. Also, we correlate events to one another with an explicit time duration to ﬁnd abnormalities easily. Moreover, spatio-temporal interactions are well-deﬁned in longer clips. Whereas, the drawback of probabilistic topic models is that they might fail to detect abnormal events occurring alongside regular events within the same clip and this is apparent in their low accuracy. For the Roundabout dataset, none of the existing works perform anomaly detection. Therefore, we show our anomaly detection results (cf. Table 14) based on the ground truth provided in [45], which is reasonably high.

Table 11 Junction typical event comparison.

Vertical ﬂow (AB) Interleaved vertical ﬂow (CD) Horizontal ﬂow (EF) Horizontal turn ﬂow (GH) Cycles

Ours

Hospedales et al. [26]

Zen and Ricci [28]

Loy et al. [18]

Kuettel et al. [1]

Emonet et al. [15]

Yes Yes Yes Yes 2

Yes Yes Yes Only H 1

Yes No Yes No 0

Yes Yes Yes Yes 0

Yes Yes Yes Yes 2

Yes Yes Yes Yes 1

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

1115

Table 13 Junction anomaly detection — quantitative analysis: GT stands for ground truth and FTP stands for frequent temporal patterns. Ours GT Clip size

12s (300f)

Break red light Illegal U-turn Drive wrong way Unusual turns Normal TP (%) FP (%)

7 12 4 2 185

Hospedales et al.[26] FTP

Accuracy

GT

MCTM

Jouneau and Carincotte[10] Accuracy

1s (25f) 7 10 3 2 183 88.0 1.1

100% 83% 75% 100% 98%

13 15 15 10 2663

Table 14 Roundabout anomaly detection: GT stands for ground truth and FTP stands for frequent temporal patterns. GT Clip size

12 s (300f)

Break red light Drive wrong way Normal TP (%) FP (%)

3 3 53

FTP

3 2 50 83.3 5.7

8. Conclusions In this work, we have proposed a method that analyzes trafﬁc patterns and detect irregular events. To the best of our knowledge, temporal mining techniques have not been used for event recognition in dynamic scenes. We ﬁrst discover frequent temporal patterns and use Allen's temporal relations [8] for representation. The time duration of composite events is included in the pattern as well. Temporal association rules are then generated from these frequent patterns. These association rules help model the trafﬁc cycle sequence and detect anomalous behavior. Recent works suggest trafﬁc cycle as an ordered sequence of activities. We on the other hand, indicate exactly how two events are related (overlaps, equals, starts, etc.) and the time interval they occupy. Spatio-temporal anomalies are identiﬁed and detected in a hierarchical manner. We show results on two standard public datasets and demonstrate considerable improvement over the current methods.

References [1] D. Kuettel, M. Breitenstein, L. Van Gool, V. Ferrari, What's going on? Discovering spatio-temporal dependencies in dynamic scenes, IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1951–1958. [2] V. Mahadevan, W. Li, V. Bhalodia, N. Vasconcelos, Anomaly detection in crowded scenes, IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1975–1981. [3] B. Zhou, X. Wang, X. Tang, Random ﬁeld topic model for semantic region analysis in crowded scenes from tracklets, IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3441–3448. [4] U. Von Luxburg, A tutorial on spectral clustering, Stat. Comput. 17 (4) (2007) 395–416. [5] D. Damen, D. Hogg, Recognizing linked events: searching the space of feasible explanations, IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 927–934. [6] L. Kratz, K. Nishino, Spatio-temporal motion pattern modeling of extremely crowded scenes, The 1st International Workshop on Machine Learning for Visionbased Motion Analysis, 2008. [7] N. Harikrishna, S. Satheesh, S. Sriram, K. Easwarakumar, Temporal classiﬁcation of events in cricket videos, National Conference on Communications (NCC), IEEE, 2011, pp. 1–5. [8] J. Allen, G. Ferguson, Actions and events in interval temporal logic, J. Log. Comput. 4 (5) (1994) 531. [9] D. Patel, W. Hsu, M. Lee, Mining relationships among interval-based events for classiﬁcation, Proceedings of the SIGMOD International Conference on Management of Data, ACM, 2008, pp. 393–404.

GT

HDP-HMM

Accuracy

– 3 11 – 8 51.9 –

– 27% 69% – –

2s (60f) 4 5 12 6 2636 52.8 1.0

31% 33% 80% 60% 99%

– 11 16 – –

[10] E. Jouneau, C. Carincotte, Particle-based Tracking Model for Automatic Anomaly Detection, ICIP, 2011. [11] V. Morariu, L. Davis, Multi-agent event recognition in structured scenarios, IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3289–3296. [12] Z. Zhang, K. Huang, T. Tan, L. Wang, Trajectory series analysis based event rule induction for visual surveillance, IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [13] T. Hospedales, J. Li, S. Gong, T. Xiang, Identifying rare and subtle behaviours: a weakly supervised joint topic model, IEEE Trans. Pattern Anal. Mach. Intell. 99 (2011). [14] C. Stauffer, W. Grimson, Learning patterns of activity using real-time tracking, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 747–757. [15] R. Emonet, J. Varadarajan, J. Odobez, Extracting and locating temporal motifs in video scenes using a hierarchical nonparametric Bayesian model, IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3233–3240. [16] J. Li, S. Gong, T. Xiang, Discovering multi-camera behaviour correlations for on-theﬂy global activity prediction and anomaly detection, IEEE International Workshop on Visual Surveillance, Kyoto, Japan, 2009. [17] J. Varadarajan, J. Odobez, Topic models for scene analysis and abnormality detection, IEEE 12th International Conference on Computer Vision Workshops, 2009, pp. 1338–1345. [18] C. Loy, T. Xiang, S. Gong, Stream-based active unusual event detection, ACCV 2010., 161–175. [19] L. Song, F. Jiang, Z. Shi, A. Katsaggelos, Understanding dynamic scenes by hierarchical motion pattern mining, IEEE International Conference on Multimedia and Expo, 2011, pp. 1–6. [20] Y. Yang, J. Liu, M. Shah, Video scene understanding using multi-scale analysis, IEEE 12th International Conference on Computer Vision, 2009, pp. 1669–1676. [21] I. Saleemi, L. Hartung, M. Shah, Scene understanding by statistical modeling of motion patterns, IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2069–2076. [22] A. Hervieu, P. Bouthemy, J. Le Cadre, A statistical video content recognition method using invariant features on object trajectories, IEEE Trans. Circ. Syst. Video Technol. 18 (11) (2008) 1533–1543. [23] Y. Shi, A. Bobick, I. Essa, Learning temporal sequence model from partially labeled data, Computer Vision and Pattern Recognition, 2006, pp. 1631–1638. [24] Y. Benezeth, P. Jodoin, V. Saligrama, C. Rosenberger, Abnormal events detection based on spatio-temporal co-occurences, IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 2458–2465. [25] K. Prabhakar, S. Oh, P. Wang, G. Abowd, J. Rehg, Temporal causality for the analysis of visual events, IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1967–1974. [26] T. Hospedales, S. Gong, T. Xiang, A Markov clustering topic model for mining behaviour in video, IEEE 12th International Conference on Computer Vision, 2009, pp. 1165–1172. [27] J. Varadarajan, R. Emonet, J.-M. Odobez, A sequential topic model for mining recurrent activities from long term video logs, Int. J. Comput. Vis. (2013) 1–27. [28] G. Zen, E. Ricci, Earth mover's prototypes: a convex learning approach for discovering activity patterns in dynamic scenes, IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3225–3232. [29] W. Brendel, A. Fern, S. Todorovic, Probabilistic event logic for interval-based event recognition, IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3329–3336. [30] M. Sridhar, A.G. Cohn, D.C. Hogg, Unsupervised Learning of Event Classes from Video, AAAI, 2010. [31] Y. Zhang, E. Swears, N. Larios, Z. Wang, Q. Ji, Modeling Temporal Interactions with Interval Temporal Bayesian Networks for Complex Activity Recognition, IEEE Trans. Pattern Anal. Mach. Intell. (2013) 1. [32] R. Hamid, S. Maddi, A. Johnson, A. Bobick, I. Essa, C. Isbell, A novel sequence representation for unsupervised analysis of human activities, Artif. Intell. 173 (14) (2009) 1221–1244. [33] B. SivaSelvan, N. Gopalan, Efﬁcient algorithms for video association mining, Adv. Artif. Intell. (2007) 250–260. [34] V. Jakkula, D. Cook, A. Crandall, Temporal pattern discovery for anomaly detection in a smart home, 3rd IET International Conference on Intelligent Environments, 2007, pp. 339–345.

1116

A.M. Talha, I.N. Junejo / Image and Vision Computing 32 (2014) 1102–1116

[35] A. Yilmaz, K. Shaﬁque, N. Lobo, X. Li, T. Olson, M. Shah, Target-tracking in FLIR imagery using mean-shift and global motion compensation, IEEE Workshop on Computer Vision Beyond Visible, Spectrum, 2001, pp. 54–58. [36] J. Li, S. Gong, T. Xiang, Scene segmentation for behaviour correlation, ECCV, 2008., 383–395. [37] I. Junejo, H. Foroosh, Euclidean path modeling for video surveillance, Image Vis. Comput. 26 (4) (2008) 512–528. [38] P. Perona, L. Zelnik-Manor, Self-tuning spectral clustering, Adv. Neural Inf. Process. Syst. 17 (2004) 1601–1608. [39] S. Atev, O. Masoud, N. Papanikolopoulos, Learning Trafﬁc Patterns at Intersections by Spectral Clustering of Motion Trajectories, IEEE International Conference on Intelligent Robots and Systems, 2006, pp. 4851–4856.

[40] B. Morris, M. Trivedi, Learning trajectory patterns by clustering: experimental studies and comparative evaluation, IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 312–319. [41] S. Laxman, P. Sastry, A survey of temporal data mining, SADHANA Acad. Proc. Eng. Sci. 31 (2) (2006) 173–198. [42] R. Agrawal, T. Imieliński, A. Swami, Mining association rules between sets of items in large databases, ACM SIGMOD Record, vol. 22, no. 2, 1993, pp. 207–216. [43] E. Winarko, J. Roddick, Discovering richer temporal association rules from intervalbased data, Data Warehous. Knowl. Discov. (2005) 315–325. [44] F. Höppner, Learning temporal rules from state sequences, IJCAI Workshop on Learning from Temporal and Spatial Data, vol. 25, 2001. [45] Datasets, http://www.eecs.qmul.ac.uk/~jianli/Dataset-List.html.

Dynamic scene understanding using temporal association rules

Dynamic scene understanding using temporal association rules

Recommend Documents