Author’s Accepted Manuscript Combining Motion and Appearance Cues for Anomaly Detection Ying Zhang, Huchuan Lu, Lihe Zhang, Xiang Ruan
www.elsevier.com/locate/pr
PII: DOI: Reference:
S0031-3203(15)00336-2 http://dx.doi.org/10.1016/j.patcog.2015.09.005 PR5509
To appear in: Pattern Recognition Received date: 14 April 2015 Revised date: 14 August 2015 Accepted date: 7 September 2015 Cite this article as: Ying Zhang, Huchuan Lu, Lihe Zhang and Xiang Ruan, Combining Motion and Appearance Cues for Anomaly Detection, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2015.09.005 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Combining Motion and Appearance Cues for Anomaly Detection Ying Zhanga , Huchuan Lua,∗∗, Lihe Zhanga , Xiang Ruanb a
School of Information and Communication Engineering, Dalian University of Technology, Dalian, 116023, China b OMRON Corporation,Kusatsu,Japan
Abstract In this paper, we present a novel anomaly detection framework which integrates motion and appearance cues to detect abnormal objects and behaviors in video. For motion anomaly detection, we employ statistical histograms to model the normal motion distributions and propose a notion of ”cut-bin” in histograms to distinguish unusual motions. For appearance anomaly detection, we develop a novel scheme based on Support Vector Data Description (SVDD), which obtains a spherically shaped boundary around the normal objects to exclude abnormal objects. The two complementary cues are finally combined to achieve more comprehensive detection results. Experimental results show that the proposed approach can effectively locate abnormal objects in multiple public video scenarios, achieving comparable performance to other state-of-the-art anomaly detection techniques. Keywords: Anomaly detection, motion model, appearance model, Support Vector Data Description (SVDD)
∗ ∗∗
Corresponding author Email Address:
[email protected]; Tel./Fax : 86-411-84708971
Preprint submitted to Pattern Recognition
September 22, 2015
1. Introduction Intelligent video surveillance [1, 2] has been attracting more and more attention in recent years. Among the many research objectives, anomaly detection from video sequences plays an important role in discovering various irregularities, such as restricted-area access [3], wrong direction or route [4, 5, 6], people carrying cases [7], falling [8, 9, 10], group fighting and panics [11], a car making an illegal U-turn [12, 13], jaywalkers [12], and some other unusual events. Despite the fact that anomaly detection has been successfully used in intelligent transportation systems, security alarm systems for houses, offices and public places, it is still encountering a series of challenges, which can be summarized as follows: (1) It is hard to define representative normal regions in observed video streams; (2) The boundary between normal and abnormal objects is often ambiguous; (3) The exact notion of anomaly is rather subjective and changes over different applications; (4) It is tricky to label the abnormal behaviors for videos; (5) Normal behaviors tend to keep evolving; (6) Observed data may contain noise. A great diversity of approaches have been proposed to solve one or more problems mentioned above. Most of them were designed to work for specific scenarios, where different representations of motion and appearance were analyzed with different models. A large majority of these methods detect abnormalities following two main assumptions: • Abnormal events rarely occur in video sequences [14, 15, 16, 17]; • Abnormal events have low similarities to normal events [4, 18, 19, 20]. 2
In this paper we consider properties of abnormal instances from the perspective of human cognition. Unusual objects are rare things with unexpected appearance or motion patterns. In order to generically detect anomalies, we argue that any abnormal object has at least one of the following characteristics: • Motion anomaly, like objects moving with an unexpected speed or in a wrong direction, along unplanned routes, at undesired locations, etc.; • Appearance anomaly, such as strange postures or unidentified objects. Many abnormal objects have both of the above characteristics. In this paper, we propose a novel framework by exploring the motion and appearance cues for anomaly detection. For appearance anomaly detection, we employ spatio-temporal gradients as appearance descriptors and develop a novel scheme based on Support Vector Data Description (SVDD) [21], which obtains a spherically shaped boundary in the non-linearly transformed feature space to include most of the normal samples, while test examples falling out of the boundary would be considered as unusual. For motion anomaly detection, we utilize the optical flow vectors to characterize motion features and describe the distribution of normal motion patterns via histogram statistics, where we introduce the notion of ”cut-bin” analogous to the boundary in SVDD. A new observation would be regarded as anomalous if it crosses the line of ”cut-bin”. The overall abnormal objects are located according to the results of both motion and appearance anomaly detection. The proposed method is robust for discovering multiple unusual events in different scenarios. Figure 1 shows the overview of our approach. 3
Optical flow map
Sorted Histogram
Motion Anomaly
Training Data
Anomaly Detection
Testing Data
Spatio-temporal gradients
Figure 1:
Integration
SVDD model
Appearance Anomaly
Overview of Anomaly Detection Algorithm. For motion anomaly detection,
optical flow vectors are computed and aggregated into sorted histograms, cut-bin of the histogram for each location is calculated to judge abnormalities. For Appearance anomaly detection, we extract spatio-temporal gradients from normal samples and train a SVDD model to identify anomalies.
2. Related Work and Paper Contents 2.1. Related Work Many approaches have been proposed for abnormality detection. According to whether or not samples of normal and abnormal activities are needed for training or initialization before anomaly detection, they can be broadly classified into three fundamental types: supervised, semi-supervised and unsupervised approaches. The first type of the anomaly detection methods require the labels available for both normal and abnormal samples, which is generally involved in training classifiers. These methods are customarily designed for specific abnormal behaviors whose properties are predefined by people, such as falling detection [9, 22, 10], fighting detection [23] and traffic violation 4
detection [12, 13]. Velocity and trajectories act as two most widely used cues to classify normal and abnormal cases. The second type of approaches only need normal data for training, which is the most popular technique adopted by researchers. These methods can be further divided into two sub-categories, rule based methods and model based methods. Rule based approaches attempt to establish a rule that any sample which breaks the rule would be considered as irregular. Sparse coding [24, 4, 25] is widely used in rule based approaches, with a larger reconstruction cost indicating a higher probability of being abnormal. Online dictionary updating [4] has been added into the coding framework to handle the concept drift, while Lu et al. [25] learned sparse combinations to speed up the coding process, reaching a speed of 140 ∼ 150 frames per second on average. However, the detection results are greatly affected by the threshold, which often varies over different scenes. Similarities based methods are also exploited by many researchers [26, 18, 19, 20], abnormal score of a test sample is measured by its similarity to the training samples. Clustering [20] and sub-class discovering [19] were adopted to speed up the computation. While the differences between normal activities are usually not obvious, clustering might not work as well as expected. Model based approaches try to build a model for normal behaviors, while instances of low probability with respect to the model will be rejected as anomalies. The most widely used models are Markov Random Field (MRF) model [14, 16] and Hidden Markov Model (HMM) [27, 28, 29, 30], which are extended in various applications. Kim et al. [14] introduced a space-time
5
MRF model characterizing distribution of normal motion patterns, while Benezeth et al. [16] used MRF to describe the co-occurrence distribution of normal observations. Kratz et al. [27] modeled the temporal and spatial relationship between local spatio-temporal motion patterns using distributionbased HMM and coupled HMM respectively. Andrade et al. [29] grouped the video segments into several different classes using spectral clustering and trained a Multiple Observation Hidden Markov Model (MOHMM) for each class. Some other models are also designed for anomaly detection, such as social force model (SFM) proposed by Mehran et al. [5] for describing individual motion dynamics and interaction forces in crowds, and mixture of dynamic texture(MDT) [31] modeling normal occurrences both in space and time. Adam et al. [15] presented a simple and fast algorithm for unusual events detection based on optical flow histogram statistics of normal activities of multiple fixed-location monitors. While Wu et al. [6] learned a normal scene model using chaotic invariants of Lagrangian particle trajectories to characterize crowded scene. Energy model were also used in predicting irregularities. Cui et al. [11] proposed an interaction energy potential function to model the action and interaction with the surroundings of normal objects changing over time, and events with sudden changes in functions are likely to be unusual. Entropy calculated on the spatiotemporal information of the interest points is used to measure disorders in [32]. Kwon et al. [33] developed a graph editing framework with a predefined energy model whose parameters reflect causality, frequency, and significance of events, which can be used for event summarization and rare event detection. The energy based methods are often sensitive to multiple parameters, and a particular energy function
6
can only function well for specific scenes. The third type of approaches require neither normal nor abnormal examples in advance and the anomaly detection is based on the assumption that anomalies are very rare compared to normal data. Hu et al. [17] provided a scan statistic method using windows of various shapes and sizes to scan a video, abnormality of a window is computed by the likelihood ratio test statistic. Zhong et al. [34] employed document keyword analysis to compute a co-occurrence matrix reflecting the relationship between prototype features and video segments, and video segments with a small inter-cluster similarity would be considered as unusual events. Our work belongs to the semi-supervised and model-based approach. We focus on the statistical distribution of all observations in this paper, abnormal activities are regarded as those outliers disobedient to the distribution of normal samples. We avoid complex distribution models with too many parameters and develop simpler but effective model for anomaly detection. An anomalous object is characterized from both motion and appearance views, which are proved to be complementary by the extensive experimental results. 2.2. Paper Contributions and Organization The proposed algorithm has significant advantages over the previous methods. Firstly, we emphasize both the motion and appearance information of the observations, which can facilitate to detect abnormalities more comprehensively and accurately. When the car or bicycles driving very slowly in crowds on the sidewalk, many of the previous methods [15, 24] focusing on the motion features will lose the capability to judge the violations. Secondly, the proposed algorithm can be applied both in crowded and uncrowded 7
scenes. Different from the methods [17, 31] which compare the target behavior with the surrounding activities and only aim at crowded scenes, we can estimate the abnormal degree of each observation with the established models whether the scene is crowded or uncrowded. Furthermore, the proposed approach is easy to implement and has lower sensitivity to parameters. We develop straightforward but effective statistical models which find the boundary to distinguish the anomalies, and the experimental results also show that the parameters change very little for different datasets to get the best performance. The contributions of this paper are summarized as follows: • We propose a novel assumption that the properties of abnormal events reflect in two aspects of motion anomaly and appearance anomaly, and develop an algorithm integrating both motion and appearance cues for video anomaly detection. • For motion anomaly detection, we propose a novel notion of ”cut-bin” in the statistical histogram of optical flows to distinguish irregular motions. • For appearance anomaly detection, we present a scheme base on the Support Vector Data Description (SVDD) to find a spherical boundary around normal samples for excluding unusual objects. We organize the reminder of this paper as follows. In Section 3, we separately elaborate the motion and appearance anomaly detection, as well as the combination of the two cues for video anomaly detection. Section 4 provide the experimental results on three public datasets and the comparison 8
with the state-of-the-art methods to demonstrate the effectiveness of the proposed method. Finally we conclude this paper in Section 5. 3. Anomaly Detection In this section, we present in detail how to detect abnormal objects according to the motion patterns and the appearance both in space and time. Our motion model is a histogram-based statistical model which uses an adaptive ”cut-bin” instead of various thresholds [15] to identify anomalies, while appearance model is a SVDD-based model that seeks a spherically boundary to exclude irregular objects. 3.1. Motion Anomaly Detection Motion anomaly detection is inspired by multiple fixed-location monitors which collect low-level statistics [15]. While they require multiple parameters or thresholds to produce an ”alarming frame”, we introduce the notion of ”cut-bin” avoiding so many parameters for detecting anomalies. As a popular method of motion estimation in computer vision, optical flow is defined as the apparent motion of brightness patterns in the image. Generally, optical flow corresponds to the instantaneous velocity of the moving pixels on objects, calculated by exploring the variation of pixels between consecutive frames. In this paper, we use the optical flow proposed by [35] as the low-level measure of activities for each pixel. Optical flow statistical histograms are constructed separately according to their spatial location both in direction and magnitude. • Direction: Optical flow vectors are quantized into D directions from 0◦ ∼ 360◦ . The orientation of every single pixel falls into one of these 9
bins. The value for each bin corresponds to the number of times pixels moving in the direction associated with the bin at a fixed location. • Magnitude: Magnitude of motion vectors votes into M bins, and the range of each bin is tailored to a specific scene, which is easy to compute according to magnitude statistics. The higher value of a certain bin, the more frequently pixels move at a speed represented by the bin. The direction histogram can be used to detect the wrong direction events at entrance gate or exit gate, while the magnitude histogram is capable of detecting irregular speed such as running or driving a car at sidewalk. In surveillance videos, one or both of the histograms can be employed to detect the abnormal events, which depends on the actual scenes and the associated requirements. In this section we take the irregular speed detection with magnitude histogram as an example for illustration of the proposed method of detecting motion anomalies. Here we propose a histogram based method to predict speed irregularities via ”cut-bins”. Firstly we rank the bins in descending order according to their values, then the cut-bin is set following the principle that the ration of the sum values of the bins before the ”cut-bin” to the total values of all the bins is not less than r, where 0 < r < 1 indicates the percentage of normal examples in all the training data. In general, r is set to be about 0.99, which is designed to accept most of the data as normal with some flexibility taking noises into consideration. We compute the ”cut-bin” of the sorted histogram for every fixed spatial location over all the training video frames. Figure 2 shows that we aggregate the magnitudes of optical flow into sorted histogram for ”cut-bin” calculation at different locations. We show three 10
locations labeled as red point with different histogram statistics. We can see that the optical flow magnitudes of point 1 mainly distribute from the 1-th bin to the 9-th bin, showing lower magnitudes observed in training data, and the ”cut-bin” of the corresponding histogram is the 5-th bin. While point 2 has a ”cut-bin” number of 1, which means that there is no moving objects going through this point in normal settings. Point 3 ranges widely from the 1-th bin to the 15-th bin, with a ”cut-bin” number of 8, indicating that more motion patterns appeared at this location. It is easy to conclude that the optical flow magnitude distribution as well as the ”cut-bin” are varying for different locations, and the smaller distance to the camera may lead to a wider distribution and a larger ”cut-bin”. Original Hist
Sorted Hist
1
Video Frames
2 1
3
2
3
Figure 2: ”Cut-bin” of magnitude histogram for different locations.
Given the sorted magnitude histogram Hp and the ”cut-bin” number Tp of a certain point p, we measure the magnitude abnormality of a new observation in this pixel position based on the relationship between the bin 11
it falls in and the cut-bin. The magnitude abnormality of the observation is computed as
Amag (p) =
0 < K ≤ Tp
0
wp 1 −
! Tp < K ≤ L L P Si , 1 SK
max
(1)
i=Tp +1
where K is the sequence number of the bin that the new observation falls in and L is sequence number of the last bin for the sorted histogram, Si represents the value of the i-th bin. The likelihood of a test data being anomalous is inversely proportional to the frequency it appeared in the training videos. wp is the weight calculated by a sigmoid function with Tp and K as parameter to control the abnormal degree in terms of the specific position. wp =
1 1 + exp (−λ1 · (K − Tp ))
(2)
where λ1 is the scale parameter. Figure 2 also demonstrates that the histogram distribution varies at different locations, and it is necessary to set different weights for different pixels. we can see that the magnitudes observed at point 1 is usually smaller than point 3, while magnitudes of point 2 only exist in the first bin. If we have a new observation with a magnitude falling in the 6-th bin, it is absolutely abnormal for point 2, and somewhat abnormal for point 1, while normal for point 3. Therefore, given the same observation value, the pixels with lower cut-bins will be assigned to higher abnormal weights according to Eq. 2. 12
1 T=1 T=5 T=8 T=10 T=15 T=20
0.9 0.8 0.7
Weight
0.6 0.5 0.4 0.3 0.2 0.1 0
5
10
15 Bin
20
25
30
Figure 3: Abnormal weights for pixels with different cut-bins.
We show the assignment of abnormal weights for pixels with different cut-bins in Figure 3, of which the horizontal axis represents the falling bin of the test sample, the vertical axis denotes the abnormal weight wp , and the different color of curves indicates different values of ”cut-bin”. we can see that with the cut-bin of an pixel increasing, the probability of the pixel being abnormal decreases. Taking the blue line with ”T=5” for example, when the ”cut-bin” of an histogram is the 5-th bin, if the falling bin number of the test example is larger than the 5, then the abnormal weight would be larger than 0.5, otherwise the weight would be less than 0.5. The larger the falling bin is, the larger the abnormal weight would be, implying higher confidence of the motion being abnormal. Therefore, the abnormal weight is able to adjust the abnormal degree according to the gap between the ”cut-bin” and the falling bin for different pixels, leading to more stable and credible detection results. Similar to the irregular speed detection, the direction abnormality Adir (p) of pixel p is calculated based on the ”cut-bin” of the direction histogram. Ac13
cording to different scenes and associated requirements, the motion abnormal degree Amotion (p) is derived from magnitude and direction abnormality in the following way. • For irregular speed detection applications, Amotion (p) = Amag (p). • For wrong direction detection requirement, Amotion (p) = Adir (p). • For both kinds of detection, Amotion (p) = max{Amag (p), Adir (p)}. Here we pick the larger abnormality for requirements of both kinds of detection to avoid the missing detection. The motion anomaly detection is capable of detecting various abnormal activities such as crossing the entrance or exit gate in the opposite direction, sudden running in case of an emergency, and violating traffic rules by running a red light or speeding, etc. By exploiting the ”cut-bin” of statistic histogram for different locations, we obtain the adaptive boundary to divide normal and abnormal motions, increasing the robustness against the perspective distortion and scene changes. 3.2. Appearance Anomaly Detection It suggests that optical flow features are not powerful enough to detect anomalous occurrences in terms of joint appearance and motion [31]. The properties of an moving object mainly reflect on two aspects: the motion patterns and the appearance. The optical flow can be used to detect the abnormal motion patterns such as wrong direction and irregular speed, while it will lose the advantages in detecting unexpected objects. Therefore we need to develop appearance anomaly detection as a supplementary cue for more 14
accurate detection. We can see from Figure 1 that the motion anomaly map is ineffective to highlight the slowly running bicycle in the test video, while the appearance model makes up for the deficiency by detecting the bicycle as an abnormal object, leading to higher accuracy of detection results. 3.2.1. Video Representation To describe the appearance of objects along different dimensions, we utilize the spatial-temporal cuboids to represent videos. We firstly extract moving regions by frame difference to obtain informative data. Then we uniformly partition each moving region to non-overlapping blocks with fixed size. Corresponding regions in τ continuous frames are stacked together to form a spatial-temporal cuboid. We compute the spatiotemporal gradients for each cuboid using the method in [27]. For each pixel p in cuboid I, the spatiotemporal gradient 5Ip is calculated as
∂I ∂I ∂I 5Ip = [Ip,x , Ip,y , Ip,t ] = , , ∂x ∂y ∂t >
> (3)
where x, y, and t are the video’s horizontal, vertical, and temporal dimensions respectively. The values in x and y directions describe the pose or shape of an object, and the values from temporal direction characterize the appearance changing with time. 3.2.2. Support Vector Data Description (SVDD) In this work, we try to estimate the distribution of normal data and formalize the problem of anomaly detection to outlier detection. As a tool of data description, SVDD has been widely used in outlier detection because of 15
its flexibility and good generalization ability [36]. The basic idea of SVDD is to model a spherical boundary around the data in the feature space. To minimize the effect of incorporating outliers, the volume of this hypersphere is minimized. Given a set of training data X = {x1 , x2 , ..., xn } ∈ Rm×n , Our strategy is to map the data into the feature space and seek a hypersphere as small as possible while including most of the data. This can be formulated into an optimization problem as follows: min R2 + C
a,R,ξ
n P
ξi
i=1
(4)
s.t. kφ(xi ) − ak2 ≤ R2 + ξi ξi ≥ 0, i = 1, ..., n
where φ is a function mapping data to a higher dimensional space, a and R are the center and radius characterizing the sphere, ξi is the slack variable for incorporating the outliers, while C > 0 is a user-specified parameter which controls the trade-off between the volume of the hypersphere and the number of excluded outliers. We can solve this optimization with Lagrangian multipliers:
2
L(a, R, ξ, α, γ) = R + C
n X i=1
ξi −
n X
2
2
αi (R + ξi − kφ(xi ) − ak ) −
i=1
n X
γi ξi
i=1
(5) where α ≥ 0 and γ ≥ 0 are Lagrange multipliers. The original problem then
16
becomes the Lagrange dual problem[37]
min
α≥0,γ≥0
max L(a, R, ξ, α, γ)
(6)
a,R,ξ
Setting partial derivatives to zeros gives the constraints: ∂L ∂R
=0 ⇒
∂L ∂a
=0 ⇒
∂L ∂ξ
=0 ⇒
n P R 1− αi = 0
⇒
i=1
n P
αi φ(xi ) − a
i=1
n P
αi = 1
i=1
n P
αi = 0 ⇒ a =
i=1
n P
αi φ(xi )
(7)
i=1
C − α i − γi = 0
⇒
0 ≤ αi ≤ C
According to Eq. 7, the center of the sphere can be expressed as the linear combination of φ(x), which makes it possible to express the original problem in a dual form with kernel functions, min α
P
αi k(xi , xi ) −
i
P
αi αj k(xi , xj )
i,j
s.t. 0 ≤ αi < C,
P
αi = 1
(8)
i
In this paper, we chose the radial basis function (RBF) kernel defined as k(xi , xj ) = exp(−γkxi , xj k2 ), γ > 0. This kernel is adept in handling the nonlinear case, with less parameters and good performance in most cases. The optimal α can be solved by the QP optimization methods. kφ(xi ) − ak2 ≤ R2 ⇒
αi = 0, γi = 0
kφ(xi ) − ak2 = R2 ⇒ 0 < αi < C, γi = 0 kφ(xi ) − ak2 > R2 ⇒
17
αi > C, γi > 0
(9)
Only objects xi with αi > 0 contribute to the characterization of the sphere and thus these objects will therefore be called the support vectors of the description [21]. Figure 4 illustrates some of the key principles of SVDD using 2D data points. SVDD tries to find a circle (or hypersphere in higher space) of radius R centered at a that encloses a single class of data. Samples of green points on the black circle are defined as unbounded support vectors while the red points falling outside the boundary are outliers defined as bounded support vectors. 12 11 10 9 8
R 7
a
6 5 4 3 2
2
4
6
8
10
12
14
Figure 4: Support Vector Data Description.
To test an observation z centered at p, we calculate the distance to the center of the sphere. The test observation will be considered as abnormal when the distance is larger than the radius. We can measure the abnormal degree of an observation by the following decision function:
f (z) = k(z, z) − 2
X
αi k(z, xi ) +
i
X i,j
18
αi αj k(xi , xj ) − R2
(10)
The greater the decision value is, the higher likelihood the observation is anomalous. We employ the logistic function to map the decision value to probability value within [0,1], and the appearance abnormality of an pixel p is calculated as Aappearance (p) =
1 1 + exp(−λ2 · f (z))
(11)
where λ2 is the scale parameter controlling the increasing speed of logistic curve at the central point. Finally, the overall abnormality map is the combination of the motion and appearance abnormality maps computed by Eq. 1 and Eq. 11. A(p) = Amotion (p) + Aappearance (p)
(12)
Consider that the illumination changes and noises may produce false detections in detection results, we imply a filter process by removing the small abnormal region of which the area is less than S pixels. Experimental results demonstrate that the filter process is able to help reduce the false alarms significantly. 4. Experiment In this section, we show the performance of the proposed anomaly detection method on the UCSD dataset [31], the Subway dataset [15] and the UMN dataset [38], both qualitatively and quantitatively. Experimental results show that the proposed method is effective to detect abnormal events in various scenes. Comparisons with methods of scan statistic (SS13) [17], sparse combination (SC13) [25], sparse reconstruction (SR11) [24], online sparse coding 19
(OSC11) [4], mixture of dynamic textures (MDT10) [31], chaotic invariants (CI10) [6], social force (SF09) [5], mixture of optical flow (MPPCA09) [14], multiple fixed-location monitors (MFLM08) [15] are also conducted on all the datasets, which shows that our approach performs favorably against current state-of-the-art methods in terms of Receiver Operating Characteristic (ROC) [31], Area Under ROC Curve (AUC) [31], number of correct detections [14], and number of false alarms [15]. It should be noticed that each approach is represented by the combination of the abbreviation of its corresponding key words and the last two numbers of the publishing year. 4.1. UCSD Dataset The UCSD Anomaly Detection Dataset [31] contains video clips of two pedestrian scenes from a campus, Ped1 and Ped2. The Ped1 dataset contains clips of groups of people walking towards and away from the camera, with a resolution of 158×238 (34 clips for training, and 36 clips for testing). The Ped2 dataset contains scenes of pedestrians moving parallel to the camera plane, with a resolution of 360×240 (16 clips for training, and 12 clips for testing). The video footage recorded from each scene was split into various clips of approximately 200 frames. In the normal settings, the videos contains only pedestrians. Examples of anomalies include buses, wheelchairs, bicycles, and skaters, which only exist in the test data. Each testing clip in the UCSD dataset is labeled with frame level ground truth and pixel level ground truth. The frame level groundtruth is represented as a binary flag, indicating whether there is any anomalous activity in the corresponding frame. Manually generated pixel level binary masks are provided to identify the regions containing anomalies. 20
To apply our motion anomaly detection, we first calculate optical flow for each spatial location in video frames, where directions of optical flow vectors are quantized into 20 (D = 20) directions from 0◦ ∼ 360◦ in the direction histogram, while magnitudes are aggregated into the magnitude histogram with 50 (M = 50) bins. We compute the cut-bin which covers 99% of all optical flows, to act as a decision boundary for locating abnormal pixels. For appearance anomaly detection, we partition the videos into spatio-temporal of 5 pixels ×5 pixels ×5 frames for Ped1 dataset (and 8 pixels × 8 pixels ×5 frames for Ped2 dataset). Dimension reduction and whitening for features extracted from all cuboids are implemented with PCA. Then we train the SVDD model using all the training samples. Finally, we integrate motion and appearance cues to get overall anomaly detection results. Other parameters for computing abnormality are set as follows: the scale parameter λ1 = 1.2, the penalty coefficient C = 100, the scale parameter λ2 = 150, the kernel parameter γ = 1 and the filter area S = 50. Some examples of the abnormal detection results are shown in Figure 5, where the detection results of ped1 and ped2 dataset are shown in Figure 5 (a-e) and Figure 5 (f-j) respectively. From the examples we can see that our method can localize different types of anomaly activities with high accuracy. Abnormal objects such as cars, bicycles, skaters, and wheelchairs are exactly detected both in crowded (Fig. 5 (c,g,h)) and uncrowded scenes (Fig. 5 (a,b,d,f,i)). The motion model is capable of detecting fast moving objects such as skaters (Fig. 5 (b,d,j)) and running cars (Fig. 5 (a,i)), while the slow moving bicycles can be identified by the appearance model, as shown in Figure 5 (a,g). Since there are few people walking towards and away from
21
the camera, the false detections may occur in Figure 5 (i,j) with new events of people walking in the new way.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Figure 5: Examples of abnormal activities detected on the UCSD dataset. The abnormal objects such as bicycles, cars, skaters and wheelchairs are accurately detected.
In Figure 6, we show our anomaly localization ROC curves with pixel level groundtruth on the UCSD dataset. We adopt the anomaly localization evaluation criteria proposed by Mahadevan [31]. The results demonstrate that by combining motion and appearance cues not only the detection accuracy can be greatly improved but also the false alarm can be reduced. In Figure 7, we show the quantitative comparison of the proposed method with the previous approaches on the UCSD ped1 dataset, from which we can see that the proposed algorithm is inferior to other methods on frame level anomaly detection results, while performance of the pixel level anomaly localization results is superior or similar to the other methods. Here we noted that the proposed method is a little inferior to SS13 [17], which is mainly due to the appearance of new moving objects at some locations without observations in the training data, leading to some false alarms in our detection 22
ROC curves for ped1 dataset
ROC curves for ped2 dataset
1
1 Combine Appearance Motion
0.9 0.8
0.9 0.8 0.7 Ture Positive Rate
Ture Positive Rate
0.7 0.6 0.5 0.4
0.6 0.5 0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
Combine Appearance Motion
0
0.1
0.2
0.3
0.4 0.5 0.6 False Positive Rate
0.7
0.8
0.9
0
1
(a)
0
0.1
0.2
0.3
0.4 0.5 0.6 False Positive Rate
0.7
0.8
0.9
1
(b)
Figure 6: Pixel level ROC curves for the UCSD dataset. (a) ROCs for Ped1 dataset using pixel level ground truth; (b) ROCs for Ped2 dataset using pixel level ground truth;
results, while the two-round scanning algorithm proposed in SS13 [17] helps to identify the true regions of the irregular events for reducing false alarms. We show the frame level ROC curves of all the methods on the UCSD ped2 dataset in Figure 8. We can see that the proposed method achieves similar or superior performance to other state-of-the-art methods. The ROC curve of the proposed algorithm is slightly below the curve of SS13 [17], this is mainly because that the abnormality of the propose method is decided by the defective boundary learned from insufficient training videos, which can easily result in false detections. Besides, the skaters in ped2 dataset escape detections as they skate slowly and appear like normal walking pedestrians. 4.2. Subway Dataset The subway dataset provided by Adam et al. [15] contains 2 surveillance videos monitoring the subway exit gate and entrance gate respectively. In both videos, there are typically 1 to 10 people moving in a typical frame, 23
1
1
0.8
True Positive Rate
0.7
0.9 0.8 0.7 True Positive Rate
0.9
Ours, AUC=0.65 SC13, AUC=0.63 SS13, AUC=0.66 SR11, AUC=0.47 MDT10, AUC=0.44 SF09, AUC=0.21 SF+MPPCA, AUC=0.21 MPPCA09, AUC=0.13 MFLM08, AUC=0.18
0.6 0.5 0.4
0.6 0.5 0.4
0.3
0.3
0.2
0.2
0.1
0.1
0 0
0.1
0.2
0.3
0.4 0.5 0.6 False Positive Rate
0.7
0.8
0.9
1
(a) Figure 7:
0 0
Ours, AUC=0.85 SC13, AUC=0.91 SS13, AUC=0.87 SR11, AUC=0.91 MDT10, AUC=0.84 SF09, AUC=0.77 SF09+MPPCA, AUC=0.77 MPPCA09, AUC=0.67 MFLM08, AUC=0.65
0.1
0.2
0.3
0.4 0.5 0.6 False Positive Rate
0.7
0.8
0.9
1
(b)
Comparisons of ROC curves for the UCSD ped1 dataset. (a) ROC curves
for UCSD ped1 dataset using pixel level ground truth; (b) ROC curves for UCSD ped2 dataset using frame level ground truth;
with a frame size of 512 × 384. In the normal settings, people walk through the gates without disturbs; And for anomaly settings we follow the definition in [14] where the abnormal activities are defined as: (a) Wrong direction (WD): people exit through the entrance gate or enter through the exit gate; (b) No payment (NP): some people sneak through or jump over the gate without payment; (c) Loitering (LT): people loiter at the station for a long time; (d) Irregular interactions between persons (II):two people awkwardly zigzag to avoid each other; (e) Misc.: e.g. a person suddenly stops walking, or runs fast. In this experiment, we uniformly sample the optical flow every 10 pixels for motion model and set the size of spatial-temporal cuboids as 10 pixels × 10 pixels × 5 frames for appearance model. Other parameters are set in the following way: M = 50, D = 20, r = 0.9998, λ1 = 1.2, C = 100, λ2 = 150, γ = 1 and S = 200. 24
1 0.9 0.8
True Positive Rate
0.7 0.6 0.5 0.4 Ours, AUC=0.9 SS13, AUC=0.94 MDT10, AUC=0.85 MPPCA09, AUC=0.77 SF+MPPCA, AUC=0.71 SF09, AUC=0.63 MFLM08, AUC=0.63
0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4 0.5 0.6 False Positive Rate
0.7
0.8
0.9
1
Figure 8: Comparisons of frame level ROC curves for the UCSD ped2 dataset
4.2.1. Subway Exit The subway exit surveillance video is 43 minutes long with 64,900 frames. The frames in first 5 minutes are used for training and the other frames for testing includes 19 abnormal events of WD, LT and misc.. The abnormal event detection results for a few frames are shown in Figure 9, from which we can see that the proposed method is able to detect people going through the gate in the opposite direction (Fig. 9 (a,b)), and sudden running events (Fig. 9 (d,e)). The loitering events with people moving slightly in a region can also be detected by the proposed model, as shown in Figure 9 (c).
(a) WD
(b) WD
(c) LT
(d) MISC
(e) MISC
Figure 9: Examples of abnormal activities detected on the Subway exit dataset.
25
Table 1 shows the quantitative comparison of the result with other methods. We can see that our proposed method obtains high detection rates with low false alarm rates, achieving similar performance to other state-of-the-art methods. It should be noted that when the subway arrives or leaves in the exit surveillance video, large lighting variations may lead to false alarms for all the methods. Table 1: Comparison of unusual event detection rate and false alarm rate on the subway exit surveillance dataset
WD
LT
MISC
Total
FA
Ground Truth
9
3
7
19
0
MPPCA09 [14]
9
3
7
19
3
OSC11 [4]
9
3
7
19
2
SR11 [24]
9
-
-
-
2
SC13 [25]
9
3
7
19
2
SS13 [17]
9
3
7
19
2
Ours
9
3
7
19
2
4.2.2. Subway Entrance The subway entrance video is 1 hour 36 minutes long with 144249 frames in total. The first 15 minutes are used for training and 66 unusual events are defined in testing frames, covering all types of abnormal activities. Figure 10 shows examples of abnormal activities detected by our algorithm, and we can see that different kinds of anomalies such as wrong direction (Fig. 10 (a)), crossing the gate without paying (Fig. 10 (b)), loitering (Fig. 10 (c)), suspi26
cious interaction (Fig. 10 (d)) and turning around at the gate (Fig. 10 (e)), can all be accurately located with the proposed method.
(a) WD
(b) NP
(c) LT
(d) II
(e) MISC
Figure 10: Examples of abnormal activities detected on the Subway entrance dataset.
The quantitative results are summarized in Table 2. From the comparison we can see that our method can detect more events while producing higher false alarms. The miss detection of NP is mainly due to the fact that the resolution of the objects is low, and people go through the gate with their back to the camera, the features extracted are insufficient to identify their irregular behaviors. The false alarms are produced when there are new events in the test videos, such as two people gesticulating while they are talking, and a person holding up the briefcase when going through the gate, these behaviors are missed in ground truth annotations, hence labeled as false alarms in evaluation. In addition, the lack of motion patterns collected in training process contributes to reducing the accuracy of the statistical models. 4.3. UMN Dataset The UMN dataset [38] contains three different scenes of crowded escape events with a resolution of 320×240. Each scene of video begins with people walking around as normal behaviors, and the sudden running of all people are taken as abnormal events. The frame number of the three scenes are 1453, 4143 and 2143 respectively. We used the first hundreds of frames (400 27
Table 2: Comparison of unusual event detection rate and false alarm rate on the subway entrance surveillance dataset
WD
NP
LT
II
MISC
Total
FA
Ground Truth
26
13
14
4
9
66
0
MPPCA09 [14]
24
8
13
4
8
57
6
OSC11 [4]
25
9
14
4
8
60
5
SR11 [24]
21
6
-
-
-
-
4
SC13 [25]
25
7
13
4
8
57
4
SS13 [17]
26
6
14
4
8
58
6
Ours
25
9
14
4
9
61
5
frames for scene1 and scene3, 300 frames for scene2) to train our models and used the others for testing. As for the experiment settings, we set the sampling interval as 10 pixels for optical flow calculation in the motion model and the size of spatial-temporal cuboids as 10 pixels × 10 pixels × 5 frames for appearance model. Other related parameters are set as follows: M = 50, D = 20, r = 0.9999, λ1 = 1.2, C = 100, λ2 = 150, γ = 1, S = 200. We demonstrate some abnormal frames detected by our algorithm in Figure 11. Our method is able to locate multiple running people in scenes of panic, providing alarming frames. The results also show that our method has good robustness in both outdoor (Fig. 11(a,b,e)) and indoor (Fig. 11(c,d)) scenes with different illuminations. Table 3 shows the area under ROC curves of different methods on the UMN dataset, and we can see that our algorithm achieved similar performance to other state-of-the-art methods. The slight 28
difference in AUC values are probably caused by the filter process of abnormal frames at the beginning and the end of sudden running, where the region of running behaviors are not large enough to be regarded as anomalies.
(a)
(b)
(c)
(d)
(e)
Figure 11: Examples of abnormal activities detected on the UMN dataset.
Table 3: Comparison of area under ROC curves on the UMN dataset
CI10 [6]
SF09 [5]
scene1 UMN
scene2
0.99
0.96
scene3
SR11 [24]
SS13 [17]
Ours
0.995
0.991
0.993
0.975
0.951
0.969
0.964
0.99
0.988
5. Conclusion In this paper we proposed a method to detect abnormal events based on the integration of motion and appearance cues. Histogram model and SVDD model are utilized to characterize the motion patterns and appearance distribution of normal samples respectively. In our algorithm, we learned the boundary of training videos in both models to measure the abnormality of a new observation, avoiding too many parameters for making decisions. Since we may describe any objects from the view of motion and appearance, our 29
method can be widely used in various surveillance scenes. What’s more, the histogram and SVDD models do not limit the type of features or the type of scenes, which helps us to extend the method to broader research areas. Experimental results illustrated that the proposed method is effective for real world videos containing different types of normal and abnormal activities. Our future work will focus on extending the method to more video applications. References [1] L. Lee, R. Romano, G. Stein, Introduction to the special section on video surveillance, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 745. [2] M. Shah, O. Javed, K. Shafique, Automated visual surveillance in realistic scenarios, IEEE MultiMedia 14 (1) (2007) 30–39. [3] J. Konrad, Motion detection and estimation, Handbook of Image and Video Processing (2000) 207–225. [4] B. Zhao, L. Fei-Fei, E. P. Xing, Online detection of unusual events in videos via dynamic sparse coding, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3313–3320. [5] R. Mehran, A. Oyama, M. Shah, Abnormal crowd behavior detection using social force model, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 935–942.
30
[6] S. Wu, B. E. Moore, M. Shah, Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2054–2060. [7] I. Haritaoglu, D. Harwood, L. S. Davis, W 4: Real-time surveillance of people and their activities, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 809–830. [8] Y. T. Liao, C.-L. Huang, S.-C. Hsu, Slip and fall event detection using bayesian belief network, Pattern Recognition 45 (1) (2012) 24–32. [9] V. Vishwakarma, C. Mandal, S. Sural, Automatic detection of human fall in video, in: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, pp. 616–623. [10] C. Rougier, J. Meunier, A. St-Arnaud, J. Rousseau, Fall detection from human shape and motion history using video surveillance, in: Advanced Information Networking and Applications Workshops, 21st International Conference on, Vol. 2, 2007, pp. 875–880. [11] X. Cui, Q. Liu, M. Gao, D. N. Metaxas, Abnormal detection using interaction energy potentials, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3161–3167. [12] G. Zen, E. Ricci, Earth mover’s prototypes: A convex learning approach for discovering activity patterns in dynamic scenes, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3225–3232. 31
[13] Z. Fu, W. Hu, T. Tan, Similarity based vehicle trajectory clustering and anomaly detection, in: Proceedings of IEEE International Conference on Image Processing, Vol. 2, 2005, pp. 602–605. [14] J. Kim, K. Grauman, Observe locally, infer globally: a space-time mrf for detecting abnormal activities with incremental updates, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 2921–2928. [15] A. Adam, E. Rivlin, I. Shimshoni, D. Reinitz, Robust real-time unusual event detection using multiple fixed-location monitors, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (3) (2008) 555– 560. [16] Y. Benezeth, P.-M. Jodoin, V. Saligrama, C. Rosenberger, Abnormal events detection based on spatio-temporal co-occurences, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 2458–2465. [17] Y. Hu, Y. Zhang, L. S. Davis, Unsupervised abnormal crowd activity detection using semiparametric scan statistic, in: Computer Vision and Pattern Recognition Workshops, IEEE Conference on, 2013, pp. 767– 774. [18] V. Saligrama, Z. Chen, Video anomaly detection based on local statistical aggregates, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2112–2119.
32
[19] R. Hamid, A. Johnson, S. Batta, A. Bobick, C. Isbell, G. Coleman, Detection and explanation of anomalous activities: representing activities as bags of event n-grams, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, 2005, pp. 1031–1038. [20] M. J. Roshtkhari, M. D. Levine, Online dominant and anomalous behavior detection in videos, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2611–2618. [21] D. M. Tax, R. P. Duin, Support vector data description, Machine learning 54 (1) (2004) 45–66. [22] A. Williams, D. Ganesan, A. Hanson, Aging in place: fall detection and localization in a distributed smart camera network, 2007, pp. 892–901. [23] E. Esen, M. A. Arabaci, M. Soysal, Fight detection in surveillance videos, in: Content-Based Multimedia Indexing, 11th International Workshop on, 2013, pp. 131–135. [24] Y. Cong, J. Yuan, J. Liu, Sparse reconstruction cost for abnormal event detection, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3449–3456. [25] C. Lu, J. Shi, J. Jia, Abnormal event detection at 150 fps in matlab, in: Proceedings of the IEEE International Conference on Computer Vision, 2013. [26] O. Boiman, M. Irani, Detecting irregularities in images and in video, in: International Journal of Computer Vision, 2007, pp. 17–31. 33
[27] L. Kratz, K. Nishino, Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1446–1453. [28] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, Semi-supervised adapted hmms for unusual event detection, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, 2005, pp. 611–618. [29] E. L. Andrade, S. Blunsden, R. B. Fisher, Modelling crowd scenes for event detection, in: Pattern Recognition, Vol. 1, 2006, pp. 175–178. [30] K. Ouivirach, S. Gharti, M. N. Dailey, Incremental behavior modeling and suspicious activity detection, Pattern Recognition 46 (3) (2013) 671–680. [31] V. Mahadevan, W. Li, V. Bhalodia, N. Vasconcelos, Anomaly detection in crowded scenes, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1975–1981. [32] M. H. Sharif, C. Djeraba, An entropy approach for abnormal activities detection in video streams, Pattern Recognition 45 (7) (2012) 2543–2561. [33] J. Kwon, K. M. Lee, A unified framework for event summarization and rare event detection., in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1266–1273. [34] H. Zhong, J. Shi, M. Visontai, Detecting unusual activity in video, 34
in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, 2004, pp. 819–826. [35] C. Liu, W. T. Freeman, E. H. Adelson, Y. Weiss, Human-assisted motion annotation, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. [36] S.-W. Lee, J. Park, S.-W. Lee, Low resolution face recognition based on support vector data description, Pattern Recognition 39 (9) (2006) 1809–1812. [37] W.-C. Chang, C.-P. Lee, C.-J. Lin, A revisit to support vector data description (svdd). [38] Unusual crowd activity dataset. http://mha.cs.umn.edu/movies/crowdactivity-all.avi.
35