Author’s Accepted Manuscript Quaternion Discrete Cosine Transformation Signature Analysis in Crowd Scenes for Abnormal Event Detection Huiwen Guo, Xinyu Wu, Shibo Cai, Nannan Li, Jun Cheng, Yen-Lun Chen www.elsevier.com/locate/neucom
PII: DOI: Reference:
S0925-2312(16)30114-X http://dx.doi.org/10.1016/j.neucom.2015.07.153 NEUCOM16913
To appear in: Neurocomputing Received date: 13 March 2015 Revised date: 12 July 2015 Accepted date: 25 July 2015 Cite this article as: Huiwen Guo, Xinyu Wu, Shibo Cai, Nannan Li, Jun Cheng and Yen-Lun Chen, Quaternion Discrete Cosine Transformation Signature Analysis in Crowd Scenes for Abnormal Event Detection, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2015.07.153 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Quaternion Discrete Cosine Transformation Signature Analysis in Crowd Scenes for Abnormal Event Detection Huiwen Guoa , Xinyu Wua,b,∗, Shibo Caic , Nannan Lia , Jun Chenga , Yen-Lun Chena a Guangdong
Provincial Key Lab of Robotics and Intelligent Systems, Shenzhen Institute of Advanced Technology, Chinese Academy of Science, China b Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong c Key Laboratory of E&M (Zhejiang University of Technology), Ministry of Education, China
Abstract In this paper, an abnormal event detection approach inspired by the saliency attention mechanism of human visual system is presented. Conventionally, statistics-based methods suffer from visual scale, complexity of normal events and insufficiency of training data, for the reason that a normal behavior model is established from normal video data to detect unusual behaviors, assuming that anomalies are events with rare appearance. Instead, we make the assumption that anomalies are events that attract human attentions. Temporal and spatial anomaly saliency are considered consistently by representing the pixel value in each frame as a quaternion, with weighted components that composed of intensity, contour, motion-speed and motion-direction feature. For each quaternion frame, Quaternion Discrete Cosine Transformation (QDCT) and signature operation are applied. The spatio-temporal anomaly saliency map is developed by inverse QDCT and Gaussian smoothing. By multi-scale analyzing, abnormal events appear at those areas with high saliency score. Experiments on typical datasets show that our method can achieve high accuracy results. ∗ Corresponding
author Email addresses:
[email protected] (Huiwen Guo),
[email protected] (Xinyu Wu),
[email protected] (Shibo Cai),
[email protected] (Nannan Li),
[email protected] (Jun Cheng),
[email protected] (Yen-Lun Chen)
Preprint submitted to Neurocomputing
April 6, 2016
Keywords: video anomaly detection, quaternion discrete cosine transformation, spectral analysis, multi-scale analysis
1. Introduction Intelligent video surveillance has been a very important technique for monitoring dense crowd in public places for crowd disaster prevention, emergencies alarming, and the safety of life and property protection. It has been obtained much attention of researchers, for the benefit that not only speeding up the response time of security agencies, but also liberating a large number of security persons from tedious work of watching videos. However, abnormal event detection still faces problems. The basic problem is the universal definition of an abnormal event. On one hand, it is infeasible to provide the list of all abnormal events in a given scene due to the variety of abnormal events. On the other hand, some normal events are treated as abnormal in different scenes. A event is normal or not is closely related to the place, the time of the occurrence and the surrounding. Since the variety of surveillance place and abnormal events, we only focus on two typical types of abnormal events at crowd scenes, one is that wrong object movement at surveillance area, for example, cyclist moving at walking street, the other one is that wrong behaviors appearing on objects, such as escaping on square. To detect these abnormal events, a popular idea is that these abnormal events are defined as the events occurring infrequently [1]. A probabilistic viewpoint could be adopted to an abnormal event, which is conformed by intuition. And normal event model is built underlying a statistics-based framework. Clustering based method [2], reconstitution-based method [3] and inferring-based method [4] are frequently adopted. However, some big challenges are encountered by using these methods. Firstly, it makes anomalies dependent on the visual scale at which training data is defined. An abnormal behavior at fine visual scales may be perceived as normal when a larger scale is considered, and vice-versa. Secondly, various normal event models are needed for different
2
scenes. For instance, people appearing on a walking street is considered as a normal event, but people appearing on the highway is taken as an abnormal event. Thirdly, a large number of normal training datasets are needed to build normal-event models robustly. However, the interesting thing is that a child without too much knowledge can be found the abnormal event in video only by a glance, such as gathering, vehicle on the sidewalk. These do not depend on the prior knowledge of the individual, simply because these events are very different from those around them. Although people are not deliberately, people neglect the brain’s summary of normal events, and only care for special events. Thus, we hope utilize the inherent characteristics of the people and avoid the mentioned problems of traditional methods. Based on this, we define these abnormal events as events that attract attentions in saliency view, which is inspired by the saliency attention mechanism of human visual system. Primates have a remarkable ability to qualitatively interpret complex scenes or events in real time, despite the limited speed of the neuronal hardware available for such tasks [4]. Before further processing the huge amount of information, intermediate and higher visual processes appear to select a subset of the available sensory information [5], most likely to reduce the complexity of scene analysis [6]. The selected spatially circumscribed region is called the focus of attention[4], which scans the scene both in a rapid, bottom-up, saliency-driven, and taskindependent manner as well as in a slower, top-down, volition-controlled, and task-dependent manner [6]. For the visual system in primates, visual input is first decomposed into a set of topographic feature maps. Competition within maps that are with different spatial location appeared, only those locations that locally stand out from their surround can persist [4]. In primates, such a map is believed to be located in the posterior parietal cortex [7] as well as in the various visual maps in the pulvinar nuclei of the thalamus [8].It is plausible that the locations of abnormal events in surveillance videos are the same to the saliency attention region that located by human visual nervous system, according to the above research findings. The sparseness and rareness characteristic of the ab3
normal events result in the triumph from other normal events, which persists after intermediate and higher visual processes of human. Anomaly in our definition is consistent with the popular definition to some extent that are based on comparison between events. Intuitively, our assumption can be demonstrated by some practical examples. For instance, some abnormal events, such as rider on sidewalks, retrograding on the one-way street, car appearing on a sidewalk, or fighting on the street, attract our attention immediately. In this work, we propose a saliency-attention-based approach to detect two kinds of abnormal events in crowd scenes, according to the definition that attractive events are anomalies. For the consistent consideration of temporal and spatial saliency attention in video, each frame is represented by a weighted quaternion image that have four channels, which are the frame intensity, object contour, motion speed and motion direction information. By using Quaternion Discrete Cosine Transformation (QDCT) and signature operation for frequent components, new frequent components could be obtained. The spatio-temporal anomaly saliency attention map is obtained by the inverse QDCT operation and Gaussian smoothing with multi-scale analysis. The location of abnormal events appear at those areas with high anomaly saliency attention scores. The main contribution of this work can be summarized into the following aspects: • A new assumption to abnormal events is introduced, which allows one to detect abnormal events in crowd scenes without any prior of anomalies and any specification of anomalous activity classes. • By using quaternion frame, the low level feature representation of events consistent consideration of temporal and spatial information. • There is no extra complex calculation steps, which result in low calculation complexity in detection, and makes the system perform at a near real-time speed in standard PC. The rest of this paper is organized as follows. Section 2 gives a brief intro-
4
duction of related work on video-anomaly detection and spectral approaches. Section 3 provides an detailed description of the proposed method and gives some characteristics analysis. The performance evaluation for the proposed method is shown in section 4, and conclusions and future work are given in section 5.
2. Related Work In the past decade, many researchers have been focusing on abnormal events detection, and a comprehensive survey of this problem can be found in several review papers[9, 10]. The existing approaches of abnormal events detection can be generally divided into two broad categories: supervised and unsupervised approaches, depending on the approaches applied for constructing the model. Supervised-based methods usually establish the normal and abnormal behavior models from labeled video data, and can detect anomalous defined beforehand in the training phase. Information extraction[11], preprocess of features[12– 14] and classifiers such as SVM[15] were usually used. However, the builded models depending on pre-defined behavior classes containing both normal and anomalous ones, which can only detect specific anomalies under rigorous restrictions of video scene conditions [16, 17]. State transition model has been used to perform sematic anomaly detection via online learning[18]. However, in real, complex environments[11, 19] where it is impossible to specify the types of anomalies, they cannot work effectively. Unsupervised-based methods only provide normal instances, the center idea of such approaches is to learn the normal behavior model either automatically or through a training process. In the literature, one category of the existing approaches is based on trajectory, such as [20–25]. The significance of trajectory is clear, although the extraction process is complicated. Typical extraction process which is used to obtain spatio-temporal trajectories is object tracking method, which they are not robust in crowd scene. Wu [20] get particle trajectories in crowd scene in-
5
spired by particle advection. The same procedure is did in [21], which extract part and short-time trajectories. To find the optimal spatio-temporal trajectories in 3D volume, [22] use path optimization algorithm. As the trajectories are extracted, normal event model is obtained by clustering the trajectories. [23] present an multi-level k-means clustering algorithm both in temporal and spatial aspects. [24] not only consider the dynamics of abnormal trajectories, but also consider the co-occurrence relationship between trajectories. Treat points in trajectories as graph node, spectral graph analysis is used to find the abnormal trajectories[25]. By whole training normal trajectories, [26] calculate the trajectory trending to predict unfinished trajectory online. The performance of the trajectories-based methods heavily depended on the extraction process of trajectories. In scenes with few people, these approaches can obtain precise detection results; however, in dense crowds, it is quite difficult to get robust result, although some improvement in articles have been proposed. Another category of approaches utilizes statistics-learning based methods to build a normal behavior model for anomaly detection. Such methods rely on lowlevel features which are extracted from image patches or spatio-temporal video volumes (STVVs). These features include optical flow [2, 27, 28], histogram of optical flows[29], histogram of spatio-temporal gradients [30, 31], texture of optical flow [32], dynamic texture[28], etc. Furthermore, some scholars [33, 34] integrate various features to obtain better detection results, taking advantage of mutual complementation from the descriptions of motion related to different features. Compared to the aforementioned tracking-based methods, those approaches demonstrate their robustness in complex environments with dense crowds, since the low-level features which they depend on can be extracted reliably. After the low-level features have been extracted, normal model has been established using various approaches. [2] firstly use bag-of-words method to model the mutual energy between volumes. [35] overcome the disorder in spatio-temporal video volumes, which may be of crucial importance for detecting anomalies under specific context, and take the configuration of spatiotemporal video volumes into account. Wang[36] compare some topic models 6
including the latent Dirichlet allocation(LDA) mixture model, the hierarchical Dirichlet processes(HDP) model and the Dual Hierarchical Dirichlet Processes (Dual-HDP) model used in unsupervised activity perception in crowded scenes, and give the performance of these methods. Considering both appearances and dynamics of abnormal events, probabilistic graphical models have been applied in crowd scenes, including Markov random field(MRF)[2], hidden Markov model(HMM)[23], conditional random field[37] and dynamic Bayesian network(DBN). It is worth noticing that sparse representation is widely used in anomaly detection. [3] weighted the components in the trained dictionary to detect specific abnormal events. For the need of detection of several abnormal events, dictionary sets has been applied in [38] and online self-adapting learning and update processes have been used in [39]. The main drawback of statisticslearning based methods is its high computation complexity in training stage. Furthermore, the performance of these methods is heavily depend on training data. By exploiting the saliency property of abnormal events, the proposed work, avoiding the boring training step, apply the spectral approach to detect the abnormal events. The first discovery of the effectiveness of spectral approach in saliency detection is Hou[40]. Several spectral saliency models have been proposed since then[41–45]. By using Spectral residual, attenuating the magnitude components by abstracting the average log magnitude components, to emphases the prominent objects, Hou achieved state-of-the-art performance for salient region detection and psychological test patterns. The bases of Hou’s work is that suppressing the magnitude components in the frequency domain highlights signal components such as lines, edges, or narrow events[46, 47]. To consider the color image as a whole, Guo[42] proposed the use of phase spectrum of quaternion Fourier transform(PQFT) and introduced a multi-resolution attention selection mechanism. By weighting the quaternion components, Bian[43] adapted the work in color images. By theoretically analyzing the use of the discrete cosine transform(DCT) for spectral saliency detection, the method referred as ”image signature”[48] outperforms all other approaches. And Schauerte[49] 7
extend the work and introduced the definition of a quaternion DCT and signatures. As in [50], the improvement that multi-scale and weighted components were used. The most related work is in [42], which applied the spectral approach in video. However, it only used one component to represent the dynamics in inter-frame. For multi-view learning, some semi-supervised learning algorithms have been proposed [51]. [52] propose the Multi-view intact space learning algorithm to encoded complementary information in multiple views. Multi source bring irregular data, like clicks [53]. Multimodal Sparse Coding [54] is proposed to solve this problem. From the point of view of probability, some stochastic learning methods have been proposed [55]. In our work, appearance and dynamics of abnormal information have been considered consistently by using quaternion discrete cosine transform. In addition, by comparing and analyzing, several improvements are applied in our method. Thus, for abnormal events detection, our method can effectively detect anomalous events in crowded and complicated scenes.
3. The proposed method For video surveillance, the anomaly detection task is to find all frames that abnormal events appeared and label all pixels that abnormal events located in these frames. In order to achieve the goal, every pixel in each frame is required to determine whether it belongs to the anomaly. For our method, we calculate all pixels in a frame simultaneously, and the result of one frame is called ”anomaly saliency map”. If there are some pixels with large value in the anomaly saliency map, it is considered that the frame is an abnormal frame, and these pixels are the location of the abnormal events. The overview of our algorithm is illustrated in Fig.1. These processes are composed of two stages. Firstly, a weighted quaternion frame is represented for each original frame, and then QDCT is adopted to the quaternion frame. After that, signature operation is applied to the frequency component of the result of QDCT. Secondly, for the result of first stage, the abnormal saliency map which represents anomaly
8
Figure 1: Overview of the proposed approach for anomaly detection
score is obtained from the reconstructed frame by inverse QDCT and Gaussian smoothing. Thus the region with high saliency value is considered as abnormal region. The details of our method are discussed in the following subsections. 3.1. Creating a Weighted Quaternion Frame Quaternion can be treated as an extension of the 2D complex number with a 4D algebra. A quaternion q is defined as q = a + bi + cj + dk or q = (a + bi) + (c + di)j with a, b, c, d ∈ R, where i, j, k(i2 = j 2 = k 2 = ijk = −1) provide the necessary basis. As quaternion does not satisfy the commutative law of multiplication, when the Hamilton product is involved, the left-sided and rightsided operations should be considered. A quaternion q is called pure imaginary, if a = 0, and a unit pure imaginary is defined as q = (b/w)i + (c/w)j + (d/w)k where w = |b2 + c2 + d2 |. Naturally, we can represent every frame Ft in a given video with less than 4 channels as a quaternion image Qt : Qt = I 1t + I 2t i + I 3t j + I 4t k
(1)
Where I 1t , I 2t , I 3t , I 4t are 4 channel images with same size of frames. For brief expression, time index will be omitted in the following paragraph.
9
Figure 2: One example of the anomaly saliency map (left small images are the four channels of quaternion frame: the intensity channel and the contour channel are in the upper row; the motion speed and the motion direction channel are in the lower row. The right one is the calculated anomaly saliency map).
The key to find the anomaly frames is the presentation of channel information of Q. By considering the spatial and temporal anomaly, intensity, contour, motion speed and motion direction are adopted. For frames with gray intensity values, the first channel is the intensity image. For frames with color intensity values, according to the color visual saliency theory [56], the first channel is obtained as follow: R = r − (g + b)/2 G = g − (r + b)/2 (2) B = b − (r + g)/2 Y = (r + g)/2 − |r − g|/2 − b RG = R − G (3) BY = B − Y I 1 = max(RG + BY, 255)
(4)
where r, g, b are the corresponding channels of RGB frame. The second channel also represents the spatial anomaly. To obtain the second channel, a background subtraction method is used between the last frame and current frame. The second channel is the absolute difference of these two frames. 10
The third and fourth channels both represent the motion information. Optical flow is used to calculate the basic motion. The third channel is represented by the amplitude of optical flow, which is set to 0 if the amplitude less than threshold Tc3 . The fourth channel is represented by the angle of optical flow which is set to 0 if the corresponding amplitude is less than Tc3 . Usually, we choose Tc3 = 0.4 empirically. In summary, we obtain four channels of each frame: two spatial channels and two temporal channels. So the original frame can be represented as a quaternion frame Q. One example is shown in Fig.2. Naturally, each channel reveals different anomaly aspect. Thus, different channel should make different contributions to the final result, practically. As done by [43], we can model the relative importance of the four components for the anomaly detection by introducing a quaternion component weight vector w = [w1 , w2 , w3 , w4 ] and adapting Eq.1 appropritately: Q = w1 I 1 + w2 I 2 i + w3 I 3 j + w4 I 4 k
(5)
By setting different weight for each channel, our algorithm is sensitive to different abnormal events. For example, it is inclined to detect the retrogradation by emphasising the weight of motion direction channel, and it is likely to detect the overspeed when the weight of motion speed channel is high. 3.2. The 2D QDCT Following the definition of the QDCT, we transform the X ∗ Y (size) quaternion matrix Q: Y QDCT L (q, s) = αX q αs
X−1 −1 X YX
Y µQ Q(x, y)β X q,x β s,y
(6)
Y Q(x, y)β X q,x β s,y µQ
(7)
x=0 y=0
R
QDCT (q, s) =
Y αX q αs
X−1 −1 X YX x=0 y=0
11
where µQ is a unit pure quaternion that served as QDCT axis. According to the Type-Π DCT, α and β are defined as follows: r 1 q=0 X αX = r q 2 q 6= 0 X βX q,x = cos
π X
x+
1 2
(8)
q
(9)
The corresponding inverse QDCT is defined as follows: IQDCT L (x, y) =
−1 X−1 X YX
Y L X Y αX q αs µQ QDCT (q, s)β q,s β x,y
(10)
Y L X Y αX q αs QDCT (q, s)β q,s β x,y µQ
(11)
q=0 s=0
IQDCT R (x, y) =
−1 X−1 X YX q=0 s=0
p p p The choice of the axis is arbitrary and we use µQ = − 1/3i − 1/3j − 1/3k by default. If not mentioned, left-sided operation is used. 3.3. Spatio-temporal Anomaly Saliency Map For real DCT, signature operation, which are called signFun, is as follows: 1 w(q,s) > 0 (12) signF un(w(q,s) ) = −1 w(q,s) < 0 However, the signature operation for quaternion can be considered as the quaternion ”direction” and is defined as follows: wi(q,s) /|w(q,s) | i signF un(w(q,s) ) = 0 where |w(q,s) | =
|wi(q,s) | = 6 0
(13)
|wi(q,s) | = 0
q 2 2 2 wi(q,s) + wj(q,s) + wk(q,s) . After executing IQDCT for the
frequency components with signature operation, the final anomaly saliency map S(Q) is obtained by smoothing the restructured frame using a Gaussian filter: S(Q) = g ∗ [TQ TQ ] 12
(14)
where TQ = IQDCT (signF un(QDCT (Q))) and TQ is conjugate form of TQ , represents the element wise product, and g is a Gaussian smoothing filter. To account for anomalies at multiple spatial scales, we rely on a hierarchical mixture of QDCT signature result. For spectral approaches the scale is defined by the resolution of the frame Q[57]. Thus, by resizing the quaternion frame and doing the above operations, anomaly saliency maps with various resolutions are obtained. To calculate a final anomaly saliency map, as proposed by Peters[H1], we combine the maps at various frame scales. Let S m (Q) denote the anomaly saliency map at scale m ∈ M , then S M (Q) is defined as follows: S M (Q) = hσM ∗
X
fR (S m (Q))
(15)
m∈M
where fR resizes the resolution of matrix to the target anomaly saliency map resolution R and hσM is an additional, optional Gaussian filter. 3.4. The Characteristics of QDCT Signature In QDCT, the four channels of quaternion are considered as a whole. Therefore, the key part of abnormal detection is the expression of events. Other than image saliency detection that only considers the static image, abnormal event detection needs to simultaneously take static and dynamic anomaly into account. In order to the consistent consideration of spatial and temporal anomaly, we choose the intensity/color, contour, motion speed and motion direction information. Hence, an event is described by combination of the intensity, contour, motion speed and motion orientation in video. Intuitively, according to above definition and description of events, normal events are frequent and in analogy to background, while abnormal events are sparse and in analogy to foreground. Hence, inference can be made that background is frequent in spatio-temporal domain and represented as spike signal in the frequency domain. Conversely, abnormal events are sparse in the spatiotemporal domain, and appears often in the frequency domain. Thus, abnormal events can be detected by removing spike signal using signature operation.
13
Let consider frames with the following structure: I, f, b ∈ RN
I =f +b
(16)
where I represents the frame. f represents the foreground of frame and is assumed to be sparse in terms of spatial field. b represents the background and is assumed to be sparse supported in the basis of the Discrete Cosine Transform. According to the theoretical analysis of DCT signature by Hou [48], the frame reconstructed from the frame signature approximates the location of a sufficiently sparse foreground on a sufficiently sparse background as follows: E(
ˆ hfˆ, Ii ) > 0.5, ˆ kfˆk · kIk
f or
|Ωb | <
N 6
(17)
where E(X) means the expectation of random variable X, fˆ denotes the operation of IDCT (signF un(DCT (f ))), kfˆk denote the `2 norm of vector fˆ, |Ωb | means the number of the support set of b. The detailed proof is given in [48]. An important note is that the eq.17 only depends on the relative sparseness of ˆ The sparse foreground means the foreground and background, kfˆk and kIk. the spatial sparsity, while the sparse background means the spectral sparsity. And the only limit is that the background b is sufficiently sparse |Ωb | <
N 6.
It meets our criterion for event representation. This property shows that the key idea of locating foreground is the sparsity of foreground and simplification of background. Some simplification process consists of background subtraction and threshold process are benefit for the anomaly detection. Some sparsity processing can also be used. When obtaining the second channel, object contour is adopted to reduce the spatial region of large objects in the image. Meanwhile, the sparsity of background is never changed because of the large number of normal objects. The discussion mentioned above certifies that abnormal events can be located. With the proposition [48], Gaussian smoothing is applied to improve the coverage rate of abnormal event regions. For a foreground f with non-zero elements independently drawn from the unit Gaussian distribution, over 79% of fˆ is expected to be contained in the support of the foreground Uf , as following: 14
E
α
!
r
2 ≈ 0.7979 π sX 2 α= fˆi
p α2 + β 2
≥
(18)
i∈Uf
sX 2 fˆj β= j ∈U / f
where Uf is the support set of f . It means that anomaly region with high score in anomaly saliency attention map will be larger after Gaussian smoothing.
4. The Experiments 4.1. The characteristic analysis of our method in psychological patterns In section 3.3, we theoretically analyze the effectiveness of our method on the detection of anomalous events, and indicate that the characteristic of the method meets that of the representation for the abnormal events. Psychological patterns are generally accepted as the cases of attracting attention and widely used to test the effectiveness of the algorithm in the terms of saliency detection. And from another point of view, these psychological patterns are consistent with the representation we express abnormal events at the feature level. Thus, we use eight kinds of psychological patterns to confirm the assumption, abnormal events attract human attentions, is right. Considering the test image only have one channel (convert to gray image), we treat the gray intensity image as the second extracted channel in Quaternion frame and set other channels as zero. The weighted and multi-scale are not used, due to the lack of the channels and the fixed size of objects. The standard deviation σ of the Gaussian kernel, proportional to the size of interest object, is setting as 10, which implicitly assume that the width of object is about 20 pixels. The results are as shown in Fig.3. In Fig.3, it is easily to observe that all these psychological patterns are corresponding to the channels of quaternion frame of abnormal events. For example, the first and the second patterns usually correspond to the intensity channel features that are extracted from the situation that strange objects appear in 15
Figure 3: Results on psychological pattern images. The first row are size, color, shape and density saliency patterns, and the third row are two dissimilar and orientation saliency patterns. The second and the fourth rows are the saliency anomaly results.
surveillance area, such as one car with white color appear in walkway. The third and the fourth patterns represent the shape and density saliency pattern, respectively. These are similar to the second channel features extracted when cyclist and crowd gathering appear on the walkway. The fifth and the sixth patterns show salient length, which usually appear in abnormal objects with excessive speed or staying stillness on the street. Salient orientation patterns are shown in the last two images. These patterns will be extracted when a few person walk differently from other persons. As mentioned above, the appearance of abnormal events certainly attract the attention of human vision, and our method is able to effectively detect these visual saliency anomalies.
16
The effectiveness of our method needs a precondition, the sparsity of background and foreground. To evaluate our method thoroughly, we design psychological patterns to test our approach, considering the utmost conditions. Firstly, we generated the test images. For background b, for which needs sparsity in the DCT domain, we randomly select the support for ˆb in the DCT domain and the non-zero number of ˆb is |Ωb |, the value is randomly choosen in [−4, 4]. For foreground, for which needs sparsity in gray domain, we set the block in image that appears at a random location with size |Uf | and generate the pixel intensity follows a normal distribution with mean value of 0.05 and standard value of 0.05. The evaluation criterion is the proportion of energy concentrated in the foreground support Uf . The standard of Gaussian kennel is corresponding with the value of |Uf |. By setting different value of |Ωb | and |Uf |, the results is as shown in Fig.4. It is clear that the energy is concentrated in the foreground support, despite the very small amplitude of the foreground. It is worth mentioning that even with complex background(with |Ωb | = 70000), the abnormal saliency map still clearly shows the shape of the foreground support of the image, which is shown in Fig.4. Although the proportional relationship of the size of foreground and the energy, the bigger the object is, the more difficult the detection is, due to the lacking of mean energy for each foreground pixel when objects have large area. 4.2. Experiment setup As mentioned previously, we create a weighted quaternion frame to represent an event. The value of weight is decided by the sensitivity of abnormal events. Although all kinds of abnormal events are impossible to test, we choose four kinds of typical abnormal events from ped2 in UCSD dataset to evaluate the sensitivity of each channel. The four kinds of abnormal events are cyclist, cyclists with different direction, vehicle and skater in walkway. Each kind of channel feature is extracted as the second channel, creating the quaternion frame by setting other channels with zero. Using the proportion of energy concentrated in the foreground support as evaluation criterion, the results is 17
Figure 4: Performance of the approach with different background complexity on a 240 ∗ 320 image (N = 76800). The left figure presents the results with different |Ωb | (0, 100, 1000, 5000, 10000, 20000, 30000, 40000, 50000, 60000, 70000 ) and different foreground size |Uf | (1%, 5%, 10%, 20%, 25% of the area of whole image). The right is one example. These images are background, foreground, generated image and saliency anomaly map result orderly (all in space domain).
shown in Fig.5. It can be seen that intensity and contour channels are more sensitive to objects that other than common objects, such as vehicle, skater. Motion speed as one important cue to anomalous events has more contribution to the detection of abnormal events than other channels. Considering the prior of abnormal events and the sensitivity of each channel, we empirically set the weight as [0.3, 1.1, 5, 0.2]. For the multi-scale analysis, the scale factor is [0.6, 0.8, 1.0, 1.2]. By the analysis of the experimental results, We recommend that the tune range of the weight of each channel is best from 0.1 to 10, and the tune range of the scale factor is from 0.1 to 0.5. 4.3. Quantitative evaluation results To verify the effectiveness of the proposed algorithm, we apply it to the publicly available dataset. The UCSD dataset[28] are used. The UCSD dataset consists of two subsets, ped1 and ped2, both of which are surveillance videos captured by a fixed camera overlooking pedestrian walkways. Ped1 shows a scene of people moving towards and away from the camera, with some perspec-
18
Figure 5: The sensitivity of each channel for the detection of different kinds of abnormal
events.
tive distortion; ped2 contains video of people moving parallel to the camera, and the resolutions are 238 × 158 and 360 × 240, respectively. The normal event appearing in the dataset is sequences of pedestrians on the walkways, with a varying density from a few to very dense. The anomalies include cyclists, skaters, small tracks, people walking on a lawn, vehicle in walkways, gathering, etc. These are all related to dynamic and appearance abnormality, and occur spontaneously, i.e., not staged on purpose. Ped1 contains 36 video clips for testing; whereas ped2 contains 12 testing video clips. 10 video clips in ped1 and 12 video clips in ped2 have the ground truth annotated by pixel-level binary masks, which identify the regions containing anomalies. All experiments run at Matlab 2014a on the windows platform. The PC is equipped with Intel i5 CPU and 12G Memory. Spectral Visual Saliency Toolbox by Boris Schauerte[50] 19
and Quaternion toolbox[58] by Steve Sangwine and Nicolas Le Bihan have been used to implement the method. Frame-level and pixel-level criterion have been used to evaluate anomaly detection accuracy. A frame containing anomalies is denoted as positive, otherwise as negative, and pixel belong to abnormal objects is denoted as positive, otherwise as negative. Denote true positive rate(TPR) as division between the number of true positive frame and that of positive frame, and false positive rate(FPR) as division between the number of false positive frame and that of negative frame. Then ROC curve of TPR v.s. FPR is presented. Performance are also summarized by the equal error rate(EER), the ratio of misclassified frames at which F P R = 1 − T P R, and area under the ROC curves(AUC). The proposed approach is compared with five state-of-the-art approaches, which are reported in [4]: the Mixture of Dynamic Texture (MDT) [4], the Mixture of Optical Flow (MPPCA) [27], the Social Force (SF) [2], the Social Force with MPPCA (SF-MPPCA) [4] and the method from Adam et al.[59]. Brief descriptions of the five algorithms are referred to [4]. In [4], the author has given the pixel-level groundtruth, which uses a binary mask to indicate the anomalous region. We alter the threshold gradually from small to large, and each time we calculate the false positive and true positive rate, finally to obtain a ROC curve. Fig. 6 shows the ROC curves of the proposed approaches and other methods on ped1 and ped2 based on frame-level groundtruth. The EER values of the ROC curves are shown in Tab.1. We calculate the AUC, as shown in Tab.2, from which we can observe that the average AUC value of our approach is 86.5%, which is better than other comparison methods. Influenced by the projective projection of camera, the curve in ped2 is lower than MDT with small false positive rate, despite that our EER value is higher than it. It is possible that DCT is linear transform, while projective projection is not linear transform. The quantitative results of anomaly detection based on pixel-level groundtruth are illustrated in Fig.6. From a comparison, it can be observed that our method also achieves a high rate of anomaly localization, which is benefit from the characteristic that over 79% of foreground is expected to be contained 20
Figure 6: ROC curves of different approaches on ped1 and ped2 datasets. The up row show the ROC of ped1(left) and ped2(right) with the frame-level groundtruth. The down row show ROC of ped2 with pixel-level groundtruth.
in the support of the foreground. Other methods are suffer from the fixed scale in training phase. Some examples of anomaly detection results are shown in Fig.7. Note that the localization of vehicle has only conclude part of it, this is because that we setting the standard deviation of Gaussian filter as 5, i,e,. implicitly assuming that the width of abnormal object is 10 pixels, which causes that objects with more than 100 pixels area are not localized precisely.
5. Conclusion In this paper, we proposed a saliency-based abnormal events detection method. Weighted quaternion frame is used to represent the events, and QDCT signature is used to obtain the anomaly saliency attention map. The consistence in the characteristic of visual saliency attention mechanism and the property of the 21
Figure 7: Examples of anomaly detection results with the proposed approach on the ped1 dataset (the upper row) and the ped2 dataset (the lower row). The anomaly events include skater, cyclist, vehicle and gathering.
quaternion representation of events has been conformed by theoretical analysis and exhaustive experiments. Experimental results on UCSD datasets show that our method can detect anomalies efficiently and precisely. In the future, we will consider the expression of long-term events for looking forward QDCT during events. The main reason is that statistics-based framework has the inherent advantage to detect durative events via using inference methods, such as HMM and CRF. However, limited by the feature space, quaternion frame only exploits several frame information.
6. Acknowledgements The work described in this paper is partially supported by National Natural Science Foundation of China (61473277, 61403364), Shenzhen Fundamental Research Program (JCYJ20140901003939022), SIAT Innovation Program for Ex-
22
Anomaly Detection Experiment: EER Algorithm
SF
MPPCA
SF+MPPCA
Adam et.al
MDT
OURs
Ped1
31%
40%
32%
38%
25%
23%
Ped2
42%
30%
36%
42%
25%
20%
Average
37%
35%
34%
40%
25%
22%
Table 1: Quantitative performance comparison of different anomaly-detection algo-
rithms. The first two rows show the EER values on ped1 and ped2 datasets, and the average values are shown in the third row. They are based on frame-level groundtruth.
Anomaly Detection Experiment: AUC Algorithm
SF
MPPCA
SF+MPPCA
Adam et.al
MDT
OURs
Ped1
68.3%
63.0%
69.3%
62.4%
80.7%
84.1%
Ped2
61.3%
72.4%
67.9%
60.8%
83.4%
88.9%
Average
64.8%
67.7%
68.6%
61.6%
82.1%
86.5%
Table 2:
Performance comparison for anomaly detection based on frame-level
groundtruth via different algorithms. The first two rows show the AUC values on ped1 and ped2 datasets. The average values are given in the third row. The ones with large values perform better.
cellent Young Researchers (201315), and Guangdong Innovative Research Team Program (201001D0104648280).
References [1] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: A survey, ACM Computing Surveys 41 (3) (2009) 15–15. [2] R. Mehran, A. Oyama, M. Shah, Abnormal crowd behavior detection using social force model, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 935–942. [3] Y. Cong, J. Yuan, J. Liu, Sparse reconstruction cost for abnormal event detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3449–3456. 23
[4] W. Li, V. Mahadevan, N. Vasconcelos, Anomaly detection and localization in crowded scenes, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (1) (2014) 18–32. [5] J. K. Tsotsos, S. M. Culhane, W. Y. K. Wai, Y. Lai, N. Davis, F. Nuflo, Modeling visual attention via selective tuning, Artificial intelligence 78 (1) (1995) 507–545. [6] R. Parasuraman, S. Yantis, The attentive brain, Mit Press Cambridge, MA, 1998. [7] J. P. Gottlieb, M. Kusunoki, M. E. Goldberg, The representation of visual salience in monkey parietal cortex, Nature 391 (6666) (1998) 481–484. [8] D. L. Robinson, S. E. Petersen, The pulvinar and visual salience, Trends in neurosciences 15 (4) (1992) 127–132. [9] V. Saligrama, J. Konrad, P.-M. Jodoin, Video anomaly identification, IEEE Signal Processing Magazine 27 (5) (2010) 18–33. [10] W. Hu, T. Tan, L. Wang, S. Maybank, A survey on visual surveillance of object motion and behaviors, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 34 (3) (2004) 334–352. [11] C. Xu, D. Tao, C. Xu, Large-margin multi-view information bottleneck, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (8) (2014) 1559–1572. [12] D. Tao, X. Li, X. Wu, S. J. Maybank, Geometric mean for subspace selection, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2) (2009) 260–274. [13] B. Geng, D. Tao, C. Xu, Y. Yang, X.-S. Hua, Ensemble manifold regularization, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (6) (2012) 1227–1233.
24
[14] D. Tao, X. Li, X. Wu, S. J. Maybank, General tensor discriminant analysis and gabor features for gait recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (10) (2007) 1700–1715. [15] D. Tao, X. Tang, X. Li, X. Wu, Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (7) (2006) 1088–1099. [16] S. Gong, T. Xiang, Recognition of group activities using dynamic probabilistic networks, in: IEEE International Conference on Computer Vision, 2003, pp. 742–749. [17] H. M. Dee, D. Hogg, Detecting inexplicable behaviour., in: The British Machine Vision Conference, 2004, pp. 1–10. [18] O. Raz, P. Koopman, M. Shaw, Semantic anomaly detection in online data sources, in: International Conference on Software Engineering, 2002, pp. 302–312. [19] C. Ding, C. Xu, D. Tao, Multi-task pose-invariant face recognition, IEEE Transactions on Image Processing 24 (3) (2015) 980–993. [20] S. Wu, B. E. Moore, M. Shah, Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2054–2060. [21] H. Guo, X. Wu, N. Li, R. Fu, G. Liang, W. Feng, Anomaly detection and localization in crowded scenes using short-term trajectories, in: IEEE International Conference on Robotics and Biomimetics, 2013, pp. 245–249. [22] D. Tran, J. Yuan, Optimal spatio-temporal path discovery for video event detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3321–3328.
25
[23] W. Hu, X. Xiao, Z. Fu, D. Xie, T. Tan, S. Maybank, A system for learning statistical motion patterns, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (9) (2006) 1450–1464. [24] F. Jiang, J. Yuan, S. A. Tsaftaris, A. K. Katsaggelos, Anomalous video event detection using spatiotemporal context, Computer Vision and Image Understanding 115 (3) (2011) 323–333. [25] S. Calderara, U. Heinemann, A. Prati, R. Cucchiara, N. Tishby, Detecting anomalies in peoples trajectories using spectral graph analysis, Computer Vision and Image Understanding 115 (8) (2011) 1099–1111. [26] E. B. Ermis, V. Saligrama, P. Jodoin, J. Konrad, Motion segmentation and abnormal behavior detection via behavior clustering, in: IEEE International Conference on Image Processing, 2008, pp. 769–772. [27] J. Kim, K. Grauman, Observe locally, infer globally: a space-time mrf for detecting abnormal activities with incremental updates, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 2921–2928. [28] V. Mahadevan, W. Li, V. Bhalodia, N. Vasconcelos, Anomaly detection in crowded scenes, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1975–1981. [29] D. Xu, X. Wu, D. Song, N. Li, Y.-L. Chen, Hierarchical activity discovery within spatio-temporal context for video anomaly detection, in: IEEE International Conference on Image Processing, 2013, pp. 3597–3601. [30] M. J. Roshtkhari, M. D. Levine, An on-line, real-time learning method for detecting anomalies in videos using spatio-temporal compositions, Computer Vision and Image Understanding 117 (10) (2013) 1436–1452. [31] M. Bertini, A. Del Bimbo, L. Seidenari, Multi-scale and real-time nonparametric approach for anomaly detection and localization, Computer Vision and Image Understanding 116 (3) (2012) 320–329.
26
[32] D. Ryan, S. Denman, C. Fookes, S. Sridharan, Textures of optical flow for real-time anomaly detection in crowds, in: IEEE International Conference on Advanced Video and Signal-Based Surveillance, 2011, pp. 230–235. [33] V. Reddy, C. Sanderson, B. C. Lovell, Improved anomaly detection in crowded scenes via cell-based analysis of foreground speed, size and texture, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2011, pp. 55–61. [34] T. Xiang, S. Gong, Incremental and adaptive abnormal behaviour detection, Computer Vision and Image Understanding 111 (1) (2008) 59–73. [35] N. Li, X. Wu, D. Xu, H. Guo, W. Feng, Spatio-temporal context analysis within video volumes for anomalous-event detection and localization, Neurocomputing 155 (1) (2015) 309–319. [36] X. Wang, X. Ma, W. E. L. Grimson, Unsupervised activity perception in crowded and complicated scenes using hierarchical bayesian models, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (3) (2009) 539–555. [37] O. Boiman, M. Irani, Detecting irregularities in images and in video, International Journal of Computer Vision 74 (1) (2007) 17–31. [38] C. Lu, J. Shi, J. Jia, Abnormal event detection at 150 fps in matlab, in: IEEE International Conference on Computer Vision, 2013, pp. 2720–2727. [39] S. Han, R. Fu, S. Wang, X. Wu, Online adaptive dictionary learning and weighted sparse coding for abnormality detection, in: IEEE International Conference on Image Processing, 2013, pp. 151–155. [40] X. Hou, L. Zhang, Saliency detection: A spectral residual approach, in: IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8.
27
[41] R. Peters, L. Itti, The role of fourier phase information in predicting saliency, Journal of Vision 8 (6) (2008) 879–879. [42] C. Guo, Q. Ma, L. Zhang, Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform, in: IEEE Conerence on Computer vision and pattern recognition, 2008, pp. 1–8. [43] P. Bian, L. Zhang, Biological plausibility of spectral domain approach for spatiotemporal visual saliency, in: Advances in Neuro-Information Processing, Vol. 55, 2009, pp. 251–258. [44] C. Guo, L. Zhang, A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression, IEEE Transactions on Image Processing 19 (1) (2010) 185–198. [45] M. D. Levine, X. An, H. He, Saliency detection based on frequency and spatial domain analysis, neuroscience 8 (8) (2005) 975–977. [46] A. V. Oppenheim, J. S. Lim, The importance of phase in signals, Proceedings of the IEEE 69 (5) (1981) 529–541. [47] T. Huang, J. Burnett, A. Deczky, The importance of phase in image processing filters, IEEE Transactions on Acoustics, Speech and Signal Processing 23 (6) (1975) 529–542. [48] X. Hou, J. Harel, C. Koch, Image signature: Highlighting sparse salient regions, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1) (2012) 194–201. [49] B. Schauerte, R. Stiefelhagen, Predicting human gaze using quaternion dct image signature saliency and face detection, in: IEEE Workshop on Applications of Computer Vision, 2012, pp. 137–144. [50] B. Schauerte, R. Stiefelhagen, Quaternion-based spectral saliency detection for eye fixation prediction, in: European Conference on Computer Vision, 2012, pp. 116–129. 28
[51] W. Liu, D. Tao, Multiview hessian regularization for image annotation., IEEE Transactions on Image Processing 22 (7) (2013) 2676–2687. [52] C. Xu, D. Tao, C. Xu, Multi-view intact space learning, IEEE Transactions on Pattern Analysis and Machine Intelligence (2015) 1–14(in press). [53] J. Yu, D. Tao, M. Wang, Y. Rui, Learning to rank using user clicks and visual features for image retrieval., IEEE Transactions on Cybernetics 45 (4) (2015) 767–779. [54] J. Yu, Y. Rui, D. Tao, Click prediction for web image reranking using multimodal sparse coding, IEEE Transactions on Image Processing 23 (5) (2014) 2019–2032. [55] J. Yu, Y. Rui, Y. Y. Tang, D. Tao, High-order distance-based multiview stochastic learning in image classification, IEEE Transactions on Cybernetics 44 (12) (2014) 2431–2442. [56] S. Engel, X. Zhang, B. Wandell, Colour tuning in human visual cortex measured with functional magnetic resonance imaging, Nature 388 (6637) (1997) 68–71. [57] T. Judd, F. Durand, A. Torralba, Fixations on low-resolution images, Journal of Vision 11 (4) (2011) 14. [58] T. A. Ell, N. Le Bihan, S. J. Sangwine, Quaternion Fourier Transforms for Signal and Image Processing, John Wiley & Sons, 2014. [59] A. Adam, E. Rivlin, I. Shimshoni, D. Reinitz, Robust real-time unusual event detection using multiple fixed-location monitors, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (3) (2008) 555–560.
29