Journal Pre-proofs Parallel spatial-temporal convolutional neural networks for anomaly detection and location in crowded scenes Zheng-ping Hu, Le Zhang, Shu-fang Li, De-gang Sun PII: DOI: Reference:
S1047-3203(20)30015-8 https://doi.org/10.1016/j.jvcir.2020.102765 YJVCI 102765
To appear in:
J. Vis. Commun. Image R.
Received Date: Revised Date: Accepted Date:
26 June 2018 13 May 2019 1 February 2020
Please cite this article as: Z-p. Hu, L. Zhang, S-f. Li, D-g. Sun, Parallel spatial-temporal convolutional neural networks for anomaly detection and location in crowded scenes, J. Vis. Commun. Image R. (2020), doi: https:// doi.org/10.1016/j.jvcir.2020.102765
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2020 Published by Elsevier Inc.
Parallel spatial-temporal convolutional neural networks for anomaly detection and location in crowded scenes Zheng-ping HU 1, 2, Le ZHANG 1, 2, Shu-fang LI 1, 2, De-gang SUN 3 (1 School of information and engineering &Yanshan University, Hebei Qinhuangdao 066004 China) (2 Hebei key laboratory of information transmission and signal processing, Hebei Qinhuangdao 066004 China) (3 School of electronic information and engineering & Shandong Huayu University of Technology, Dezhou, 253000 China)
Abstract: Anomaly detection and location in crowded scenes have attracted a lot of attention in computer vision research community recently due to the increased applications of intelligent surveillance improve security in public. We propose a novel parallel spatial-temporal convolution neural networks model to detect and localize the abnormal behavior in video surveillance. Our approach contains two main steps. Firstly, considering the typical position of camera and the large number of background information, we introduce a novel spatial-temporal cuboid of interest detection method with varied-size cell structure and optical flow algorithm. Then, we use the parallel 3D convolution neural networks to describe the same behavior in different temporal-lengths. That step ensures that the most of behavior information in cuboids could be captured, also insures the reduction of information unrelated to the major behavior. The evaluation results on benchmark datasets show the superiority of our method compared to the state-of-the-art methods. Keyword: abnormal detection; video surveillance; parallel 3D convolution neural networks; spatial-temporal interest cuboids
Ⅰ. Introduction People have an urgent need for an intelligent monitoring system with real-time massive video data processing capacity owing to the popularity of intelligence and the attention to public safety. Video anomaly detection technology has been the research emphasis of image processing and machine vision, and is playing an increasingly
important role in social life [1]. There are various definitions of "anomaly" in different video sequence because the meaning of the anomaly always depends on specified scene and location [2]. For example, the normal bicycle riding is defined as an abnormal behavior in the scene of pedestrian [3]. However, the similarity among various definitions of anomaly is that abnormal behavior is always opposite to normal behavior, that is to say a certain behavior in the video that has different spatial-temporal feature from normal behavior is likely to be abnormal. As a typical binary classification problem, anomaly detection and location in video surveillance not only needs to classify the anomaly target effectively, but also needs to detect the location of the anomaly target. A difference between abnormal event detection and ordinary classification problem is that several normal events in anomaly detection share a common label and will be represented by one set of feature [4]. It is obvious that representing vast kinds of normal events in crowded scenes only by one set of feature is likely to cause inaccurate abnormal detection and high computational cost. The other difference is that anomaly detection problem requires that abnormal behaviors not only should be recognized exactly in video surveillance but also should be located accurately both in spatial and temporal domain. Lots of methods detecting abnormal events in crowd have been proposed in the past years. Most of these methods can be roughly classified into two categories: (1) anomaly detection methods based on frames, which capture 2D visual features from single frame and moving information between frames
[5-7].
(2) anomaly detection
methods based on spatial-temporal interest cuboid, which obtain spatial-temporal interest cuboids from video surveillance sequences, and then directly acquire spatial-temporal information from 3D cuboids[4,8,9]. Compared with the methods based on frames, methods directly describing the spatial-temporal interest cuboid can achieve better performance. In fact, the size of cuboid is a sensitive factor of the anomalous events in video sequences of crowded scenes, because it should be big enough to contain all the behavior and small enough to abandon the noise
at the same time [9]. Owing to the typical position of the camera using for capturing video surveillance, regions of scene that are relatively close to the camera provide more descriptive information
[7].
Thus, we need to take the
camera’s position into consideration when extracting features from video sequences. In order to extract more effective features from different scales of targets in the scene, we apply optical flow and variable-sized cells grid for the spatiotemporal interest cuboid. Those cells varying spatial sizes will be segment into two kinds of cuboid with different temporal length for representing each action event in long and short time. To detect and localize anomaly behavior in video sequence of crowded scenes, we propose a parallel spatial-temporal CNN model with long and short frames, which describes events in varying temporal lengths. The main contributions of this paper are as follows: (1) We acquire the spatial-temporal interest volumes by optical flow and variable-sized cell structure, which not only abandons the background information in the crowded scene could be but also reduces computational cost. (2) Variable-sized cells grid with large cells at the lower region of the scene (closer to the camera) ensures the mostly information of the single object could be extracted in the crowded scenes. (3)The parallel spatial-temporal convolutional neural networks is proposed to learn spatial-temporal features from cuboid in different temporal length, which is enable to achieve better performance. We test the effectiveness of the proposed method on the existing classical local abnormal behavior dataset the UCSD and global behavior dataset the UMN, and compare our results with the classical anomaly detection algorithm under different evaluation criteria. Experimental results show that the proposed method is effective in detecting both global and local anomalies. In this paper, abnormal behavior detection results under different evaluation criteria are similar, which means that our method could achieve strong performance. The rest of this paper is organized as follows. In Section II, we briefly review the existing abnormal crowd
activity detection methods. Section III gives the details of the proposed approach. In Section IV, we present evaluation results and discussions. We conclude this paper in Section V.
Ⅱ. The Related Work In the research of video anomaly detection, appropriate features extraction plays an important role in abnormal behaviors identification in crowded scenes. In addition, researchers have proposed various methods for feature extraction and behavior representation. The feature extraction of anomaly detection problems could be divided into two categories by the perspective of ideas: the handcraft feature extraction by manual design and the deep feature obtained by video sequence directly, and both of which are based on biological nerve theory. However, the difference is that the handcraft feature is simulated by human visual framework, while the feature extraction method of deep learning focuses on the study of the distribution of data. Handcraft features are extracted from images according to the sensitivity of human vision to features and usually have clear physical meaning s. Handcraft features commonly used in video anomaly detection include texture features[10-11], color features[12], MoSIFT (Motion Scale Invariant Feature Transform)[13], optical flow, trajectory and so on. For example, Li Weixin et al. use dynamic textures (MDT) to build normal behavior model, and use saliency to distinguish between anomalies and normal events in space, so that the anomalies in the model can be regarded as anomalous events[10-11]. Aravinda S. Rao et al. construct an anomaly framework by describing the spatial characteristics of contrast, correlation and uniformity of anomalous events or objects through Gray Level Co-occurrence Matrix (GLCM) from a statistical point of view, and detect the anomalous wandering behavior in the crowd by spatiotemporal coding [12]. MoSIFT as an effective feature descriptor could detect the moving and distinguishing points of interest in space also measure the intensity of interest points according to the optical flow intensity
around the points of interest. Xu Long et al. use Kernel Density Estimation (KDE) to implement feature selection of low-level descriptors extracted from MoSIFT video and achieve better anomaly detection performance [13]. Optical flow as an effective moving object descriptor is widely applied in the research of anomaly detection. For example, optical acceleration, and the histogram of optical flow gradients are used in N. Hajananth et al.
[14]
detect the presence of any abnormal objects and speed violations in the scene. In contrast to common optical flow methods, histogram of Optical Flow (HOFO) descriptor proposed in T. Wang et al.[15] has lower feature dimension and obtains better performance when describing the feature of global abnormal event. S.Q. Wang et al. [8] propose two novel local motion video descriptors, SL-HOF and ULGP-OF, to extract the feature of normal event in video sequence. As a novel optical flow descriptor, SL-HOF could capture spatial distribution information of the three dimensional local region’s motion in the spatial-temporal interest block. ULGP-OF descriptor has been introduced to describe the motion of local region texture in video foreground by combining the classic 2D texture descriptor Local Gradient Pattern (LGP) and the optical flow. Besides, trajectory not only provides the information of the moving object, but also represents the object in different aspect such as the length, pixel and location. S. Biswas et al. [16] propose a novel trajectory extraction that is appropriate to several kinds of scene with different crowd density. In contrast to traditional trickles, short local trajectories (SLT) could be extracted as super-pixels belonging to foreground objects by capturing both spatial and temporal information of a candidate moving object. C. Li et al. [17] proposes a visual abnormal behavior detection method based on trajectory sparse reconstruction analysis that builds the dictionary by using the Least-squares Cubic Spline Curves Approximation (LCSCA) method to detect anomaly event in video scene. In addition, H. Mousavi et al.[18]extends efficient video representation the histogram of oriented tracklets (HOT) introducing
Improved HOT and achieved state-of-the-arts on the available datasets. The argument put forward in S. Coşar et al. [19]
is that the feature extraction method either trajectory-based or pixel-based is limited to represent all kinds of
abnormal behavior. Trajectory feature could detect the anomaly related to speed and direction, however it is difficult to find abnormal movements concerned about human limb movements like jumping or fighting. Meanwhile, pixel based methods may not be able to detect the behavior related to the overall movement of human beings like wandering terrorists or thieves. Therefore, the author introduces an integrated trajectory method to incorporate the output of object trajectory analysis and pixel-based analysis for abnormal behavior inference. This method enables to detect abnormal behaviors related to speed and direction of object trajectories as well as complex behaviors related to finer motion of each object. Although handcraft feature extraction method has solid foundation, the subjective factors of this method are too strong to objectively express the behavior. Secondly, the feature extracted in this way often has strong limitations for different datasets, that is to say, handcraft features may only perform well in some databases, but may not work well in other databases. When deep feature extraction method is carried out by learning data directly, only the rules of feature extraction need to be designed. For example, the neural network obtains the parameters of the depth model and extracts the deep feature by artificially designing the structure of the network model and learning rules, so the obtained feature often cannot explain the physical meaning of each dimension. In recent years, due to the rapid development of deep learning and cloud computing technology, a breakthrough has been done in the field of computer vision. CNN model is able to extract only 2D feature by using two-dimensional convolution operation for images, however it is limited to capture spatial-temporal information from video sequence. To address this issue, Simonyan Karen et al. propose the two-stream network for feature learning and
behavior discrimination of RGB image spatial information and optical flow graph of video sequence respectively. The final action classification is obtained by fusing the discriminant results of the two networks, and experiments show that the two-stream network has certain effect on feature extraction and behavior representation
[20].
There
are a series of improved algorithms based on two-stream network, such as Convolutional Two-stream Network [21], Temporal Segment Networks (TSN)
[22]
and TSN based on weighted fusion
[23].
Although two-stream
convolutional neural network can extract the temporal and spatial features of video sequences to some extent, this kind of method is still based on the artificial design features and has certain subjectivity. In addition, T. Du et al. [24]
introduce a deep 3 dimensional convolution neural network (C3D) to solve the behavior classification problem
in video by obtaining the spatial-temporal features from the video sequence. Based on this method, S.F. Zhou et al. [9]
regard the spatial-temporal volumes of interest containing moving object detected by optical flow as the input of
C3D to solve the detection and location of abnormal behavior in video. M. Sabokrou et al.
[4]
propose a similar
framework with cascading 3D deep neural networks. In this proposed method, the 3D automatic encoder was used to detect the spatial-temporal interest block which is the input of the C3D. Convolution neural network could recognize abnormal behavior in video sequences by a supervised deep learning method, however how to perform unsupervised deep learning is remind to be done. M. Ravanbakhsh et al.
[5]
realize an unsupervised method the
Generative Adversarial Nets (GAN) to detect and locate the anomaly in video sequence by using the game between the generative model and the discriminant model. In order to learn an internal representation of the scene normality, GAN was trained using normal frames and corresponding optical-flow images. The appearance and motion representation of real data were compared with those of normal data during the test phase, and the abnormal area was detected by calculating the local coincidence degree.
Ⅲ. The Proposed Approach In this paper, considering the typical position of camera and large number of background information, we introduce a novel spatial-temporal cuboid of interest detection method with varied-size cell structure and optical flow algorithm. Then, the parallel 3D convolution neural networks method is used to describe the same behavior in different temporal-lengths. That method not only ensures that the mostly information of the behavior in spatial-temporal interest cuboids could be captured, but also insures the information unrelated to the major behavior in the cuboid could be reduced. Finally, the final abnormal judgment result is obtained through the fusion of the abnormal judgment result of the spatial-temporal interest block with different frame length, and the detection and location of the abnormal behavior in the video surveillance can be realized by the location of and the abnormal judgment result of the spatial-temporal interest block. An overview of the procedure for detecting and localizing anomalies is depicted in Fig.1.
Cell structure
Optical flow
7 frame
3 frame
Abnormal Normal cuboids cuboids
Abnormal detection
32323 C3D
Abnormal detection
32 327 C3D
resize
Parallel double C3D
Fig. 1 Overview of the proposed video anomaly detection approach.
A. The detection of spatial-temporal cuboid In previous works, researchers usually segment the video frame scene by equal-sized cell structure and optical flow for the detection of spatial-temporal volumes of interest, which is similar to the segment method used in
BOW. This non-overlapping cell structure segment the video scene into small pieces with same size, and ensures the reduction of the noise and the computational cost. In fact, the typical position of camera result in the fact that more information could be extracted from the regions of the scene relatively closed to the camera than those located far from it. Therefore, when using equal-sized grid to segregate the whole scene, it is possible that the size of the cell is too large to extract more useless information captured at the upper of the scene, and it is too small to detect mostly information for object at the lower of the scene. Object tracking is an excellent object detection method, and could be used to compensate the perspective of scene. However, it is difficult to track several certain object in crowded scenes. After analyzing the majority of the exiting video surveillance, a valid assumption could be developed that the video surveillance is acquired by camera which is installed in a relatively high position looking downward [7]. Therefore, we introduce a non-overlapping cell structure with variable-sized grid over the spatial domain to extract more effective feature. This method guarantees that more part of the behavior and less part of the unrelated behavior could be captured in the spatial region of a certain spatial-temporal volumes.
y0
{
Y
Fig. 2 Example cell structure for a scene.
The process to create cell structure in this paper is operated as the follow orders: from top to bottom and from left to right. The cells in a common row share the same size, and the size of cells in the same column keeps increasing from top to bottom. The shape of cells always keeps as square. If we represent the length and width of
the k -th row cell by yk , the length and width of the cell in the row the ( k 1) -th can be expressed by:
yk 1 ayk
(3-1)
If the growth rate a of cells is constant, the vertical size of the frame Y could be expressed in terms of the recursive vertical dimension of each cell as follows: n
Y a k y0
(3-2)
k 0
Where n is the number of cell dimension in the vertical direction and y0 is the vertical dimension of the smallest cell. At this point, the cell structure is in line with the geometric sequence:
Y a n 1 1 y0 a 1
(3-3)
The example of the cell structure setting is shown in Fig.2. The value of growing rate in the cell structure could be set as a fixed number or a varied number changing with frame size and object region. The cell structure is set up to capture mostly information of a single object, and it ensures that most global information of the entire scene is captured. It requires the suitable choice of growing rate a and initial size of the cell y0 .
Fig. 3 Illusion of the 2D CNN for images.
Fig. 4 Illusion of C3D for successive frames in video. In this example, there is a 3D kernel with the temporal dimension as 3. That is to say, feature map is obtained by performing spatial-temporal convolutions across 3 adjacent frames.
Optical flow as effective motion detection method and robust object tracking method is applied in our method to extract the spatial-temporal volumes of interest. In the process of detecting and localizing anomalous events in video sequences of crowded scene, it is unnecessary to take the background region or static object region into consideration. Therefore, we abandon the cuboids containing little motion information or even static information and then regard the rest of the cuboids as the input of the parallel stream spatial-temporal neural networks. In our proposed method, the temporal length of cuboids is chosen as 7 and 3 frames. The temporal length of 7 frames is long enough to learn the essential information of the normal or abnormal behavior. However, it is possible that there are two kinds of behavior in a volume with the temporal length of 7. Namely, when we set the temporal length of cuboid as 7 frames, behavior in the ahead frames of the patch may be a biker, and in the relatively backward frames of patch may be a pedestrian entering the scene. It is hard to set a certain label for this kind of patch, as each label is unsuitable for this patch. Hence, we extract two kinds of those cuboids with different temporal length 7 frames and 3 frames to describe each behavior in long and short temporal length. After adjusting the size of different cuboids, we have two kinds of patches with the size of 32327 and 32323. In the process of potential spatio-temporal volumes extraction the complex problem of describing multiple behaviors in crowded scenes is simplified to extract feature of few behavior from interest volume in the whole video sequence. The large number of small cuboid with the size of 32327 and 32323 not only satisfy the requirement of large data in
convolution neural network training, but also reduce the requirement of arithmetic memory.
B. Spatial-temporal convolution As we known, deep learning and convolution neural networks have provided lots of new ideas for the research in the field of computer vision. Although the common convolution neural network has achieved excellence performance for learning feature of two-dimensional images, it is limited to capture spatial-temporal features of video sequence. In order to apply convolution neural network to the field of video analysis, K. Simonyan et al.
[20]
proposes a novel two streams convolution neural networks which capture the spatial
information from single picture and temporal information from the optical flow chart of the video sequence in the same network structure. However, the temporal information used in two streams CNN is extracted from hand-craft features, which restricts the extraction of effective information to some extent. In T. Du et al.
[24],
a deep 3
dimensional convolution neural network (C3D) is proposed to obtain the spatial-temporal features from video sequence simply and efficiently. When apply 2D CNN to detect the anomalous events in video sequence of crowd scene, we could only obtain spatial information by the video frames neglecting characteristics of behavior in temporal domain. The schematic diagram of 2D convolution operation to image is shown in Fig.3. The three-dimensional convolution neural network allows spatial-temporal feature could be extracted directly from video sequence with the three-dimensional convolution kernel. The schematic diagram of 3D convolution operation to video sequence is shown in Fig.4. As the demonstration of Fig.4, the temporal dimension of the convolution kernel is 3, and the example above operates the convolution across 3 adjacent frames. In the structure of the 3D convolution, each feature map in the convolution layer is connecting with several adjacent continuous frames in the upper layer, therefore we could capture motion information from video surveillance directly by C3D.
As shown in Fig.4, one value of the convolution map is obtained by the local receptive field of three consecutive frames at the same location on the top of this convolution layer. Compared with the 2D CNN, 3D CNN is enabling to capture not only the spatial information of the behavior but also the significant temporal information to discriminate abnormal behavior. It can obtain the spatial-temporal feature directly from video sequence in the processing of convolution. Therefore, it not only express the spatial characteristics for the appearance of object but also express the characteristics related to the temporal domain like speed. It is obvious that 3D CNN is a supervision method which could obtain high-performance result in video dataset with extensive samples. The video feature descriptor learned by 3D CNN has a powerful distinguishing ability and can represent different types of video. This relatively simple video descriptor is conductive to solving real-time tasks which need high computational ability. The j -th feature map aij in i -th layer can be calculated as follows:
aij f Wn * a(i 1) n bij Where a i 1 n represents the n -th feature map obtained in i 1 layer. which is defined as
(3-4)
f () is the nonlinear function
f x max 0, x . Wn is the kernel of filters, n index the set of feature maps that
connected to the current feature map in (i 1) -th layer, denotes the convolution operation, bij is the scalar bias term for current feature map. The value of Wn and bij could be obtain by training of the net. In the spatial-temporal convolution operation, 3D kernel Wn and spatial-temporal volume a( i 1) n can be defined as follow: U i 1 Vi 1 Ri 1
W n
u 0 v 0 r 0
uvr ( x u )( y v )( t r ) ( i 1) n n
a
(3-5)
Where U i , Vi , Ri represent the height, width and temporal length of 3D kernel respectively, x y t
domain the size of the spatial-temporal interest cuboid a( i 1) n .
C. The parallel spatial-temporal convolution Conv2: 3 3 3 pad:1 temporal_pad: 1
Conv1: 333 pad:1 temporal_pad: 1 3232764
16167
Conv3: 3 3 3 pad:1 temporal_pad: 1 884
16167128
884256
442
442256 221
Pool3: 222
Pool2: 22
Pool1: 221
Conv5: 3 3 3 pad:1 temporal_pad: 1
Conv4: 3 3 3 pad:1 temporal_pad: 1
Pool4: 222
2 21256 Pool5: 221 111 fc6
fc7
1112048
fc8 1112048 1112
32327
1112 1112048 12048
32323 111 Pool2: 222
Pool1: 221
3232364 Conv1: 333 pad:1 temporal_pad: 1
16163
16163128
Conv2: 333 pad:1 temporal_pad: 1
Pool4: 222
Pool3: 221
882
882256
Conv3: 333 pad:1 temporal_pad: 1
442
442256 221
Conv4: 333 pad:1 temporal_pad: 1
Pool5: 221
221256 Conv5: 333 pad:1 temporal_pad: 1
Fig. 5 The spatial-temporal CNN structure used in this paper. It contains 5 convolutional layers, 5 pooling layer and 3 fully connected layers. Two different kinds of input share same network structure, but parameters of two networks are different. We introduce two networks in a structure, and the parameter of the input of 32327 is shown in upper of the picture, the parameter of the input of 32323 is shown in lower of the picture
In this sub-section, we introduce the parallel spatial-temporal neural network structure developed in our method for anomaly detection in video sequence of crowded scene. In the research of detecting and localizing anomaly events in video sequence, different kinds of anomalies require different temporal length for identifying. It is clear that few information could be obtained in a short time. However, if the temporal length of the cuboid is too long, information unrelated to the major behavior in the cuboid may be captured. In addition, larger size of input patch for C3D requires bigger storage. Therefore, the parallel C3D is proposed to describe the behavior in the video sequence with two different temporal lengths. Learning the entire spatial-temporal features of the behavior ensures the accurate representation of moving targets. The proposed structure of the network is based on VGG-16,
and we operate edge extension (PAD) in each convolution layer to ensure the size of the input of the convolution layer that is same as the output. Then we could perform deeper network for more effective feature. The structure of the network used in parallel C3D is shown in Fig.5. Two different kinds of input share a same structure of network, however parameters of two networks are different. We introduce two networks in a structure, and the parameter of the input of 32327 is shown in upper of the picture, the parameter of the input of 32323 is shown in lower of the picture. The input of parallel C3D is the unprocessed spatial-temporal interest volumes with the size of 32327 and 32323. In the first layer Conv1, we apply spatial-temporal convolution with 64 different 3D kernel with the size of 333 (33 in the spatial dimension and 3 in temporal dimension) on the input data to generate 64 feature maps in the layer. In this layer, the value of pad and temporal pad are both set as 1, so the two different size of the 64 feature maps are 32327 and 32323. Then, each feature map in the layer Conv 1 is performed subsampling with a factor of 221 in subsampling layer Pool1. The pooling layer only reduces the size of the feature map and does not reduce the number of feature maps, so after Pool1 layer 64 different feature map with the size of 16167 or 16163 as the input of Conv2 operate the 3D convolution with the edge extension same as Conv1 to generate 128 different 16167 or 16163 feature map by the 333 convolution kernel. In the second Pooling layer Pool2, we still use same parameter in the parallel two C3D networks subsampling 128 the feature map with the size of 16167 or 16163 by 222 factor, and obtain 128 feature maps with the size of 884 or 882. In Conv3 we do convolutional operation on different feature maps with the edge extension in spatial and temporal domain, and acquire 256 kinds of feature maps with the size of 884 or 882. And 256 different convolutional kernel with the size of 333 is used in this layer. Then we use different pooling layer in that two convolution networks, the 222 factor is used to subsample the 884 feature maps with the number of 128, and
we subsample the 882 feature maps just in spatial with the factor 221. After this pooling layer Pool3 two different networks have the same feature map size as 442. In Conv4 layer, 256 feature maps are generated as the size of 442 by the convolutional kernel 333 and the edge expansion of size 1. In the Pool4 layer, we subsample the feature map to 221 by the factor of 222. In the last convolution layer, 333 convolution kernel is used to operate convolution on varies feature maps with the edge expansion of size 1. And the last pooling layer Pool5 subsample the feature maps to 111 by the factor of 221. Then, layer Pool5 is followed by a fully connected layer fc6. In layer fc6, we further perform convolution operation with 11 convolution kernel to capture higher level complex features and generate 2048 kinds of feature maps with the size of 111. In the second full connection layer fc7, we apply 2048 kinds of convolutional kernel to accomplish the convolution operation, and it does not change the number and size of the feature maps. After the third full connection layers, the number of output units of the two networks is same as the number of types of behaviors (abnormal or normal). In this parallel C3D network, we adopt maximum pooling mode in all the pool layers, and the activation function is softmax. The better features could be obtained by using the convolution kernel 333 as much as possible, so we use convolution kernel 333 in all convolution layers in the network for better results.
Ⅳ. Experiment In this section, we evaluate the effectiveness of our proposed method on two benchmark datasets UCSD [3] and UMN
[25]
and compare its performance with state-of-the-art video anomaly detection methods. Some experiment
details are also mentioned in this section.
Fig. 6 Example of abnormal detection and location in UCSD Ped1 Green rectangles show the motion areas while red rectangles denote anomalies detected by our method
In UCSD Ped1, 80413 normal spatial-temporal interest blocks are extracted, and 6297 spatial-temporal interest blocks containing abnormal behavior are used for model training. At the test stage about 129000 cuboids are extracted to detect anomaly. In UCSD Ped2, about 38637 normal space-time interest cuboids are extracted, and 2600 anomalous spatial-temporal interest cuboids are used for model training. At the test stage about 43000 cuboids are extracted to detect anomaly. UMN data set is a continuous video sequence, in order to ensure the experimental results and the efficiency of anomaly detection, the video sequence is sampled separately. At the same time, we extract about 47854 normal spatial-temporal interest blocks and 5253 anomalous spatial-temporal blocks for training 3D neural network model in UMN. Considering the problem of data imbalance in anomaly detection, in order to achieve better anomaly detection effect, we tried our best to maintain the consistency of the probability distribution of normal and abnormal samples in the process of training and testing to achieve better anomaly detection results.
A. Datasets In this part, we introduce the challenging video anomaly detection dataset used in the experiment, the UCSD
dataset and The UMN dataset. The UCSD data set is recorded by a static camera of 10fps, and contains two subsets of different schemes, Ped1 and Ped2, and both of them are captured by a high fixed camera overlooking the pedestrians. The population density of the dataset changes with time going. The dataset of UCSD is a challenging dataset with low resolution for abnormal behavior detection. There are only normal behaviors in the training samples and abnormal behavior only occur in the test video data. It is possible that there may be one or several even no abnormal behavior in a frame in the test sample. Abnormal events in the UCSD dataset include bicycles, skating, small cars, wheelchair etc., and some of which are difficult to be observed by humans in the crowded scene. Ped1contains 34 normal video sequences for training as well as 36 abnormal video sequences for testing with a certain perspective distortion. The frame length of each sequence is 200, and the resolution of each frame is 158238. The trend of the crowd includes two directions: go towards from the camera and go away from the camera. Ped2 depicts the horizontal movement of the crowd, including 16 normal training sequences and 12 abnormal test video sequences. The frame length of each sequence is varies from 120 to 170, and the resolution of each frame is 320240. The low resolution of objects in Ped1 and the occlusion problem in Ped2 cause the difficulty of behavior identification. Therefore, UCSD is a local anomaly dataset in challenging congestion scenarios. The example of anomaly detection in UCSD Ped1 and Ped2 are shown in Fig. 6 and Fig.7. The anomaly in the UMN dataset is belonging to global anomaly such as panic and fleeing of the crowd. The total time of UMN data set is 257 seconds, and the frame rate is 30 frames per second. The size of each frame is 320240. The UMN dataset with 11 video clips is recorded for three different scenarios lawn, indoor and plaza. In each clip, people in the scene of video sequence are walking or wandering originally and then begin to run in all
direction from a certain frame. The example of anomaly detection experiment in UMN dataset is shown in Fig.9.
Fig. 7 Example of abnormal detection and location in UCSD Ped2. Green rectangles show the motion areas while red rectangles denote anomalies detected by our method
B. Evaluation Methodology In order to evaluate the performance of our method, we not only introduce Receiver Operating Characteristic Curve (ROC) for performance comparison, but also adopt AUC (the area under the curve) and EER (the equal error rate) in different criterions to evaluate anomaly detection accuracy. EER is the point on the ROC curve where the false positive rate (FPR, normal behavior is considered abnormal) is equal to the false negative rate (FNR, the abnormal behavior is identified as normal). Namely, it is also the point of the intersection between the diagonal line of the ROC space (connection between [0, 1] and [1, 0]) and the ROC curve. For a good anomaly detection algorithm, EER should be as small as possible, and AUC should be as large as possible. The evaluation criterion based on ROC curves has three different levels, the frame level criterion, pixel level criterion and the dual pixel level criterion. The ROC curve of our method on those criterions is shown in Fig.8. The experimental results of this method on UCSD Ped1 have no obvious downward trend in different levels, which indicates that the proposed method could achieve robust performance for anomaly detection.
Table1 EER and AUC for frame and pixel level comparisons on UCSD Ped 1 Method
Frame level
Pixel level
EER
AUC
EER
AUC
Cascade DNN[4]
9.1%
-
15.8%
-
Spatial-temporal CNN[9]
24%
0.85
-
0.87
14.9%
0.908
-
-
SR[2]
20%
0.487
-
-
OCELM[8]
18%
0.885
33%
0.689
SLT[16]
18.33%
-
-
0.6025
CFS[7]
21.15%
-
39.7%
-
GAN[5]
8%
0.974
35%
0.703
25.34%
-
48.1%
-
-
0.91
9%
0.75
10%
-
16%
-
6.29%
0.9673
9.22%
0.9527
GMM-MRF[14]
Binary Feature[6] OADC-SA[27] Tan Xiao et al.[28] Ours
a) Ped1
b) Ped2 Fig. 8 ROC curves for UCSD dataset
Fig. 9 Example of abnormal detection and location in UMN Red rectangles show the anomalies detected by our method
In the frame level criterion, we recognize a frame as abnormal where at least an anomaly is detected. This standard only pays attention to the temporal position accuracy of abnormal behavior, but it does not consider whether the spatial location of the anomaly is accurate or not. Therefore, there may be false positive coincidence prediction when using this criterion for performance evaluation. That is to say, an abnormal frame will be coincidentally identified as anomaly in frame level, when both the anomaly and some of normal behavior in ground truth are misjudged. Table2 EER and AUC for frame and pixel level comparisons on UCSD Ped 2 Method
Frame level
Pixel level
EER
AUC
EER
AUC
Cascade DNN[4]
8.2%
-
19%
-
Spatial-temporal CNN[9]
24.4%
0.86
-
0.88
GMM-MRF[14]
4.89%
0.979
-
-
19%
-
24%
-
SLT[16]
12.77%
-
-
0.7631
CFS[7]
19.2%
-
36.6%
-
OCELM[8]
12%
0.913
17%
0.801
GAN[5]
14%
0.935
-
-
OADC-SA[27]
0.925
-
-
-
Tan Xiao et al.[28]
10%
-
17%
Binary Feature[6]
21.2%
-
38.4%
-
iHOT et al.[18]
8.59%
-
-
-
Ours
5.59%
0.9637
11.8%
0.9350
Mohammad Sabokrou[26]
In pixel level criterion, if over 40% of the abnormal behavior regions are detected correctly, the detection will be considered as successful and the frame will be identified as an abnormal frame. The evaluation result based on this rigorous pixel level is more reliable because this standard requires localizing anomaly behavior in both spatial and temporal domain. Some researchers apply a rate of detection (RD) to replace EER to evaluate the results of abnormal detection in pixel level criterion. A larger abnormal detection rate denotes better performance of
abnormal detection. In fact, when multiple abnormal regions has been detected in one frame, it is possibility that only one of these regions is annotated as anomaly in the available ground truth. This kind of lucky detection which has multiple regions misjudged as abnormal will be considered as the detection positive on the frame level and pixel level criterion. Therefore, a more rigorous criterion the dual pixel level evaluation criterion is introduced to evaluate the performance of anomalous behavior detection [26]. In dual pixel level (DPL), a frame could be regarded as an anomaly if: (1) The frame satisfies the pixel level criterion. (2) At least
% (i.e., 10%) of the pixels detected as anomaly are denoted by the anomaly ground truth.
This criterion not only requires the accuracy of detecting and localizing anomaly in spatial and temporal domain, but also sensitive to false positive error discrimination. Therefore, DPL is a more reliable criterion to reflect the performance of abnormal behavior detection algorithm.
C. Experimental results In this part, we evaluate our method with several novel high-performance methods. The result shows that the proposed method not only has the ability to detect the frame with abnormal behavior from video sequence, but also could locate the abnormal area accurately in the crowded scene. Table 1 shows our performance in comparison with other excellent approaches at frame-level and pixel-level on UCSD Ped1. It can be viewed that our method outperforms most of the state-of-the-art methods at both frame-level and pixel-level. At the frame level, the EER of our approach is ahead by 1.71% and 2.81% to previous methods of GAN [5] and Cascade DNN [4] respectively. The frame-level AUC of the presented method is also better than other methods and close to the value for GAN [5]. Example of abnormal detection and location in UCSD Ped1 is shown in Fig.6, in addition green rectangles show the motion areas while red rectangles denote anomalies detected by our method. In order to evaluate our method
on UCSD Ped2 at frame-level and pixel-level, the comparison of AUC and EER with existing methods is provided in Table2. It confirms the assumption that the anomaly detection method based on parallel 3D CNN has a good performance compared to other approaches. The method we proposed in this paper is comparable to other anomaly detection method at frame level and back by 0.7% in EER, 0.00153 in AUC to the excellent method based on GMM-MRF [14] in previous work. It can be seen in Table2 that our method’s EER at pixel-level is 11.8%, which is ahead by 5.2% to the previous method based on OCELM
[8].
The AUC at the pixel-level is 0.9350 that is lower
0.0287 than it at frame-level. Example of abnormal detection and location in UCSD Ped2 is shown in Fig.7. In addition green rectangles show the motion areas while red rectangles denote anomalies detected by our method. The performance on UCSD dataset demonstrates that our algorithm outperforms all the other methods with respect to weather the frame-level or pixel-level measure. We also use the DPL measure to analyze the accuracy of anomaly location. Most of state-of-the-art methods did not report the localization performance respect to the DPL measure
[26].
Consequently we just compare our
method with [26] and [4]. The dual-pixel EER (for β = 5%) of our method on Ped1 and Ped2 is 12.37% and 21.43%, respectively. As show in Table3, we are better by 12.13% and 2.37% than [4][10] on Ped1 and Ped2, respectively. The ROC curve of our method for UCSD dataset on different criterions is shown in Fig.8. It is evident that our method has better performance for abnormal behavior detection, and some reason for why the result in Ped1 is better than in Ped2 are as fellow: (1) the scale change in Ped1 is relatively larger, and the cell structure with varied size used in this paper is more suitable for this kind of situation; (2) In Ped2, because of the spatial-temporal interest of cuboid detection method and special location such as the edge of the lawn, video blocks with normal behavior in the test set and no motion information in the training set are misjudged as
abnormal; (3) CNN has relatively accurate classification results for the abundant abnormal species in Ped1. We compare the performance of our method with state-of-the-art algorithm on UCSD and UMN benchmarks. EER and AUC are adopted at frame-level to evaluate our method on UMN dataset, and the result is shown in Table 4. In order to evaluate the performance on UMN dataset, we compare our result with the outstanding method proposed in recent years. Previous methods performed perfectly on this dataset, and the AUC of our method is comparable with previous high-performance results. The EER of our approach is better than Cascade DNN [4] and OCELM[8] by 0.85% and 1.45% respectively. Example of abnormal detection and location in UMN is shown in Fig.9, in addition red rectangles show the anomalies detected by our method. Table3 EER for dual pixel level comparisons on UCSD Ped1 and Ped2 Method
UCSD Ped1
UCSD Ped2
-
27.50%
Cascade DNN[4]
24.50%
23.80%
Ours
12.37%
21.43%
Mohammad Sabokrou[26]
Table4 EER and AUC for comparisons on UMN Method
EER
AUC
2.5%
0.9960
Spatial-temporal CNN[9]
-
0.9963
CFS[7]
-
0.8830
SR[2]
-
0.9700
3.1%
0.9900
GAN[5]
-
0.9900
OADC-S[27]
-
0.9967
iHOT et al[18]
-
0.9940
1.65%
0.9974
Cascade DNN[4]
OCELM[8]
Ours
Ⅴ. Conclusion In this paper, we introduce a novel parallel three-dimensional convolution neural networks method for abnormal behavior detection and location in video surveillance. We adopt optical flow combined with varied-size
cell structure to segment the foreground and remove the background information, which not only effectively handles the problem of uneven distribution of information in video frames, but also obtains spatial-temporal interest blocks containing moving objects. After adjusting the size of interest blocks to 32323 and 32327, a parallel three-dimensional convolution neural network is used to describe the spatial and temporal features of interest blocks and classify the anomalous blocks. This training method of parallel three-dimensional convolution neural networks with long and short frames for the same space-time block not only ensure that the complete spatial and temporal features of interest blocks are fully learned, but also ensures the removal of information unrelated to the main moving target actions. Finally, the result of anomaly detection is obtained by fusing the results of anomaly discrimination of interest blocks with different frame lengths, and the detection and location of anomaly objects in surveillance video are realized. The experimental results show that the proposed anomaly detection algorithm can achieve promising performance, especially in the pixel level criterion and the dual-pixel level criterion.
Acknowledge This work is supported by National Natural Science Foundation of China under Grant (No. 61071199), Natural Science Foundation of Hebei Province of China under Grant (No. F2016203422) and Hebei key laboratory of information transmission and signal processing. The authors declare that there is no conflict of interests regarding the publication of this paper.
HIGHLIGHTS 1. A novel parallel spatial-temporal convolutional neural networks method is proposed for anomaly detection. 2. More behavior information is extracted in cuboid by optical flow and varied-size cell structure. 3. Each behavior is described in different kinds of frame length by three dimension convolutional neural network.
References [1] A.B. Mabrouk, E. Zagrouba, Abnormal behavior recognition for intelligent video surveillance systems: A review. Expert Systems with Applications, 91(2018) 480-491. [2] P. Liu, Y. Tao, W. Zhao, X.L. Tang, Abnormal crowd motion detection using double sparse representation. Neurocomputing, 269(2017) 3-12. [3]
University
of
California,
San
Diego.
UCSD
Anomaly
Detection
Dataset
[EB/OL].
http://www.svcl.ucsd.edu/project-s/anomaly/dataset.html.2010-10-10. [4] M. Sabokrou, M. Fayyaz, M. Fathy, R. Klette, Deep-Cascade: Cascading 3D Deep Neural Networks for Fast Anomaly Detection and Localization in Crowded Scenes. IEEE Transactions on Image Processing, 26(2017) 1992-2004. [5] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, Abnormal event detection in video using generative adversarial nets, in: Proceedings of the IEEE International Conference on Image Processing(ICIP), Beijing, China, 2017, pp. 1577-1281. [6] R. Leyva, V. Sanchez, C.T. Li, Abnormal event detection in videos using binary features, in: Proceedings of the IEEE Telecommunication and Signal Processing(TSP), Barcelona, Spain, 2017, pp. 621- 625. [7] R. Leyva, V. Sanchez, C.T. Li, Video Anomaly Detection With Compact Feature Sets for Online Performance. IEEE Transactions on Image Processing, 26(2017) 3463-3478. [8] S.Q. Wang, E. Zhu, J.P. Yin, F. Porikli, Video anomaly detection and localization by local motion based joint video representation and OCELM. Neurocomputing, 277(2018) 161-175. [9] S.F. Zhou, W. Shen, D. Zeng, M. Fang, Y.W. Wei, Z.J. Zhang, Spatial-temporal convolutional neural networks for anomaly detection and localization in crowded scenes. Signal Processing: Image Communication, 47(2016) 358-368. [10] Vijay Mahadevan, Li XeiWin, Bhalodia V, et al. Anomaly Detection in Crowded Scenes, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 2010,pp. 1975-1981. [11] Li XeiWin, Vijay Mahadevan, Nuno Vasconcelos. Anomaly Detection and Localization in Crowded Scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2014) 18-32. [12] Rao A S, Gubbi J, Rajasegargar S, et al. Detection of Anomalous Crowd Behavior Using Hyper Spherical Clustering in: Proceedings of the International Conference on Digital Image Computing: Techniques and Applications (DlCTA), Wollongong, NEW, Australia, 2014,pp. 1-8. [13] Xu Long, Chen Gong, Yang Jie, et al. Violent Video Detection Based on Mosift Feature and Sparse Coding, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 2014,pp. 3538-3542. [14] N. Hajananth, C. Fookes, S. Denman, S. Sridharan, An mrf based abnormal event detection approach using motion and appearance features, in: Proceedings of the In Advanced video and signal based surveillance (AVSS) , Seoul, South Korea, 2014, pp. 343-348. [15] T. Wang, H. Snoussi, Detection of abnormal visual events via global optical flow orientation histogram. IEEE Transactions on Information Forensics and Security, 9(2014) 988-998. [16] S. Biswas, R .V. Babu, Anomaly detection via short local trajectories. Neurocomputing, 242 (2017) 63-72. [17] C. Li, Z.J. Han, Q.X. Ye, J.B. Jiao,Visual abnormal behavior detection based on trajectory sparse reconstruction analysis. Neurocomputing, 19 (2013) 94-100. [18] H. Mousavi, M. Nabi, H. K. Galoogahi, A. Perina,V. Murino. Abnormality detection with improved histogram of oriented
tracklets, in: Proceedings of the Image Analysis and Processing (ICIAP), Geneva, Italy,2015, pp. 722-732. [19] S. Coşar, G. Donatiello, V. Bogorny, C Garate, L.O. Alvares, F. Brémond. Toward Abnormal Trajectory and Event Detection in Video Surveillance. IEEE Transactions on Circuits and Systems for Video Technology, 27(2017) 683-695. [20] Simonyan Karen, Zisserman Andrew. Two-Stream Convolutional Networks for Action Recognition in Videos, in: Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 2014,pp. 568-576. [21] Feichtenhofer Christoph, Pinz Axel, Zisserman Andrew. Convolutional Two-Stream Network Fusion for Video Action Recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp.1933-1941. [22] Wang Limin, Xiong Yuanjun, Wang Zhe, et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, in: Proceedings of the Computer Vision -14th European Conference, ECCV, Amsterdam, Holland, 2016, pp. 20-36. [23] Lan Zhenzhong, Zhu Yi, Alexander G Hauptmann, et al. Deep Local Video Feature for Action Recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017,pp.1219-1225. [24] T. Du, L. Bourdev, R. Fergus, L. Torresani, M. Paluri. Learning Spatiotemporal Features with 3D Convolutional Networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4489-4497. [25] University of Minnesota. UMN abnormal events detection dataset [EB/OL]. http://mha.cs.umn.deu/proj_events.shtml. 2009-4-12. [26] M. Sabokrou, M. Fathy, M. Hoseini, R. Klette. Real-Time Anomaly Detection and Localization in Crowded Scenes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops(CVPRW), IEEE Computer Society, MA, USA, 2015, pp. 56-62. [27] Y. Yuan, J. Fang, Q. Wang. Online anomaly detection in crowd scenes via structure analysis. IEEE Transactions on Cybernetics, 45(2015) 548-561. [28] T. Xiao, C. Zhang, H. Zha. Learning to detect anomalies in surveillance video. IEEE Signal Processing Letters, 22(2015) 1477-1481.
Author introduction
Zheng-ping HU, born in 1970 in Yilong, Sichuan Province. Ph.D, professor (Ph.D. advisor). He is a senior member of the Chinese Institute of Electronics and Computer Society CSIG. His research interests include deep learning and video pattern recognition.
Le Zhang, born in 1994 in Xinxiang, Henan Province. Master candidate at the School of Information Science and Engineering of Yanshan University. Her research interests include video understanding.
Shu-fang Li is a lecturer in Department of Information Engineering of Hebei University of Environmental Engineering, China. Her current research interests include image processing, pattern recognition. Li received her M.E. degrees from East China University of Technology, China, in 2007. She is currently pursuing the Ph. D. degree in School of Information Science and Engineering, Yanshan University, China. De-gang Sun, born in 1978 in Liaocheng, Shandong Province. Master, senior engineer, and assistant dean at the School of Electronic Information Engineering of Shandong Huayu Institute of Technology. His research interests include Software Engineering