Integrating prediction and reconstruction for anomaly detection

Integrating prediction and reconstruction for anomaly detection

Integrating Prediction and Reconstruction for Anomaly Detection Journal Pre-proof Integrating Prediction and Reconstruction for Anomaly Detection Ya...

2MB Sizes 0 Downloads 41 Views

Integrating Prediction and Reconstruction for Anomaly Detection

Journal Pre-proof

Integrating Prediction and Reconstruction for Anomaly Detection Yao Tang, Lin Zhao, Shanshan Zhang, Chen Gong, Guangyu Li, Jian Yang PII: DOI: Reference:

S0167-8655(19)30344-7 https://doi.org/10.1016/j.patrec.2019.11.024 PATREC 7708

To appear in:

Pattern Recognition Letters

Received date: Revised date: Accepted date:

26 August 2019 13 November 2019 15 November 2019

Please cite this article as: Yao Tang, Lin Zhao, Shanshan Zhang, Chen Gong, Guangyu Li, Jian Yang, Integrating Prediction and Reconstruction for Anomaly Detection, Pattern Recognition Letters (2019), doi: https://doi.org/10.1016/j.patrec.2019.11.024

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

1 Highlights • A novel framework is proposed for anomaly detection in videos. • The innovation lies in the combination of prediction and reconstruction methods. • This work is more robust to noise and suitable for realworld surveillance videos. • This work outperforms both prediction (baseline) and reconstruction approaches.

1

Pattern Recognition Letters journal homepage: www.elsevier.com

Integrating Prediction and Reconstruction for Anomaly Detection Yao Tanga , Lin Zhaoa,∗∗, Shanshan Zhanga , Chen Gonga , Guangyu Lia , Jian Yanga,∗∗ a PCA

Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China Article history:

2000 MSC: 41A05 41A10 65D05 65D17 Keywords: Anomaly Detection Reconstruction Future Frame Prediction

ABSTRACT Anomaly detection in videos refers to identifying events that rarely or shouldn’t happen in a certain context. Among all existing methods, the idea of reconstruction or future frame prediction is predominant for detecting anomalies. Reconstruction methods try to minimize the reconstruction errors of training data, but cannot guarantee large reconstruction errors for abnormal events. Future frame prediction methods follow the concept that normal events are predictable while abnormal ones are unpredictable. However, the results may drop rapidly since prediction is not robust to the noise in real-world surveillance videos. In this paper, we propose an approach that combines the advantages and balances the disadvantages of these two methods. An end-to-end network is designed to conduct future frame prediction and reconstruction sequentially. Future frame prediction makes the reconstruction errors large enough to facilitate the identification of abnormal events, while reconstruction helps enhance the predicted future frames from normal events. Specifically, we connect two U-Net blocks in the generator. One block works in the form of frame prediction, and the other tries to reconstruct the frames generated by the former block. Experiments over several benchmark datasets demonstrate the superiority of our method over previous state-of-the-art approaches, while running in real-time at 30 frames per second. c 2019 Elsevier Ltd. All rights reserved.

1. Introduction Anomaly detection is attracting increasing attention due to its application in video surveillance and the growing demand of social security. It is a challenging task in computer vision, as the definition of anomalies heavily depends on the context. But we can generally agree that anomalies should be unexpected events which occur less often than normal events (Chandola et al., 2009). Another difficulty is that the datasets are highly biased towards the normal class due to the insufficient sample size of abnormal ones, so only the familiar class is available during training in most cases. Lots of efforts have been made to solve these problems. Many approaches adopt the idea of reconstruction (Cong et al.,

∗∗ Corresponding

author: e-mail: [email protected] (Lin Zhao), [email protected] (Jian Yang)

2011; Lu et al., 2013; Hasan et al., 2016; Luo et al., 2017b), which assume that a frame can be labeled as anomaly with a high reconstruction error during testing period. But these methods cannot always guarantee a large reconstruction error for anomalies, which means that abnormal frames could be well reconstructed with high quality. The reasons are as follows: i) Deep neural network nowadays has full learning capability. ii) Anomalous events only occupy a small portion of image pixels in a frame. iii) Reconstruction methods detect anomalies regardless of context information, since the generated frames are reconstructed from themselves. Recently, with the developments of Generative Adversarial Network (GAN) (Goodfellow et al., 2014) and video prediction (Chen et al., 2017; Mathieu et al., 2015), Liu et al. (2018) propose a method which distinguishes normal and abnormal frames by comparing them with their predicted ones instead of reconstructing training data for anomaly detection. It can overcome the limitations of reconstruction errors, expanding the gap

2 Prediction

Reconstruction

U-Net (Generator)

U-Net (Generator)

Iˆt  4

Iˆm

I t , I t 1 , I t  2 , I t  3

Discriminator

It 4

Scores

Fig. 1. The pipeline of our proposed method. Four stacked frames are fed into the first U-Net block, which works in the form of frame prediction. The other block tries to reconstruct the frames generated by the former block. Discriminator tries to distinguish the generated frame and the ground truth, and outputs scores to judge whether a frame contains anomalies or not.

between normal and abnormal frames. The reasons are as follows. For reconstruction methods, take a person who is running on the pedestrian street (abnormal events) as an example, once a model is familiar with the appearance features of both backgrounds and persons during training period, it can easily reconstruct this abnormal scene from the input frame itself. But for prediction methods, the model has to gain previous information and tries to predict what will happen next. Under the scene of pedestrian street, the model learns motion features of walking persons from training datasets. Once stacked frames of a running person are fed to the model, it can only predict a person keeping on walking, which promises a large gap between the predicted frame and the ground-truth frame. However, this method is quite sensitive to noise, and possible reasons are as follows: i) Optical flow is added as a constraint on motion between the generated frame and ground truth, which is sensitive to varying illumination. ii) It adopts the value of Peak Signal to Noise Ratio (PSNR) as scores to judge the extent of abnormality. As a standard to estimate the quality of images, PSNR does not always consistent well with human perception (Gao et al., 2018; Gao and Yu, 2016), and its value will drop rapidly once images contain noise, making more normal frames judged as abnormal ones. iii) Prediction method highly relies on previous information, thus the detection results are sensitive to any changes of former frames. In this paper, we propose a framework which combines the advantages of prediction and reconstruction. Future frame prediction enlarges the reconstruction errors of abnormal events, making anomalies more distinguishable. In the meanwhile, reconstruction enhances the predicted future frames from normal events, which guarantees the robustness to noise. Thus we integrate these two methods in a generator, which connects two UNet (Ronneberger et al., 2015) blocks in series. The first block works in the form of frame prediction, generating the intermediate frame based on the stacked input images. Then the second block reconstructs the future frame according to the intermediate one. A discriminator tries to differentiate the generated frame and the ground truth, minimizing the difference between

these two images. The pipeline of our method is illustrated in Fig. 1. Compared to the previous work, we summarize our contributions as follows: i) we propose a framework which unifies the prediction and reconstruction methods. The intermediate frame gains information of future frames from the first U-Net block, and deliver it to the next reconstructing block. In this way, we overcome the limitations of reconstruction errors, normal and abnormal events becoming more distinguishable. To the best of our knowledge, it is the first work that integrates these two methods. ii) Our framework is more robust to noise, since reconstruction helps enhance the predicted future frames from normal events. Further, extensive experiments on several datasets show the superiority of our method.

2. Related Work Among existing methods, the idea of reconstruction or prediction achieves good results in anomaly detection. Reconstruction Methods. Early work (Cheng et al., 2015; Cong et al., 2011; Dutta and Banerjee, 2015) of reconstruction based on hand-crafted features usually learns a dictionary that reconstructs normal events with small reconstruction errors, and label the events not linearly represented by the dictionary (which means high reconstruction errors ) as abnormal. For example, Lu et al. (2013) propose to discard the sparse constraint and learn multiple dictionaries to encode normal scale-invariant patches in order to accelerate training and testing phase, since optimizing the sparse coefficients is usually time-consuming. Deep learning methods have achieved great success in many fields. It also makes progress in anomaly detection by adopting deep features in reconstruction. For instance, motivated by the strong capability of Convolutional Neural Networks (CNN) to learn spatial features, a deep convolutional Auto-Encoder (AE) is trained to reconstruct an input sequence of frames(Hasan et al., 2016). Some of these deep methods also include spatiotemporal AE (Chong and Tay, 2017), 3D Conv-net AE (Zhao

3 gain better results. We think the reasons lie in that the network we use is too shallow, and the amount of data is quite small. So when residual block deepens the network, it cannot be fully trained. Inputs: 256×256×3 Conv: 5×5, 64 filters, strides = 2 Conv: 5×5, 128 filters, strides = 2 Conv: 5×5, 256 filters, strides = 2 Conv: 5×5, 512 filters, strides = 2 Fully-connected layer Outputs: scores Fig. 2. The architecture of frame-level discriminator, which outputs scores and determines if its inputs contains abnormal events or not.

3 64 64

256×256

256×256

128 64 64 3

64 128 128

128×128

256 128 128

128×128

512

3. Our Method Generally, anomaly detection in videos consists of two tasks: frame-level detection and pixel-level detection. The former aims to label the frames with anomalies as abnormal, while the latter tries to mark the locations of corresponding pixels containing an abnormal event in a frame. Our framework is flexible to conduct both tasks. 3.1. Generator We connect two U-Net blocks in series as the generator. UNet (Ronneberger et al., 2015) is a convolutional neural network that is developed for biomedical image segmentation. To avoid the problem of gradient vanishing and information imbalance, U-Net is proposed to add a shortcut between a high level layer and a low level layer with the same resolution. We slightly modify the architecture and adjust the number of layers from 4 to 5. The kernel sizes are set to be 3×3 for convolution and deconvolution layers, and 2×2 for max pooling layers. The details are illustrated in Fig. 3. All frames in datasets are resized to 256×256 and pixel values are normalized to [-1,1] as data preprocessing. In addition, we have tried using residual block instead of standard convolution in U-Net, but it does not

512

32×32

512

512

16×16

32×32

256

256

256

64×64

128 256 256

64×64

et al., 2017), Temporally-coherent Sparse Coding Stacked RNN (Luo et al., 2017b) and ConvLSTM-AE (Luo et al., 2017a). Prediction Methods. Predictive models aim to model the future output frame as a function of several past frames. Recently, video frame prediction is growing rapidly due to its widespread application in autonomous driving, video comprehension and so on. For example, a recurrent Auto-Encoder using an LSTM that models temporal dependence between patches from a sequence of input frames is used to detect video forgery (D’Avino et al., 2017). With the great improvement in video prediction (Mathieu et al., 2015), some work adopts the idea of predicting into anomaly detection, assuming that normal events are predictable while abnormal ones are unpredictable. For instance, Liu et al. (2018) propose a method which predicts the future frame with U-Net architecture. Medel and Savakis (2016) use the ConvLSTM model as a unit within the composite LSTM model (Srivastava et al., 2015) following an encoder-decoder, with a branch for reconstruction and another for prediction. A convolutional feature representation is fed into an LSTM model to predict the latent space representation and its prediction error is used to evaluate anomalies in a robotics application(Munawar et al., 2017) . Reconstruction methods cannot guarantee a large reconstruction error for anomalies (the reasons are mentioned in Introduction), so it’s difficult to set an appropriate threshold for reconstruction errors to judge whether a frame contains abnormal events or not. Prediction methods can expand the gap between normal and abnormal frames. However, noise-sensitive motion constraints such as optical flow is added to the framework, leading to worse robustness than reconstruction approaches. Accordingly, we propose a framework that combines the advantages of prediction and reconstruction, as detailed in the following sections.

1024

1024

512

512 Convolution

1024

Max pooling Deconvolution Concatenate

Fig. 3. The architecture of U-Net block. The images of the same layer are equal in resolution.

Prediction part. For the first U-Net block which works in the form of prediction, the input is four stacked frames It , It+1 , It+2 , It+3 , and the output is an intermediate result Im . The intermediate frame Im contains the information of future frames and deliver it to the next block. Reconstruction part. For the second block, we feed Im to the network, which reconstructs the predicted frame b It+4 from the intermediate one. Our goal is to minimize the difference between b It+4 and its ground truth It+4 through some constraints. 3.2. Frame-level Discriminator The frame-level discriminator is composed of several convolutional layers, which competes with the generator and tries to differentiate the ground truth and the generated frames. The details are showed in Fig. 2. The kernel sizes are set to 5×5 for each layer and we choose Leaky ReLU as activation functions. Discriminator distinguishes whether a frame is a real-world image or not, promoting the generator to create frames with high quality. The output

4 Inputs: 256×256×3 Conv: 4×4, 64 filters, stride: 2 Leaky ReLU Conv: 4×4, 128 filters, stride: 2 Batch normalization Leaky ReLU Conv: 4×4, 256 filters, stride: 2 Batch normalization Leaky ReLU Conv: 4×4, 512 filters, stride: 2 Sigmoid Output: 16×16×1 Fig. 4. The architecture of pixel-level discriminator, which outputs matrices and predicts the broad locations of abnormal events.

of the discriminator is a scalar value, which is considered as a score to judge the extent of abnormality of a frame. A lower score means that the frame is more likely to contain abnormal events. 3.3. Pixel-level Discriminator For pixel-level task, we follow the PatchGAN discriminator (Isola et al., 2017) to predict the broad locations of abnormal events. The main difference between a PatchGAN and a conventional GAN discriminator is that the latter maps an input image to a single scalar output in the range of [0,1], indicating the probability of the image being real or fake, while PatchGAN provides a matrix as the output with each element signifying whether its corresponding patch is real or fake. The details of our pixel-level discriminator are illustrated in Fig. 4. Each element of the output matrix maps to its corresponding patch in a frame, the value of which is utilized to judge whether this patch contains anomalies or not. Here the output size of pixel-level discriminator is set to 16×16, and we will discuss the impact of matrix size on detection results in the following sections. 3.4. Constraints In order to minimize the difference between a generated frame and its ground truth, we adopt intensity, gradient and temporal image difference as constraints. The intensity constraint compares the value of each pixel between two frames, ensuring that pixel values in RGB space are similar in the whole picture. The gradient constraint compares the gradient of pixel values at the same position of two images and sharpens the generated frames. These two constrains are based on appearance, so we additionally add an image difference penalty as a temporal loss, which tries to keep the same image difference in both generated frames series and ground-truth series. We set b I as the generated frame, and I as its corresponding ground truth. Intensity loss is defined as follows:



Lint (I, b I) =

I − b I

(1) 2

Gradient loss can be expressed as follows:

X

|Ii, j − Ii−1, j | − |b Ii, j − b Ii−1, j |

Lgd = i, j

1



+

|Ii, j − Ii, j−1 | − |b Ii, j − b Ii, j−1 |

1

(2)

where i, j denote the spatial index of a video frame. It is worth mentioning that we have also tried L1 loss for intensity constraint, and it gets almost the same results as L2 loss. L1 loss generally generates sharper images than L2 loss and may give better performance. Nonetheless, the gradient constraint is also adopted in our method, which can sharpen the generated frames and reduces the blurring effects. Thus, whether L2 loss or L1 loss is used for the intensity constraint does not make too much difference. Further, we define image difference loss as follows:



Ldi f (I, b I) =

|It+1 − It | − |b It+1 − b It |

(3) 2 3.5. Training and Testing

Recently, GANs have shown its great success in image and video generation (Brock et al., 2018; Yu et al., 2017; Karras et al., 2019; Clark et al., 2019). In GAN architectures, discriminator tries to differentiate real-world images and fake images generated by the generator, promoting the generator to update itself. Generator tries to fool the discriminator with more realistic images, which inspires the discriminator to improve its discriminant ability. Thus, adversarial training is implemented with an alternative update manner. Training discriminator(D). If we set 0 as the label for fake images and 1 for real ones, D is aimed to label It with class 1 and b It with class 0. For frame-level discriminator, D(I) means the output scaler for frame I, while for pixel-level discriminator, it represents the average of all elements in the output matrix for frame I. Then we can define an adversarial training loss of D: 1 1 D I) − 0)2 (4) Ladv (I, b I) = (D(I) − 1)2 + (D(b 2 2 where D(I), D(b I) ∈ [0, 1]. When training D, the goal is to minimize the following objective function: D LD = Ladv (It+1 , b It+1 ) (5) Training generator(G). G aims to generate images which could be classified into class 1. An adversarial training loss of D is expressed as follows: 1 LGadv (b I) − 1)2 I) = (D(b 2

(6)

Considering the constraints in Section 3.3, we aim to minimize the objective function as follows: LG = αint Lint (It+1 , b It+1 ) + αgd Lgd (It+1 , b It+1 ) + αdi f Ldi f (I, b I) + αadv Ladv (b It+1 )

(7)

The coefficients αint , αgd , αdi f , αadv is set to be 1.0, 1.0, 2.0 and 0.05, which are slightly different for each dataset. The learning rate of generator and discriminator for gray scale

5

4. Experiment results In this section, the proposed method is evaluated on three publicly available benchmark datasets, including the CUHK Avenue dataset (Lu et al., 2013), the UCSD Pedestrian dataset (Mahadevan et al., 2010) and the ShanghaiTech dataset (Luo et al., 2017b). The performance of our method is compared with state-of-the-art approaches, which is analyzed in details. 4.1. Datasets USCD dataset. The UCSD Pedestrian dataset is composed of two subsets, namely Ped1 and Ped2. Ped1 contains 34 training and 36 test videos with a frame resolution of 238 × 158 pixels. Ped2 contains 16 training and 12 test videos, and the frame resolution is 360 × 240 pixels. Ped1 and ped2 have different viewing angles, and anomalies include bicycles, vehicles, skateboarders and wheelchairs crossing pedestrian areas. Some anomalies in Ped1 are showed in Fig. 5.

ShanghaiTech dataset. It is considered one of the most comprehensive and realistic datasets for video anomaly detection currently available, which includes 330 training videos and 107 testing ones. The test set contains 130 abnormal events annotated at pixel-level. Totally, there are 13 different scenes with various lighting conditions and camera angles. The resolution of each video frame is 480 × 856 pixels. 4.2. Evaluation Metric We adopt the Area Under Curve (AUC) and Equal Error Rate (EER) as evaluation metrics. AUC measures the entire twodimensional area underneath ROC (Price Rate of Change) curve from (0,0) to (1,1). Equal error rate (EER) predetermines the threshold values for its false acceptance rate and its false rejection rate. When the rates are equal, the common value is referred to as the equal error rate. A higher AUC means a better performance, while the opposite is true for EER. The relationship between AUC and EER is illustrated in Fig. 6. For frame-level detection, we can directly get scores of the frames to gain AUC or EER values. As for pixel-level detection, we follow the evaluation procedure proposed by Mahadevan et al. (2010). To test localization accuracy, detections are compared to pixel level ground-truth masks. If at least 40% of the truly anomalous pixels are detected, the frame is considered to be detected correctly. 1 0.8 True positive rate

datasets is 0.0001 and 0.00001, while 0.0002 and 0.00002 for color sacle dataset. As for parameter optimization, we adopt Adam (Kingma and Ba, 2014) based Stochastic Gradient Descent method and set mini-batch size as 4. Since the first 4 frames in a video cannot be predicted in prediction based method, these frames are ignored. During testing period, the frame-level discriminator outputs scalers in range of [0,1], which are considered as the scores for testing frames. If one frame gains a low score, it is more likely to contain abnormal events in this frame. Pixel-level discriminator works the same way as the frame-level one, except that it outputs matrices for corresponding patches. We adopt some evaluation metrics to measure our test results, as detailed in Section 4.2.

ROC curve EER line AUC

0.6 0.4 0.2 EER value

0

0.2

0.4

0.6

0.8

1

False positive rate Fig. 6. The relationship between AUC and EER.

4.3. Frame-level Detection Results

Fig. 5. Some abnormal events appeared in Ped1, including cart, wheelchair, skater and biker (marked with red rectangle).

CUHK Avenue dataset. It contains 16 training and 21 test videos, with 15328 frames in the training set and 15324 frames in the test set. Anomalies includes throwing objects, loitering and running. For each test frame, ground-truth locations of anomalies are provided using pixel-level masks. The resolution of each video frame is 360 × 640 pixels.

Our proposed method for frame-level anomaly detection is compared with some state-of-the-art approaches. We set Liu et al. (2018) as our baseline, since it is a successful work that leverages video prediction for anomaly detection and achieves state-of-the-art performance. First, we adopt AUC as the evaluation metric and the results are demonstrated in Table 1. We can see that our method outperforms all existing methods on these benchmark datasets. Especially, our method provides an absolute gain of 0.8% and 2.7% in terms of AUC values on Ped1 and Ped2 respectively compared with baseline. It is worth noting that the baseline that we compared with in Table 1 is set to remove its optical flow constraint. For a more

6 Table 1. Frame-level AUC performance of different methods on several datasets

Discriminative framework (Del Giorno et al., 2016) Unmasking (Tudor Ionescu et al., 2017) Conv-AE (Hasan et al., 2016) ConvLSTM-AE (Luo et al., 2017a) Stacked RNN (Luo et al., 2017b) Hinami et al. (Hinami et al., 2017) Baseline (Liu et al., 2018) (without optical flow) Our method comprehensive comparison, we also add optical flow constraint to our proposed approach and compare it with baseline. Optical flow is a good estimator of motion and a temporal loss is defined as the difference between optical flow of prediction frames and the ground truth. Following the baseline method (Liu et al., 2018) , we use the Flownet (Dosovitskiy et al., 2015), a CNN based approach for optical flow estimation. The Flownet is denoted as f , then the loss in terms of optical flow can be expressed as follows:



(8) Lop =

f (b It+1 , It ) − f (It+1 , It )

1

The final results are demonstrated in Table 2. It is obvious that our method achieves better results than baseline, no matter whether optical flow is adopted or not. The reason lies in that reconstruction is integrated into our method, enhancing the predicted frames. Moreover, we can see that the promotion of AUC values are more obvious on Ped1 and Ped2 than other datasets. This may be due to that Ped1 and Ped2 have lower resolution and contains more noise, which brings reconstruction part of our method into full play. Besides, we adopt EER as the evaluation metric and compare our method with Sabokrou et al. (2018) on Ped2, since only EER value on Ped2 is given in this work. The results demonstrate the effectiveness of our method, which are illustrated in Table 3.

F ra m e -le v e l A U C

0.95

O u r O u r B a s e B a s e

m e th o d m e th o d + o p tic a l flo w lin e lin e w ith o u t o p tic a l flo w

0.90 0.85 0.80

0.000 0.002 0.004 0.006 0.008 V a ria n c e o f G a u s s ia n N o is e

Fig. 7. Frame-level AUC value of different methods when the variance of Gaussian noise changes.

USCD Ped1 N/A 68.4% 75.0% 75.5% N/A N/A 81.8% 82.6%

USCD Ped2 N/A 82.2% 85.0% 88.1% 92.2% 92.2% 93.5% 96.2%

CUHK Avenue 78.3% 80.6% 80.0% 77.0% 81.7% N/A 83.6% 83.7%

Shanghai Tech N/A N/A N/A N/A N/A N/A 71.3% 71.5%

Compared to Sabokrou et al. (2018) which adopts the reconstruction method and learns a one-class classifier for novelty detection, our method achieves better results due to the prediction part, which makes normal and abnormal events more distinguishable. Moreover, our proposed method without optical flow also outperforms baseline with optical flow on EER metric. 4.4. Pixel-level Detection Results For pixel-level detection, the discriminator outputs a matrix X and each element of X maps to a patch in the corresponding frame. We can trace the receptive field of Xi j to see which input pixels it is sensitive to. Take the matrix size 16×16 for example, each 1×1 pixel in the output layer maps to a 46×46 patch in the input layer according to the structure of the discriminator we adopt. All the 46×46 patches are overlapped and arranged in order. For visual effects, we modify the patches to become nonoverlapping partitions, and one example of the detection results is showed in Fig. 9. In order to get the most suitable output matrix size, pixellevel discriminators of different sizes are adopted, and the performances are tested on CUHK Avenue dataset because of its wide use on pixel-level detection. The results are illustrated in Table 6, which indicate that a 16×16 matric is the best choice for follow-up experiments. Since Shanghai Tech dataset does not provide pixel-level ground-truth masks, we do the comparison of pixel-level performances on USCD Ped1, USCD Ped2 and CUHK Avenue dataset. As the most recent works (Ionescu et al., 2019; Nguyen and Meunier, 2019) on anomaly detection do not provide pixellevel detection, some classical methods are selected for the comparison, and we directly adopt the performances reported in the corresponding papers. The comparison results are demonstrated in Table 5. It is obvious that our method surpasses these approaches and gains a clear superiority on all datasets. Moreover, we find that all listed methods achieve better results on CUHK Avenue dataset than other datasets. The reasons lie in the evaluation procedure and ground-truth masks. The ground-truth masks of USCD datasets are annotated pixel by pixel, while Avenue dataset is annotated roughly with rectangles (Fig 9(a)). As pixel-level detection results are also presented in the form of rectangles (Fig 9(b)), it is easier for Avenue dataset to reach the

7 Table 2. Frame-level AUC performance of baseline and our method on several datasets

Baseline (Liu et al., 2018) (with optical flow) Our method+optical flow

USCD Ped1 83.1% 84.7%

USCD Ped2 95.4% 96.3%

(b) variance of Gaussian noise: 0.092

(a) variance of Gaussian noise: 0 (original frame)

CUHK Avenue 84.9% 85.1%

Shanghai Tech 72.8% 73.0%

(c) variance of Gaussian noise: 0.162

Fig. 8. Frames with different variance of Gaussian noise.

Table 3. Comparison of EER performance on Ped2

Method One-class classifier (Sabokrou et al., 2018) Baseline (Liu et al., 2018) (with optical flow) Our method without optical flow

Table 5. Pixel-level AUC performance of different methods

EER 0.13 0.11 0.10

Table 4. Real-time Performance on Ped2

Method

Unmasking (Tudor Ionescu et al., 2017) Cheng et al. (Cheng et al., 2015) Baseline (Liu et al., 2018) Our method

Computational Time (frames per second) 20 2 32 30

evaluation criterion, which demands at least 40% of the truly anomalous pixels are detected in an abnormal frame. 4.5. Robustness to Noise In order to prove the robustness to noise of our method, we add Gaussian noise to training and testing sets. The variance of Gaussian noise changes from 0 to 0.092 , which leads to a drop in AUC. Frames from ShanghaiTech dataset with different variances of Gaussian noise are illustrated in Fig. 8. For visual effects, the variances we choose are 0.092 and 0.162 . Since the baseline method does not fulfill pixel-level task, we merely

(a) ground truth (with anomalies)

(b) corresponding detection results

Fig. 9. Pixel-level detection on CUHK Avenue dataset. Red rectangles represent ground-truth masks and green ones represent detection results.

Lu et al. (2013) Zhang et al. (2016) Del Giorno et al. (2016) Tudor Ionescu et al. (2017) Fan et al. (2018) Our method

USCD Ped1 63.8% 77.0% N/A 52.5% 71.4% 78.4%

USCD Ped2 91.8% 90.0% N/A N/A 78.2% 93.1%

CUHK Avenue 92.9% N/A 91.0% 93.0% N/A 93.6%

Table 6. Pixel-level performance on CUHK Avenue dataset

Output matrix size 4×4 8×8 16×16 32×32

AUC 90.9% 92.0% 93.6% 91.2%

compare frame-level detection results. As for the varying curve of AUC, here we take Ped2 as an example due to its obvious change in AUC values with different noise intensity. The results are showed in Fig. 7. We can see that once optical flow is added, AUC values drop in both our method and baseline since optical flow is sensitive to illumination intensity and noise. Especially, baseline performs better than it without optical flow when variance of Gaussian Noise is low, but the opposite is true with higher variance. This is because the influence of the sensitivity of optical flow to noise on experimental results gradually exceeds the improvement brought from optical flow. Moreover, it is obvious that our method without optical flow is more robust to noise and more suitable for real-world surveillance videos, due to the enhancement of normal frames through reconstruction combined in our method. 4.6. Real-time Performance It is worth mentioning that our method runs in real-time at 30 frames per second during test period using a single NVIDIA

8 Tesla P40 GPU, which means that our approach can process videos online and be better applied to practice. Here we also compare the computational time of several previous work with our method, which is tested on USCD Ped2 dataset (resolution: 360 × 240 pixels) with the same device. The results are illustrated in Table 4. We can see that our approach runs almost as fast as baseline. The reason lies in that we add one more U-Net block for reconstruction than baseline, which consumes a small amount of computation. 5. Conclusion In this paper, since prediction and reconstruction methods have their own advantages and disadvantages respectively in anomaly detection, we propose an approach that integrates these two methods together. In our framework, the generator is composed of two U-Net blocks, which are connected in series. The first block gains information of future frames, enlarging the gap between normal frames and abnormal ones. The second block reconstructs the frames generated by the former block, enhancing the images and the robustness of our method. Experimentation across several datasets shows that the proposed method outperforms both prediction and reconstruction methods. References Brock, A., Donahue, J., Simonyan, K., 2018. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 . Chandola, V., Banerjee, A., Kumar, V., 2009. Anomaly detection: A survey. ACM computing surveys (CSUR) 41, 15. Chen, B., Wang, W., Wang, J., 2017. Video imagination from a single image with transformation generation, in: Proceedings of the on Thematic Workshops of ACM Multimedia 2017, ACM. pp. 358–366. Cheng, K.W., Chen, Y.T., Fang, W.H., 2015. Video anomaly detection and localization using hierarchical feature representation and gaussian process regression, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2909–2917. Chong, Y.S., Tay, Y.H., 2017. Abnormal event detection in videos using spatiotemporal autoencoder, in: International Symposium on Neural Networks, Springer. pp. 189–196. Clark, A., Donahue, J., Simonyan, K., 2019. Efficient video generation on complex datasets. arXiv preprint arXiv:1907.06571 . Cong, Y., Yuan, J., Liu, J., 2011. Sparse reconstruction cost for abnormal event detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE. pp. 3449–3456. D’Avino, D., Cozzolino, D., Poggi, G., Verdoliva, L., 2017. Autoencoder with recurrent neural networks for video forgery detection. Electronic Imaging 2017, 92–99. Del Giorno, A., Bagnell, J.A., Hebert, M., 2016. A discriminative framework for anomaly detection in large videos, in: European Conference on Computer Vision, Springer. pp. 334–349. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T., 2015. Flownet: Learning optical flow with convolutional networks, in: Proceedings of the IEEE international conference on computer vision, pp. 2758–2766. Dutta, J.K., Banerjee, B., 2015. Online detection of abnormal events using incremental coding length, in: Twenty-Ninth AAAI Conference on Artificial Intelligence. Fan, Y., Wen, G., Li, D., Qiu, S., Levine, M.D., 2018. Video anomaly detection and localization via gaussian mixture fully convolutional variational autoencoder. arXiv preprint arXiv:1805.11223 . Gao, F., Yu, J., 2016. Biologically inspired image quality assessment. Signal Processing 124, 210–219.

Gao, F., Yu, J., Zhu, S., Huang, Q., Tian, Q., 2018. Blind image quality prediction by exploiting multi-level deep representations. Pattern Recognition 81, 432–442. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Advances in neural information processing systems, pp. 2672–2680. Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A.K., Davis, L.S., 2016. Learning temporal regularity in video sequences, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 733–742. Hinami, R., Mei, T., Satoh, S., 2017. Joint detection and recounting of abnormal events by learning deep generic knowledge, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 3619–3627. Ionescu, R.T., Khan, F.S., Georgescu, M.I., Shao, L., 2019. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7842–7851. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A., 2017. Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Karras, T., Laine, S., Aila, T., 2019. A style-based generator architecture for generative adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 . Liu, W., Luo, W., Lian, D., Gao, S., 2018. Future frame prediction for anomaly detection–a new baseline, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6536–6545. Lu, C., Shi, J., Jia, J., 2013. Abnormal event detection at 150 fps in matlab, in: Proceedings of the IEEE international conference on computer vision, pp. 2720–2727. Luo, W., Liu, W., Gao, S., 2017a. Remembering history with convolutional lstm for anomaly detection, in: 2017 IEEE International Conference on Multimedia and Expo (ICME), IEEE. pp. 439–444. Luo, W., Liu, W., Gao, S., 2017b. A revisit of sparse coding based anomaly detection in stacked rnn framework, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 341–349. Mahadevan, V., Li, W., Bhalodia, V., Vasconcelos, N., 2010. Anomaly detection in crowded scenes, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE. pp. 1975–1981. Mathieu, M., Couprie, C., LeCun, Y., 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 . Medel, J.R., Savakis, A., 2016. Anomaly detection in video using predictive convolutional long short-term memory networks. arXiv preprint arXiv:1612.00390 . Munawar, A., Vinayavekhin, P., De Magistris, G., 2017. Spatio-temporal anomaly detection for industrial robots through prediction in unsupervised feature space, in: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE. pp. 1017–1025. Nguyen, T.N., Meunier, J., 2019. Anomaly detection in video sequence with appearance-motion correspondence, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 1273–1283. Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer. pp. 234– 241. Sabokrou, M., Khalooei, M., Fathy, M., Adeli, E., 2018. Adversarially learned one-class classifier for novelty detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3379–3388. Srivastava, N., Mansimov, E., Salakhudinov, R., 2015. Unsupervised learning of video representations using lstms, in: International conference on machine learning, pp. 843–852. Tudor Ionescu, R., Smeureanu, S., Alexe, B., Popescu, M., 2017. Unmasking the abnormal events in video, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 2895–2903. Yu, J., Shi, S., Gao, F., Tao, D., Huang, Q., 2017. Towards realistic face photosketch synthesis via composition-aided gans, in: arXiv: 1712.00899. Zhang, Y., Lu, H., Zhang, L., Ruan, X., Sakai, S., 2016. Video anomaly detection based on locality sensitive hashing filters. Pattern Recognition 59, 302–311. Zhao, Y., Deng, B., Shen, C., Liu, Y., Lu, H., Hua, X.S., 2017. Spatio-temporal autoencoder for video anomaly detection, in: Proceedings of the 25th ACM international conference on Multimedia, ACM. pp. 1933–1941.