Future Generation Computer Systems 90 (2019) 198–210
Contents lists available at ScienceDirect
Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs
Simultaneous denoising and moving object detection using low rank approximation ∗
Shijila B., Anju Jose Tom , Sudhish N. George Department of Electronics & Communication Engineering, National Institute of Technology Calicut, Kerala, India
highlights • • • •
Denoising and Moving object detection based on Low-rank approximation and l1 -TV regularizations. Denoising is done by using nuclear norm and l2 -norm regularization. Spatial continuity constraint was effectively utilized by the TV regularization. Detection performance of the method outperformed the compared existing techniques in terms of Fjoint measures.
article
info
Article history: Received 19 March 2018 Received in revised form 16 July 2018 Accepted 28 July 2018 Available online xxxx Keywords: Moving object detection Low rank recovery Background subtraction Robust principle component analysis
a b s t r a c t Moving Object Detection (MOD) and Background Subtraction (BS) are the fundamental tasks in video surveillance systems. But, one of the major challenges which badly affects the accuracy of detection is the presence of noise in the captured video sequence. In this work, we propose a new moving object detection method from noisy video data named as De-Noising and moving object detection by Low Rank approximation and l1 -TV regularization (DNLRl1 TV ). In general, background of videos are assumed to lie in a low-rank subspace and moving objects are assumed to be piecewise smooth in the spatial and temporal direction. The proposed method consolidates the nuclear norm, l2 -norm, l1 -norm and Total Variation (TV) regularization in a unified framework to obtain simultaneous denoising and MOD performance. The nuclear norm exploits the low-rank property of the background, the sparsity is enhanced by the l1 norm, TV regularization is used to explore the foreground spatial smoothness and noise is detected and removed by the l2 -norm regularization. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art approaches in terms of denoising capability and detection accuracy. © 2018 Elsevier B.V. All rights reserved.
1. Introduction The detection of moving objects is an important step in computer vision to develop numerous kinds of systems, such as intelligent video surveillance and motion capture. These systems are used in a wide range of applications, including retail, home automation, security, traffic monitoring, control applications, safety etc. In visual surveillance systems, the detection of moving objects is an important task to identify the useful insights from video data, such as intrusion detection, sidelined objects, traffic data collection etc. Object detection in real time environment has several important applications. This provides better sense of security using visual information and helps to automatically recognize people and objects. In retail space, it helps to analyze shopping behavior of customers. In traffic management system, it helps to find the flow of vehicles and detect/warn accidents. In video editing scenarios, it ∗ Corresponding author. E-mail address:
[email protected] (A.J. Tom). https://doi.org/10.1016/j.future.2018.07.065 0167-739X/© 2018 Elsevier B.V. All rights reserved.
aids to eliminate cumbersome human operator interaction, for designing futuristic video effects. These are some of the applications of object detection in real time environment. The top level applications of computer vision such as visual surveillance systems, object tracking etc. demand Background Subtraction (BS) followed by detection of moving objects. The simplest method for background modeling is to obtain a background image without including any moving objects. The main conditions which can ensure good results using background subtraction are static camera, constant illumination and static background. However, the background can be affected under some situations, such as illumination changes, noise and dynamic background. Therefore, it is a mandatory requirement that the background representation model must be robust and adaptive with respect to these challenges. In addition to above mentioned problems, there are many challenges associated with general Moving Object Detection (MOD) to decide its accuracy and precision in performance. Noise, illumination changes, occlusion, camouflage, clutter, camera jitter etc. are some of them. A video signal is generally contaminated by noise
S. B. et al. / Future Generation Computer Systems 90 (2019) 198–210
Fig. 1. Illustration of IBT [1] and GRASTA [2] against Gaussian noise: 1–6th column illustrates the input noisy video data, background and foreground masks estimated using GRASTA [2] and IBT [1] method respectively. Top row: Backdoor sequence, Bottom row: Highway sequence. The circled portions correspond to the misclassified background components.
during the recording process. That is mainly caused during signal transmission, acquisition, coding and between the processing steps. The different types of noises like sensor noise, speckle noise, compression artifacts etc. degrade the quality of videos. Usually, these noises can produce undesirable effects or artifacts in the background scenes. Many algorithms on BS and MOD have been proposed in response to the rising demands for surveillance cameras. Most of the existing methods are unable to achieve good performance due to the above mentioned issues. Usually, these methods classify moving objects as well as background pixels which are near to moving objects, as foreground pixels which reduces the precision badly. Even if there exists many methods for MOD, no method is capable of detecting a moving object from the noisy video by removing the noise from each frame in an effective manner. The noise affected frames stand against the efficient working of the classical methods, as those methods result in holes in the foreground as well as misclassifications in the background. Fig. 1 shows the visual performance of the IBT [1] and GRASTA [2] against noise. Both methods fail in the accurate background and foreground extraction. The considerable amount of foreground components is captured in the background, which reduces the precision and recall value of the background component. From Fig. 1, it can be seen that these methods detect noise as foreground pixels. Hence there is an immense requirement of developing a low complexity scheme for simultaneous denoising and moving object detection so that the accuracy of the detection can be maximized. This work is concentrated on extracting moving objects from a noisy background. 1.1. Literature review Nowadays, computer vision systems are commonly employed in many applications. Surveillance systems, traffic monitoring and robotics are some of the common areas where computer vision systems are employed for automation. Detection of moving objects is fundamental and crucial task in these systems. MOD methods are mainly classified as pixel based methods and frame-wise methods. Adaptive Median Filtering (AMF) [3], Running Gaussian Average (RGA) method, Mixture of Gaussian (MoG) [4] etc. are some of the examples of pixel based methods. The mixture of Gaussian is a method that can handle multi-modal distribution. In this method, all objects are filtered out and each pixel location is represented by a mixture of Gaussian functions that come together to form a probability distribution function. Pixel based methods result in comparatively poor visual quality for latest high definition (HD) videos and are computationally expensive. The methods developed later [5–8] are concentrated on framewise statistics. In 1999, Oliver et al. proposed a background model based on Principal Component Analysis (PCA) [9]. This technique
199
changes each frame to a PCA-transformed space and uses eigenvector to reconstruct the background image. Foreground detection is then achieved by thresholding the difference between the generated background image and the current image. PCA provides a robust model of the probability distribution function of the background, but not for the moving objects as they do not have a significant contribution to the model. Recently, Robust Principal Component Analysis (RPCA) has been extensively used in many algorithms in computer vision problems. However, it can handle only simple indoor and outdoor scenarios. It works well only with static background having few moving object with uniform movements. But in real life scenarios, the background is not always static in nature, and the objects (foreground) movements are not uniform. Recently, to improve the performance of conventional RPCA on moving object detection, many methods have been derived. Detecting Contiguous Outliers in the Low-Rank Representation (DECOLOR) [10] is based on RPCA template, where regularized non-convex l0 penalty and Markov Random Field (MRF) are used for extracting moving objects and background. This method gives better result than RPCA, but in some other contexts DECOLOR fails. This method detects some regions near to moving objects (greedy property) and they are misclassified as foreground pixels. Hence the exact shape of the foreground is lost. In recent years, low-rank subspace learning models [11,12] and sparse models [13,14] represent a new trend and achieve better performance than the state-of-the-art techniques. Cao et al., proposed a framework named Total Variation Regularized RPCA (TVRPCA) [15] which deals with dynamic backgrounds and longstanding or slowly moving objects. This method is based on two assumptions, namely, the moving foreground objects have spatial and temporal continuity and the dynamic background is sparser than foreground objects. However, it works only if the dynamic natured background is sparser than the moving foreground having smooth boundary and trajectory and becomes ineffective in addressing the special cases similar to camouflage and far away objects. Zhao et al. proposed Low Rank and Sparse Decomposition (LRSD) [16] to detect outliers, which prefers the regions that are relatively solid and contiguous. LRSD is able to tolerate dynamic background variations, without losing the sensitivity to detect real foreground objects. This method is weak in case of noise robustness performance. Tensor-based robust PCA [17] approach for background subtraction from compressive measurements, in which Tucker decomposition is utilized to model the spatial and temporal correlation of the background in video streams. 3D-TV [18] is employed to characterize the smoothness of video foreground. The convergence requirement for this method is about 250 iterations even for a video sequence with 20 frames. Hence this method is not suitable for real-time applications. Iterative Block Tensor Singular Value Thresholding (IBTSVT) [1] method developed by Chen et al., considered the video data as a 3D tensor. They proposed a new method called Tensor Principal Component Analysis (TPCA) to extract the principal components of the data based on tensor singular value decomposition. This is a block based approach, and hence the main limitation of this method is the selection of the block size. The block size depends on the number of frames and hence the computational complexity increases with the increase in block size. Sajid et al. proposed Online Tensor Decomposition (OSTD) method [19] to address this problem and is efficient for large size data. But the main problem with this method is that it considers single frame per each time instant and it is not a suitable choice for real time applications. Hu et al. proposed [20] another method for MOD based on Saliently Fused Sparse (SFS) decomposition for the low rank tensor [21]. This method combines 3D Locally Adaptive Total Variation (LATV) with l1 -norm to construct the fused-sparse regularization
200
S. B. et al. / Future Generation Computer Systems 90 (2019) 198–210
of the foreground. Graph spectral theory is used to find LATV and this method helps to get a smooth solution of moving objects. The performance of this method to camouflaged sequences is not pleasing. Nowadays, the video sequences available from the surveillance system has increased dramatically. It is difficult to perform the background subtraction algorithm for such a large amount of data. In order to solve this issue, real time techniques are required [22, 23]. Many real time/incremental algorithms have been proposed in response to the rising demands for real time systems. Guo et al. proposed an online method for separating the slowly changing background and sparse foreground sequence that consists of one or more moving objects from a video (PracReProCS) [24]. This method uses some training sequences at initial stages which do not contain moving objects, but in real cases, it is difficult to get the clean frames. In addition to this, some factors like noise or dynamic background of the environment also influence the frames, which in turn affect the performance of the method. PracReProCS and incPCP [25] methods are extension of classical PCP algorithm. Incremental Gradient on the Grassmannian for online foreground and background separation in sub-sampled video (GRASTA) [2] is another method in this category, proposed by He et al. GRASTA updates the subspace where the background lies and separates the foreground in an online manner. Here l1 norm loss term is used for getting foreground components and Alternating Direction Method of Multipliers (ADMM) is used for subspace update. Recently, Xu et al. proposed Grassmannian Online Subspace Updates with Structured-sparsity (GOSUS) [26], which uses complex loss term to get structured sparsity term. The solution is based on ADMM similar to that of GRASTA. Seidel et al. proposed a smoothed lp -norm robust online subspace tracking method for real time background subtraction in videos (pROST) [27]. In the case of multi-modal backgrounds, pROST outperforms GRASTA. Most of the algorithms estimate background and foreground as low rank and sparse components respectively. These algorithms are not utilizing the structural property of both background and foreground components. Hence, the performance of these algorithms degrade sharply when they are used in real life cases (e.g. presence of dynamic background, irregular movements of object, presence of noise etc.). Sajid et al. proposed Online Spatio-temporal RPCA (OS-RPCA) [28], combined Active Random Field (ARF) and Online RPCA (ORPCA) [29] and Motion-assisted Spatio-temporal Clustering of Low-rank (MSCL) [30]. These algorithms utilize the structural property of the foreground component. The OS-RPCA [28] algorithm is a graph regularized algorithm which helps to separate background and foreground components from RGB-D videos. In this method, low-rank background spatiotemporal information is preserved in the form of dual spectral graphs. The combined ARF and Online RPCA [29] technique is used to separate the low-rank and sparse components from video with bad weather conditions such as fog, snow or haze, rain etc. In this method, ARF helps to remove rain, fog and snow. ORPCA is used to separate background and foreground component from ARF output. The MSCL [30] algorithm incorporates the spatial and temporal subspace clustering into RPCA for background modeling and the solution is based on LADMAP approach. Sorbel et al. [31] proposed double-constrained version of RPCA to improve the foreground detection in maritime environments and is called Shape and Confidence Map-based RPCA (SCM-RPCA). The algorithm makes use of double constraints extracted from spatial saliency maps to enhance the object detection in marine scenes. Dastanova et al. [32] proposed an algorithm for moving object detection, which illustrates the advantages of memristive circuits in terms of power and area consumption. The memristive crossbar arrays are used for storing the bit planes of frames and thresholdlogic XOR gates are used for comparing each frames. The binary
images are obtained by combining the XOR output using weighted summation circuits and then they threshold it using comparators. The Content Addressable Memory (CAM) array helps for finding the moving objects by finding the number of different object pixels in the first and second pairs of the processed frames, in a row-byrow manner. The main disadvantage of this method is the detection accuracy. This method is unable to detect all the foreground pixels in the ground truth. Cheng et al. proposed a hybrid background subtraction method [33] to separate the background and foreground pixels from a video sequence. This algorithm detects background pixels by the effective utilization of unimodal mode and the features of the pixels like luminance, gradient features etc. This method is developed on the assumption that background pixel values show higher frequencies and less variance than any foreground pixel. This method cannot handle occlusion, multiple moving object, dynamic background etc. efficiently. Three-pronged compensation and Hysteresis Thresholding for Moving Object Detection (THTMOD), was proposed by Chia et al. [34] based on the concepts of hysteresis thresholding and motion compensation. It combines both texture and color information to restore the broken foreground objects caused due to the motion problems such as slow motion or walking toward camera. However, the execution time of this approach is comparatively high. An advanced method for moving object detection was proposed by Huang et al. [35], where background initialization is done through modified moving average algorithm and Alarm Trigger (AT) module is used for detecting a moving object. This method has been tested for both outdoor and indoor environments, but the accuracy of detection is comparatively less. In recent years, many algorithms were proposed for detecting moving objects from traffic monitoring system [36–38]. Huang et al. [36] proposed a motion detection approach, which is based on the Cerebellar-Model-Articulation-Controller (CMAC) through Artificial Neural Networks (ANN) to completely and accurately detect moving objects in both high and low bit-rate video streams. In [37], Haung et al. used optical flow (OF) to track small objects. Genetic algorithm [38] is used to detect object by combining Genetic Dynamic Saliency Map (GDSM) and background subtraction algorithm. Panda et al. [39] proposed a background subtraction algorithm based on Fuzzy Color Difference Histogram (FCDH) by using fuzzy c-means (FCM) clustering and exploiting the Color Difference Histogram (CDH). The CDH measures the color difference between a pixel and its neighborhood pixels. The function of FCM clustering algorithm in CDH helps to find the intensity effect variation generated due to the change in illumination of the background or fake motion. Chiranjeevi et al. [40] proposed a multi-channel kernel fuzzy correlogram algorithm for background subtraction, which exploits the color information by taking into account the color dependencies and the inter-pixel relations across and within the color planes. Then, the correlograms are mapped to a space of reduced dimensionality by using a transformation based on a fuzzy membership matrix, whose elements indicate the belongingness of each intensity pair to the new bins. Huang et al. [41] proposed a new algorithm for object detection from dynamic scenes based on radial basis function of Artificial Neural Networks (ANN). It contains two operations, which are multi-background generation and a moving object detection. However, the detection accuracy of this method is less compared with the recent counterparts. Proposed work is developed on the popular RPCA [42] method. RPCA decomposes a matrix into a low rank background matrix and a sparse moving object (foreground) matrix. Basically, background portion is expected to be present in a low-rank subspace and the sparse mobile objects are assumed to be piecewise smooth in the spatial and continuous in the temporal directions. So RPCA is able to handle MOD problem as mentioned in [42]. In addition
S. B. et al. / Future Generation Computer Systems 90 (2019) 198–210
201
to RPCA, Low Rank and Total Variation (LRSTV) model [43–47] developed for hyper-spectral remote sensing image denoising and other computer vision solutions, also aided in the formulation of the proposed method. 1.2. Our contributions This method offers the following improvements.
Fig. 2. Constraint region of l1 -norm.
• To the best of our knowledge, this is the first attempt to implement the Low Rank (LR) and TV based model for simultaneous MOD and denoising. This method utilizes the low rank property of background to remove the noise and extract the background. • Nuclear norm and l2 -norm regularization are used to separate the original background from the noisy background. The proposed method combines l1 -norm and spatial TV-norm regularization for the foreground part. The TV-norm enhances the spatial and spectral smoothness to preserve the edges more precisely. • The problem is resolved based on the Augmented Lagrange Multiplier (ALM) with Alternating Direction Minimizing (ADM) approach [48] and the proposed DNLRl1 TV model is tested with both synthetic and real world video sequences. The experimental results prove that the proposed method exhibits superior performance than the counterparts with reasonable computational overhead. The rest of this paper is organized as follows. In Section 2, we explain the basic principles which are used for the proposed DNLRl1 TV method. Section 3 presents our DNLRl1 TV method in details, which includes constrained problem formulation and solution of the proposed method. The experimental results and performance evaluations are explained in Section 4. Finally, the conclusions are drawn in Section 5. 2. Basic principles
2.1. Robust Principal Component Analysis (RPCA) and low rank approximation RPCA considered the input matrix as sum of the low-rank matrix and sparse matrix. This method recovers the low rank and sparse outliers matrix by solving an optimization problem [52]. It is formulated as, rank(A) + λ∥B∥0
s.t.
O=A+B
A,B
min
∥A∥∗ + λ∥B∥1
s.t.
O=A+B
A,B
(2)
where, ∥A∥∗ represents the nuclear norm of matrix A. The minimization of ∥A∥∗ enforces low rank in A, while the minimization of B maximizes the sparsity in B. The minimization of rank is obtained by applying Singular Value Decomposition (SVD) with thresholding of singular values up to a particular value as mentioned [43,25,53]. The SVD of an r × s matrix A is, A = UΣ VT
(3)
where, U is r × r matrix and V is s × s matrix. The column of U and V are called left and right singular matrices of A respectively. Σ is an r × s diagonal matrix, where diagonal entries are the singular values (square roots of eigenvalues) of AT A. They are arranged in descending order. Most of the energy is associated with the first few singular values and the smaller singular values generally represent noise or other sparse disturbance. This noise or disturbances can be removed by suitable thresholding operation. 2.2. l1 -norm
Learning low-rank [49–51] and sparse structures from the video, has substantiated the modeling of background and foreground structures in intelligent video surveillance systems. Here, our assumption on background and foreground modeling are, the background is assumed to be lie in a low dimensional subspace because the background of each frames are highly correlated, and sparse outliers usually represent the foreground (or moving) objects. For noisy videos, the background is assumed to lie in a low dimensional subspace whereas, moving objects are sparse. The noisy video O can be modeled as, O = A + B + N, where, A is the background component, B is the foreground component and N is the noise component. In the proposed method, the background and foreground modeling are based on RPCA [42].
min
rank and l0 -norm. Candes et al. [42] showed that A and B can be recovered by solving a convex optimization problem called Principal Component Pursuit (PCP). In PCP, rank is replaced with nuclear norm and l0 — norm with its convex surrogate l1 -norm. So Eq. (1) can be rewritten as,
(1)
where, A represents the low rank matrix, B represents the sparse matrix and λ is a trade off parameter to determine the relative significance of minimizing rank(A) and l0 -norm of B. Eq. (1) is highly non-convex optimization problem due to the presence of
The l1 -norm regularization, which makes the B matrix to be sparse, as most of its components are zeros and at the same time, the remaining non-zero components are very significant for perfect recovery. The constraint region of l1 -norm is shown in Fig. 2. The reason for using l1 -norm [54] to find a sparse solution is due to its special shape, as spikes that happen to be at sparse points. This sparse point moves through the solution surface and finally it touches the sparse solution. The l1 -norm of an r × r matrix B with element bij is given by the maximum value of column sum.
∥B∥1 = max ( 1≤j≤r
m ∑ |bij |)
(4)
i=1
In MOD, foreground parts are spatially contiguous. The l1 -norm assumes that each pixel is independently corrupted and this will neglect the adjacent pixel correlation. Hence it is difficult to extract foreground mask using l1 -norm alone. So, there must be some measure besides l1 -norm to extract the perfect shape of the foreground mask. 2.3. Spatial Total Variation (TV) norm Nowadays, most of the image processing methods in de-noising problem use TV regularization because of its effectiveness in preserving edge information and the spatial smoothness. It is based on the assumption that signals with excessive and possibly spurious details have high total variation. The TV regularization has become one of the standard techniques known for preserving edges
202
S. B. et al. / Future Generation Computer Systems 90 (2019) 198–210
and object boundaries. TV is the most preferential measure to enhance the spatio-temporal continuity by filling up the gaps and suppressing the intensity changes due to the dynamic nature of the background. Generally, TV [43] is defined in terms of derivative function (Gradient operation). But for gray scale images, this derivatives are substituted by difference operators. The two types of discrete TV [55] are anisotropic and isotropic TV . The l1 -norm based anisotropic TV norm of gray-level image O of size M × N [56] is defined as,
∥O∥TVa =
∑ |Oi+1,j − Oi,j | + |Oi,j+1 − Oi,j |
(5)
i,j
The isotropic TV norm is defined as,
∥O∥TVi =
∑√
|Oi+1,j − Oi,j |2 + |Oi,j+1 − Oi,j |2
(6)
the nuclear norm (sum of singular values) and l1 -norm are substituted as their equivalent surrogates. Eq. (8) can be reformulated as, m in ∥B∥∗ + ∥N∥2F + λ ∥S∥1
B,S
s.t . ∥O − B − S∥2F ≤ ε, rank(B) ≤ r .
Eq. (9) works well only for the static background with uniformly moving objects and no shelter, but degrades sharply with realistic situations. Since TV -norm preserves edge information and spatial smoothness, we have incorporated both l1 -norm and TV -norm in the objective function for the foreground part. Then, the objective function is rewritten as, m in ∥B∥∗ + ∥N∥2F + λ ∥S∥1 + τ ∥S∥TV ,
B,S
s.t . ∥O − B − S∥2F ≤ ε, rank(B) ≤ r .
i ,j
Isotropic TV is mathematically formulated based on the l2 -norm [57] in a direction independent form. The direction independent nature of isotropic TV puts limits on getting access to the directional properties existing spatially and temporally. The TVregularized MOD problem recovers the foreground in such a way that the resulting image is piecewise smooth in all directions by exploiting the similarities existing between adjacent pixel values of each frames. Apart from grabbing the spatial smoothness of the foreground, the TV regularization may capture and preserve singularities (edges) [58]. Inspired by these capabilities of TV, this work uses anisotropic TV regularization for extraction of foreground matrix. 3. Proposed method In this section, the problem formulation and its solution are briefed. The aim of the proposed method is to formulate an algorithm for performing simultaneous denoising and moving object detection in a single framework which can perform well in various challenging situations such as camera jitter, dynamic background, shadow etc. The block diagram of DNLRl1 TV is shown in Fig. 3. 3.1. Problem formulation Let the input video stream be V ϵ RM ×N ×P where, P is the number of frames and M and N are the height and width of each frame. Initially, stacking of each frame are done as a column of matrix O and hence the problem is converted to a 2D platform with a newly formulated matrix O ϵ RMN ×P . The observed noisy video stream contains the background component B ϵ RMN ×P , the foreground S ϵ RMN ×P and noise component N ϵ RMN ×P . Hence, it can be decomposed as, O=B+S+N
(7)
Here, the background (almost static and highly correlated between frames) is assumed to lie in a low dimensional space and S is the sparse matrix. Hence, there can be a constraint on the rank of the input frame matrix which can give the knowledge of the sparse foreground on combining with the sparsity constraint. The l2 -norm is used for minimizing the noise component. The object detection is obtained through minimizing the following problem: m in rank(B) + ∥N∥2F + λ ∥S∥0
B,S
s.t . ∥O − B − S∥2F ≤ ε
(8)
where ∥.∥0 denotes the l0 -norm, ∥.∥F denotes Frobenius norm and rank(.) denotes rank of a matrix. Eq. (8) is similar to RPCA [42] decomposition, where rank(.) and l0 -norm are non-convex in nature. So, it is NP hard and difficult to approximate the solution. Hence,
(9)
(10)
where, the nuclear norm exploits the low-rank property of the background whereas noise is minimized by using l2 -norm regularization, the sparsity is enhanced by the l1 -norm and the foreground spatial smoothness is explored by TV regularization. In Eq. (10), on replacing N with O − B − S, the constrained convex optimization equation for the proposed method can be rewritten as, m in ∥B∥∗ + ∥O − B − S∥2F + λ ∥S∥1 + τ ∥S∥TV ,
B,S
s.t . O = B + S, rank(B) ≤ r .
(11)
where, λ is the parameter that determines the relative significance of B and S. λ is used to maximize the sparsity of the moving object and τ is the trade-off parameter used to control the TV -norm. Regularization parameter τ plays an important role in the object detection problem. 3.2. Solution The optimization problem in Eq. (11) is solved by using the Augmented Lagrange Multiplier (ALM) with Alternating Direction Minimizing (ADM) approach [48]. For the problem to be more clear, another variable C is imported and equated to S and is used as an additional constraint. m in ∥B∥∗ + ∥O − B − S∥2F + λ ∥S∥1 + τ ∥C∥TV ,
B,S
s.t . O = B + S, rank(B) ≤ r , S = C
(12)
Using ALM, Eq. (12) can be expanded as, Lµ (B, S, C, Λ1 , Λ2 )
= ∥B∥∗ + ∥O − B − S∥2F + λ ∥S∥1 + τ ∥C∥TV + ⟨Λ1 , O − B − S⟩ + ⟨Λ2 , C − S⟩ µ + (∥O − B − S∥2F + ∥C − S∥2F ) 2
(13)
where, µ is a positive penalty scalar, Λ1 , Λ2 ϵ RMN ×P are the Lagrange multiplier matrices, ⟨., .⟩ denotes the matrix inner product and ∥.∥F indicates the Frobenius norm. Since there are three variables besides the Lagrange multipliers, i.e., B, S, and C, it is difficult to optimize them simultaneously. So an approximate solution is obtained by finding the minimum value of one variable at a time while setting the other variables as constant (using the concept of ADM). A closed form solutions for B, S, and C are explained in each subproblem below. 3.2.1. B subproblem In a video sequence, the background of each frames are highly correlated. Hence, B is the matrix of low rank components and the aim is to find the r value indicating the low rank which is identified by the significant difference in the singular values. The
S. B. et al. / Future Generation Computer Systems 90 (2019) 198–210
203
Fig. 3. Basic block diagram of the proposed DNLRl1 TV method.
update expression for B with the other terms fixed can be obtained as,
This can be reduced to , S
Bk+1
= argmin Lµ {B, Sk , Ck , Λk1 , Λk2 } B 2 = argmin ∥B∥∗ + O − B − Sk F + B 2 µ ⟨Λ1 , O − B − Sk ⟩ + (O − B − Sk F ) 2 ( = argmin ∥B∥∗ + (µ + 1) B − O − Sk + B
[ =S
k+1
O−B
λ µk +1
Λk + Λk2 + C + 1k µ +1 k
] (19)
where S [.] is the shrinkage operator defined in Eq. (17).
)2 Λk1 µk + 1 F
3.2.3. C subproblem C denotes the sparse foreground and on updating C under the same condition, the solution can be found by applying shrinkage operator given in [43], as given below. (14)
Λk1 ) µ +1 k
Bk+1 = US1/µk +1 (Σ )V T
(15)
2 µ = argmin τ ∥C∥TV + ⟨Λ2 , C − Sk+1 ⟩ + (C − Sk+1 F ) C 2 ( ) k 2 µ Λ 2 k+1 = argmin τ ∥C∥TV + C − S − (20) C 2 µk F This can be reduced to the closed form solution,
(16)
where, U Σ V T is the singular value decomposition (SVD) of (O − Λk
Sk − µk +1 1 ) and S [.] represents the shrinkage operator which is defined as follows, Sθ >0 (.) = sgn(b)max(|b| − θ, 0)
Ck+1 = argmin Lµ {Bk+1 , Sk+1 , C, Λk1 , Λk2 } C
This can be reduced to the closed form solution based on the mathematical proof given in [59], which is given as, (U Σ V T ) = sv d(O − Ck +
k+1
(17)
On applying this operator element-wisely, its matrix extension [60] is possible.
3.2.2. S subproblem Matrix S represents the denoised sparse foreground component. The update expression for S with the other terms as fixed is, Sk+1 = argmin Lµ {Bk+1 , S, Ck , Λk1 , Λk2 } S
2 = argmin λ ∥S∥1 + O − Bk+1 − SF + ⟨Λ2 , Ck − S⟩+ S 2 2 µ ⟨Λ1 , O − Bk+1 − Sk ⟩ + (O − Bk+1 − SF + Ck − SF ) 2
= argmin λ ∥S∥1 S ( )2 Λk1 + Λk2 k+1 k + (µ + 1) S − O − B +S + k µ + 1 F (18)
Ck+1 = S
[ τ 2µk
Sk+1 −
Λk2 µk
] (21)
3.2.4. Updating expression for multipliers The updating expressions for multipliers with other terms fixed are,
( ) Λk1+1 = Λk1 + µk O − Bk+1 − Sk+1 ( ) Λk2+1 = Λk2 + µk Ck+1 − Sk+1
(22)
Algorithm 1 gives a summary of the proposed method. Any surveillance video sequence is expected as input to implement the proposed method. After RPCA decomposition of the input, one of the components is modeled as low rank approximated matrix by SVD and soft thresholding. These tasks constitute the separation process of background. The remaining challenge is to pick up the sparse components to form foreground part. This is done by the TV — l1 regularization done on S and together these regularizations set off an extraordinary but low complexity scheme to acquire the foreground part from noisy background. 4. Experimental results The implementation of proposed method is carried out in MATLAB version 2014 on a single PC with its CPU specification being Intel (R) Core(TM) CPU i5-4590, 3.30 GHz for 8GB RAM and 64 bit operating system. For validating the robustness of the proposed DNLRl1 TV method under different noises, USCD (University of California at San Diego Academic & Science) Background Subtraction
204
S. B. et al. / Future Generation Computer Systems 90 (2019) 198–210
Dataset [10], synthetic data sequences from SABS dataset [61], real world data from CD.net [62] and BMC dataset [63] are used. The CD.net dataset [62] contains more than 31 video and spanning six categories. Categories include bad weather, illumination changes, dynamic background etc. SABS dataset [61] contains synthetic video which is developed from modern ray-tracing on real models in real time. USCD background subtraction dataset [10] contains 17 video data with each frame in the JPEG format. This dataset covers various challenging situations such as baseline, dynamic background, camera jitter, intermittent motion, shadow and thermal. For comparing the performance of the proposed method, DECOLOR [10], TVRPCA [15], IBTSVT [1], OSTD [19], ReProCS [24], GOSUS [26], GRASTA [2] and incPCP [25] methods are used. For Quantitative evaluation of the proposed algorithm, first step is to calculate the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values. True Positive (TP): TP refers that both ground truth and system result agrees that the pixel belongs to the one or more objects. False Positive (FP): Ground truth declares that the pixel as background but the system result classifies the pixels as object. True Negative (TN): TN means that both ground truth and system result agree on the absence of object. False Negative (FN): FN means that the system result assigns the pixel to background but in ground truth which is an object. After finding these four values, we can calculate f-measures as follows [64]: Algorithm 1: DNLRl1 TV Algorithm Input : Original Video V ϵ RM ×N ×P , then add noise to the video to form a noisy video. Stack the noisy video frames to form O ϵ RMN ×P . Initialization : B0 = S0 = C0 = Λ01 = Λ02 = 0 ϵ RMN ×P , λ > 0 µ0 = 10−2 , ρ > 1, µmax = 104 and k = 0. Output : Optimal solution (Kk , Ck ) while not converged do Update Bk+1 via Eq. (13) Update Sk+1 via Eq. (16) Update Ck+1 via Eq. (18) Update Λk1+1 via Eq. (19) Update Λk2+1 via Eq. (19) Update the parameter µk+1 = min(ρµk , µmax ) k=k+1 Check the convergence condition ∥O − B − S∥2 ≤ 10−8 ∥O∥2F 10. end 1. 2. 3. 4. 5. 6. 7. 8. 9.
f0 − measure =
2RP R+P
, P=
TP TP + FP
, R=
TP TP + FN
(23)
where, P represents the relation between rightly classified foreground pixels and total number of foreground pixels after classification and R shows the relation of the foreground elements that are correctly classified with respect to the foreground elements included in the ground truth. f1 − measure =
2R1 P1 R1 + P1
, P1 =
TN TN + FN
, R1 =
TN TN + FP
(24)
where, P1 represents the relation between rightly classified background pixels and total number of background pixels after classification and R1 shows the relation of the background elements that are correctly classified with respect to the background elements included in the ground truth.
Combined measure of the foreground and the background performances (f0 and f1 measure) is calculated as follows [64]: fjoint − measure =
2f0 f1 f0 + f1
(25)
For good performance, fjoint measure must be close to 1. 4.1. Response to Gaussian noise corrupted videos Most common noise in image is Gaussian noise, which arises during acquisition and transmission time due to poor illumination and high temperature. To verify the performance of the DNLRl1 TV method, ten video sequences from CD.net dataset [62] are considered and are added with Gaussian noise of different variances. The variance of the Gaussian noise used are, σ =0.005, 0.01 and 0.02. The proposed method is examined in contrast with the leading methods of BS and MOD such as DECOLOR [10], TVRPCA [15], IBT [1], OSTD [19], ReProCS [24], GOSUS [26], GRASTA [2] and incPCP [25]. About 300 frames are used for the execution and 90 . Ten iterations are comparison. This work uses λ as, √max(M ,N)×P required on an average to get a visually pleasing foreground/ background separation for proposed method. First, a sample of Gaussian corrupted input video stream with a variance of 0.02 is executed. The results estimated for ten video sequences from CD.net dataset [62] and are illustrated in Fig. 4. The f -measure values obtained using the different real world videos from CD.net dataset for eight existing methods and the proposed DNLRl1 TV method are summarized in Table 1. For comparing the f -measure values of the other existing methods with the proposed method, we chose videos from the challenging cases. In Table 1, the highway and pedestrians data sequences are from baseline sequence, boat sequence is an example for dynamic background dataset, backdoor sequence is an example for shadow data sequence and boulevard sequence is an example for camera jitter case. Comparing to other methods, the proposed method give very good f -measure values and this method have 99.5% background f0 value. It can be found that DECOLOR method detects some region near to the foreground due to its greedy property which decreases its P0 values, which reduces f0 value. Even though TVRPCA method has good values of f1 , its f0 values are less. Since it is failed to detect many foreground pixels in the ground truth, R0 value and correspondingly f0 values are reduced. Since IBT detects noise as foreground pixels, it again fails to detect many foreground pixels in the ground truth. This reduces both P0 and R0 values, and hence the f0 value of IBT is less. The superiority of proposed method in detecting small moving object is clear from the ninth row of Fig. 4. The sequence from BMC dataset (seventh and tenth row) with small sized moving object as the surveillance camera is placed far away from the object. Even in this case, the proposed method efficiently recognizes and captures the moving object. In Fig. 4, the first row is from camera jitter sequence (bouleverd) and comparing to other methods, proposed method has large R0 value. This is because of the fact that the proposed method detects many more foreground pixels in the ground truth. The proposed method is proven to be excellent in resolving camera jitter problem to a very large extent because it is ahead of other methods such as [10,15,1,24,2,26,25,19] in Recall, Precision and f -measure values. Second row data sequence corresponds to dynamic background sequence. Comparing to the result of TVRPCA, proposed method yields far better result, even though TVRPCA is specifically for dynamic background–foreground separation cases. Third, fifth, sixth and eight row correspond to static background case. Comparing to the results of other methods, the proposed method gives better results, same as in the case of bad weather sequence (fourth row), and shadow sequence (eight row).
S. B. et al. / Future Generation Computer Systems 90 (2019) 198–210
205
Fig. 4. Robustness to Gaussian noise: We added Gaussian noise with zero mean variance, σ = 0.02. 1–8 columns illustrate input noisy video data, ground truth, background obtained using proposed method, foreground masks estimated using DECOLOR, OSTD, TVRPCA, IBT and the proposed DNLRl1 TV method respectively; Top row: Camera jitter case-bouleverd sequence, Second row: dynamic background case-boats sequence, Third row static background case-pedestrian sequence, Fourth row:badweather sequence-Snowfall, Fifth row:static background case-Highway sequence, Sixth row: static background case-pedestrian-man sequence, Seventh row: small size object case-BMC sequence, Eighth row: static background case-backdoor sequence, Ninth row:static background case-pet sequence, Bottom row:small size object case-BMC sequence. For the foreground masks, white represents correctly detected foreground.
206
S. B. et al. / Future Generation Computer Systems 90 (2019) 198–210
Table 1 Analysis of DNLRl1 TV method and comparison with different MOD Algorithms using CD.net dataset. f0 : f0 -measure, f1 : f1 -measure, fj : fjoint -measure (BEST: BOLD) Methods
Highway f0 f1
Case DECOLOR [10] TVRPCA [15] IBT [1] OSTD [19] ReProCS [24] incPCP [25] GRASTA [2] GOSUS [26] DNLRl1 TV Case DECOLOR [10] TVRPCA [15] IBT [1] OSTD [19] ReProCS [24] incPCP [25] GRASTA [2] GOSUS [26] DNLRl1 TV Case DECOLOR [10] TVRPCA [15] IBT [1] OSTD [19] ReProCS [24] incPCP [25] GRASTA [2] GOSUS [26] DNLRl1 TV Case DECOLOR [10] TVRPCA [15] IBT [1] OSTD [19] ReProCS [24] incPCP [25] GRASTA [2] GOSUS [26] DNLRl1 TV Case DECOLOR [10] TVRPCA [15] IBT [1] OSTD [19] ReProCS [24] incPCP [25] GRASTA [2] GOSUS [26] DNLRl1TV Case DECOLOR [10] TVRPCA [15] IBT [1] OSTD [19] ReProCS [24] incPCP [25] GRASTA [2] GOSUS [26] DNLRl1TV Case DECOLOR [10] TVRPCA [15] IBT [1] OSTD [19] ReProCS [24] incPCP [25] GRASTA [2] GOSUS [26] DNLRl1 TV
Gaussian noise with variance 0.005 0.48 0.99 0.65 0.41 0.98 0.57 0.52 0.75 0.98 0.85 0.69 0.98 0.81 0.75 0.27 0.79 0.40 0.27 0.90 0.42 0.45 0.62 0.98 0.76 0.63 0.98 0.76 0.61 0.58 0.99 0.73 0.54 0.98 0.69 0.57 0.45 0.97 0.60 0.48 0.96 0.64 0.38 0.59 0.98 0.72 0.66 0.98 0.78 0.58 0.62 0.98 0.75 0.68 0.98 0.80 0.58 0.89 0.99 0.94 0.83 0.98 0.90 0.85 Gaussian noise with variance 0.01 0.45 0.99 0.62 0.39 0.98 0.56 0.46 0.74 0.98 0.84 0.67 0.98 0.80 0.73 0.26 0.78 0.38 0.23 0.88 0.37 0.26 0.57 0.97 0.71 0.60 0.98 0.74 0.60 0.57 0.98 0.72 0.49 0.96 0.65 0.54 0.40 0.94 0.56 0.47 0.95 0.62 0.36 0.54 0.97 0.67 0.65 0.98 0.77 0.51 0.59 0.97 0.73 0.63 0.97 0.76 0.56 0.86 0.99 0.92 0.80 0.98 0.88 0.82 Gaussian noise with variance 0.02 0.44 0.99 0.61 0.38 0.98 0.54 0.44 0.69 0.98 0.80 0.64 0.98 0.78 0.69 0.21 0.72 0.33 0.22 0.86 0.35 0.22 0.56 0.96 0.70 0.57 0.98 0.72 0.59 0.51 0.96 0.66 0.47 0.94 0.62 0.52 0.28 0.93 0.42 0.43 0.93 0.58 0.32 0.44 0.96 0.57 0.63 0.97 0.76 0.41 0.57 0.96 0.71 0.60 0.96 0.73 0.54 0.79 0.99 0.88 0.79 0.98 0.87 0.77 Speckle Noise with variance 0.02 0.54 0.97 0.70 0.47 0.99 0.63 0.56 0.69 0.99 0.81 0.77 0.98 0.85 0.72 0.24 0.84 0.38 0.21 0.92 0.33 0.21 0.55 0.97 0.70 0.70 0.98 0.82 0.60 0.54 0.98 0.68 0.48 0.96 0.64 0.42 0.56 0.97 0.71 0.53 0.96 0.68 0.44 0.57 0.98 0.70 0.66 0.98 0.79 0.44 0.58 0.97 0.72 0.61 0.97 0.74 0.57 0.88 0.99 0.93 0.83 0.98 0.90 0.88 Speckle Noise with variance 0.05 0.42 0.97 0.46 0.46 0.99 0.63 0.52 0.68 0.99 0.80 0.73 0.98 0.84 0.67 0.19 0.72 0.30 0.16 0.88 0.28 0.19 0.54 0.96 0.69 0.68 0.98 0.80 0.59 0.50 0.95 0.65 0.44 0.93 0.59 0.4 0.34 0.95 0.49 0.38 0.93 0.54 0.21 0.48 0.97 0.62 0.63 0.98 0.77 0.46 0.56 0.96 0.70 0.57 0.96 0.71 0.55 0.83 0.98 0.90 0.83 0.98 0.90 0.82 Poisson Noise 0.29 0.97 0.44 0.30 0.98 0.47 0.39 0.69 0.99 0.81 0.78 0.98 0.86 0.73 0.54 0.45 0.50 0.41 0.45 0.57 0.98 0.55 0.97 0.70 0.65 0.98 0.78 0.56 0.54 0.96 0.69 0.55 0.97 0.70 0.42 0.34 0.95 0.49 0.38 0.93 0.54 0.21 0.63 0.98 0.75 0.55 0.95 0.68 0.44 0.62 0.98 0.76 0.58 0.96 0.72 0.48 0.88 0.99 0.93 0.83 0.99 0.91 0.84 Salt and Pepper noise with variance 0.02 0.43 0.99 0.59 0.52 0.98 0.67 0.57 0.67 0.98 0.79 0.72 0.98 0.83 0.80 0.26 0.84 0.40 0.18 0.95 0.31 0.30 0.55 0.96 0.69 0.69 0.98 0.81 0.59 0.54 0.96 0.69 0.52 0.97 0.67 0.48 0.40 0.94 0.56 0.48 0.96 0.64 0.35 0.55 0.97 0.68 0.58 0.97 0.72 0.44 0.59 0.98 0.74 0.60 0.97 0.74 0.52 0.80 0.99 0.88 0.83 0.98 0.90 0.84
fj
Backdoor f0 f1
fj
Pedestrians f0 f1 fj
Boulvard f0 f1
fj
Boat f0
f1
fj
Turnpike f0 f1
fj
Snowfall f0 f1
fj
0.98 0.98 0.96 0.98 0.97 0.97 0.99 0.98 0.99
0.67 0.85 0.60 0.75 0.71 0.55 0.73 0.72 0.92
0.49 0.71 0.29 0.62 0.44 0.43 0.44 0.48 0.77
0.98 0.98 0.86 0.98 0.97 0.93 0.93 0.94 0.98
0.65 0.82 0.43 0.75 0.59 0.57 0.59 0.63 0.86
0.68 0.78 0.27 0.58 0.41 0.38 0.57 0.43 0.82
0.99 0.99 0.87 0.98 0.97 0.94 0.97 0.95 0.99
0.76 0.88 0.41 0.72 0.57 0.54 0.72 0.58 0.90
0.31 0.32 0.58 0.53 0.58 0.54 0.65 0.64 0.82
0. 96 0.95 0.96 0.96 0.95 0.95 0.96 0.95 0.99
0.46 0.48 0.72 0.68 0.72 0.69 0.77 0.76 0.90
0.40 0.79 0.33 0.42 0.40 0.46 0.43 0.49 0.87
0.96 0.98 0.95 0.97 0.94 0.95 0.90 0.92 0.99
0.55 0.88 0.49 0.58 0.56 0.61 0.57 0.63 0.92
0.98 0.98 0.93 0.98 0.96 0.96 0.97 0.97 0.99
0.63 0.84 0.40 0.74 0.69 0.52 0.65 0.71 0.90
0.46 0.69 0.26 0.62 0.42 0.40 0.43 0.46 0.77
0.98 0.98 0.82 0.98 0.96 0.93 0.92 0.94 0.98
0.63 0.81 0.39 0.75 0.58 0.56 0.57 0.61 0.86
0.53 0.77 0.26 0.57 0.40 0.37 0.47 0.40 0.81
0.98 0.99 0.85 0.98 0.95 0.94 0.95 0.94 0.99
0.68 0.87 0.39 0.70 0.56 0.53 0.63 0.56 0.89
0.27 0.24 0.45 0.52 0.51 0.51 0.56 0.57 0.81
0.96 0.94 0.94 0.96 0.95 0.95 0.94 0.94 0.99
0.42 0.37 0.60 0.67 0.66 0.66 0.69 0.71 0.89
0.30 0.76 0.29 0.40 0.37 0.45 0.21 0.48 0.85
0.95 0.98 0.93 0.97 0.91 0.87 0.87 0.90 0.99
0.46 0.86 0.44 0.56 0.52 0.60 0.34 0.63 0.92
0.98 0.98 0.89 0.98 0.95 0.94 0.91 0.96 0.99
0.61 0.80 0.35 0.74 0.67 0.47 0.53 0.69 0.86
0.44 0.67 0.21 0.62 0.40 0.35 0.40 0.44 0.76
0.98 0.98 0.76 0.98 0.94 0.90 0.91 0.93 0.98
0.61 0.79 0.32 0.75 0.56 0.49 0.54 0.60 0.85
0.52 0.76 0.19 0.55 0.38 0.29 0.37 0.40 0.81
0.98 0.99 0.80 0.98 0.94 0.91 0.92 0.93 0.99
0.67 0.86 0.31 0.68 0.54 0.44 0.52 0.56 0.89
0.25 0.14 0.44 0.51 0.32 0.43 0.53 0.45 0.80
0.96 0.94 0.94 0.96 0.88 0.89 0.93 0.90 0.99
0.40 0.24 0.60 0.66 0.47 0.58 0.66 0.59 0.88
0.21 0.74 0.25 0.38 0.31 0.44 0.20 0.46 0.84
0.97 0.98 0.91 0.97 0.91 0.88 0.85 0.86 0.99
0.49 0.84 0.39 0.54 0.46 0.57 0.31 0.59 0.91
0.99 0.98 0.90 0.99 0.96 0.97 0.94 0.98 0.99
0.71 0.83 0.34 0.74 0.58 0.59 0.56 0.71 0.93
0.41 0.67 0.29 0.62 0.43 0.46 0.44 0.48 0.79
0.97 0.98 0.85 0.98 0.93 0.94 0.92 0.96 0.99
0.58 0.80 0.42 0.75 0.57 0.61 0.58 0.64 0.88
0.63 0.81 0.31 0.63 0.43 0.36 0.59 0.46 0.85
0.98 0.99 0.90 0.98 0.95 0.93 0.97 0.95 0.99
0.76 0.89 0.46 0.76 0.58 0.52 0.74 0.61 0.92
0.39 0.33 0.50 0.53 0.53 0.62 0.64 0.67 0.82
0.95 0.97 0.95 0.97 0.95 0.95 0.95 0.96 0.99
0.53 0.50 0.65 0.68 0.68 0.75 0.76 0.78 0.89
0.22 0.79 0.16 0.40 0.27 0.30 0.46 0.43 0.85
0.97 0.98 0.85 0.97 0.95 0.91 0.91 0.92 0.99
0.33 0.88 0.27 0.56 0.42 0.45 0.61 0.58 0.92
0.98 0.98 0.91 0.99 0.94 0.93 0.98 0.96 0.99
0.67 0.79 0.34 0.73 0.56 0.34 0.62 0.69 0.90
0.38 0.65 0.23 0.62 0.42 0.37 0.39 0.45 0.79
0.97 0.98 0.79 0.98 0.90 0.91 0.91 0.93 0.98
0.55 0.78 0.35 0.75 0.57 0.51 0.53 0.61 0.88
0.68 0.77 0.25 0.60 0.41 0.30 0.32 0.40 0.81
0.99 0.98 0.86 0.98 0.93 0.91 0.91 0.94 0.98
0.76 0.86 0.38 0.74 0.56 0.45 0.47 0.56 0.88
0.35 0.22 0.41 0.52 0.40 0.49 0.57 0.50 0.81
0.94 0.95 0.93 0.96 0.92 0.92 0.94 0.92 0.99
0.51 0.36 0.56 0.54 0.56 0.63 0.71 0.64 0.88
0.12 0.35 0.13 0.39 0.25 0.25 0.42 0.25 0.84
0.98 0.94 0.83 0.97 0.94 0.87 0.89 0.89 0.99
0.20 0.51 0.22 0.55 0.39 0.38 0.56 0.39 0.91
0.98 0.99 0.71 0.99 0.97 0.93 0.92 0.96 0.99
0.55 0.84 0.71 0.58 0.34 0.56 0.64 0.91
0.43 0.71 0.34 0.62 0.39 0.37 0.38 0.42 0.78
0.98 0.98 0.88 0.98 0.94 0.90 0.91 0.97 0.98
0.60 0.82 0.48 0.75 0.55 0.51 0.53 0.58 0.87
0.53 0.81 0.25 0.71 0.41 0.30 0.36 0.52 0.85
0.98 0.99 0.86 0.99 0.91 0.91 0.92 0.96 0.99
0.69 0.89 0.38 0.82 0.56 0.45 0.51 0.67 0.91
0.37 0.22 0.53 0.51 0.65 0.70 0.66 0.69 0.82
0.95 0.97 0.96 0.96 0.97 0.96 0.96 0.94 0.99
0.52 0.36 0.68 0.66 0.77 0.80 0.77 0.79 0.89
0.21 0.79 0.20 0.38 0.32 0.29 0.25 0.32 0.83
0.96 0.98 0.85 0.97 0.92 0.89 0.90 0.92 0.99
0.33 0.87 0.31 0.54 0.46 0.43 0.39 0.46 0.90
0.98 0.98 0.95 0.99 0.96 0.96 0.93 0.96 0.99
0.72 0.88 0.46 0.73 0.64 0.50 0.57 0.67 0.91
0.33 0.78 0.29 0.62 0.47 0.34 0.43 0.48 0.79
0.97 0.98 0.84 0.98 0.92 0.95 0.92 0.96 0.98
0.50 0.86 0.42 0.75 0.62 0.49 0.57 0.64 0.87
0.68 0.76 0.22 0.66 0.40 0.38 0.38 0.46 0.79
0.99 0.99 0.83 0.98 0.93 0.94 0.93 0.94 0.98
0.76 0.86 0.34 0.78 0.56 0.54 0.54 0.62 0.88
0.37 0.32 0.43 0.50 0.45 0.57 0.64 0.58 0.82
0.95 0.97 0.95 0.96 0.96 0.94 0.96 0.94 0.99
0.52 0.48 0.59 0.65 0.61 0.71 0.76 0.80 0.88
0.15 0.79 0.17 0.39 0.21 0.26 0.25 0.27 0.88
0.98 0.98 0.84 0.97 0.93 0.91 0.89 0.91 0.99
0.24 0.87 0.27 0.56 0.34 0.40 0.39 0.41 0.93
S. B. et al. / Future Generation Computer Systems 90 (2019) 198–210
207
Fig. 5. Robustness to salt and pepper noise: We add input video data with salt and pepper noise having 0.02 noise. The 1–6 columns illustrate input noisy video data, ground truth, foreground masks estimated using DECOLOR, OSTD, IBT and the proposed DNLRl1 TV method respectively; First and Second row: static background case691th and 747th frame of highway sequence, Third and Fourth row: dynamic background case-1999th and 2000nd frame of boat sequence. For the foreground masks, white represents correctly detected foreground.
4.2. Response to Poisson noise corrupted videos In some cases, the image sensor produces Poisson noise which affects the dark parts of an image. In order to find the immunity of the proposed method against Poisson noise, we added Poisson noise to each frame. The f -measure values obtained using the different real world videos from CD.net datasets for eight existing methods and DNLRl1 TV are summarized in sixth section of Table 1. For each dataset the proposed method gives good performance. f1 -value of proposed method is 0.99 in almost every cases. Fig. 7 shows the visual results of proposed method, DECOLOR [10], IBT [1], and OSTD [19] with Poisson noise. Here, we use different frames of the turnpike sequence from CD.net dataset, which is an example of low frame rate data sequence. In Fig. 7, the first row shows the different frames of turnpike dataset, second row is the ground truth of these data sequences and third, fourth, fifth and sixth row corresponds to [10,19,1,24,26,2,25] and proposed method results. Comparing to other methods, the proposed method detects many foreground pixels in the ground truth. Comparing to the results of DECOLOR [10] method, the proposed method yields far better result. It is clear that DECOLOR [10] method cannot detect small sized object. Same is in the case of OSTD [19]. So this will reduce R0 value of these methods. IBT [1] misclassified the noise as foreground pixels. Hence its performance is adversely affected. However, proposed method efficiently recognizes and captures the small moving object. 4.3. Response to speckle noise corrupted videos Speckle noise is a multiplicative noise and has granular pattern. Removing this noise is one of the major challenge and least touched issue. For evaluating the performance of proposed method, speckle noise with variance values 0.02, and 0.05 are added. The obtained f -measure values are described in fourth and fifth section of Table 1. From the result it is clear that the proposed method yields better f -measure values. The qualitative results are shown in Fig. 6. From Fig. 6, it can be seen that the proposed method can detect and remove noise efficiently. Detection of moving object becomes a very difficult task when weather conditions are not pleasing like snow on the ground, fog,
snow storm etc. The proposed method is tested with bad weather sequence and Fig. 6, third to last row illustrates the visually favorable foreground. In this case, the foreground is comprised of cars which are not clearly visible in the original frame, but the method was efficient enough to gather the sparse components properly to discover the foreground part. 4.4. Response to salt and pepper noise corrupted videos Salt and pepper noise is an impulse type of noise, which is also referred to as intensity spikes. This is generally produced due to the errors in data transmission. It presents itself as sparsely occurring white and black pixels. Salt and pepper noise are high frequencies, so for salt noise the values of this noise type are high, and for the pepper noise the values of this noise type are low. To test the immunity of the proposed method against salt and pepper noise, the input video data is mixed with these types of noises. The observation obtained for a corrupted video sequence with variance 0.02 is disclosed in Fig. 5. The first two rows are from highway dataset. Comparing to other methods, proposed method can detect the small moving car. Third and fourth rows illustrate the dynamic background results. Proposed method detects boat and person effectively and also removes the dynamic background variation (as shown in last column of third and fourth row in Fig. 5). The performance evaluation results for noise corrupted videos on comparing with [15,10,1,19,26,2,25,24] are shown in last section of Table 1. Comparing to the evaluated state of art methods, the proposed method give better visual and quantitative results. 4.5. Computational complexity analysis and running time comparison The time required for the execution of the forenamed methods [10,15,1,19] for one hundred frames of the special case sequences in the CD.net dataset [62], with frame size 100, is analyzed and tabulated in Table 2. This implementation is carried out in MATLAB version 2014 on a single PC with its CPU specification being Intel (R) Core(TM) CPU i5-4590, 3.30 GHz for 8GB RAM and 64 bit operating system. To analyze the computational efficiency, consider a = MN and b = P. From Algorithm 1, each outer iteration requires four rules
208
S. B. et al. / Future Generation Computer Systems 90 (2019) 198–210
Fig. 6. Robustness to Speckle noise: We add input video data with speckle noise having 0.02 variance. The 1–6 columns illustrate input noisy video data, ground truth, foreground masks estimated using DECOLOR, OSTD, IBT and the proposed DNLRl1 TV method respectively; First and Second row: static background case- 458th and 694th frame of pedestrian sequence, Third and Fourth row: camera jitter case- 814th and 815th frame of boulevard sequence, Fifth, Sixth and Seventh row: bad weather case807th, 832nd and 852nd frame of snowfall sequence, Eight and Ninth row: badweather sequence- 813th and 1351th frame of skating sequence. For the foreground masks, white represents correctly detected foreground.
for update with respect to B, S , C , and multipliers respectively. For updating B, first the SVD of an a × b matrix is to be found. Then shrink the singular value matrix and multiplies this matrix with two singular vector matrices. i.e the update of B requires O(a2 b + ab2 + b3 ) floating point operations. Element-wise addition and shrinkage operations of a × b matrices required for
updating C , U and multipliers which require O(ab) floating point operations. In short, each outer iteration of proposed algorithm requires O(a2 b + ab2 + b3 + 4ab) floating point operations. However, TVRPCA [15] requires O(a2 b + ab2 + b3 + 3ab + 20 (ablog(ab))) floating point operations. So the proposed method offers moderate computational complexity.
S. B. et al. / Future Generation Computer Systems 90 (2019) 198–210
209
Table 2 Comparison of Running Time. (BEST: BOLD). Sequence
IBT [1]
TVRPCA [15]
DECOLOR [10]
OSTD [19]
GRASTA [2]
Proposed
Highway (Static background) Backdoor (Shadow) Boulevard (Camera jitter) Turnpike(Low Frame rate) Boats (Dynamic Background)
46.72 s 43.07 s 34.94 s 31.94 s 23.34 s
54.81 s 49.61 s 62.11 s 45.38 s 42.43 s
19.27 s 13.60 s 13.25 s 16.40 s 11.10 s
21.21 s 18.95 s 16.28 s 16.35 s 9.69 s
12.23 s 11.24 s 31.42 s 20.26 s 11.45 s
9.05 s 9.07 s 9.32 s 8.19 s 5.94 s
Fig. 7. Robustness to Poisson noise: 1–6th Column illustrates the input noisy video data, ground truth, foreground masks estimated using DECOLOR, OSTD, IBT and the proposed DNLRl1 TV method respectively. Each row corresponds to 816, 822 and 999th frames of turnpike sequence, which belongs to low frame rate case.
5. Conclusion
[8] J. Lei, Data driven based method for field information sensing, Math. Probl. Eng. (2014).
The proposed DNLRl1 TV method efficiently separates the background and foreground from noisy video sequence using low rank minimization and l1 , l2 and TV -norm regularizations. In this paper, we compared the proposed method with other methods, by adding different types of noises like Gaussian, Poisson, salt and pepper and speckle noises. In all these cases, the proposed method gives good visual results and f -measure values. The proposed method was tested with different challenges like shadow, camera jitter, bad weather, dynamic background, etc. For each case, our method gives better visual and quantitative results than that of other evaluated methods. The existing methods used for comparison are DECOLOR [10], TVRPCA [15], IBT [1], OSTD [19], GOSUS [26], GRASTA [2], incPCP [25] and ReProcs [24]. It is found that the proposed method outshines the compared methods.
[9] N.M. Oliver, B. Rosario, A.P. Pentland, A Bayesian computer vision system for modeling human interactions, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 831–843.
References [1] L. Chen, Y. Liu, C. Zhu, Iterative block tensor singular value thresholding for extraction of lowrank component of image data, in: Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 2017, pp. 1862–1866. [2] J. He, L. Balzano, A. Szlam, Incremental gradient on the grassmannian for online foreground and background separation in subsampled video, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2012, pp. 1568–1575. [3] X. Li, M.K. Ng, X. Yuan, Median filtering-based methods for static background extraction from surveillance video, Numer. Linear Algebra Appl. 22 (5) (2015) 845–865. [4] C. Stauffer, W.E.L. Grimson, Adaptive background mixture models for realtime tracking, in: Computer Vision and Pattern Recognition, 1999 IEEE Computer Society Conference on, vol. 2, 1999, pp. 246–252. [5] H. Lee, H. Kim, J.-I. Kim, Background subtraction using background sets with image-and color-space reduction, IEEE Trans. Multimed. 18 (10) (2016) 2093– 2103. [6] Z. Zhang, X. Liang, Y. Ma, Unwrapping low-rank textures on generalized cylindrical surfaces, IEEE, 2011, pp. 1347–1354. [7] S. Li, Y. Fu, Learning robust and discriminative subspace with low-rank constraints, IEEE Trans. Neural Netw. Learn. Syst. 27 (11) (2016) 2160–2173.
[10] X. Zhou, C. Yang, W. Yu, Moving object detection by detecting contiguous outliers in the low-rank representation, IEEE Trans. Pattern Anal. Mach. Intell. 35 (3) (2013) 597–610. [11] T. Bouwmans, A. Sobral, S. Javed, S.K. Jung, E.-H. Zahzah, Decomposition into low-rank plus additive matrices for background/foreground separation: A review for a comparative evaluation with a large-scale dataset, Computer Sci. Rev. 23 (2017) 1–71. [12] T. Bouwmans, Traditional and recent approaches in background modeling for foreground detection: An overview, Comput. Sci. Rev. 11 (2014) 31–66. [13] N. Vaswani, T. Bouwmans, S. Javed, P. Narayanamurthy, Robust PCA and Subspace Tracking, arXiv preprint. arXiv:1711.09492. [14] T. Bouwmans, E.H. Zahzah, Robust PCA via principal component pursuit: A review for a comparative evaluation in video surveillance, Comput. Vis. Image Underst. 122 (2014) 22–34. [15] X. Cao, L. Yang, X. Guo, Total variation regularized RPCA for irregularly moving object detection under dynamic background, IEEE Trans. Cybern. 46 (4) (2016) 1014–1027. [16] X. Liu, G. Zhao, J. Yao, C. Qi, Background subtraction based on low-rank and structured sparse decomposition, IEEE Trans. Image Process. 24 (8) (2015) 2502–2514. [17] W. Cao, Y. Wang, J. Sun, D. Meng, C. Yang, A. Cichocki, Z. Xu, Total variation regularized tensor rpca for background subtraction from compressive measurements, IEEE Trans. Image Process. 25 (9) (2016) 4075–4090. [18] C. Chen, X. Li, M.K. Ng, X. Yuan, Total variation based tensor decomposition for multi-dimensional data with time dimension, Numer. Linear Algebra Appl. 22 (6) (2015) 999–1019. [19] S. Javed, T. Bouwmans, S.K. Jung, Stochastic decomposition into low rank and sparse tensor for robust background subtraction, in: 6th International Conference on Imaging for Crime Prevention and Detection, ICDP-2015, 2015, pp. 5–7. [20] W. Hu, Y. Yang, W. Zhang, Y. Xie, Moving object detection using tensorbased low-rank and saliently fused-sparse decomposition, IEEE Trans. Image Process. 26 (2) (2017) 724–737. [21] K. Braman, Third-order tensors as linear operators on a space of matrices, Linear Algebra Appl. 433 (7) (2010) 1241–1253. [22] P. Narayanamurthy, N. Vaswani, Medrop: Memory-efficient dynamic robust pca, arXiv preprint. arXiv:1712.06061.
210
S. B. et al. / Future Generation Computer Systems 90 (2019) 198–210
[23] A. Sobral, C.G. Baker, T. Bouwmans, . E.-h. Zahzah, Incremental and multifeature tensor subspace learning applied for background modeling and subtraction, in: International Conference Image Analysis and Recognition, 2014, pp. 94–103. [24] H. Guo, C. Qiu, N. Vaswani, Practical reprocs for separating sparse and lowdimensional signal sequences from their sum—part 1, in: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 4161–4165. [25] P. Rodriguez, B. Wohlberg, Incremental principal component pursuit for video background modeling, J. Math. Imaging Vision 55 (1) (2016) 1–18. [26] J. Xu, V.K. Ithapu, L. Mukherjee, J.M. Rehg, V. Singh, Gosus: Grassmannian online subspace updates with structured-sparsity, in: Computer Vision (ICCV), 2013 IEEE International Conference on, 2013, pp. 3376–3383. [27] F. Seidel, C. Hage, M. Kleinsteuber, prost: A smoothed lp -norm robust online subspace tracking method for background subtraction in video, Mach. Vis. Appl. 25 (5) (2014) 1227–1240. [28] S. Javed, T. Bouwmans, M. Sultana, S.K. Jung, Moving object detection on rgb-d videos using graph regularized spatiotemporal rpca, 2017, pp. 230–241. [29] S. Javed, T. Bouwmans, S.K. Jung, Combining arf and or-pca for robust background subtraction of noisy videos, in: International Conference on Image Analysis and Processing, 2015, pp. 340–351. [30] S. Javed, A. Mahmood, T. Bouwmans, S.K. Jung, Background–foreground modeling based on spatiotemporal sparse subspace clustering, IEEE Trans. Image Process. 26 (12) (2017) 5840–5854. [31] A. Sobral, T. Bouwmans, E.-h. ZahZah, Double-constrained rpca based on saliency maps for foreground detection in automated maritime surveillance, in: Advanced Video and Signal Based Surveillance (AVSS), 2015 12th IEEE International Conference on, 2015, pp. 1–6. [32] N. Dastanova, S. Duisenbay, O. Krestinskaya, A.P. James, Bit-plane extracted moving-object detection using memristive crossbar-cam arrays for edge computing image devices, IEEE Access 6 (2018) 18954–18966. [33] F.-C. Cheng, B.-H. Chen, S.-C. Huang, A hybrid background subtraction method with background and foreground candidates detection, ACM Trans. Intell. Syst. Technol. 7 (1) (2015) 7. [34] C.-H. Yeh, C.-Y. Lin, K. Muchtar, H.-E. Lai, M.-T. Sun, Three-pronged compensation and hysteresis thresholding for moving object detection in real-time video surveillance, IEEE Trans. Ind. Electron. 64 (6) (2017) 4945–4955. [35] S.-C. Huang, An advanced motion detection algorithm with video quality analysis for video surveillance systems, IEEE Trans. Circuits Syst. Video Technol. 21 (1) (2011) 1–14. [36] S.-C. Huang, B.-H. Chen, Highly accurate moving object detection in variable bit rate video-based traffic monitoring systems, IEEE Trans. Neural Netw. Learning Syst. 24 (12) (2013) 1920–1931. [37] S.-C. Huang, B.-H. Chen, Automatic moving object extraction through a realworld variable-bandwidth network for traffic monitoring systems, IEEE Trans. Ind. Electron. 61 (4) (2014) 2099–2112. [38] G. Lee, R. Mallipeddi, G.-J. Jang, M. Lee, A genetic algorithm-based moving object detection for real-time traffic surveillance, IEEE Signal Process. Lett. 22 (10) (2015) 1619–1622. [39] D.K. Panda, S. Meher, Detection of moving objects using fuzzy color difference histogram based background subtraction, IEEE Signal Process. Lett. 23 (1) (2016) 45–49. [40] P. Chiranjeevi, S. Sengupta, Detection of moving objects using multi-channel kernel fuzzy correlogram based background subtraction, IEEE Trans. Cybern. 44 (6) (2014) 870–881. [41] S.-C. Huang, B.-H. Do, Radial basis function based neural network for motion detection in dynamic scenes, IEEE Trans. Cybern. 44 (1) (2014) 114–125. [42] Y.M.E. Candes, X. Li, J. Wright, Robust principal component analysis? J. ACM 58 (3) (2011) 11–20. [43] Z.L. He W., H. Zhang, H. S., Total-variation-regularized low-rank matrix factorization for hyperspectral image restoration, IEEE Trans. Geosci. Remote Sens. 54 (1) (2016) 178–188. [44] C. Jiang, H. Zhang, L. Zhang, H. Shen, Q. Yuan, Hyperspectral image denoising with a combined spatial and spectral weighted hyperspectral total variation model, Can. J. Remote Sens. 42 (1) (2016) 53–72. [45] Y. Xie, Y. Qu, D. Tao, W. Wu, Q. Yuan, W. Zhang, Hyperspectral image restoration via iteratively regularized weighted schatten p-norm minimization, IEEE Trans. Geosci. Remote Sens. 54 (8) (2016) 4642–4659. [46] Y. Chen, T.-Z. Huang, X.-L. Zhao, L.-J. Deng, J. Huang, Stripe noise removal of remote sensing images by total variation regularization and group sparsity constraint, Remote Sens. 9 (6) (2017) 559. [47] L. Zhu, Y. Hao, Y. Song, L1/2 norm and spatial continuity regularized lowrank approximation for moving object detection in dynamic background, IEEE Signal Process. Lett. 25 (1) (2018) 15–19. [48] M.C.Z. Lin, Y. Ma, The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices, arxiv preprint arxiv, 2011. [49] L. Wu, A. Ganesh, B. Shi, Y. Matsushita, Y. Wang, Y. Ma, Robust photometric stereo via low-rank matrix completion and recovery, in: Asian Conference on Computer Vision, Springer, 2010, pp. 703–717. [50] W.-Z. Shao, Q. Ge, Z.-L. Gan, H.-S. Deng, H.-B. Li, A generalized robust minimization framework for low-rank matrix recovery, Math. Probl. Eng. (2014).
[51] S.V.M. Sagheer, S.N. George, Ultrasound image despeckling using low rank matrix approximation approach, Biomed. Signal Process. Control 38 (2017) 236–249. [52] T. Bouwmans, N.S. Aybat, E.-h. Zahzah, Handbook of Robust Low-Rank and Sparse Matrix Decomposition: Applications in Image and Video Processing, CRC Press, 2016. [53] N. Batmanghelich, A. Gooya, S. Kanterakis, B. Taskar, C. Davatzikos, Application of Trace-Norm and Low-Rank Matrix Decomposition for Computational Anatomy, IEEE, 2010, pp. 146–153. [54] S. Brutzer, B. Höferlin, G. Heidemann, Evaluation of background subtraction techniques for video surveillance, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, 2011, pp. 1937–1944. [55] A. Chambolle, V. Caselles, D. Cremers, M. Novaga, T. Pock, An introduction to total variation for image analysis, Theor. Found. Numer. Methods Sparse Recovery 9 (263–340) (2010) 227. [56] A. Beck, M. Teboulle, Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems, IEEE Trans. Image Process. 18 (11) (2009) 2419–2434. [57] L.I. Rudin, S. Osher, E. Fatemi, Nonlinear total variation based noise removal algorithms, Physica D 60 (1–4) (1992) 259–268. [58] B. Madathil, S.N. George, Twist tensor total variation regularized-reweighted nuclear norm based tensor completion for video missing area recovery, Inform. Sci. 423 (2018) 376–397. [59] J.-F. Cai, E.J. Candès, Z. Shen, A singular value thresholding algorithm for matrix completion, SIAM J. Optim. 20 (4) (1956) 2010. [60] T. Boas, A. Dutta, X. Li, K.P. Mercier, E. Niderman, Shrinkage function and its applications in matrix approximation, arXiv preprint. arXiv:1601.07600. [61] S. Brutzer, B. Höferlin, G. Heidemann, Evaluation of background subtraction techniques for video surveillance, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, 2011, pp. 1937–1944. [62] N. Goyette, P.-M. Jodoin, F. Porikli, J. Konrad, P. Ishwar, Changedetection. net: A new change detection benchmark dataset, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, 2012, pp. 1–8. [63] A. Vacavant, T. Chateau, A. Wilhelm, L. Lequièvre, A benchmark dataset for outdoor foreground/background extraction, Asian Conf. Comput. Vis. 12 (2012) 291–300. [64] D. Giveki, G.A. Montazer, M.A. Soltanshahi, Atanassov’s intuitionistic fuzzy histon for robust moving object detection, Internat. J. Approx. Reason. 91 (2017) 80–95.
Shijila B. received B.Tech degree in Electronics and Communication Engineering from University of Kannur, Kerala, India in 2012. She is currently pursuing M.Tech degree in Signal Processing from National Institute of Technology Calicut, India.
Anju Jose Tom received B. Tech degree in Applied Electronics and Instrumentation Engineering from Mahatma Gandhi University, Kerala, India and M.Tech in Signal Processing from CUSAT University, Kerala, India in 2014 and 2016 respectively. Currently she is pursuing Ph.D. in National Institute of Technology Calicut. Her research interests include image processing, low rank approximation, moving object detection.
Sudhish N. George received B.Tech degree in Electronics and Communication Engineering from M.G University, Kerala, India, in 2004, M.Tech degree in signal processing from Kerala University, India, in 2007 and PhD in 2014, in multimedia security from National Institute of Technology Calicut, India. He is working as Assistant Professor in Department of Electronics and Communication, National Institute of Technology Calicut from 2010 onwards. His current interests include sparse signal processing and low rank recovery etc.