Effects of Texture Addition on Optical Flow Performance in Images with Poor Texture Mehran Andalibi, Lawrence.L. Hoberock, Hossein Mohamadipanah PII: DOI: Reference:
S0262-8856(15)00055-4 doi: 10.1016/j.imavis.2015.04.008 IMAVIS 3411
To appear in:
Image and Vision Computing
Received date: Revised date: Accepted date:
5 May 2014 15 March 2015 28 April 2015
Please cite this article as: Mehran Andalibi, Lawrence.L. Hoberock, Hossein Mohamadipanah, Effects of Texture Addition on Optical Flow Performance in Images with Poor Texture, Image and Vision Computing (2015), doi: 10.1016/j.imavis.2015.04.008
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Effects of Texture Addition on Optical Flow Performance in Images with Poor Texture
Technology Research Center, Oklahoma State University, Stillwater, OK 74078, U.S.A
SC
RI
a Advanced
PT
Mehran Andalibia,∗, Lawrence. L . Hoberocka, Hossein Mohamadipanaha
Abstract
AC CE P
TE
D
MA
NU
This paper investigates the effects of adding texture to images with poorly-textured regions on optical flow performance, namely the accuracy of foreground boundary detection and computation time. Despite significant improvements in optical flow computations, poor texture still remains a challenge to even the most accurate methods. Accordingly, we explored the effects of simple modification of images, rather than the algorithms. To localize and add texture to poorly-textured regions in the background, which induce the propagation of foreground optical flow, we first perform a texture segmentation using Laws’ masks and generate a texture map. Next, using a binary frame difference, we constrain the poorly-textured regions to those with negligible motion. Finally, we calculate the optical flow for the modified images with added texture using the best optical flow methods available. It is shown that if the threshold used for binarizing the frame difference is in a specific range determined empirically, variations in the final foreground detection will be insignificant. Employing the texture addition in conjunction with leading optical flow methods on multiple real and animation sequences with different texture distributions revealed considerable advantages, including improvement in the accuracy of foreground boundary preservation, prevention of object merging, and reduction in the computation time. The F-measure and the Boundary Displacement Error metrics were used to evaluate the similarity between detected and ground-truth foreground masks. Furthermore, preventing foreground optical flow propagation and reduction in the computation time are discussed using analysis of optical flow convergence. Keywords: Optical Flow, Poor Texture, Foreground Detection, Laws’ Masks, F-Measure, Boundary Displacement Error, Condition Number.
1. Introduction Accurate optical flow computation is crucial in many computer vision tasks, including motion estimation, object detection, and tracking. Three decades after the seminal contribution by Horn and Schunck [1], accuracy of optical flow computation methods have been improved significantly. However, images with poor texture, especially in the background, which occur in ∗ Corresponding
author Email addresses:
[email protected] (Mehran Andalibi),
[email protected] (Lawrence. L . Hoberock),
[email protected] (Hossein Mohamadipanah) Preprint submitted to Elsevier Journal of Image and Vision Computing
March 15, 2015
ACCEPTED MANUSCRIPT
AC CE P
TE
D
MA
NU
SC
RI
PT
many sequences, still remain a major challenge in this field [2]. Since solving for optical flow components using the optical flow constraint is an ill-posed problem with two unknowns and one equation, there is a need for extra constraint(s). Spatial smoothness of optical flow components introduced by Horn and Schunck (HS) is one of the most common constraints used in different publications with various modifications, such as in [3, 4, 5, 6]. The smoothness constraint causes the blurring of computed motion at the object boundaries, together with spread of foreground non-zero flow to the neighboring background pixels. As we will see in 2.1, while making optical flow computation possible, in images with poorlytextured regions, the smoothness constraint leads to some disadvantages, such as considerable deformations in the size and the shape of the detected foreground objects, and accordingly in the position of the center area, which results in errors for foreground diagnosis and tracking. This is shown in the first row of Fig 1 for a sequence, where a wooden model (only the upper body) and its cast shadows are moving against a background with poor texture. The first and second frames are shown in parts (a) and (b), respectively; the magnitude of optical flow calculated by the method in [4] is shown in part (c), where propagation of the object flow to the neighboring background pixels with poor texture has deformed the object shape and lead to difficulty in foreground detection. In images with multiple moving objects within a small region, smoothness of optical flow can lead to objects merging. This is illustrated in the second row of Fig 1, where multiple cars with cast shadows are moving close to each other on a highway with insufficient texture. The first and second frames are shown in parts (d) and (e), respectively; the magnitude of optical flow calculated by the method in [6] is shown in part (f), where object merging is observable. The other negative effect of computing optical flow for poorly-textured regions is the considerable computation time due to solving the time-consuming Laplace equation with boundary conditions.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 1: Negative effects of the smoothness constraint on images with poorly-textured regions: First frame of the Wooden Model Sequence (a); Second frame of the Wooden Model Sequence (b); Optical flow magnitude computed according to [4] where the object shape is distorted (c); First frame of the Highway Sequence (d); Second frame of the Highway Sequence (e); Optical flow magnitude computed according to [6] where object merging has occurred (f).
2
ACCEPTED MANUSCRIPT
AC CE P
TE
D
MA
NU
SC
RI
PT
Researchers have attempted to overcome the negative effects of the smoothness term following the HS contribution. Nagel and Enkelmann [7] employed oriented derivatives for the smoothness term, observing that the motion boundaries coincide with the abrupt light intensity transitions. Using heuristically determined smoothness across and along the object boundaries, Alvarez et al. [8] proposed a modification for improving the method by Nagel and Enkelmann. A manually-designed probabilistic model using Markov Random Field (MRF) and a statistical model using patch-based motion discontinuity were used to relate the light intensity edges and motion boundaries by Black [9] and Fleet et al. [10], respectively. Lei et al. [11] adopted a variable weight for the effectiveness of the smoothness term in the HS formulation. The variable weight coefficient is adaptive through a threshold function based on the detection of the gray boundaries and on the real-time detection of the movement boundaries in the iterative process. The method by Nir et al. [12] solves for six affine parameters at each pixel position instead of two flow components. Sun as well as Werlberger et al. [13, 2] modified the total energy function by adding non-local smoothness terms that employ adaptive weights for each pixel, which is basically equivalent to using median filtering after every warping step. The approach of anisotropic weighting of the smoothness term is a breakthrough employed recently, including substitution of the standard quadratic penalizing function by the anisotropic Huber-L1 Norm, first introduced in [14] and used in [15] and [16], applying smaller weights along the intensity boundaries compared to the orthogonal direction in [2]. A similar approach was proposed by Zimmer et al. [17] in which the brightness constancy is used to determine the weights rather than the intensity gradient. Harmonic constraint has been imposed on the isotropic gradient vector field to create the anisotropic diffusion in [18] and [19], where the authors utilized divergence and curl of the vector field. Aubert et al. [20] added an extra term, which penalizes computing motion in homogeneous blocks and only allows for large values of optical flow components in textured regions. Divergence controls the amount of diffusion, and the curl term controls the diffusion direction. Despite significant improvements in the suppression of motion blurring at the object boundaries, even accurate and sophisticated leading methods in the Middlebury 1 , KITTI 2 , and MPI Sintel 3 rankings, such as [4], [6], and [13] fail to capture the proper size and contour of the foreground in images with poorly-textured background regions. Note that employing more complicated cost functions leads to larger computation times, which is detrimental in real-time tracking procedures. After numerous observations of optical flow results using state-of-the-art methods, we came to the conclusion that no matter how sophisticated the algorithm is, performance could be undesirable if original frames have poor texture. This encouraged us to explore the outcomes of modifying original images, rather than modifying the computing algorithms. Regions with poor texture exist both within the background and the foreground; however the regions in the background are the main reason for propagation of foreground flow to neighboring pixels, object shape distortion, and even object merging. Furthermore, adding a static texture to these regions can be performed with sufficient accuracy. However, to generate a moving texture for the foreground regions, we need to know pixels’ correspondence, which is not possible without calculating the optical flow. If we want to calculate optical flow once and use it again for generating a moving texture, even small inaccuracies in the optical flow vector field leads to addition of the texture to wrong pixels and hence induction of erroneous flow. Moreover, any 1
http://vision.middlebury.edu/flow/
2 http://www.cvlibs.net/datasets/kitti/eval_stereo_flow.php?benchmark=flow 3 http://ps.is.tue.mpg.de/project/MPI_Sintel_Flow
3
ACCEPTED MANUSCRIPT
AC CE P
TE
D
MA
NU
SC
RI
PT
interpolation using feature matching is prone to mismatching errors and it requires knowledge about the type of object motion (rigid or flexible as they need different types of interpolations) which is not known a-priori; thus it can not be used to generate an accurate moving texture. Therefore, we treat only poorly-textured background regions and leave the modification of foreground regions to future investigations. It is important to mention that the camera is assumed to be stationary in this paper. Therefore, we only add texture to the background pixels. Should the camera be moving, egomotion estimation and compensation are required to be performed before texture addition . To localize and add texture to poorly-textured regions in the background, we first perform a texture segmentation using Laws’ masks and generate a texture map. Next, using a binary frame difference, we constrain the poorly-textured regions to those with negligible motion. Finally, we calculate the optical flow for the modified images with added texture. It is shown that if the threshold used for binarizing the frame difference is in a specific range determined empirically, variations in the final foreground detection will be insignificant. Note that optical flow calculations suffer from poor texture, and image differencing can not provide us with motion information, being significantly sensitive to illumination changes and the binarizing threshold; however, computing optical flow while using image differencing and texture addition as just described provides improved accuracy and robustness in foreground detection, as will be shown. The main contributions of this study are: (1) creation of sharp motion boundaries and more accurate capture of the object size, position, and contour; (2) avoiding or mitigating object merging in sequences with multiple objects moving in a small area with poor texture; (3) reduction in computation time; and (4) mathematical analysis of the effects of texture addition on the optical flow convergence and computation time. This paper is organized as follows: In Section 2, we describe the problem in more detail, together with the texture addition algorithm and effects accompanied by mathematical analysis of optical flow convergence. Section 3 demonstrates representative and quantitative results, and discussion. Section 4 provides Limitations and future work, while general conclusions are included in Section 5. 2. Algorithm and Analysis
In this section, we first discuss the problem in more detail in 2.1. Then, we describe the texture addition algorithm and effects in 2.2, and explain these effects from the mathematical perspective in 2.3. 2.1. Problem Discussion
We will use the formulation of HS throughout this section for simpler explanations, while applying the accurate and leading methods in [4], [6], and [13] for demonstrations later. Denoting light intensity by I, first order spatial and temporal derivatives of light intensity by (I x , Iy ) and It , respectively, optical flow components (u,v) in [1] are computed by minimizing the following total energy function: Z Z φ(u, v) = (ρ2 φd (u, v) + φc (u, v))dxdy (1) where the data error energy function is given by: φd (u, v) = (I x u + Iy v + It )2 4
(2)
ACCEPTED MANUSCRIPT
and smoothness energy function is defined as: ∂u 2 ∂u ∂v ∂v ) + ( )2 + ( )2 + ( )2 . ∂x ∂y ∂x ∂y
(3)
PT
φc (u, v) = (
I x2 u + I x Iy v − ρ2 ∇2 u = −I x It
SC
I x Iy u + Iy2 v − ρ2 ∇2 v = −Iy It .
RI
Parameter ρ determines the relative weight of the data term. Then, the Euler-Lagrange equations yield:
(4)
NU
In regions where spatial light intensity variations are negligible (poor texture), including along the x and y axes (I x ≈ 0, Iy ≈ 0), (4) will be approximated by the Laplace equations with boundary conditions dictated by the neighboring windows: ∇2 u ≈ 0, ∇2 v ≈ 0
(5)
AC CE P
TE
D
MA
If (5) holds in the background regions (outside the boundaries of the foreground objects), nonzero optical flow of the object blocks will affect the neighboring background pixels, where zero motion is expected. This effect will spread, and the level of influence depends on the magnitude of light intensity variations around the object (as well as the computing method), which can be a region of the image relatively larger than the object size in images with considerably poor texture. This is illustrated in Fig 2 for a sequence in which an airplane4 is moving against a uniform background sky, with the first frame, second frame, and the ground-truth of the foreground shown in parts (a), (b), and (c), respectively. The magnitude of optical flow computed according to methods in [4], [6], and [13] can be seen in parts (d), √ (e), and (f) in the second row, respectively. Comparing the magnitudes of optical flow (w = u2 + v2 ), with the ground-truth of the foreground, we can see that not only are the detected moving pixels significantly larger than the object’s size, especially in part (d), but also the contour of the object is not preserved. 2.2. Texture Addition Algorithm and Effects The first step is to localize poorly-textured regions in a frame. We use the Laws’ masks introduced in [21] to measure the texture energy in different regions of an image. Denoting a gray-scale frame by IL, we apply Laws’ 2-D convolution kernels on IL, which can be created using the following set of 1-D kernels of length three: L3 =[1 2 1] (average gray level), E3 =[1 0 -1] (edge extractor), and S 3 =[1 -2 1] (spot extractor). Although nine 2-D kernels can be built using the outer product of these filters, we did not use LT3 L3 , since it only measures the average gray level value in a 3 × 3 window, while we look for pixel-wise light intensity variations. Accordingly, each frame is convolved with the following set of eight 2-D masks: [LT3 E3 , LT3 S 3 , E3T L3 , E3T E3 , E3T S 3 , S 3T L3 , S 3T E3 , S 3T S 3 ]. This set is capable of measuring light intensity variations in different patterns (i.e. texture) in a region. Denote this kernel set by [K1 , K2 , ..., K8 ] and show the convolution operation by (∗). Then texture energy (TE) at a pixel in position (i, j) is given by: 4 http://www.youtube.com/watch?v=qF9VZSkVZI0
5
(b)
(c)
NU
SC
RI
(a)
PT
ACCEPTED MANUSCRIPT
(d)
(e)
(f)
D
MA
Figure 2: First frame of the Airplane Sequence (a); Second frame of the Airplane Sequence (b); Ground-truth of the foreground (c); Magnitude of optical flow computed according to [4] (d); Magnitude of optical flow computed according to [6] (e); Magnitude of optical flow computed according to [13] (f).
TE
TE(i, j) =
8 X n=1
|Fn (i, j)|,
Fn =Kn ∗ IL
(6)
AC CE P
The next step is to create a binary texture energy map, denoted by TEB, which requires a threshold on the values of the texture energy, here denoted by γ. To avoid using an empiricallydetermined threshold which can fail for sequences not studied, we use an adaptive method based on the histogram of the texture energy values. Based on the knowledge from images processing the major portion of an image information lies in the low texture energy values - and after investigating a large number of sequences, we acquired assurance that the histogram of TE values for a typical image looks like what is shown in Fig 3 (here using 100 bins). It is a right-skewed distribution with mostly decreasing frequencies as the texture energy increases (there could be sudden increases, but low very texture energy values tend to have significantly larger frequencies). We call those regions ”poorly-textured” for which the texture energy levels are sufficiently different, here smaller than the other values (with higher frequencies). In other words, we look for those values with such high frequency that are outliers in this histogram. Since the histogram of texture energy levels for most of the images does not follow a normal distribution, determining outliers is performed using the method introduced in [22] for skewed distributions. In this paper, Vanderviere and Huber introduced an adjusted boxplot taking into account the medcouple MC, a robust measure of skewness for a skewed distribution, which for a data series with sorted entries (Xn = {x1 , x2 , . . . , xn } , x1 ≤ x2 ≤ · · · ≤ xn ) is given by: MC = med(
(x j − medk ) − (medk − xi ) ) x j − xi 6
(7)
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
D
Figure 3: Histogram of texture energy values for a typical image.
TE
with med and medk be the median operator and the median of Xn , and xi and x j have to satisfy xi ≤ medk ≤ x j and xi , x j . Then, for right-skewed distributions like in Fig 3 with MC ≥ 0, the boxplot limits given in (8) can be used to determine the outliers: xi < Q1 − 1.5e−3.5MC IQR
AC CE P
,
xi > Q3 + 1.5e4MC IQR
(8)
where Q1 and Q3 are the first and third quantiles and IQR is the interquartile range. Here Xn represents the sorted frequencies of TE values acquired from the histogram. We use the upper limit in (8) to determine those texture energy levels with frequencies in the histogram that are outliers from above. For instance, for the histogram shown in Fig 3, only the two first bins were found to be outliers; so a threshold of γ = 0.02 was used and multiplied by the maximum value of TE, since 100 bins were used. Because the binary texture energy maps for the first and the second frames are not usually identical due to different factors, such as lighting variations, we use the binary intersection operator to ensure that texture will only be added to identical locations in both frames. Denoting the texture energy maps for the first and the second frame by TE1 and TE2 , the final binary texture map is given by:
TEB(i, j) =
(
1 0
if (TE1 (i, j) ≥ γ × max(TE1 )) ∩ (TE2 (i, j) ≥ γ × max(TE2 )) otherwise.
(9)
Fig 4 shows the first and second frames from a laboratory sequence 5 , and the binary texture energy map in parts (a), (b), and (c), respectively. 5 http://arma.sourceforge.net/shadows/
7
ACCEPTED MANUSCRIPT
SC=10
SC=40
(a)
(b)
(c)
PT
SC=0
RI
Figure 4: First frame of the Laboratory Sequence (a); Second frame of the Laboratory Sequence (b); Binary texture energy map, TEB (c).
AC CE P
TE
D
MA
NU
SC
TEB distinguishes only between regions with rich and poor textures. So, to localize and add texture only to the poorly-textured regions in the background, we must use a type of foreground detection algorithm. We opted to use image differencing due to the small computation time required. Note that in addition to foreground detection (which is not accurate for poorly-textured images), optical flow can also provide further information, such as direction and magnitude of pixels’ displacement per frame. Furthermore, image differencing can not be employed to detect the foreground with sufficient accuracy due to lighting changes, small capture rate, etc, and the shape of the resulting binary map will depend on the threshold utilized. Therefore, image differencing can not replace optical flow regarding accurate detection of motion; it is merely used to ensure that the static texture is not added to the pixels with apparent motion, and that the added texture does not induce erroneous flow. While the threshold used for binarizing the frame difference will determine the shape of the binary map, FDB, later we show that the final optical flow magnitude using the texture-added frames will not significantly vary provided that the threshold, β, is selected from an empiricallydetermined range. Denoting the frame difference by (FD = IL2 − IL1 ), we define FDB as: ( 1 if FD(i, j) ≥ β × max(FD) FDB(i, j) = (10) 0 otherwise. Note that in order to binarize the frame difference, typically a fixed threshold might be used, but we can improve the results by employing an adaptive threshold. To avoid rendering small light intensity variations (due to illumination changes, camera inherent noise, etc.) as foreground pixels, we tried to only keep those pixels with the largest light intensity variations, which can be determined by comparing their light intensity variation magnitude to the maximum value of light intensity change. Therefore, we used the max( ) function in Equation (10). We also perform binary image filling on FDB to avoid missing some of the moving pixels within the FDB borders. To generate the texture that will be added to the frames, we can select texture patches from available texture images or simply generate a stochastic texture. Selection of the texture types depends on the optical flow computation method, which will be discussed in Section 3. To create the stochastic texture for each pixel (i, j), first three random numbers (RND) from a normal distribution are generated with µ = 0 and σ = 1 for three RGB channels (RND(i, j, k) ∈ N(0, 1), k = 1, 2, 3). Then, they are multiplied by a scalar, S C that determines the magnitude of the texture for each pixel. The static stochastic texture image (STX) has the same size as both sequence frames, and is given by: STX(i, j, k) = S C × RND(i, j, k)
, k = 1, 2, 3.
(11)
Finally, the modified frames (IN1 , IN2 ) are created by adding texture only to the regions in both original frames (here original color frames are denoted by IM) that have poor texture and do not 8
ACCEPTED MANUSCRIPT
show apparent motion: IM(i, j, k) + STX(i, j, k) IM(i, j, k)
if (TEB(i, j) = 0 otherwise.
and FDB(i, j) = 0)
(12)
PT
IN(i, j, k) =
(
D
MA
NU
SC
RI
In the first row in Fig 5, part (a) shows the binary frame difference for the Laboratory Sequence, and part (b) shows the map that localizes background regions with poor texture. Modified frames with texture addition can be seen in parts (c) and (d), respectively. Here, for frames with 8−bit unsigned integer values, S C = 40 was used for all pixel. Selection of S C will be discussed in 2.3. Magnitude of optical flow calculated according to [4], [6], and [13] for the original and modified frames are shown in the second and third rows, respectively. Comparing the optical flow magnitude results for original and modified frames, we observe that the erroneous background flow has been suppressed, and thus the foreground boundaries have been preserved with higher accuracy in all methods, despite differences in the approaches used to calculate optical flow components. Quantitative improvement in preservation of foreground boundary detection is discussed in Section 3.
(b)
(c)
(d)
AC CE P
TE
(a)
(e)
(f)
(g)
(h)
(i)
(j)
Figure 5: First row: Binary frame difference (a); Background regions with poor texture rendered in solid white (b); Modified first frame (c); Modified second frame (d); Second row: Magnitude of optical flow for original frames computed according to [4], [6], and [13] in (e), (f), and (g), respectively; Second row: Magnitude of optical flow for modified frames computed according to [4], [6], and [13] in (h), (i), and (j), respectively.
We have illustrated the effect of β ∈ [0.002, 0.1] on the binary frame difference FDB and the optical flow magnitude using the texture-added frames in Fig 6. In the first row, part (a) shows the magnitude of optical flow using original images computed according to [13] and part (b) shows the ground-truth for the foreground; in the second row, binary frame difference is shown for β = 0.002, β = 0.01, β = 0.025, β = 0.04, and β = 0.1 ; and the third row shows the magnitude of optical flow using modified frames computed according to [13]. Very small thresholds as in 9
ACCEPTED MANUSCRIPT
MA
NU
SC
RI
PT
part (c) lead to detection of the majority of pixels as belonging to the foreground, since light intensity of pixels do not remain unchanged due to lighting changes, quantization effects, etc. Therefore, texture will not be added to the poorly-textured regions as they are recognized to be foreground regions, and thus the magnitude of optical flow for the modified and original images will be similar without specific improvements. For large thresholds as in part (g), only a fraction of foreground pixels will be correctly detected, and adding texture incorrectly to the moving pixels will result in erroneous flow in the foreground regions (small flow magnitude in part (l) for some pixels). Our empirical results shows that if the threshold is selected from the range β ∈ [0.01, 0.04], the magnitude of optical flow remains approximately the same, while the binary frame difference significantly changes with the threshold. Similar experiments with other sequences revealed that the range β ∈ [0.01, 0.04] (enclosed in a red rectangle in Fig 6) can provide reasonable results.
(b)
(a)
=0.01
=0.025
=0.04
=0.1
AC CE P
(c)
TE
D
=0.002
(h)
(d)
(e)
(f)
(g)
(i)
(j)
(k)
(l)
Figure 6: First row: Magnitude of optical flow using original images computed according to [13] (a); Ground-truth for the foreground (b); Second Row: binary frame difference using different thresholds, (c) to (g); Third row: Magnitude of optical flow using modified images computed according to [13] using different thresholds used for binarizing the frame difference, (h) to (l).
In images with multiple objects moving close to each other, smoothness of optical flow variations and propagation of foreground flow into neighboring background pixels lead to object merging, as observed in the second row of Fig 1 for the Highway Sequence. Modified frames and the magnitude of optical flow for the these frames computed according to [6] are shown in parts (a) to (c) in Fig 7. (Here we use the method proposed in [6] rather than in [13], since it is designed to handle large displacements of the objects as they exist in this sequence). As compared to part (f) in Fig 1, adding texture to background regions with poor texture results in the suppression of erroneous flow for the background pixels, thus reducing object merging, although 10
ACCEPTED MANUSCRIPT
(b)
(a)
RI
PT
not completely eliminating the problem.
(c)
SC
Figure 7: Modified first frame of the Highway Sequence (a); Modified second frame of the Highway Sequence (b); Magnitude of optical flow using modified images computed according to [6] (c).
MA
NU
The other advantage of adding texture to the images with poorly-textured regions in the background is reduction in the computation time, which is explained in 2.3. For the Laboratory Sequence shown in Fig 4, we noticed 20.1%, 7.8%, and 11.7% reduction in computation time for methods in [4], [6], and [13], respectively. Similar results have been observed in other sequences, which will be discussed in Section 3.
AC CE P
TE
D
2.3. Analysis of Convergence and Computation Time Reduction In this section, we explain why adding texture to poorly-textured background regions suppresses erroneous flow and why it reduces the computation time. Rewriting (4) in matrix notation helps analyze the convergence of optical flow values for the background to very small values, and reduced computation time. Mitchie et al. [23] used discretization of the Laplace operator and rewrote (4) into a set of linear equations of the form Az = b, where A is a tridiagonal matrix, vector z contains 2N 2 unknown optical flow components, u and v, of an image with N × N pixels, and a constant vector b of the same size as z. Proof of block-wise convergence for unknown values in vector z using Gauss-Seidel iteration is provided in [23]. We can rewrite (4) for ρ = 1 as: ! ! u 2 u ∇ +M =R (13) v v where M and R are defined by:
M=−
"
I x2 I x Iy
# " # I x Iy I x It , R = Iy2 Iy It
(14)
and the magnitudes of I x and Iy depend on S C. For the background regions, where texture has been added, I x and Iy will no longer be negligible terms, while temporal derivative It is close to zero (because background pixels do not move and their light intensities do not change significantly with time). If we use finite difference formulas, then the higher the scalar value S C is, the larger the spatial derivative terms become, and the larger the magnitudes of entries in M and R become. Since the entries of the resulting matrix from the Laplacian operator on the left-hand-side of (13) are not affected by the texture magnitude, and are calculated using only a weighted averaging of the neighboring flow components (which are not large), the Laplacian operator term would be significantly smaller than the entries of M and R for relatively large values of S C (S C > 20). Therefore, (13) can be written as M uv ≈ R, which is approximately a set of linear homogeneous equations, where the solution must converge to zero due to invertibility of 11
ACCEPTED MANUSCRIPT
MA
NU
SC
RI
PT
the positive definite matrix M [23]. For the background blocks in the vicinity of the foreground objects, due to the coupling of optical flow components to the foreground values (as explained in [23]), values for u and v will not converge to zero, but to small values. As we move further from the foreground object, background flow would approximately vanish in a distance that depends on the magnitude of S C. The effect of S C on the suppression the background flow can be clearly seen in Fig 8 for the Airplane Sequence in Fig 2. First and second frames are shown in parts (a) and (b); modified second frames are shown in parts (c) to (e) for different values of S C (higher values mean stronger texture intensity); and corresponding magnitudes of optical flow are displayed under each modified frame in (f) to (h), respectively. As the value of S C increases, the background flow vanishes in shorter distances from the object.
(a)
SC=5
(b)
SC=40
AC CE P
TE
D
SC=75
(c)
(d)
(e)
(f)
(g)
(h)
Figure 8: Effect of texture magnitude on the erroneous background flow suppression for the Airplane Sequence: First frame (a); Second frame (b); Modified second frame using S C = 5 (c); Modified second frame using S C = 40 (d); Modified second frame using S C = 75 (e); Magnitude of optical flow using S C = 5 computed according to [6] (f); Magnitude of optical flow using S C = 40 computed according to [6] (g); Magnitude of optical flow using S C = 75 computed according to [6] (h).
Higher values of S C, however will lead to higher errors in the foreground object boundaries. Consider a background pixel and its immediate neighboring pixels in a poorly-textured background region that are covered by a foreground object only in the first frame (possibly pixels near boundaries). Using I to show light intensity values and indices i, j, and k for the x-axis, 12
ACCEPTED MANUSCRIPT
y-axis, and time, respectively, we can write the variations in the light intensity derivatives as:
(15)
RI
Ii, j+1,k+1 →Ii, j+1,k+1 + n3 Ii+1, j+1,k+1 →Ii+1, j+1,k+1 + n4
PT
Ii, j,k+1 →Ii, j,k+1 + n1 Ii+1, j,k+1 →Ii+1, j,k+1 + n2
NU
SC
where ”→” indicates ”become” and n1 to n4 are random numbers due to texture addition. If we employ the same discretizations used in [1] for numerical differentiations in I x , Iy , and It , then the expected values of the derivative terms can be given by (15), where E[n] is the expected value of a random variable n: E[I x ] →E[I x ] + E[n3 + n4 − n1 − n2 ]
MA
E[Iy ] →E[Iy ] + E[n2 + n4 − n1 − n3 ] E[It ] →E[It ] + E[n1 + n2 + n3 + n4 ].
(16)
AC CE P
TE
D
Therefore, selection of random numbers in the synthetic texture from a normal distribution with expected value of zero helps to maintain the expected values of spatial and temporal derivative terms unchanged. Note that our zero-mean normal distribution is more likely to produce a larger number of random values close to zero than a zero-mean uniform distribution. This will reduce the foreground errors around the foreground object boundaries. Here, higher S C values would magnify the standard deviation of the foreground flow error, while suppressing the background flow faster. As we can see in Fig 8(c), employing large values for S C does not change the foreground size and shape considerably and provides negligible advantage. However, higher values of S C cause the optical flow error in the foreground boundary pixels to increase significantly. For the Airplane Sequence in Fig 8, the object only undergoes a translation of u = −5.11 and v = −2.28 pixels between two frames. The percentage of average errors (∆u, ∆v) in the u and v components are (0.40%, 0.32%), (1.16%, 0.91%), and (2.52%, 2.11%) for the object boundary pixels using texture magnitudes of S C = 5, S C = 40, and S C = 75, respectively. Experiments with multiple sequences have revealed that a texture magnitude of S C = 40 for images with 8 − bit unsigned integer values (or equivalently 15% of the maximum light intensity value) can provide a reasonable compromise between suppression of erroneous background flow and the errors in u and v components for the object boundary pixels. Equation (13) can also help investigate the reduction in computation time. In a linear system of equations, such as Az = b, (where A is a positive definite matrix) the rate of convergence is directly related to the condition number (κ) of the coefficient matrix A, which can be defined as the ratio of the largest eigenvalue of a matrix, λmax , to the smallest eigenvalue of the matrix, λmin , for any symmetric positive matrix [24]. The higher the condition number, the slower the convergence would become. If we use a 9-point discretization for the Laplace operator on an image of N × N pixels, the condition number of the resulting positive definite matrix (denoted by L) is of order O(N 2 ) [25], which could be very large (≫ 1). In images with poorly-textured backgrounds, such as the UFO sequence, since the matrix M for these regions approximately vanishes in (13), the condition number of the Laplace operator is the dominating term, and it leads to very slow convergence. 13
ACCEPTED MANUSCRIPT
r = 1, 2, ..., S Z r = 1, 2, ..., S Z
SC
λS Z (L + M) ≤ min(λr (L) + λr+S Z−1 (M)), λS Z (L + M) ≥ λS Z (L) + λS Z (M).
RI
λ1 (L + M) ≤ λ1 (L) + λ1 (M) λ1 (L + M) ≥ max(λr (L) + λr+S Z−1 (M)),
PT
When synthetic texture is added, entries in matrix M are no longer negligible, so we encounter the problem of eigenvalues of the sum of two hermitian (here positive definite) matrices. If we denote the eigenvalues of a matrix in a descending order by λ1 > λ2 > ... > λS Z , where S Z is the size of the matrix, then we can find the upper and lower bounds for the largest and smallest eigenvalues of the sum of two positive definite matrices by referring to the Weyl inequality [26]:
(17)
NU
The condition number of L + M, denoted by κ(L + M) is the ratio of λ1 (L + M) to λS Z (L + M), which is bounded by the following inequalities:
MA
max(λr (L) + λr+S Z−1 (M)) λ1 (L) + λ1 (M) ≤ κ(L + M) ≤ min(λr (L) + λr+S Z−1 (M)) λS Z (L) + λS Z (M)
, r = 1, 2, ..., S Z. (18)
AC CE P
TE
D
Within a poorly-textured background region, if the values of I x and Iy using a very small scalar value of S C0 (which can correspond to the original image without texture addition) are designated I x0 and Iy0 , then amplifying the texture magnitude using a scalar value of η × S C0 would change these terms to η × I x0 and η × Iy0 due to the linearity of finite difference calculations. Furthermore, multiplication of scalar S C by a factor of η would magnify all the entries and the eigenvalues of M by a factor of η2 , including the smallest and the largest eigenvalues. As a result, all numerators and denominators of the upper bound and the lower bound terms in (18) increase linearly proportional to η2 . This leads to the reduction of both upper and lower bounds, since λ1 (L) ≫ λS Z (L) and the bound terms are decreasing functions with respect to eigenvalues of M, or equivalently η2 . Reduction in the upper and the lower bounds will lead to the reduction in the condition number of the equivalent matrix (M+L), such that the convergence rate would increase.
3. Results and Discussion
In this section, we will first demonstrate representative results in 3.1. Then, we evaluate the performance of optical flow methods used in this study with and without texture addition in terms of foreground boundary preservation and computation time in 3.2. Finally, we provide discussion in 3.3. 3.1. Representative Results As we have seen in the previous section, an important advantage of the preprocessing stage is the suppression of the erroneous background flow yielding sharp motion boundaries and more accurate rendering of the foreground size and shape. To illustrate this effect, we employed texture addition in conjunction with accurate and leading optical flow methods in [4], [6], and [13] on ten sequences with different texture distributions and number of moving objects, eight of which 14
ACCEPTED MANUSCRIPT
AC CE P
TE
D
MA
NU
SC
RI
PT
are captured from real videos and two are animations. The number of methods we used is limited by the availability of the publication and the algorithm code, because many leading algorithms in databases are from anonymous subscribers without access to the publication or the algorithm. The method in [4] is an accurate algorithm with similar performance over all sequences used in this study, so it can be considered as a reference for comparisons. The method in [6] is one of the leading methods in the KITTI database, which can handle large displacements more efficiently. The method in [13] uses median filtering that leads to higher accuracy of object boundary preservations in many sequences. This method was ranked 1 st in the Middlebury database in 2010, and is currently a leading method in the KITTI and MPI Sintel databases. Due to the large number of sequences and images, and to maintain sufficient space for each sequence, we have divided the results into two separate figures, Fig 9 and 10. For each sequence, the results are displayed in one column, where the images from the top to the bottom are, respectively: the modified first frame, the original second frame, ground-truth mask for the foreground, magnitude of optical flow using the method in [4] for original images, magnitude of optical flow using the method in [4] for modified images, magnitude of optical flow using the method in [6] for original images, magnitude of optical flow using the method in [6] for modified images, magnitude of optical flow using the method in [13] for original images, and magnitude of optical flow using the method in [13] for modified images. To generate ground-truth images, we used the edge maps of the frames which were delineated by the Canny edge detector with manually selected parameters determined to render all edge pixels. Next, we asked multiple volunteers to manually eliminate extra edge pixels and connect non-connected edges. Finally, image filling was performed on the accurate edge maps. In Fig 9, the first sequence (Laboratory Sequence) shows a person moving away from the camera in a laboratory6, where cabinet doors surrounding his body have poor texture. The second sequence shows a wooden model7 (Wooden Model Sequence), where the upper body and cast shadows move against a uniform background. The third sequence shows a personal vehicle captured by a surveillance camera8 (Surveillance Sequence), where the vehicle and the cast shadow move against the street with poor texture. In these sequences, a single object (and corresponding cast shadows) is moving, and the texture distribution varies significantly across the sequences. Texture addition in all methods helps suppress the erroneous flow around the objects (and shadows) and helps preserve the motion boundaries more accurately. The fourth and the fifth sequences demonstrate multiple objects moving, where two individuals are practicing in an indoor environment9 (Indoor Practice Sequence) with a poorly-textured background in the fourth sequence, and three rolling balls10 (and cast shadows) are animated on a curved surface with poor texture in the fifth sequence (Rolling Balls Sequence). For these sequences, texture addition and suppression of the background flow also helps prevent object merging, specifically for the method in [4]. Comparing ground-truth for the foreground in the third row to the magnitude of optical flow in fifth, seventh, and ninth rows, we notice that the combination of the optical flow method in [13] with texture addition provides the most accurate foreground detection for the sequences in this figure.
6 http://arma.sourceforge.net/shadows/ 7 http://www.youtube.com/watch?v=eJlqQSMifqk 8
http://www.youtube.com/watch?v=x6HRKncJuB0
9 http://www.youtube.com/watch?v=APDmcwT1ii4 10 http://visual.cs.ucl.ac.uk/pubs/flowConfidence/supp/
15
AC CE P
TE
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
Figure 9: By rows: modified first frame, original second frame, ground-truth mask for the foreground, magnitude of optical flow using according to [4] for original images, magnitude of optical flow according to [4] for modified images, magnitude of optical flow according to [6] for original images, magnitude of optical flow according to [6] for modified images, magnitude of optical flow according to [13] for original images, and magnitude of optical flow according to [13] for modified images: first column, Laboratory Sequence; second column, Wooden Model Sequence; third column, Surveillance Sequence; fourth column, Indoor Practice Sequence; fifth column, Rolling Balls Sequence.
16
AC CE P
TE
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
Figure 10: By rows: modified first frame, original second frame, ground-truth mask for the foreground, magnitude of optical flow using according to [4] for original images, magnitude of optical flow according to [4] for modified images, magnitude of optical flow according to [6] for original images, magnitude of optical flow according to [6] for modified images, magnitude of optical flow according to [13] for original images, and magnitude of optical flow according to [13] for modified images: first column, Highway Sequence; second column, Basketball Sequence; third column, UAV Sequence; fourth column, Office Sequence; fifth column, Playground Sequence.
17
ACCEPTED MANUSCRIPT
MA
NU
SC
RI
PT
In Fig 10, all sequences have multiple moving objects. The first sequence (Highway Sequence) shows four vehicles moving toward the camera on a highway11 with poor texture, where object merging can be clearly observed. The second sequence shows an indoor sequence (Basketball Sequence12 ), where two individuals are playing basketball against a background with partial poor texture. The third sequence shows two unmanned aerial vehicles flying in a partially cloudy sky (UAV Sequence13). The fourth sequences is taken from an indoor office video, where two individuals are moving against a uniform background (Office Sequence14). The fifth sequence is an animation in which three children are moving against a background with poorly-textured regions (Playground Sequence15 ). Similar to Fig 9, texture addition in all methods helps suppress the erroneous flow around the objects (and shadows), and helps preserve the motion boundaries more accurately. Note that while combination of texture addition and the optical flow method in [13] provides the most accurate foreground detections for most of the sequences in Fig 10, for the Highway and Playground Sequences, it does not show the best results and is not able to prevent the object merging problem. This is due to large object displacement in both sequences. As can be seen, the method in [6] demonstrates higher accuracy and is the only method capable of preventing object merging for the Highway Sequence, because of considering an extra term in the error functional, which employs feature matching to handle large displacements. 3.2. Quantitative Results
TE
D
To quantify the effect of texture addition on the accuracy of foreground detection, we utilized two measures that are frequently used in the literature, such as in [27] when two binary images are compared: F-measure [28] and Boundary Displacement Error (BDE) [29]. Given a specific weight α (here 0.5), we use Fα to designate F-measure, which evaluates the amount of overlap between the ground-truth foreground mask and the detected foreground mask, given by: (1 + α) × Precision × Recall , (19) α × Precision + Recall where Precision/Recall are the ratios of the correctly detected foreground (overlapped area) to the detected/ground-truth foreground. Denoting the area of detected foreground mask and groundtruth foreground mask by A(D) and A(G), respectively, Precision is given by:
AC CE P
Fα =
and Recall is given by:
Precision =
Recall =
A(D ∩ G) , A(D)
A(D ∩ G) . A(G)
(20)
(21)
Clearly, larger values for Fα indicate higher overlap between the ground-truth foreground and the detected foreground, with Fα = 1 indicating perfect overlap. The BDE measures the average displacement error between the boundaries of two masks mentioned above. Let BD and BG represent the boundary point set of the detected and ground-truth 11 http://arma.sourceforge.net/shadows/ 12 http://vision.middlebury.edu/flow/data/ 13 http://www.youtube.com/watch?v=fgHjVvqLXV8 14 http://www.youtube.com/watch?v=cOyla67NMHk 15 http://www.youtube.com/watch?v=GXmW6S1iVCI
18
ACCEPTED MANUSCRIPT
PT
rectangles, respectively. The BDE from BD to BG , denoted as E(D, G), is computed as the average of distances from every point p in BD to its closest point in BG : P p∈BD d(p, BG ) . (22) E(D, G) = |BD |
RI
In Equation (22), |BD | represents the number of points in set BD, and d(p, BG ) represents the minimum Euclidean distance from p to all points in BG : q d(p, BG ) = min( (p1 − q1 )2 + (p2 − q2 )2 ), (23)
SC
q∈BG
where (p1 , p2 ) and (q1 , q2 ) are coordinates of p and q, respectively. E(G, D), is computed similarly. The final BDE between BD and BG is computed as the average of these two BDEs:
TE
D
MA
NU
1 [E(D, G) + E(G, D)]. (24) 2 Smaller values for BDE indicate lower average boundary displacement between the groundtruth foreground and the detected foreground, with BDE = 0 indicating perfect boundary match. Table 1 summarizes the values for Fα calculated for the optical flow methods in [4], [6], and [13] without and with texture addition as they are employed on all sequences used in this study. Results indicate significant improvements in preservation of the foreground boundaries if the optical flow methods are accompanied by texture addition. Comparison of Fα shows 30%, 21%, and 28% more overlap on average between detected and ground-truth foreground masks for the methods in [4], [6], and [13], respectively. BDE(BD, BG ) =
Table 1: Effect of the adding texture on the foreground boundary preservation (Fα ) for different methods using (S C = 40). without texture addition: ”wo tex add”, with texture addition: ”w tex add”. Fα values closer to unity are better. [4] wo tex add 0.395 0.079 0.664 0.370 0.189 0.327 0.634 0.082 0.259 0.327 0.332
[4] w tex add 0.686 0.434 0.831 0.736 0.541 0.396 0.808 0.487 0.808 0.614 0.634
AC CE P
Sequence Laboratory Wooden Model Surveillance Indoor Practice Rolling Balls Highway Basketball UAV Office Playground Average
[6] wo tex add 0.792 0.290 0.737 0.724 0.706 0.402 0.660 0.253 0.547 0.553 0.567
[6] w tex add 0.864 0.772 0.870 0.794 0.836 0.714 0.863 0.530 0.869 0.683 0.779
[13] wo tex add 0.672 0.120 0.988 0.541 0.809 0.330 0.602 0.382 0.288 0.338 0.504
[13] w tex add 0.927 0.865 0.990 0.813 0.958 0.399 0.898 0.552 0.930 0.476 0.781
Table 2 summarizes the values for BDE calculated for the optical flow methods in [4], [6], [13] without and with texture addition as they are employed on all sequences used in this study. Results indicate significant improvements in preservation of the foreground boundaries (displacement error) if the optical flow methods are accompanied by texture addition. BDE comparison shows an average reduction in the Boundary Displacement Error between detected and groundtruth foreground masks by a factor of 5 to 6 for all methods. Note that from the majority of representative results, it is expected that the method in [13] with texture addition would provide the most promising quantitative values. However Table 1 shows only marginal performance increase for [13] with respect to the method in [6] with texture addition. Moreover Table 2 shows higher BDE for the method in [13] compared with the method 19
ACCEPTED MANUSCRIPT
Table 2: Effect of the adding texture on the foreground boundary preservation (BDE) for different methods using (S C = 40). without texture addition: ”wo tex add”, with texture addition: ”w tex add”. BDE values closer to zero are better. [6] wo tex add 0.842 12.378 2.775 6.134 0.586 8.021 5.724 2.660 6.604 2.866 4.859
[6] w tex add 0.464 0.560 0.421 1.273 0.218 2.774 0.706 0.889 0.453 1.111 0.887
[13] wo tex add 3.300 29.123 0.020 4.476 0.631 11.088 7.460 2.555 27.748 8.490 9.569
PT
[4] w tex add 1.722 5.963 0.722 1.145 1.421 7.630 1.426 1.097 0.859 1.993 2.398
RI
[4] wo tex add 8.443 32.462 2.857 10.100 7.854 10.297 6.558 8.414 25.149 11.349 12.349
SC
Sequence Laboratory Wooden Model Surveillance Indoor Practice Rolling Balls Highway Basketball UAV Office Playground Average
[13] w tex add 0.138 0.314 0.019 0.949 0.061 9.302 0.489 1.557 0.236 6.429 1.983
AC CE P
TE
D
MA
NU
in [6]. As explained earlier, the reason for relatively poor performance of the method in [13] for the Highway Sequence and the Playground Sequence is large object displacement as a result of low capture rate, which can be clearly observed from both tables, given that the method in [6] is designed to handle large displacements. To have a fair comparison between competing methods with data that is not biased toward one method, we need to compare Fα and BDE for the sequences in these tables other than the Highway and Playground Sequences. New calculations for the averages over these remaining sequences show that average Fα for [13] with texture and [6] with texture are 0.853 and 0.799, respectively. Average BDE for [13] with texture and [6] with texture become 0.470 and 0.623, respectively. Now, it can be seen that [13] with texture demonstrates the highest performance provided that large object displacement does not occur. Another advantage of texture addition is the reduction in computation time. Table 3 displays the variation percentage (defined as the difference between computation time with texture and computation time without texture divided by computation time without texture) in the running time for all three optical flow methods as they are applied to all sequences used in this study. Computations were performed using Matlab 2011a on a processor of Intel(R) core TM i7 − 2670QM CPU@ 2.20GHz with 6.00 GB of installed RAM. Here, two types of texture are used: stochastic, which is generated according to (11) and regular, which is taken from a texture image. The reason for considering regular texture is that as can be seen in the fourth column of Table 3, computation time for the method in [6] has increased when stochastic texture is added to sequences. This is due to considering an extra term for feature matching in the energy functional in [6] for handling large displacements. When stochastic texture is added to frames, the number of feature points detected and used in the calculations dramatically increases. Therefore, increase in the running time due to larger number of feature point detection, description, and matching will be considerably larger than the decrease in the running time due to texture addition. Accordingly, we also employed a regular texture image for all methods to investigate the difference between the effects of regular and stochastic textures on the computation time. Comparing the fourth and the fifth columns, we can see that since a regular texture (with moderate size of textons as shown in Fig 11) does not add a very large number of feature points to each frame, computation time of [6] with texture shows reduction with respect to [6] without texture. Furthermore, reduction percentages using regular and stochastic textures for other methods exhibit only negligible differences. As a result, the texture type does not change the computation time significantly for other methods. We note that the effect of the texture type on Fα and BDE variations are similarly small. 20
SC
RI
PT
ACCEPTED MANUSCRIPT
NU
Figure 11: Regular texture image used in Table 3.
Table 3: Effect on variation percentage of computation time from adding texture for different methods using regular texture: ”r tex” and stochastic texture: ”s tex” (S C = 40).
AC CE P
[6] r tex -7.8 -24.8 -14.8 -44.1 -33.2 -27.9 -15.3 -12.9 -21.8 -55.7 -25.8
MA
[4] s tex -18.2 -64.4 -19.2 -42.7 -59.9 -27.4 -64.9 -74.9 -64.0 -70.1 -50.6
D
[4] r tex -20.1 -19.6 -24.6 -55.7 -54.0 -47.6 -57.8 -69.3 -60.8 -72.5 -48.2
TE
Sequence Laboratory Wooden Model Surveillance Indoor Practice Rolling Balls Highway Basketball UAV Office Playground Average
[6] s tex +58.2 +198.3 +20.8 +38.9 +50.4 +5.1 +8.1 +270.4 +212.2 +18.9 +88.1
[13] r tex -3.7 -34.4 -2.2 -10.1 -18.6 -0.1 -7.8 -19.5 -11.2 -2.9 -11.1
[13] s tex -11.7 -33.1 -1.1 -10.3 -23.8 -0.2 -6.7 -27.4 -8.6 -0.8 -12.7
3.3. Discussion In this section, we discuss two remaining issues related to the algorithm suggested. First, the answer to the following question: ”what is the difference between adding static texture in nonmoving areas according to this algorithm with simply setting optical flow component to zero for those pixels?”; second, further discussion about the parameters used in the algorithm. Since a major goal of our proposed algorithm is to reduce the optical flow of the background regions with poor texture, ideally to zero, one might wonder why not setting the optical flow values to zero for the static poorly-textured regions after optical flow using original frames is calculated. To answer this question, we have to go back to the technique used for rough estimation of the moving pixels. On one hand, as explained in 2.2, image differencing can provide a fast yet crude estimate of the moving pixels. On the other hand, while results of texture segmentation are satisfying, there is potential for imperfection in the final binary mask. Therefore, when T EB and FDB are combined, the mask of static poorly-texture regions is not very accurate, as can be seen in Fig 5(b). In fact, this mask contains some false alarms as well as significant number of disjoint pixels in the background due to camera inherent noise, lighting changes, and so on. As a result, employing this mask along with the optical flow magnitude from original frames leads to a foreground mask with remarkable boundary distortions. This is shown in Fig 12 for four sequences used in this study, including the Laboratory Sequence in Fig 5. As observed in this figure, employing the final mask for static poorly-textured regions in or21
ACCEPTED MANUSCRIPT
optical flow magnitude
mask of static non-textured
optical flow magnitude using
optical flow magnitude
using original frames
pixels shared in both frames
original frames and set to
using modified images
AC CE P
TE
D
MA
NU
SC
RI
PT
zero where TEB=FDB=0
Figure 12: By columns: optical flow magnitude using original frames, mask of static non-textured pixels shared in both frames, optical flow magnitude using original frames and set to zero where T EB = FDB = 0, optical flow magnitude using modified images: first row, Laboratory Sequence; second row, Basketball Sequence; third row, Indoor Practice Sequence; fourth row, Office Sequence.
der to suppress optical flow does not yield accurate object boundaries and produces erroneous foreground blobs due to imperfection in this mask. This is while if this mask is only used to add texture to these regions, should the synthetic texture be mistakenly not added to some background pixels, due to calculation of zero flow for neighboring pixels and the effects of those pixels next to the pixel deprived of texture (as discussed in 2.3), optical flow can still converge to zero for this pixel. This can be viewed as the healing effect of optical flow. To quantitatively compare the effects of our proposed algorithm with the idea of simply suppressing the optical flow for pixels in the static purely-textured regions on the foreground detection accuracy, we calculated Precision, Recall, Fα , and BDE for the sequences in Fig 12, and showed the results in Table 4. Since the crude frame difference map detects even the smallest variations in a pixel’s brightness, the probability of detecting all foreground pixels and subsequently the value for Recall is expected to be higher if the latter method is used. This can be seen in 22
ACCEPTED MANUSCRIPT
RI
PT
the fourth and fifth columns of Table 4, where Recall values for the alternative idea are either comparable or higher. This is while, our proposed algorithm shows higher Precision values due to suppressing the erroneous background flow, as discussed earlier (second and third columns). For Fα , which provides a good representation of both Precision and Recall, and BDE, our proposed algorithm outperforms the alternative idea, as can be seen by comparing sixth and seventh columns and eighth and ninth columns of this table, respectively. Hence, we can claim that our method provides higher foreground detection accuracy.
Precision (OM) 0.940 0.926 0.899 0.961
Precision (AM) 0.835 0.858 0.780 0.758
Recall (OM) 0.902 0.938 0.834 0.795
Recall (AM) 0.933 0.941 0.882 0.922
Fα (OM) 0.927 0.930 0.813 0.898
NU
Sequence Laboratory Office Indoor Practice Basketball
SC
Table 4: Quantitative comparison of foreground detection accuracy using our proposed algorithm ”OM” and the idea of simply suppressing the optical flow for pixels in the static purely-textured regions ”AM” via Precision, Recall, Fα , and BDE (winner for each metric is shown in bold). Fα (AM) 0.865 0.884 0.811 0.770
BDE (OM) 0.138 0.236 0.949 0.489
BDE (AM) 0.319 0.383 1.165 1.506
AC CE P
TE
D
MA
The next topic is about the effects of parameters used in the algorithm section: β, γ, and S C. Effects of β on the final binary mask of image difference were extensively discussed in 2.2 and Fig 6. Investigating multiple sequences, it was empirically found that β ∈ [0.01, 0.04] can provide a safe range for accurate foreground detection; however, it can not guarantee perfect application for all sequences. Threshold for texture segmentation (γ) was adaptively determined based on the histogram of the texture energy values and was explained in 2.2 using (8), and Fig 3. Due to its adaptive nature, γ can be more trusted over a large number of sequences. Finally, effects of S C were studied extensively in 2.3. Since the effect of S C on the shrinkage of the foreground blobs was only qualitatively demonstrated in Fig 8, we employed the quantitative measure of the foreground detection accuracy introduced in 3.2 using Fα to further clarify selection of value S C = 40 for the synthetic texture intensity. Figure 13 shows the effect of S C on Fα for the Airplane Sequenced used in Fig 8. As can be seen, beyond S C = 40, the accuracy of foreground detection increases negligibly, while disadvantages mentioned in 2.3 increases more tangibly. Therefore, the value of S C = 40 can be justified quantitatively. 4. Limitations and Future Work The algorithm proposed herein is the first step toward solving the problem of poor texture in optical flow computation using a novel and simple, yet effective approach. As mentioned earlier in the the Introduction and Limitation Sections, it has shortcomings, which includes providing an adaptive threshold for binarizing the image differencing results and a moving synthetic texture to poorly-textured blocks within the foreground. We have attempted to generate an accurate moving texture that does not produce erroneous flow and tested multiple strategies for this problem, but they are far from a final solution. For instance, adding a moving texture to non-textured moving regions using displacement of the feature points and interpolation for any arbitrary pixel within the moving blocks can not be accurately performed, since rigid and non-rigid motions require different types of texture interpolations and we do not know the motion type a-priori. The other option considered was to use the optical flow vector field itself for finding the location of the corresponding pixels in the second frame, which has two limitations. First, optical flow 23
NU
SC
F
RI
PT
ACCEPTED MANUSCRIPT
MA
SC
Figure 13: Effect of synthetic texture intensity, S C on the foreground detection accuracy, measured by Fα .
AC CE P
5. Conclusions
TE
D
does not only belong to one of the frames and magnitudes have sensible errors which leads to production of the synthetic texture in the wrong location; second, it requires optical flow computation twice, which neutralizes the advantage of reduction in the computation time provided by our algorithm. We have considered these problems as future work.
In this study, we have investigated the effects of adding synthetic texture to images with poorly-textured regions on the optical flow performance, namely the accuracy of foreground boundary detection and computation time. It is demonstrated that texture addition leads to important advantages, including creation of sharp motion boundaries, more accurate capture of object contour and size, avoidance or mitigation of object merging, and reduction in the computation time. Well-known quantitative metrics have been employed to evaluate the effectiveness of combining texture addition with several leading optical flow methods on multiple real and animation sequences. Analysis of optical flow convergence supported the resulting advantages from a mathematical perspective. Acknowledgment The authors would like to thank Dr. Yanqiu Wang with the Department of Mathematics, Oklahoma State University, for her assistance with the analysis of the convergence for the optical flow computation, and Dr. Damon Chandler with the Department of Electrical and Computer Engineering, Oklahoma State University, for his assistance with the quantitative measures for optical flow performance. [1] B. K. P. Horn and B. G. Schunck, ”Determining Optical Flow,” Artif Intell, vol. 17, (1981), pp. 185-203.
24
ACCEPTED MANUSCRIPT
AC CE P
TE
D
MA
NU
SC
RI
PT
[2] M. Werlberger, T. Pock, and H. Bischof, ”Motion Estimation with Non-Local Total Variation Regularization,” (2010), PROC CVPR IEEE. [3] H. Haussecker and D. Fleet, ”Computing Optical Flow with Physical Models of Brightness Variation,” IEEE TPAMI, vol. 23, (2001), pp. 661-673. [4] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, ”High Accuracy Optical Flow Estimation Based on a Theory for Warping,” T. Pajdla, J(G.) Matas, Eds. ECCV , LNCS, vol. 3024, Springer, Heidelberg, (2004), pp. 25-36. [5] A. Bruhn, J. Weickert, and C. Schn¨orr, ”Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods,” IJCV, vol. 61, (2005), pp. 211-231. [6] T. Brox and J. Malik, ”Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 3, (2011), pp. 500-513. [7] H.-H. Nagel and W. Enkelmann, ”An Investigation of Smoothness Constraints for the Estimation of Displacement Vector Fields from Image Sequences,” PAMI, vol. 8, no. 5, (1986), pp. 565-593. [8] L. Alvarez, R. Deriche, T. Papadopoulo, and J. Sanchez, ”Symmetrical Dense Optical Flow Estimation with Occlusions Detection,” IJCV, vol. 75, (2007), pp. 371-385. [9] M.J. Black, ”Combining Intensity and Motion for Incremental Segmentation and Tracking over Long Image Sequences,” G. Sandini, Ed. ECCV, LNCS, vol. 588, Springer, Heidelberg, (1992), pp. 485-493. [10] D.J. Fleet, M.J. Black, and O. Nestares, ”Bayesian Inference of Visual Motion Boundaries,” Exploring Artificial Intelligence in the New Millennium, Morgan Kaufmann Pub., San Francisco, (2002), pp. 139-174. [11] Y. Lei, L. Jinzong, and L. Dongdong, ”Discontinuity-preserving optical flow algorithm,” Elsevier J Syst Eng Electron, vol. 18, no. 2, (2007), pp.347-354. [12] T. Nir, A. Bruckstein, and R. Kimmel, ”Overparameterized variational optical flow,” Int J Comput Vision, vol. 76, no. 2, (2008), pp. 205-216. [13] D. Sun , S. Roth, and M. J. Black, ”Secrets of Optical Flow Estimation and Their Principles,” Proc CVPR IEEE, (2010), pp. 2432-2439. [14] P. J. Huber, ”Robust regression: Asymptotics, conjectures and Monte Carlo,” Ann Stat, vol. 1, no. 5, (1973), pp. 799-821. [15] D. Shulman and J. Y. Herv´e, ”Regularization of discontinuous flow fields,” Proc Workshop on Vis Motion, (1989), pp. 81-86. [16] M. Werlberger, W. Trobin, T. Pock, A. Wedel, D. Cremers, and H. Bischof, ”Anisotropic Huber-L1 Optical Flow,” (2009), In BMVC. [17] H. Zimmer, A. Bruhn, J. Weickert, L. Valgaerts, A. Salgado, B. Rosenhahn, and H. P. Seidel, ”Complementary optic flow,” Proc seventh Lect Notes Comput Sc, vol. 5681 of LNCS, (2009), pp. 214-223. [18] L. Kun and Y. Wang, ”Oriented Smoothness Aided Harmonic Gadient Vector Flow for Active Contours,” In proceedings of the 2nd International Cong Img Signal, (2009), pp. 1-5. [19] J. Zhao, Y. Wang, and H. Wang, ”Optical Flow with Harmonic Constraint and Oriented Smoothness,” Sixth International Conference on Image and Graphics, (2011), pp. 94-99. [20] G. Aubert, R. Deriche, P. Kornprobst, ”Computing optical fiow via variational techniques,” SIAM J. Appl. Math, vol. 60, (1999), pp. 156-182. [21] K. Laws, ”Textured Image Segmentation,” Ph.D. Dissertation, University of Southern California, 1980. [22] M. Hubert, and E. Vandervieren, ”An adjusted boxplot for skewed distributions,” Computational statistics and data analysis, vol. 52, no. 12, (2008), pp. 5186-5201. [23] A. Mitiche and A. Mansouri, ”On Convergence of the Horn and Schunck Optical-Flow Estimation Method,” IEEE Trans. Image Process, vol. 13, no. 6, (2004), pp. 848-852. [24] L. N. TRef5then and D. Bau, ”Numerical Linear Algebra,” (1997), SIAM. [25] R. J. LeVeque, ”Finite Difference Methods for Ordinary and Partial Differential Equations, Steady State and Time Dependent Problems,” (2007), SIAM. [26] J. F. Queir´o, ”Partial Spectra of Sums of Hermitian Matrices,” Mathematical papers in honour of Eduardo Marques de S 39, 2006. [27] C. Vu and D. Chandler, ”Main subject detection via adaptive feature refinement,” Journal of Electronic Imaging, vol. 20, no. 1, (2011), pp. 013. [28] T. Liu, J. Sun, N. N. Zheng, X. Tang, and H. Y. Shum, ”Learning to detect a salient object,” Computer Vision and Pattern Recognition, CVPR 07, IEEE Conference, Minneapolis, Minnesota, USA, (2007), pp. 18.. [29] J. Freixenet, X. Munoz, D. Raba, J. Marti, and X. Cufi, ”Yet another survey on image segmentation: Region and boundary information integration,” ECCV 02: Proceedings of the 7th European Conference on Computer Vision-Part III, Springer-Verlag, London, (2002), pp. 408422.
25
ACCEPTED MANUSCRIPT Highlights
TE
D
MA
NU
SC
RI
PT
An initial step for optical flow estimation in poorly-textured images is proposed. The simple yet effective step preserves motion boundaries where other methods fail. The proposed algorithm reduces computation time meaningfully. Mathematical analysis is employed to explain the advantages provided. Quantitative measures have been introduced to assess the performance.
AC CE P
Ø Ø Ø Ø Ø