Pattern Recognition Letters 89 (2017) 73–80
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Occlusion detecting window matching scheme for optical flow estimation with discrete optimization Kyong Joon Lee a,∗, Il Dong Yun b a b
Department of Radiology, Seoul National University Bundang Hospital, Seongnam 13620, Korea Division of Computer and Electronic Systems Engineering, Hankuk Universuity of Foreign Studies, Yongin 17035, Korea
a r t i c l e
i n f o
Article history: Received 5 September 2016 Available online 14 February 2017 MSC: 41A05 41A10 65D05 65D17 Keywords: Optical flow Occlusion detection Window-matching
a b s t r a c t Occlusion detection plays an important role in optical flow estimation and vice versa. We propose a single framework to simultaneously estimate flow and detect occlusion using novel support-weight based window matching. The proposed support-weight provides an effective clue to detect occlusion based on the assumption that the occlusion occupies relatively small portion in the window. By applying a coarse-tofine approach we successfully address non-small occlusion problems as well. The proposed method also presents reasonable estimation for the flow for the occluded pixels. The energy model with the matching cost and flow regularization cost is optimized by an efficient discrete optimization method. Experiments demonstrate our method improves estimated flow accuracy compared to the method without occlusion detection, particularly on motion boundaries. It also yields highly competitive occlusion detection results, outperforming the previous state-of-the-art methods. © 2017 Elsevier B.V. All rights reserved.
1. Introduction
1.1. Related work
In estimating optical flow between a reference image and a target image, occlusion refers to a certain region of the reference image that does not correspond to any region in the target image due to movement of objects and/or view change. Unless properly defined, occlusion degrades the quality of estimation, particularly on object boundaries, and may lead to severe performance degeneration in many applications of optical flow estimation; for example, frame interpolation [1,2], motion segmentation [3,4], motion layer ordering [5], and motion compensated coding [6,7]. A convincing method to find exact occlusion is grasping exact motion of all objects in the images; inversely, if we obtain exact occlusion in advance, the accuracy of flow estimation will be significantly improved. In practice, neither the exact motion nor the exact occlusion is provided in advance, thus it is very challenging to obtain highly accurate optical flow and occlusion at the same time. We address this challenge with a novel window matching scheme on an unified discrete MRF (Markov Random Field) framework.
Various approaches have been presented for jointly estimating optical flow while detecting occlusion. Many of them employ individual sequential steps as following: (1) calculating optical flow as if occlusion does not exist, (2) finding occlusion based on the estimated flow, and then 3) iterating the previous two steps until convergence. One simple approach to detect occlusion given flow estimation, is thresholding the residual of subtracting warped target image from reference image [8]. In [9], authors introduce a probabilistic criterion employing histogram of image contents, and alternately calculate flow and visibility through the EM-algorithm. Alvarez et al. define occlusion by checking symmetric consistency of forward and backward flows [10]. Another method in [11] also detects occlusion by cross-checking the bi-directional flows, utilizing a coarse-to-fine approach with discrete optimization. To reduce inherent computational complexity, it estimates movement of similar pixel groups (i.e., super-pixel) with over-segmenting input images. A method in [12] utilizes observation that a certain point in a target image could probably be occluded if the point is accessible by multiple pixels in a reference image through forward warping. It finally refines the estimated flow with the probability map of accessibility. All those approaches, however, may suffer from the fact that they depend on the initial flow result which could be incorrectly estimated in the occluded area; and subsequent iterations may also yield erroneous results accordingly. Moreover, they
∗
Corresponding author. E-mail addresses:
[email protected] (K.J. Lee),
[email protected] (I.D. Yun).
http://dx.doi.org/10.1016/j.patrec.2017.02.009 0167-8655/© 2017 Elsevier B.V. All rights reserved.
74
K.J. Lee, I.D. Yun / Pattern Recognition Letters 89 (2017) 73–80
require additional computational cost for obtaining the backward flow or the occlusion probability map. Occlusion has also been recognized as a significant issue in stereo matching problems. Zitnick et al. proposes to iteratively update a 3D disparity array using uniqueness and smoothness constraints, detecting occlusion by thresholding [13]. The uniqueness constraint implies that each pixel in a target image should have at most one correspondence in a reference image. An approach in [14] shows promising results by applying the Graph-cuts algorithm [15] to efficiently enforce the uniqueness constraint. A method in [16] uses backward disparity and visibility maps to detect symmetric occlusion using iterative optimization with the Belief Propagation [17]. These methods generally find good solutions in the discrete sample spaces; however, in the two-dimensional flow estimation, the size of the sample space exponentially increases, leading to very high computational complexity unless efficiently managed. Meanwhile, Ballester et al. presented an assumption that an occluded pixel may be visible in the previous frame of a reference frame [18]. But their approach is limited to the case that multiple frames are provided and motion across the frames is relatively simple. A recent method shown in [19] utilizes over-segmentation to find image layers with respective movements, detecting occlusion with local ordering of the layers. While this method can address large occlusion issues on textureless regions, it may yield over-simplified flow estimation depending on performance of the over-segmentation. Fortun et al. also presented an approach to manage large occlusion problems [20,21]. They first compute local flow candidates on non-occluded regions, and then fill in a large occlusion based on the candidates close to the occlusion. However, filling-in may fail if no region is close enough or multiple confusing region candidates exist. In [22], authors showed a new model incorporating a cost for occlusion which is supposed to be very sparse in the input images within infinitesimal time interval. While this method presents state-of-the-art performance in detecting occlusion, it degenerates performance of flow estimation as the process iterates. In addition, the performance can be very sensitive to a threshold value controlling sparseness of the occlusion. Another work [23] also applies the sparsity constraint estimating flow as well as occlusion; but the algorithm eventually depends on the weight map obtained from motion inconsistency, yielding insufficient performance to be state-of-the-art. 2. Proposed approach This work aims to simultaneously estimate optical flow and detect occlusion within a single optimization framework. Our method does not iterate through flow estimation and occlusion detection. Compared to the previous state-of-the-art method [22] without the iteration, the proposed approach does not degrade the performance of optical flow estimation, indeed it does not require sensitive threshold parameter tuning. 2.1. Unified optimization framework To this end, we propose a discrete MRF framework with novel implication of graphical nodes. In previous approaches employing the discrete framework [24–26], each pixel in a reference image is mapped to a node representing a 2D displacement vector. In contrast, a node in our framework represents a 3D vector, comprising a 2D displacement vector and occlusion status. Our contribution also lies in providing an efficient method to find the optimal solution for the proposed framework. The proposed framework defines reasonable matching cost for the new type of displacement vector with the occlusion status turned on. In addition to conventional intensity consistency cost
Fig. 1. Support-weight illustration for occluded pixel. A reference frame (a) and a target frame (b) contain two moving objects; the foreground object (in bright brown) moving to top, while the background object (in dark brown) moving to left. The center pixel with a support-window in the reference frame, shown in pink, is occluded in the target frame. Three matching candidate points with supportwindows in the target frame are located in the foreground (green,) the occluded region (red,) and the background (yellow,) respectively. In (c), (d), and (e), supportweights for these three points are depicted for the normal weight (top row,) and for the occlusion weight (bottom row.). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
on each pixel, we define a new cost for the case that the pixel is occluded but matched to a certain position in a target image. These two costs should be well balanced not to find a trivial solution (e.g., all pixels are occluded.)
2.2. Matching cost for occluded pixel The matching cost in this work is calculated through comparing windows based on support-weight scheme [26–28] (The window refers to local neighbourhoods of the pixel to be matched.) The support-weight accentuates pixels in the window in the calculation, if the pixels belong to the object that the central pixel belongs to. When a pixel in a reference image is occluded, the pixel does not match to any position in a target, and its actual matching cost is hard to define. Avoiding this difficulty, previous approaches [8,16,22] simply assign a constant cost for occlusion candidates. However, the optimal constant cost is unknown and the resulting flow estimation for the occluded pixel mostly depends on regularization with neighbouring flows. We assign a reasonable cost for an occluded pixel utilizing the local neighbourhoods. We observe that non-occluded neighbors in the identical object of the occluded pixel can be employed to estimate correct matching of the pixel. Based on this observation, we divide the support-weight into two different types as follows: Denoting wref and wtar as the support-weights for windows in a reference image and a target image respectively, we define wref · wtar as normal weight , and wre f · (1 − wtar ) as normal weight. Fig. 1 presents an example of the support-weights (normal and occlusion weights) for an occluded pixel. The bright region represents background object while the dark region represents foreground object. The foreground in the reference (a) moves to left in the target (b). The pixel shown in pink in (a) is occluded in (b). The top and the bottom rows of (c),(d), and (e) illustrate the normal weight and the occlusion weight respectively, for the cases that the window is located in the background (c), the occlusion (d) and the foreground (e) in the target. Highly weighted part of each window is shown in high intensity, which we refer as effective area. When the window is located in the occluded region as in (d), the effective area from the normal weight appears identical to the occlusion itself, assigning high weights on the pixels in the occlusion when calculating the matching cost. In contrast, the effective area from the occlusion weight exactly illustrates the non-occluded background area excluding the occlusion in the cost calculation, which can be a good clue for correct matching. In other cases when the window is located in the foreground (c) or in the back-
K.J. Lee, I.D. Yun / Pattern Recognition Letters 89 (2017) 73–80
75
ground (e), the effective area from the occlusion weight is negligible. 2.3. Constraint for occlusion detection We define a condition to determine that the occlusion weight should be considered prior to the normal weight. Assuming the time gap between the target and the reference is infinitesimal and thus the occlusion occupies relatively small portion (i.e., sparse) in the window, we propose a simple (yet powerful) constraint for the decision: if the sum of the normal weight is smaller than the sum of the occlusion weight, then it is very likely that the window matches to occlusion, and we assign some penalty on the cost using the normal weight.
Fig. 2. Flow estimation results for toy examples.(a) Reference frames. (b) Target frames. (c) Estimated flow using the conventional weight, represented in HSI color space (direction: hue, magnitude: saturation.) (d) Estimated flow, and (e) detected occlusion (shown in black) using the proposed weight.
3. Background Let G be an undirected graph with a node set V and an edge set E. A node in V corresponds to a pixel in a reference image. Let ls be a label, i.e., a random variable for a node s in some discrete sample space Ls = {1, . . . , 2L2 }, representing the quantized vector set Ts = {us (1 ), . . . , us (2L2 )}. A vector in Ts is three dimensional, i.e., us = (us , vs , os ). First two dimensions represent a displacement vector for x and y directions, homogeneously quantized by L labels in each direction. The last dimension os ∈ {0, 1} indicates an occlusion status of the node. The problem of joint estimation of optical flow and occlusion detection can be considered as finding the labels for each pixel, which minimizes an energy function such that:
s (ls ) +
st (us (ls ) − ut (lt )),
(1)
(s,t )∈E
s∈V
where s (·) imposes the cost for matching the correlation window for s, and st (·) denotes the spatial smoothness term between s and t. The discrete sample space L is a finite set. The size of the space |L| (= 2L2 ) is proportional to the maximum displacement over the desired flow precision μ, such that, |L| ∝ max(T )/μ. To cover large displacement with limited number of labels, we build Gaussian image pyramids for the input images, and find a rough solution from the top level of the pyramids. Down to the next level, dense flow field can be generated by interpolating the coarse solution, and is provided as the initial flow field for further estimation. The number of pyramid level is determined by logd (max(T )/|L| ) where d is the downsampling factor building the image pyramid. We use d = 2 in our experiments. 4. Window matching scheme Data matching cost in our work is based on a square window with support-weight [26,27], which can be defined as follows:
s (ls ) =
t∈W (s )
f tar wre s (t )ws (t )ρ (t , t )
t∈W (s )
f tar wre s (t )ws (t )
,
(2)
where W(s) is a neighboring node set in the window supporting s. re f ws means the support-weight function for s in the reference, and wtar indicates the function for s in the target. s and t are mapped s to s and t by displacement vector of us (ls ). ρ (t, t ) denotes a similarity measure between pixels at t and t , e.g., the absolute difference, the squared difference, or the gradient inner product. 4.1. Proposed support-weight We propose to use different support-weight according to the occlusion status. The modified definition of Eq. (2) is shown as fol-
lows:
s (ls ) = (1 − os ) t
+os
t
f wre s (t )ρ (t , t ) Z
f tar wre s (t )(1 − ws (t ))ρ (t, t ) + βˆ , Zo
(3)
re f re f ws (t ) and Zo = t ws (t )(1 − wtar (t )) are nors ˆ malization factors. β is a conditional parameter that implements the constraint for sparse occlusion detection, such as: where Z =
βˆ =
0
t
if Z > Zo
β otherwise.
(4)
Defining wref , we employ the conventional bilateral filtering based approach [27], shown as follows:
||x − x ||2 ||I (s ) − I (t )||2 s t − , 2σg2 2σr2
f wre s (t ) = exp −
(5)
where xs , xt indicate 2D coordinates, and I(s), I(t) mean color values of the points s and t, respectively. σ g and σ r are parameters controlling the effect of geometric and range differences. wtar is defined in similar manner as presented in [26], such that,
wtar s (t ) = exp −
(xs − xt )T Tg (s )(xs − xt ) ||I (s ) − I (t )||2 − , (6) 2 2 2 σg 2 σr
where Tg (s) is an anisotropic tensor produced by the second moment matrix of the supporting window. Use of this tensor is more suitable to preserves non-occluded region along the motion boundary in the window, compared to the geometric constraint in (5) with fixed distribution regardless of image contents. In Fig. 2, we present estimation examples for two toy problems. The reference (a) and target (b) images in the top row contain occlusion stemming from moving layers. The bottom row presents a case that occlusion arises from a pop-up object. As seen in the results from the conventional weight (c), estimated flow in occluded area is not consistent with the layers where the area is included. In contrast, the proposed weight is shown to contribute to the state-of-the-art results in both estimating consistent flow (d) for the homogeneous layers, as well as detecting accurate occlusion (e) shown in black. 4.2. Coarse-to-fine occlusion update The proposed method assumes relatively small occlusion in subsequent frames to detect occlusion by comparing the normal weight to the occlusion weight. However, the input images occasionally contain large occlusion, and as shown in the middle row of Fig. 3, the proposed method may yield a poor result in middle of the large occlusion.
76
K.J. Lee, I.D. Yun / Pattern Recognition Letters 89 (2017) 73–80
Fig. 4. Conceptual illustration for the node decomposition. Left: Original MRF model. A node represents a label for 3D displacement vector: (u, v, o). Right: The original node is decomposed into three nodes representing labels for 1D vectors: u, v, and o respectively. The unary term (shown in a black square) in the original MRF model becomes a high-order potential term for the decomposed nodes.
scheme. The original MRF formulation in (1) is updated as follows:
xyo (lx , ly , lo )
(x,y,o)∈E xyo
+ Fig. 3. Coarse-to-fine occlusion update. Top row: The reference and the target frames in the Ambush 5 sequence. Detected occlusion is shown in black. Middle row: Estimation result without the update. Bottom row: Estimation result with the update. Note the update improves flow estimation on large motion boundaries in addition to presenting plausible occlusion detection on those regions.
xx (ux (lx ) − ux (lx ))
(x,x )∈E x
+
(7)
yy (uy (ly ) − uy (ly ))
(y,y )∈E y
+
oo (uo (lo ) − uo (lo )).
(o,o )∈E o
We employ a coarse-to-fine update combined with a filtering method to address this challenge. Building Gaussian image pyramids for the input images, the occluded region is also scaled down in the upper level of pyramids, which can be considered as sparse. Down to the next level, we interpolate the found occlusion and apply the rank filtering. In our experiments, we use rank 4 for the filtering mask of 5 × 5 pixels. The resulting occlusion map is added to a new occlusion map generated in the current level. The effect of this strategy can be seen in the bottom row of Fig. 3. The proposed approach improves flow estimation on large motion boundaries in addition to presenting plausible occlusion detection on those regions. 5. Optimization To find the optimal solution for the MRF formulation in (1), we employ the TRW-S [29], which has shown state-of-the-art results [30] in many discrete framework applications. The asymptotic computational complexity of the TRW-S in general is O(|V |L2 ), and for our current framework, we rewrite it as O(|V |L4 ). As the proposed method requires an adequate number of labels to yield satisfactory estimation results, we introduce techniques to address the complexity issue which is dominated by the number of labels. 5.1. Node decomposition We apply the node decomposition scheme [31,32], reducing the complexity to O(|V |L2 ). The scheme decomposes the node s ∈ V into three nodes x ∈ Vx , y ∈ Vy , and o ∈ Vo. We may define li as a random variable for a node i in some discrete sample space Li = {1, . . . , L}, representing the quantized 1D displacement vector set Ti = {ui (1 ), . . . , ui (L )} where i ∈ {x, y}; and lo as a random variable in Lo = {0, 1}. The original displacement vector us (ls ) corresponds to (ux (lx ), uy (ly ), lo ). The original edge set E is decomposed into Ex , Ey , Eo, and the new hyper-edge set Exyo is introduced, to account for the high-order potential between the decomposed nodes. Fig. 4 shows a conceptual illustration of the decomposition
We note the original unary potential s is updated to the factor node, an element of a hyper edge set E F , represented by the ternary potential xyo . Unary potentials for these nodes are undefined, imposing no cost on any configuration. As the number of labels for a node reduces to L, the complexity of message-passing for pairwise interactions reduces to O(|V |L2 ). 5.2. Conversion from high-order potential to pairwise interactions The factor node induced by the decomposition is not easy to control in message-passing in the TRW-S algorithm. We apply the conversion proposed in [33,34] and introduce its efficient implementation. Let a be an auxiliary variable node. a is associated to a new label z ∈ A, where A is the Cartesian product of label spaces of connected three nodes, i.e., A = Lx × Ly × Lo. We replace the factor node with the auxiliary node a. Unary and pairwise potentials for the converted energy model are described as follows:
a (z ) = xyo (lx , ly , lo ),
0 ∞
ai (z, li ) =
if zi = li ∀i ∈ {s, t, u}, otherwise
(8)
(9)
where any possible value z has one-to-one correspondence to the triplets (lx , ly , lo ). zi (i ∈ {x, y, o}) denotes the associated component of the triplet. The pairwise potential ai enforces zi to be consistent with the labels of the neighboring decomposed nodes. After the conversion, energy model (9) changes to following,
a (z ) +
z∈A
+
ai (z, li )
(a,i )∈EF
xx (ux (lx ) − ux (lx ))
(x,x )∈E x
+
yy (uy (ly ) − uy (ly ))
(y,y )∈E y
+
(o,o )∈E o
oo (uo (lo ) − uo (lo )).
(10)
K.J. Lee, I.D. Yun / Pattern Recognition Letters 89 (2017) 73–80
77
Fig. 5. Estimation results for the Alley 1 sequence. Top left: The reference frame. Top right: Flow estimation without occlusion detection. Bottom left/right: Flow estimation with the full algorithm, which successfully detects occlusion (shown in black,) and provides reasonable estimation on the occluded regions as well. Considering occlusion generally occurs on motion boundaries, the full algorithm obviously helps to distinguish the boundaries.
The auxiliary node may induce O(L3 ) of time complexity as well as O(L2 ) of memory space to store messages. However, a message a to i ∈ {x, y, o} does not require any storage since the pairwise potential Psiai never adds a value on the message. We may also ignore all updating operations if zi = li . In sum, by just modifying message-update logics in the implementation, the time complexity remains O(L2 ). 5.3. Min convolution The decomposition enables defining the pairwise potential
st as linear to the label difference; that is, we may rewrite st (us (ls ) − ut (lt )) as st (ls − lt ). Then we can apply the minconvolution algorithm [35] for the TRW-S, reducing the time complexity to O(|V |L ). In experiments, we set st (ls − lt ) = α|ls − lt |, which is a parametric and robust convex penalizer. 5.4. Regularization for occlusion The pairwise potential for E o can impose regularization for occlusion status between adjacent pixels. We have applied the Potts model with various λ values, and found the best results with λ = 0, i.e., no regularization.
Fig. 6. Estimation results for the Bamboo 2 sequence using various algorithms. First row: The reference image and estimation with Xu et al. Second row: Estimation with Ayvaci et al. Third row: Estimation with ours Fourth row: Ground-truth occlusion (shown in black) and flow. The proposed method presents outstanding flow estimation (AEPE=0.29) as well as occlusion detection (F1=0.54) even on the region with large occlusion, in which other methods provide degenerated estimation. (e.g., the right wing of the butterfly.).
6. Experiments All the experiments are performed on a system with 3.30 GHz Intel Core i5-2500 CPU and Nvidia Geforce GTX 285 GPU (240 CUDA core.) We validate our method on various image sets from the Sintel dataset [36] and the Middlebury flow dataset [37]. To assess the accuracy of estimated flow, we compare average endpoint error (AEPE) and average angular error (AAE). For the performance of occlusion detection, we calculate the F1 scores using the ground-truth occlusion maps. We assumed the maximum deformation for each direction to be 64 pixels, and quantized each direction by 8 pixels with the target precision μ = 0.05. The size of correlation windows was fixed to 30 × 30 pixels. The parameters affecting the relative influence and strictness of the different constraints are fixed to optimal values for all experiments: σg = 7.2, σr = 3.8, σd = 3.8, and α = 0.05. Fig. 5 shows the effect of occlusion detection, particularly on object boundaries. We compare the full algorithm to the algorithm without occlusion detection (so called w/o occ.) We simply impose very large penalty on the matching cost using the occlusion
weight, so that the algorithm w/o occ. cannot lead to the occlusion label lo = 1 in every node. The result from this algorithm (topright) shows decent flow estimation on homogeneous motion segments, but degenerated estimation on the occluded regions, e.g., upper region of the left arm and the fruit, and the back of the hair. In contrast, the proposed full algorithm not only successfully detects occlusion (bottom-left,) but also provides reasonable estimation on the occluded regions as well (bottom-right.) Considering occlusion generally occurs on motion boundaries, the proposed method obviously helps to distinguish the boundaries. Fig. 6 presents illustrative comparison of our algorithm to other related methods. We employed the source codes provided in their website. The method of Xu et al. [12] (the top row of the right column,) one of top-performing methods in the Middlebury flow site, shows state-of-the-art estimation overall. (AEPE=0.44) However, the lack of occlusion detection causes degenerated estimation around the region with large occlusion, (e.g., the right wing of the butterfly.) Estimation with Ayvaci et al. [22] (the second row) finds very delicate motion boundaries, but both of flow estimation
78
K.J. Lee, I.D. Yun / Pattern Recognition Letters 89 (2017) 73–80
Table 1 Flow estimation error (AEPE/AAE).
Ayvaci et al. Xu et al. Ours w/o occ Ours
alley1
alley2
ambush5
bamboo1
bamboo2
bandage1
bandage2
market2
temple2
temple3
average
0.36/3.09 0.23/2.44 0.25/3.01 0.16/1.75
0.47/5.63 0.16/2.28 0.17/2.34 0.16/1.74
1.37/28.26 0.70/9.37 0.56/8.32 0.44/7.17
0.34/5.46 0.27/4.11 0.25/3.94 0.24/4.00
0.67/7.39 0.44/5.63 0.33/6.48 0.30/5.42
1.31/8.59 0.48/3.57 0.53/4.24 0.49/3.45
0.63/7.01 0.34/3.87 0.32/4.07 0.26/3.68
1.44/10.03 0.96/6.27 1.07/7.66 0.81/6.05
1.64/8.54 0.78/4.06 0.75/4.27 0.65/3.68
0.51/2.17 0.57/2.11 0.65/2.15 0.57/1.90
0.87/8.62 0.49/4.37 0.49/4.65 0.41/3.88
Fig. 7. Estimation results for real scenes: the Army and the Schefflera sequences. Detected occlusion is shown in black. Top row: The reference frames. Middle row: Flow estimation with the proposed method. Bottom row: The ground truth flow.
and occlusion detection are also severely deteriorated (AEPE=0.67, F1=0.09), particularly on the region with large displacement. In contrast, the proposed method (the third row) presents outstanding flow estimation (AEPE=0.29) as well as occlusion detection (F1=0.54) even on the problematic region. Fig. 8 demonstrates additional results, comparing the proposed algorithm with the method of Ayvaci et al. and the ground truth. In Fig. 7, we also present estimation results for real scenes in the Middlebury dataset, which makes the achieved results more convincing.
In Table 1, we show quantitative analysis for several sequences in the Sintel dataset, comparing two state-of-the-art methods to our methods. The reference frame is the tenth frame for each sequence. We excluded the results from the sequences containing too large displacement, that no algorithm found meaningful estimation with AEPE less than 10. For reference, we also provide estimation results without occlusion weight represented by ours w/o occ. Compared to the method of Xu et al., ours shows lower AEPE in average, probably due to better estimation on the occluded region, as ours w/o occ. also shows similar performance with the method of Xu et al. Table 2 also provides analysis comparing precision, recall, and F1-score measuring occlusion detection performance with the method of Ayvaci et al., (the previous state-of-the-art,) to detection performance of ours. While the method of Ayvaci et al. shows better precision in some sequences, but the proposed method outperforms it for recall, and yields higher F1-score for most of sequences. We note the method of Ayvaci degenerates the accuracy of flow estimation for occlusion detection while our method even improve the accuracy, as shown in Table 1. Our current implementation takes 934 s (15 min and 34 s) on average, for a 640 × 480 pixels RGB image, with parallel computation of data matching costs using GPU. With the implementations from the authors, the method of Xu et al. takes 253 s and the method of Ayvaci et al. takes 361 s in average on our system. If we use a window with 15 × 15 pixels and a sparse search scheme as in [24], our method only takes 319 s in average with increased error by 4%, which is comparable to other methods. We believe significantly faster processing can be obtained with a full implementation that computes message-passing in optimization on parallel graphics hardware [38]. 7. Conclusion In this work, we presented a novel support-weight based window matching method for simultaneously estimating optical flow and detecting occlusion. Our method works on a unified optimization framework, which does not require explicit flow estimation nor additional backward flow computation for occlusion detec-
Fig. 8. Estimation results for the Alley 2, and the Bandage 2 sequences. Detected occlusion is shown in black. Left column: The reference frames. Second column: Flow estimation with Ayvaci et al. Third column: Flow estimation with the proposed method. Right column: The ground truth flow.
average temple3
0.93/0.45/0.61 0.49/0.47/0.48
temple2
0.48/0.06/0.11 0.60/0.43/0.50
0.68/0.18/0.27 0.59/0.48/0.52
K.J. Lee, I.D. Yun / Pattern Recognition Letters 89 (2017) 73–80
79
tion. The proposed support-weight provides an effective clue to detect occlusion, and improve estimation of flow in the occluded area. Experiments showed our method yields highly competitive results, for optical flow estimation as well as occlusion detection. Compared to the previous state-of-the-art method, the proposed method does not degrade performance of optical flow to enhance detection. We currently assume foreground and background objects are distinguishable by their color, and so our algorithm may not present good results for the region that assumption is not valid. A more effective approach to distinguish those objects can significantly improve the results in the future. Also, we plan to use multi-labels for occlusion to specify the class of occlusion, e.g., order of motion layer, rotation, viewpoint change, or severe illumination change.
market2
0.58/0.33/0.42 0.59/0.31/0.41
bandage2 bandage1 bamboo2
0.33/0.05/0.09 0.58/0.50/0.54 0.62/0.02/0.04 0.49/0.40/0.44
bamboo1 ambush5
0.97/0.34/0.50 0.57/0.56/0.57 0.84/0.23/0.36 0.35/0.63/0.45
alley2 alley1
0.94/0.17/0.28 0.77/0.49/0.60 Ayvaci et al. Ours
Table 2 Occlusion detection evaluation (precision/recall/F1-score).
0.45/0.11/0.17 0.69/0.52/0.59
0.65/0.08/0.15 0.73/0.52/0.61
Acknowledgments This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIP) (2015R1A5A7036384), and the grant (no. 14-2016-010) from the Seoul National University Bundang Hospital (SNUBH) Research Fund. References [1] D. Mahajan, F.-C. Huang, W. Matusik, R. Ramamoorthi, P. Belhumeur, Moving gradients: a path-based method for plausible image interpolation, ACM Trans. Graphics (TOG) 28 (3) (2009) 42. [2] B.-D. Choi, J.-W. Han, C.-S. Kim, S.-J. Ko, Motion-compensated frame interpolation using bilateral motion estimation and adaptive overlapped block motion compensation, IEEE Trans. Circuits Syst. Video Technol. 17 (4) (2007) 407–416. [3] D. Cremers, S. Soatto, Motion competition: a variational approach to piecewise parametric motion segmentation, Int. J. Comput. Vis. 62 (3) (2005) 249–265. [4] M.M. Chang, A.M. Tekalp, M.I. Sezan, Simultaneous motion estimation and segmentation, IEEE Trans. Image Process. 6 (9) (1997) 1326–1333. [5] J. Xiao, M. Shah, Motion layer extraction in the presence of occlusion using graph cuts, IEEE Trans. Pattern Anal. Mach. Intell. 27 (10) (2005) 1644–1659. [6] J. Ascenso, C. Brites, F. Pereira, Improving frame interpolation with spatial motion smoothing for pixel domain distributed video coding, in: 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, Citeseer, 2005, pp. 1–6. [7] A. Puri, H.-M. Hang, D. Schilling, An efficient block-matching algorithm for motion-compensated coding, in: Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’87., 12, IEEE, 1987, pp. 1063–1066. [8] J. Xiao, H. Cheng, H. Sawhney, C. Rao, M. Isnardi, Bilateral filtering-based optical flow estimation with occlusion detection, Comput. Vis. ECCV 2006 (2006) 211–224. [9] C. Strecha, R. Fransens, L. Van Gool, A probabilistic approach to large displacement optical flow and occlusion detection, in: Statistical Methods in Video Processing, Springer, 2004, pp. 71–82. [10] L. Alvarez, R. Deriche, T. Papadopoulo, J. Sánchez, Symmetrical dense optical flow estimation with occlusions detection, Int. J. Comput. Vis. 75 (3) (2007) 371–385. [11] C. Lei, Y.-H. Yang, Optical flow estimation on coarse-to-fine region-trees using discrete optimization, in: 2009 IEEE 12th International Conference on Computer Vision, IEEE, 2009, pp. 1562–1569. [12] L. Xu, J. Jia, Y. Matsushita, Motion detail preserving optical flow estimation, IEEE Trans. Pattern Anal. Mach. Intell. 34 (9) (2012) 1744–1757. [13] C.L. Zitnick, T. Kanade, A cooperative algorithm for stereo matching and occlusion detection, IEEE Trans. Pattern Anal. Mach. Intell. 22 (7) (20 0 0) 675–684. [14] V. Kolmogorov, R. Zabih, Computing visual correspondence with occlusions using graph cuts, in: Proceedings of Eighth IEEE International Conference on Computer Vision, 2001. ICCV 2001., 2, IEEE, 2001, pp. 508–515. [15] Y. Boykov, O. Veksler, R. Zabih, Fast approximate energy minimization via graph cuts, IEEE Trans. Pattern Anal. Mach. Intell. 23 (11) (2001) 1222–1239. [16] J. Sun, Y. Li, S.B. Kang, H.-Y. Shum, Symmetric stereo matching for occlusion handling, in: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, 2, IEEE, 2005, pp. 399–406. [17] J. Sun, N.-N. Zheng, H.-Y. Shum, Stereo matching using belief propagation, IEEE Trans. Pattern Anal. Mach. Intell. 25 (7) (2003) 787–800. [18] C. Ballester, L. Garrido, V. Lazcano, V. Caselles, A tv-l1 optical flow method with occlusion detection, in: Pattern Recognition, Springer, 2012, pp. 31–40. [19] D. Sun, C. Liu, H. Pfister, Local layering for joint motion estimation and occlusion detection, in: 2014 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2014, pp. 1098–1105. [20] D. Fortun, P. Bouthemy, C. Kervrann, Aggregation of local parametric candidates with exemplar-based occlusion handling for optical flow, Comput. Vision Image Understanding 145 (2016a) 81–94.
80
K.J. Lee, I.D. Yun / Pattern Recognition Letters 89 (2017) 73–80
[21] D. Fortun, P. Bouthemy, C. Kervrann, A variational aggregation framework for patch-based optical flow estimation, J. Math. Imaging Vis. 56 (2) (2016b) 280–299. [22] A. Ayvaci, M. Raptis, S. Soatto, Sparse occlusion detection with optical flow, Int. J. Comput. Vis. 97 (3) (2012) 322–338. [23] P. Chen, X. Zhang, P.C. Yuen, A. Mao, Combination of spatio-temporal and transform domain for sparse occlusion estimation by optical flow, Neurocomputing 214 (2016) 368–375. [24] B. Glocker, N. Paragios, N. Komodakis, G. Tziritas, N. Navab, Optical flow estimation with uncertainties through dynamic mrfs, in: Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008. [25] K.J. Lee, D. Kwon, I.D. Yun, S.U. Lee, Optical flow estimation with adaptive convolution Kernel prior on discrete framework, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 2504–2511. [26] K.J. Lee, I.D. Yun, S.U. Lee, Adaptive large window correlation for optical flow estimation with discrete optimization, Image Vis. Comput. 31 (9) (2013) 631–639. [27] K.-J. Yoon, I.S. Kweon, Adaptive support-weight approach for correspondence search, IEEE Trans. Pattern Anal. Mach. Intell. 28 (4) (2006) 650–656. [28] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, M. Gelautz, Fast cost-volume filtering for visual correspondence and beyond, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 3017–3024. [29] V. Kolmogorov, Convergent tree-reweighted message passing for energy minimization, IEEE Trans. Pattern Anal. Mach. Intell. 28 (10) (2006) 1568–1583.
[30] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, C. Rother, A comparative study of energy minimization methods for markov random fields with smoothness-based priors, IEEE Trans. Pattern Anal. Mach. Intell. 30 (6) (2008) 1068–1080. [31] A. Shekhovtsov, I. Kovtun, V. Hlaváˇc, Efficient mrf deformation model for non– rigid image matching, Comput. Vis. Image Understanding 112 (1) (2008) 91–99. [32] K.J. Lee, D. Kwon, I.D. Yun, S.U. Lee, Deformable 3d volume registration using efficient mrfs model with decomposed nodes, in: British Machine Vision Conference, 2008, pp. 1–10. [33] Y. Weiss, W.T. Freeman, On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs, IEEE Trans. Inf. Theor. 47 (2) (2001) 736–744. [34] M.J. Wainwright, T.S. Jaakkola, A.S. Willsky, Map estimation via agreement on trees: message-passing and linear programming, IEEE Trans. Inf. Theor. 51 (11) (2005) 3697–3717. [35] P.F. Felzenszwalb, D.P. Huttenlocher, Efficient belief propagation for early vision, Int. J. Comput. Vis. 70 (1) (2006) 41–54. [36] D.J. Butler, J. Wulff, G.B. Stanley, M.J. Black, A naturalistic open source movie for optical flow evaluation, in: Computer Vision–ECCV 2012, Springer, 2012, pp. 611–625. [37] S. Baker, D. Scharstein, J. Lewis, S. Roth, M.J. Black, R. Szeliski, A database and evaluation methodology for optical flow, Int. J. Comput. Vis. 92 (1) (2011) 1–31. [38] C.-K. Liang, C.-C. Cheng, Y.-C. Lai, L.-G. Chen, H.H. Chen, Hardware-efficient belief propagation, IEEE Trans. Circuits Syst. Video Technol. 21 (5) (2011) 525–537.