An O(1) disparity refinement method for stereo matching

Pattern Recognition 55 (2016) 198–206 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr An...

Download PDF

7MB Sizes 2 Downloads 115 Views

Report

Full Text

Pattern Recognition 55 (2016) 198–206

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

An O(1) disparity reﬁnement method for stereo matching Xiaoming Huang, Yu-Jin Zhang n Department of Electronic Engineering, Tsinghua University, Tsinghua National Laboratory for Information Science and Technology, Beijing 100084, China

art ic l e i nf o

a b s t r a c t

Article history: Received 10 March 2015 Received in revised form 9 December 2015 Accepted 21 January 2016 Available online 5 February 2016

Disparity reﬁnement is the ﬁnal step but also the timing bottleneck of stereo matching due to its high computational complexity. Weighted media ﬁlter reﬁnement method and non-local reﬁnement method are two typical reﬁnement methods with O(N) computational complexity for each pixel where N indicates the maximum disparity. This paper presents an O(1) disparity reﬁnement method based on belief aggregation and belief propagation. The aggregated belief, which means the possibility of correct disparity value, is efﬁciently computed on a minimum spanning tree ﬁrst, and then the belief aggregation is fast performed on another minimum spanning tree in two sequential passes (ﬁrst from leaf nodes to root, then from root to leaf nodes). Only 2 additions and 4 multiplications are required for each pixel at all disparity levels, so the computational complexity is O(1). Performance evaluation on Middlebury data sets shows that the proposed method has good performances both in accuracy and speed. & 2016 Elsevier Ltd. All rights reserved.

Keywords: Stereo matching Disparity reﬁnement Belief aggregation Belief propagation Non-local

1. Introduction Stereo matching algorithms usually consists of four steps [1]: matching cost computation, cost aggregation, disparity computation and disparity reﬁnement. A lot of work has been spent on developing robust cost computation [2,3] and cost aggregation methods [4–10], while the high complexity of disparity reﬁnement becomes the timing bottleneck in stereo matching algorithm. For example, for the guided image based algorithm [11], the average runtime of disparity reﬁnement is about 6.5 seconds on the Middlebury data sets [12], reported by Yang [13]. Traditional reﬁnement step include left– right check [14], hole ﬁlling and a median ﬁlter. Weighted media ﬁlter reﬁnement method is widely adopted by many stereo matching algorithms (e.g., [11,15]). But the high complexity of this ﬁlter becomes the timing bottleneck. Recently, a constant time weighted median ﬁltering [16] was proposed. The author proposed reﬁnement with the help of a histogram whose size is maximum disparity N. This algorithm is driven by the recent progress on fast median ﬁltering [17–19], fast algorithms [20–23] for bilateral ﬁltering [24], and other fast edge-aware ﬁltering [25,26], these existing O(1) edge-aware ﬁlters can be performed on each histogram bin. The method shows good accuracy at slow speed due to O(N) computational complexity for each pixel, the speed evaluation of this method is shown in Section 4.2. Moreover, n

Correspondence author. Tel.: þ 86 10 62798540; fax: þ 86 10 62770317. E-mail addresses: [email protected] (X. Huang), [email protected] (Y.-J. Zhang). http://dx.doi.org/10.1016/j.patcog.2016.01.025 0031-3203/& 2016 Elsevier Ltd. All rights reserved.

one main shortage of this reﬁnement method is that the support windows are of ﬁxed size. Yang proposed a non-local reﬁnement method [13] using a non-local aggregation method on MST (minimum spanning tree) structure. All pixels are ﬁrstly divided into stable or unstable pixels after left–right disparity check, a new cost volume will be computed based on this checked disparity map, then followed by nonlocal aggregation at each disparity level and a winner-take-all operation to propagate the disparity values from stable pixels to unstable pixels. But the speed is still very slow due to O(N) computational complexity for each pixel, the speed evaluation of this method is shown in Section 4.2. In previous work [27], we presented one fast disparity reﬁnement method based only on belief propagation. The method shows good performance on complicated cost aggregation methods (e.g., guided ﬁlter aggregation [11], non-local aggregation [13]) but failed in simple cost aggregation methods such as box-ﬁlter aggregation [1]. In this paper, we proposed a fast reﬁnement based on belief aggregation and belief propagation. All pixels ﬁrstly have initial disparity belief. We build a hybrid MST whose edge weight is determined by disparity distance and color distance, belief aggregation is efﬁciently computed on this hybrid MST in two sequential passes (same to cost aggregation in [13], ﬁrst from leaf nodes to root, then from root to leaf nodes). The pixel has greater aggregated belief if it has larger close neighbors both in disparity and color. Then, we build another MST whose edge weight is only determined by color distance, belief propagation is fast performed on this MST in two sequential passes (ﬁrst from leaf nodes to root, then from root to leaf nodes). The pixel having lower aggregated belief will receive propagation from the pixel with

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

199

Fig. 1. Proposed reﬁnement process of tsukuba data set. (a) Left image of tsukuba data set. (b) Guided ﬁlter aggregation [11] followed by left–right check and hole ﬁlling. (c) belief aggregation of (b), brighter color indicates higher belief. (d). belief propagation of (c). Although the disparity of h and j is same in (b), but h has larger close neighbors both in disparity and color, so h has greater aggregated belief in (c). Compared with k, pixel j has close color in (a) but smaller aggregated belief in (c), so j receives propagation from k, ﬁnal disparity of j is assigned by the disparity of k. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

similar color and higher aggregated belief. Proposed reﬁnement process on tsukuba data set is demonstrated in Fig. 1. Left image of tsukuba and disparity map to be reﬁned is shown in Fig. 1(a) and (b), belief aggregation result is presented in Fig. 1(c), belief propagation result is presented in Fig. 1(d). In proposed reﬁnement method, only 2 additions and 4 multiplications are required for each pixel at all disparity levels, so the computational complexity is O(1). Performance evaluation tested on Middlebury data set [12] shows good performance both in accuracy and speed. The remaining of the paper is organized as following: Previous work is introduced in Section 2. Section 3 gives the details of the proposed disparity reﬁnement algorithm. The experimental results are presented in Section 4. Finally, the paper concludes in Section 5.

2. Previous work In this section, we mainly review the MST (minimum spanning tree) based non-local ﬁlter [13]. The reference color/intensity image I is represented as a connected, undirected graph G ¼(V, E), where each node in V corresponds to a pixel in I, and each edge in E connects a pair of neighboring pixels. The graph G is thus simply the standard 4connected or 8-connected grid. For an edge e connecting pixels s and r, its weight is determined as follows: wðs; rÞ ¼ wðr; sÞ ¼ j IðsÞ IðrÞj

ð1Þ

A tree T can be constructed by selecting a subset of edges from E. Yang [13] proposed to construct a MST connecting all the pixels and the sum of its weights is minimized. For any two pixels p and q, their distance Wðp; qÞ is determined by the sum of the edge weights along the path in T, and Wðp; qÞ Sðp; qÞ ¼ exp ð2Þ

σ

denotes the similarity between p and q where σ is a parameter to adjust the similarity between two nodes. Let C d ðpÞ denote the matching cost for pixel p at disparity level d. The ﬁnal aggregated cost of pixel p at disparity level d is computed as follows: X C Ad ðpÞ ¼ Sðp; qÞC d ðqÞ ð3Þ qAI

Different from the local ﬁltering-based methods, in the nonlocal cost aggregation method, p gets support weights from all the pixels in I. Yang proved that the non-local cost aggregation can be accomplished in exactly linear time by traversing the tree structure in two sequential passes: ﬁrst from leaf to root, then from root to leaf.

In the ﬁrst pass from leaf to root, the intermediate aggregated cost C A↑ d ðpÞ of each node p can be computed: X A↑ Sðp; qÞC A↑ ð4Þ C d ðpÞ ¼ C d ðpÞ þ d ðqÞ q is child of p

C A↑ d ðpÞ

Note that equals to the ﬁnal cost aggregation C Ad ðpÞ if p is the root node. In the second pass from root to leaf, the ﬁnal aggregated cost C Ad ðpÞ of each node p can be computed: C Ad ðpÞ ¼ Sðp; qÞC Ad ðqÞ þð1 S2 ðp; qÞÞC A↑ d ðqÞ

ð5Þ

where pixel q is parent of pixel p. The non-local ﬁlter is an efﬁcient ﬁlter with the following advantages: 1. It provides a non-local solution, which theoretically and experimentally outperforms local cost aggregation methods. 2. It has low computational complexity: only 2 addition/subtraction operations and 3 multiplication operations are required for each pixel at each disparity level. 3. It can be used for non-local disparity reﬁnement, which is proved to be more robust and effective than weighted media ﬁlter reﬁnement method presented in [11].

3. Proposed disparity reﬁnement In this section, we ﬁrst propose belief aggregation method on a hybrid MST whose edge weight is determined by disparity distance and color distance, then we present belief propagation method on another MST whose edge weight is only decided by color distance, and lastly we discuss the computational complexity of the algorithm. 3.1. Belief Aggregation on Hybrid MST In a general local stereo matching algorithm, left and right disparity maps are obtained separately. Then left–right disparity check divides all the pixels into stable or unstable pixels. If left disparity is equal to the corresponding right disparity, the pixel is regarded as a stable pixel, otherwise it is considered as an unstable pixel. In addition, a hole ﬁlling step estimates disparity from stable pixels to unstable pixels. Fig. 1(b) shows one disparity map with guided ﬁlter aggregation method [11] followed by left–right check and hole ﬁlling. In this paper, we ﬁrst propose disparity belief for each pixel to represent the possibility of correct disparity value. Obviously, the pixel has greater disparity belief if it has large number of close neighbors both in disparity and color. If we construct a hybrid MST

200

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

whose edge weight is determined by disparity distance and color distance, and initialize the disparity belief, then aggregated belief can be computed on this hybrid MST similar to cost aggregation on MST [13]. In the construction of the hybrid MST, the weight of edge connecting pixels s and r is determined by disparity distance and color distance as follows: wH ðs; rÞ ¼ wH ðr; sÞ ¼ ð1 αÞj DðsÞ DðrÞj þ α j IðsÞ IðrÞj

ð6Þ

where α is a balance parameter, D denotes the disparity map to be reﬁned. Let B denote the initial disparity belief of D as follows: ( 1 p is stable: BðpÞ ¼ ð7Þ 0:1 p is unstable:

Claim 1 can be obtained from (9) and the deﬁnition of MST. Let v denote the nodes on r's sub trees, then propagated belief from r's sub trees to r is max Sðr; vÞBA ðvÞ. The propagated belief from T r to s v

is the maximum value of: 1) Propagation from r to s: Sðs; rÞBA ðrÞ 2) Propagation from r's sub trees to s: max Sðs; vÞBA ðvÞ v ¼max Sðs; rÞSðr; vÞBA ðvÞ ¼Sðs; rÞmax Sðr; vÞBA ðvÞ ¼Sðs; rÞ times the v v propagation from r's sub trees to r. From Claim 1, the propagation can be performed from leaf nodes to root node. Let BP↑ denote the propagated belief result obtained after this propagation, then at each node s, BP↑ ðsÞ ¼ maxðBA ðsÞ;

A

Let B denote the aggregated belief. Similar to cost aggregation in (3), BA can be computed without normalization as follows: X X W H ðp; qÞ SH ðp; qÞBðqÞ ¼ exp BðqÞ ð8Þ BA ðpÞ ¼ qAI

qAI

σH

where the subscript H indicates hybrid MST, W H ðp; qÞ represents the distance from pixels p to q and is determined by the sum of the edge weights along the path in this hybrid MST. BA can be efﬁciently computed on this hybrid MST in two sequential passes (same to cost aggregation in [13], ﬁrst from leaf nodes to root, then from root to leaf nodes). Fig. 1(c) shows aggregated belief of Fig. 1(b). Although the disparity of h and j is the same in Fig. 1(b), but h has more close neighbors both in disparity and color, so h has greater aggregated belief in Fig. 1(c). 3.2. Belief propagation on MST When r is the pixel to be reﬁned, it is required to ﬁnd the largest neighbor area E which meets the following two conditions: 1) Color and disparity are both similar inside area E. 2) Color of E is similar to r.

Sðs; vÞBP↑ ðvÞÞ

ð11Þ

Note that in (11), if node s has no child then BP↑ ðsÞ ¼ BA ðsÞ, if node s is root node then BP ðsÞ ¼ BP↑ ðsÞ and propagation from all nodes to root node is ﬁnished. Claim 2. Let T r denote a sub tree with root node r and parent node s, then propagated belief from all nodes to node r can be proved as: BP ðrÞ ¼ maxðBP↑ ðrÞ; Sðs; rÞBP ðsÞÞ

ð12Þ

After disparity belief propagation from leaf nodes to root node, the root node receives propagation from all nodes, the rest nodes only receive propagation from its sub trees. Fig. 2(a) is an example of belief propagation from leaf nodes to root node v4, the propagation result BP↑ ðv3Þ contains propagation from its sub trees and itself (grouped in red ellipse), BP ðv4=v3Þdenote v4 receive propagation from all nodes other than v3 and its sub trees (grouped in blue ellipse). According to Claim 1, BP ðv4Þ denotes v4 receiving propagation from all nodes (grouped in green rectangle) which can be described as: BP ðv4Þ ¼ maxðSðv4; v3ÞBP↑ ðv3Þ; BP ðv4=v3ÞÞ

We construct another MST whose edge weight is only determined by color distance (same to MST in cost aggregation). For each pixel v, Sðr; vÞ denotes color similarity between r and v on this MST, aggregated belief BA ðvÞ represents the size of close neighbors of v both in disparity and color, so Sðr; vÞBA ðvÞ indicates the possibility of r and v having same disparity. When p ¼ arg maxðSðr; vÞBA ðvÞÞ, the disparity of p is the best disparity estimation vAI

max

v is child of s

ð13Þ

Fig. 2(b) is an example of belief propagation from root node v4 to child node v3. According to Claim 1, BP ðv3Þ denotes v3 receiving propagation from all nodes (grouped in green rectangle) which can be described as: BP ðv3Þ ¼ maxðBP↑ ðv3Þ; Sðv4; v3ÞBP ðv4=v3ÞÞ P↑

ð14Þ

P

If Sðv4; v3ÞB ðv3Þ r B ðv4=v3Þ, then P

B ðv4Þ ¼ BP ðv4=v3Þ

A

of r, maxðSðr; vÞB ðvÞÞ means the possibility of this best disparity vAI

estimation. The possibility of best disparity estimation is propagated from aggregated belief on MST, so it’s deﬁned as propagated belief (denoted as BP ), the best disparity estimation is deﬁned as propagated disparity (denoted as DP ). For each node r: BP ðrÞ ¼ maxðSðr; vÞBA ðvÞÞ vAI

DP ðrÞ ¼ Dðarg maxðSðr; vÞBA ðvÞÞÞ vAI

ð9Þ ð10Þ

The belief propagation result of Fig. 1(c) is shown in Fig. 1(d). Compared with pixel k, pixel j has close color but smaller aggregated belief, so j receives propagation from k, ﬁnal disparity of j is assigned by the disparity of k. Similar to algorithm [13], we can get the following claims: Claim 1. Let T r denote a sub tree with root node r and parent node s, then the propagated belief from T r to s is the maximum value of: 1) The propagated belief from node r to s. 2) Sðs; rÞ times the propagated belief from r's sub trees to r.

BP ðv3Þ ¼ maxðBP↑ ðv3Þ; Sðv4; v3ÞBP ðv4ÞÞ If Sðv4; v3ÞBP↑ ðv3Þ 4 BP ðv4=v3Þ, then BP ðv4Þ ¼ Sðv4; v3ÞBP↑ ðv3Þ Since Sðv4; v3Þ r 1, P↑

B ðv3Þ 4Sðv4; v3ÞBP ðv4=v3Þ BP ðv3Þ ¼ BP↑ ðv3Þ ¼ maxðBP↑ ðv3Þ; S2 ðv4; v3ÞBP↑ ðv3ÞÞ ¼ maxðBP↑ ðv3Þ; Sðv4; v3ÞBP ðv4ÞÞ Considering the above two cases, conclusion (12) is proved. Because BP ðvÞ ¼ BP↑ ðvÞ for the root node of MST, the propagation result at each node can be computed using (12) by tracing from root node to leaf nodes. Hence, the whole propagation process is separated into two steps: 1) Propagation on aggregated belief BA from leaf nodes to root node using (11), the intermediate disparity belief stored as BP↑ .

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

201

Fig. 2. Two belief propagation steps. (a). Belief propagation from leaf nodes to root node v4. BP↑ ðv4Þ denotes v4 receive propagation from all nodes (grouped in green rectangle), BP↑ ðv3Þ contains propagation from its sub trees and itself (grouped in red ellipse), BP ðv4=v3Þ denotes v4 receive propagation from all nodes other than v3 and its sub trees (grouped in blue ellipse). (b). Belief propagation from root node v4 to child node v3. BP ðv3Þ denotes v3 receive propagation from all nodes (grouped in green rectangle) which can be proved to be BP ðv3Þ ¼ maxðBP↑ ðv3Þ; Sðv4; v3ÞBP ðv4ÞÞ. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

2) Propagation on intermediate disparity belief BP↑ from root node to leaf nodes using (12), the ﬁnal results stored as BP . In the propagation process, the disparity value of each pixel is updated using (10). 3.3. Computational complexity As mentioned in Section 1, N denotes maximum disparity in this section. In non-local reﬁnement method [13], a new cost volume will be computed based on left–right checked disparity map, then followed by non-local aggregation on MST at each disparity level and a winner-take-all operation to propagate the disparity values from stable pixels to unstable pixels. Operations required for each pixel at each disparity level is 2 additions and 3 multiplications, total operations required for each pixel is 2N additions and 3N multiplications. The computational complexity of each pixel is O(N). In const time weighted media ﬁlter method [16], the author proposed reﬁnement by histogram whose size is N. Some existing O(1) edge-aware ﬁlter is performed on each histogram bin, total computational complexity of each pixel is O(N). In the proposed reﬁnement method, belief aggregation on hybrid MST need 2 additions and 3 multiplications for each pixel, belief propagation on another MST need 1 multiplications for each pixel, total operations required for each pixel is 2 additions and 4 multiplications. In addition, we need time to build two MST which is independent on N. Obviously, computational complexity of each pixel is O(1).

4. Experimental results In this section, the experimental settings is introduced ﬁrst, the speed and accuracy evaluation is presented afterwards. 4.1. Experimental settings Two typical reﬁnement methods are compared with the proposed reﬁnement method: non-local reﬁnement method [13]

(denoted as Nlocal) and weighted media ﬁlter method [16] (denoted as WMF). In order to evaluate reﬁnement performance on different cost aggregation methods, each disparity reﬁnement method is evaluated on three aggregation methods: two complicated aggregation methods (guided ﬁlter aggregation method [11], non-local aggregation method [13]) and one simple box-ﬁlter aggregation method [1]. The experiments are carried on the Middlebury stereo benchmark [12]. In order to reliably evaluate performance on different texture stereo pairs, we evaluate all the pairs that have groundtruths available. We consider the error metric as the percentage of bad pixels with error threshold 1. One shortcoming of MST based non-local ﬁlter [13] is that the distance between two pixels is approximated by the sum of the edge weights along the path in the tree. The approximation leads to unreasonable support weight in textureless region. In textureless region, while pixels are looked almost identical, color distances have usually small but non zero values. This causes the small-weight-accumulation problem, that is, many small weight edges can accumulate along a long path and form undesirable high weight in textureless region. The left image of baby3 dataset is presented in Fig. 3(a) with textureless bottom region (shown in blue rectangle), support weight from one stable pixel (shown in green) to other pixels decreases quickly in Fig. 3(e), this makes the disparity reﬁnement of unstable pixels in bottom textureless region failed as shown in Fig. 3(f). In order to suppress the inﬂuence of small-weight-accumulation problem in textureless region, the edge of neighbor pixels described in (1) needs a weight clip operation as follows in the proposed reﬁnement method: ( j IðsÞ IðrÞj 0:15 if j IðsÞ IðrÞj r2 wðs; rÞ ¼ wðr; sÞ ¼ ð15Þ j IðsÞ IðrÞj else The support weight from the stable pixel to other pixels with MST edge weight clip declines slowly in Fig. 3(g), this leads to disparity reﬁnement of unstable pixels in bottom textureless region has better accuracy which is shown in Fig. 3(h). Non-local reﬁnement [13] results is also presented in Fig. 3(d) which is comparable with the proposed reﬁnement results without MST edge weight clip.

202

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

Fig. 3. The proposed reﬁnement result with MST edge weight clip shows higher accuracy. (a) Left image of baby3 dataset with textureless bottom region (shown in blue rectangle). (b)(c) disparity groundtruth and box-ﬁlter [1] aggregation result. (d) non-local reﬁnement [13] result of (c). (e) and (g) are support weight from one stable pixel (shown in green) to other pixels without and with MST edge weight clip, the former decrease faster than the latter. (f) and (h) are the proposed reﬁnement result of (c) without and with MST edge weight clip. In textureless bottom region, the proposed reﬁnement result with weight clip shows better accuracy. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

In the proposed method, we do left–right check and simple hole-ﬁlling as in [11] before reﬁnement. The parameter for the proposed method is kept constant for all benchmark stereo pairs: σ ¼ 0.04, σ H ¼ 0.0016, α ¼ 0.05. For non-local reﬁnement method [13] and weighted media ﬁlter method [16], their parameters follow the settings of corresponding paper. The hardware platform for test is a desktop with Intel Core Duo 3.00 GHz CPU and 2 GB RAM, no parallelism technique is utilized. 4.2. Speed evaluation The speed comparison of three reﬁnement methods is shown in Table 1 and Fig. 4. Note that the time has been translatated to needed time per 100,000 pixels to remove resolution inﬂuence. The time of proposed reﬁnement method keeps constant while the time of other two reﬁnement methods increases if maximum disparity increases, this is consistent with computational complexity analysis. 4.3. Accuracy evaluation Table 2 shows the accuracy evaluation of all Middlebury data sets [12] with three aggregation methods followed by three reﬁnement methods. The percentages of the erroneous pixels with error threshold 1 are used to evaluate the accuracy, and the subscript numbers are the relative rank of the methods on the data sets. For visual comparison, we present the disparity results of four data sets with three aggregation methods followed by three reﬁnement methods in Figs. 5 and 6. The bad pixels are marked in red, and the percentage of bad pixels is shown below each

Table 1 The time of different reﬁnement methods on the Middlebury data sets [12]. Note that the time has been translatated to needed time per 100,000 pixels to remove resolution inﬂuence. Dataset

Max disparity

WMF (ms)

Nlocal (ms)

Proposed (ms)

Tsukuba Bull Baby1 Teddy Cloth2

15 32 46 59 76

95.85 207.31 312.15 405.33 530.55

96.75 141.84 174.73 207.70 250.61

122.07 123.66 126.30 125.04 126.71

disparity map. The complete disparity results can be found in the Supplementary material. In general, non-local reﬁnement method [13] and the proposed reﬁnement method show better accuracy performance than weighted media ﬁlter reﬁnement [16]. A great advantage of MST based non-local ﬁlter [13] is that it gives a more natural image pixel similarity measurement metric. It is more accurate than the previous local-ﬁlter so that every pixel in the image can correctly contribute to all the other pixels during ﬁltering. In contrast, weighed media ﬁlter reﬁnement method [16] require a ﬁxed window, and only pixels inside this window provide supports. It is impossible to ﬁnd a window that is optimal for different data sets. On the contrary, the size of unstable regions in non-local reﬁnement method [13] can be huge. The size of unstable regions is limited in weighted media ﬁlter reﬁnement [16]. Our proposed reﬁnement is also based on non-local ﬁlter. As a result, non-local reﬁnement method [13] and proposed reﬁnement method show better accuracy performance than weighted media ﬁlter reﬁnement [16] for most of the test data sets.

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

Fig. 4. Relationship between the time of different reﬁnement methods and maximum disparity, only the time of proposed reﬁnement method keeps constant due to O(1) computational complexity.

203

Compared with non-local reﬁnement method [13], our proposed reﬁnement method shows better accuracy performance due to the following three factors: First, one shortcoming of MST based non-local reﬁnement method [13] is small-weight-accumulation problem in textureless region which is described in Section 4.1, improved MST edge weight clip presented in (15) is adopted by proposed method. Second, our proposed belief aggregation is computed on a hybrid MST which is shown in (8), the weight of this hybrid MST edge is determined by disparity distance and color distance which is shown in (6). The contribution of disparity distance and color distance both in exponential form from (6) and (8). In non-local reﬁnement method [13], a new cost volume will be computed, then non-local aggregated on MST and followed by a winner-take-all operation to select a best choice. The contribution of disparity distance is reﬂected in computation of new cost volume, and the contribution of color distance is reﬂected in non-local aggregation. Clearly, the contribution of color distance is exponential form, while contribution of

Table 2 Accuracy evaluation on all Middlebury [12] data sets with error threshold 1. Three aggregation methods (guided ﬁlter [11], non-local ﬁlter [13] and box-ﬁlter [1]) followed by three reﬁnement methods (weighted media ﬁlter [16], non-local [13] and proposed) are evaluated by the percentages of the error pixels. The subscripts represent the relative rank of the methods on the data sets. Dataset

Teddy Cones Venus Tsukuba Sawtooth Bull Poster Barn1 Barn2 Map Art Books Dolls Laundry Moebius Reindeer Aloe Baby1 Baby2 Baby3 Bowling1 Bowling2 Cloth1 Cloth2 Cloth3 Cloth4 Flowerpots Lampshade1 Lampshade2 Midd1 Midd2 Monopoly Plastic Rocks1 Rocks2 Wood1 Wood2 Avg. error Avg. rank

Guided aggregation

Non-local aggregation

Box aggregation

WMF

Nlocal

Ours

WMF

Nlocal

Ours

WMF

Nlocal

Ours

12.113 6.603 0.993 4.263 1.942 1.182 1.173 2.013 2.853 5.333 15.252 17.983 11.403 20.013 16.223 9.611 11.883 9.032 7.501 10.333 22.531 17.322 6.873 11.543 6.413 8.063 21.932 15.243 21.782 39.543 32.073 25.583 40.092 6.722 5.183 9.941 1.741 12.443 2.493

11.782 6.262 0.631 3.541 1.581 0.981 1.092 1.431 2.211 3.031 15.413 17.731 10.472 19.102 14.192 12.013 10.812 12.163 14.583 9.062 28.163 19.713 4.912 9.972 5.341 7.412 24.833 12.882 23.313 31.632 26.682 23.542 40.071 7.183 5.162 12.382 3.253 12.282 2.002

11.001 5.891 0.862 3.652 2.013 1.923 0.901 1.762 2.642 4.472 13.571 17.982 9.251 18.131 13.021 10.222 9.501 9.021 10.902 7.571 22.702 16.891 4.841 9.721 5.402 7.031 20.271 10.311 12.211 22.731 24.831 19.051 43.793 6.111 4.631 12.993 1.752 10.801 1.541

11.613 6.923 1.212 5.053 1.702 0.871 0.861 2.082 4.033 7.823 16.873 19.432 12.403 19.103 15.823 14.383 12.133 14.501 17.611 11.193 29.311 21.351 6.143 11.303 6.193 7.753 30.382 18.513 21.163 25.473 27.882 23.323 50.313 5.592 5.322 14.152 2.932 13.583 2.413

10.622 6.401 1.131 3.431 1.511 1.102 1.213 1.551 2.721 4.211 15.692 18.971 10.982 18.242 12.921 14.312 9.492 16.723 20.493 10.762 31.832 21.402 4.162 9.762 4.941 7.402 31.113 16.811 18.022 25.432 33.113 20.632 47.631 6.133 5.101 13.661 3.653 13.062 1.812

10.471 6.822 1.423 4.152 2.333 1.113 1.152 2.263 3.702 7.012 15.211 19.663 10.691 17.711 13.482 13.541 8.401 16.102 20.282 10.101 33.453 23.123 3.691 9.281 5.082 6.951 29.651 18.172 15.671 19.931 23.761 20.041 48.002 5.171 5.393 15.963 2.891 12.751 1.781

13.383 6.732 1.131 4.303 1.641 2.332 1.121 2.023 2.713 4.643 16.153 18.993 10.643 21.283 16.203 8.282 10.753 8.491 7.561 10.533 25.892 14.602 5.193 9.313 5.333 7.013 20.362 15.613 27.063 41.293 37.861 28.002 46.482 5.712 4.732 9.301 3.522 12.872 2.323

13.383 6.753 1.142 3.581 1.802 1.451 2.043 1.711 2.041 2.881 15.462 17.842 9.262 21.082 14.692 11.013 9.652 11.693 13.713 9.552 28.083 17.323 3.322 7.362 4.341 6.492 23.233 13.262 26.312 40.502 41.823 31.283 48.143 5.813 4.863 12.753 4.113 13.233 2.272

12.771 6.551 1.183 3.982 2.233 3.123 2.002 1.922 2.602 3.682 14.231 17.421 8.901 19.481 13.821 7.641 9.441 8.602 9.132 7.381 21.611 14.061 3.031 7.081 4.352 5.981 18.941 8.981 9.401 39.321 39.642 27.021 46.261 4.991 4.351 9.852 2.821 11.451 1.431

204

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

Teddy

23.17

12.11

11.78

11.00

Cones

16.12

6.60

6.26

5.89

Baby3

21.68

10.33

9.06

7.57

34.70

21.93

24.82

20.27

Flowerpots left image

right image

ground truth

guided agg.

WMF

non-local

proposed

Fig. 5. Experimental results on some Middlebury data sets [12] with different reﬁnement method. Results of all data sets is attatched as Supplementary material. (a) Left image. (b) Right image. (c) Ground truth of left disparity. (d) Guided ﬁlter aggregation [11] followed by left–right check. (e)–(g) Reﬁnement result of (d) with different reﬁnement method (Weighted media ﬁlter [16], non-local reﬁnement [13] and proposed). The percentage of bad pixels marked in red also is shown below each disparity map. (a) left image (b) right image (c) ground truth (d) guided agg. (e) WMF (f) non-local (g) proposed. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

24.66

11.61

10.62

10.47

25.93

13.38

13.38

12.77

18.77

6.92

6.40

6.82

17.52

6.73

6.75

6.55

23.71

11.19

10.76

10.10

24.79

10.53

9.55

7.38

46.78

30.38

31.11

29.65

34.24

20.36

23.23

18.94

WMF

non-local

proposed

box agg.

WMF

non-local

proposed

nonlocal agg.

Fig. 6. (a) and (e) non-local aggregation [13] and box-ﬁlter aggregation followed by left–right check. (b)–(d) (f)–(h) reﬁnement result of (a) and (e) with different reﬁnement methods (Weighted media ﬁlter [16], non-local reﬁnement [13] and proposed). The percentage of bad pixels marked in red also is shown below each disparity map. (a) nonlocal agg. (b) WMF (c) non-local (d) proposed (e) box agg. (f) WMF (g) non-local (h) proposed. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

205

Fig. 7. Non-local reﬁnement [13] result with different computation of new cost value. (a) and (b) Left image and ground truth of Lampshade1 dataset. (c) box-ﬁlter [1] aggregation result with a lot of disparity noise (shown in blue rectangle). (d) and (e) non-local reﬁnement [13] result with linear form (16) and exponential form (17), more noisy pixels are suppressed in latter form. (a) left image (b) ground truth (c) box-ﬁlter aggregation (d) non-local with (16) (e) non-local with (17) (error pixels 16.63%) (error pixels 13.77%). (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

Fig. 8. The proposed reﬁnement result without and with hole-ﬁlling. (a)–(c) Left image, right image and ground truth of Teddy dataset. (d) box-ﬁlter [1] aggregation result (one shadow region is inside the blue rectangle). (e) and (f) non-local reﬁnement [13] and the proposed result without hole-ﬁlling, two results are comparable. (g) holeﬁlling of (d). (h) the proposed reﬁnement result based on hole ﬁlling, accuracy of shadow region is better than without hole-ﬁlling. (a) left image (b) right image (c) ground truth (d) box aggregation result (e) non-local reﬁnement result (error pixels 15.14%) (f) proposed without hole ﬁlling (error pixels 15.05%) (g) hole-ﬁlling (h) proposed with hole ﬁlling (error pixels 12.77%). (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

disparity distance is in linear form as following: ( j d DðpÞj p is stable: new C d ðpÞ ¼ 0 p is unstable:

ð16Þ

where DðpÞ denotes the disparity of pixel p to be reﬁned, C new d ðpÞ denotes a new cost value of pixel p at disparity level d. Compared with the contribution of disparity distance as linear form in non-local reﬁnement method [13], we ﬁnd that the contribution of disparity distance as exponential form in the proposed method shows more robust accuracy on noisy initial disparity map. In order to verify this conclusion, we modify the contribution of disparity distance in non-local reﬁnement method [13] to exponential form as following: ( DðpÞj expð j d Þ p is stable: σc C new ðpÞ ¼ ð17Þ d 0 p is unstable: where σ c is smooth parameter. From tuition, if pixel p is stable with noise initial disparity DðpÞ, nearby pixel q is stable with unnoise initial disparity DðqÞand expect to get reﬁned disparity

d, j d DðpÞj usually is very large, thus C new d ðpÞ is close to zero. In followed non-local aggregation, the contribution from noise pixel p to unnoise q is little that means the inﬂuence of noisy pixel p is suppressed. In one word, inﬂuence of noisy pixel can be suppressed more efﬁciently in (17) than in (16). We evaluate non-local reﬁnement [13] with exponential form (17) and linear form (16) on box-ﬁlter aggregation result, parameter σ c ¼10, σ ¼0.04, other parameters follow the settings of corresponding paper, the average percentage of error pixels in Middlebury stereo benchmark [12] decreased from 12.97% to 12.43%. One example is shown in Fig. 7. Finally, we do simple hole-ﬁlling as in [11] after left–right check in the proposed method, while hole-ﬁlling is not adopted by nonlocal reﬁnement [13] method. Hole-ﬁlling can improve accuracy performance in shadow region. One example is shown in Fig. 8. Since three factors discussed above lead to our proposed reﬁnement method shows better accuracy performance than nonlocal reﬁnement [13], we can evaluate these two methods without inﬂuence of three factors. That is to say, proposed method modiﬁed without weight clip and hole ﬁlling (denoted as ours*), non-

206

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

Table 3 Accuracy evaluation between non-local reﬁnement [13] with (17) (denoted as Nlocal*) and the proposed method without weight clip and hole ﬁlling (denoted as ours*), details is shown in Supplementary material. The results of these two modiﬁed methods are very close. Dataset

Guided aggregation

Non-local aggregation

Box aggregation

Nlocal*

Ours*

Nlocal*

Ours*

Nlocal*

Ours*

11.37

12.44

12.31

12.41

12.54

Avg. Error 11.43

local reﬁnement modiﬁed with exponential form (17) (denoted as Nlocal*). The parameter σ ¼0.04 is best for both modiﬁed methods, other parameters is same to previous evaluation. The accuracy performance of two modiﬁed methods is shown in Table 3, details is shown in Supplementary material. We can ﬁnd that two results are very close. Both our proposed reﬁnement and non-local reﬁnement [13] are based on MST structure ﬁlter, the close accuracy performance is consistent with our expectation.

5. Conclusions We proposed an O(1) disparity reﬁnement based on belief aggregation and belief propagation. The proposed fast reﬁnement method only needs 2 additions and 4 multiplications for each pixel at all disparity levels. Performance evaluation on Middlebury stereo benchmark [12] demonstrates that the proposed reﬁnement algorithm outperforms non-local reﬁnement method [13] and weighted media ﬁlter method [16]. The future work should emphasize on constructing more effective trees to resolve the shortcoming of MST based non-local ﬁlter [13] and parallelism implementation.

Conﬂict of interest None declared.

Acknowledgment This work was supported by National Natural Science Foundation of China (NNSF: 61171118).

Appendix A. Supplementary material Supplementary data associated with this article can be found in the online version at http://dx.doi.org/10.1016/j.patcog.2016.01. 025.

References [1] D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, Int. J. Comput. Vis. 47 (1–3) (2002) 7–42. [2] S. Birchﬁeld, C. Tomasi, A pixel dissimilarity measure that is insensitive to image sampling, IEEE Trans. Pattern Anal. Mach. Intell. 20 (4) (1998) 401–406. [3] S. Chambon, A. Crouzil, Similarity measures for image matching despite occlusions in stereo vision, Pattern Recognit. 44 (9) (2011) 2063–2075. [4] K.-J. Yoon, I.S. Kweon, Adaptive support-weight approach for correspondence search, IEEE Trans. Pattern Anal. Mach. Intell 28 (4) (2006) 650–656. [5] C. Richardt, D. Orr, I. Davies, A. Criminisi, N.A. Dodgson, Real-time spatiotemporal stereo matching using the dual-cross-bilateral grid, in: Proceedings of the European Conference on Computer Vision, 2010, pp. 510–523. [6] M. Gong, Y. Zhang, Y. -H. Yang, Near-real-time stereo matching with slanted surface modeling and sub-pixel accuracy, Pattern Recognit. 44 (10–11) (2011) 2701–2710. [7] X. Mei, X. Sun, W. Dong, H. Wang, X. Zhang, Segment-tree based cost aggregation for stereo matching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 313–320. [8] C. Pham, J. Jeon, Domain transformation-based efﬁcient cost aggregation for local stereo matching, IEEE Trans. Circuits Syst. Video Technol. 23 (7) (2013) 1119–1130. [9] J. Sun, N. Zheng, H.Y. Shum, Stereo matching using belief propagation, IEEE Trans. Pattern Anal. Mach. Intell. 25 (7) (2003) 787–800. [10] Q. Yang, L. Wang, N. Ahuja, A constant-space belief propagation algorithm for stereo matching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1458–1465. [11] C. Remain, A. Hosni, M. Bleyer, C. Rother, M. Gelautz, Fast cost-volume ﬁltering for visual correspondence and beyond, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3017–3024. [12] D. Scharstein, R. Szeliski, Middlebury Stereo Evaluation (Online). Available: 〈http://vision.middlebury.edu/stereo/eval〉. [13] Q. Yang, A non-local cost aggregation method for stereo matching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1402–1409. [14] S.D. Cochran, G. Medioni, 3-d surface description from binocular stereo, IEEE Trans. Patt. Anal. Mach. Intell. 14 (10) (1992) 981–994. [15] X. Sun, X. Mei, S. Jiao, M. Zhou, H. Wang, Stereo matching with reliable disparity propagation, in: Proceedings of the 3DIMPVT, 2011, pp. 132–139. [16] Z. Ma, K. He, Y. Wei, J. Sun, Constant time weighted median ﬁltering for stereo matching and beyond, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 49–56. [17] S. Perreault, P. Hébert, Median ﬁltering in constant time, in: Proceedings of the IEEE Transactions on Image Processing, vol. 16(9), 2007, pp. 2389–2394. [18] D. Cline, K.B. White, P.K. Egbert, Fast 8-bit median ﬁltering based on separability, in: Proceedings of the IEEE Conference on Image Processing, 2007, pp. 281–284. [19] M. Kass, J. Solomon, Smoothed local histogram ﬁlters, SIGGRAPH, 2010, p. 100. [20] F. Durand, J. Dorsey, Fast bilateral ﬁltering for the display of high-dynamicrange images, SIGGRAPH, 2002, pp. 257–266. [21] S. Paris F. Durand. A fast approximation of the bilateral ﬁlter using a signal processing approach, in: Proceedings of the IEEE Conference on Computer Vision, 2006, pp. 568–580. [22] F. Porikli, Constant time O(1) bilateral ﬁltering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. [23] Q. Yang, K.-H. Tan, N. Ahuja, Real-time O(1) bilateral ﬁltering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 557–564. [24] C. Tomasi, R. Manduchi, Bilateral ﬁltering for gray and color images, in: Proceedings of the IEEE Conference on Computer Vision, 1998, pp. 839–846. [25] K. He, J. Sun, X. Tang, Guided image ﬁltering, in: Proceedings of the IEEE Conference on Computer Vision, 2010, pp. 1–14. [26] E.S. Gastal, M.M. Oliveira, Domain transform for edgeaware image and video processing, SIGGRAPH, 2011, p. 69. [27] X. Huang, G. Cui, Y. Zhang, A fast non-local disparity reﬁnement method for stereo matching, in: Proceedings of the IEEE Conference on Image Processing, 2014, pp. 3823–3827.

Xiaoming Huang received the master and bachelor degree from Peking University and Lanzhou University respectively. Currently, he is a Ph.D. candidate from the Department of Electronic Engineering at Tsinghua University, Beijing, China. His research interests include image processing, computer vision, and machine learning.

Yu-Jin Zhang received the PhD degree in applied science from Monteﬁore Institute at the State University of Liège, Belgium, in 1989. He was postdoctoral fellow and research fellow with the Department of Applied Physics and the Department of Electrical Engineering at the Delft University of Technology, the Netherlands from 1989 to 1993. In 1993, he joined the Department of Electronic Engineering at Tsinghua University, Beijing, China, where he has been a professor of image engineering since 1997. He has authored more than 30 books and published more than 400 papers in the areas of image engineering (image processing, image analysis, and image understanding). He is the director of academic committee of China Society of Image and Graphics. He is a senior member of IEEE, and a fellow of SPIE.

An O(1) disparity refinement method for stereo matching

An O(1) disparity refinement method for stereo matching

Recommend Documents