An O(1) disparity refinement method for stereo matching

An O(1) disparity refinement method for stereo matching

Pattern Recognition 55 (2016) 198–206 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr An...

7MB Sizes 2 Downloads 115 Views

Pattern Recognition 55 (2016) 198–206

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

An O(1) disparity refinement method for stereo matching Xiaoming Huang, Yu-Jin Zhang n Department of Electronic Engineering, Tsinghua University, Tsinghua National Laboratory for Information Science and Technology, Beijing 100084, China

art ic l e i nf o

a b s t r a c t

Article history: Received 10 March 2015 Received in revised form 9 December 2015 Accepted 21 January 2016 Available online 5 February 2016

Disparity refinement is the final step but also the timing bottleneck of stereo matching due to its high computational complexity. Weighted media filter refinement method and non-local refinement method are two typical refinement methods with O(N) computational complexity for each pixel where N indicates the maximum disparity. This paper presents an O(1) disparity refinement method based on belief aggregation and belief propagation. The aggregated belief, which means the possibility of correct disparity value, is efficiently computed on a minimum spanning tree first, and then the belief aggregation is fast performed on another minimum spanning tree in two sequential passes (first from leaf nodes to root, then from root to leaf nodes). Only 2 additions and 4 multiplications are required for each pixel at all disparity levels, so the computational complexity is O(1). Performance evaluation on Middlebury data sets shows that the proposed method has good performances both in accuracy and speed. & 2016 Elsevier Ltd. All rights reserved.

Keywords: Stereo matching Disparity refinement Belief aggregation Belief propagation Non-local

1. Introduction Stereo matching algorithms usually consists of four steps [1]: matching cost computation, cost aggregation, disparity computation and disparity refinement. A lot of work has been spent on developing robust cost computation [2,3] and cost aggregation methods [4–10], while the high complexity of disparity refinement becomes the timing bottleneck in stereo matching algorithm. For example, for the guided image based algorithm [11], the average runtime of disparity refinement is about 6.5 seconds on the Middlebury data sets [12], reported by Yang [13]. Traditional refinement step include left– right check [14], hole filling and a median filter. Weighted media filter refinement method is widely adopted by many stereo matching algorithms (e.g., [11,15]). But the high complexity of this filter becomes the timing bottleneck. Recently, a constant time weighted median filtering [16] was proposed. The author proposed refinement with the help of a histogram whose size is maximum disparity N. This algorithm is driven by the recent progress on fast median filtering [17–19], fast algorithms [20–23] for bilateral filtering [24], and other fast edge-aware filtering [25,26], these existing O(1) edge-aware filters can be performed on each histogram bin. The method shows good accuracy at slow speed due to O(N) computational complexity for each pixel, the speed evaluation of this method is shown in Section 4.2. Moreover, n

Correspondence author. Tel.: þ 86 10 62798540; fax: þ 86 10 62770317. E-mail addresses: [email protected] (X. Huang), [email protected] (Y.-J. Zhang). http://dx.doi.org/10.1016/j.patcog.2016.01.025 0031-3203/& 2016 Elsevier Ltd. All rights reserved.

one main shortage of this refinement method is that the support windows are of fixed size. Yang proposed a non-local refinement method [13] using a non-local aggregation method on MST (minimum spanning tree) structure. All pixels are firstly divided into stable or unstable pixels after left–right disparity check, a new cost volume will be computed based on this checked disparity map, then followed by nonlocal aggregation at each disparity level and a winner-take-all operation to propagate the disparity values from stable pixels to unstable pixels. But the speed is still very slow due to O(N) computational complexity for each pixel, the speed evaluation of this method is shown in Section 4.2. In previous work [27], we presented one fast disparity refinement method based only on belief propagation. The method shows good performance on complicated cost aggregation methods (e.g., guided filter aggregation [11], non-local aggregation [13]) but failed in simple cost aggregation methods such as box-filter aggregation [1]. In this paper, we proposed a fast refinement based on belief aggregation and belief propagation. All pixels firstly have initial disparity belief. We build a hybrid MST whose edge weight is determined by disparity distance and color distance, belief aggregation is efficiently computed on this hybrid MST in two sequential passes (same to cost aggregation in [13], first from leaf nodes to root, then from root to leaf nodes). The pixel has greater aggregated belief if it has larger close neighbors both in disparity and color. Then, we build another MST whose edge weight is only determined by color distance, belief propagation is fast performed on this MST in two sequential passes (first from leaf nodes to root, then from root to leaf nodes). The pixel having lower aggregated belief will receive propagation from the pixel with

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

199

Fig. 1. Proposed refinement process of tsukuba data set. (a) Left image of tsukuba data set. (b) Guided filter aggregation [11] followed by left–right check and hole filling. (c) belief aggregation of (b), brighter color indicates higher belief. (d). belief propagation of (c). Although the disparity of h and j is same in (b), but h has larger close neighbors both in disparity and color, so h has greater aggregated belief in (c). Compared with k, pixel j has close color in (a) but smaller aggregated belief in (c), so j receives propagation from k, final disparity of j is assigned by the disparity of k. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

similar color and higher aggregated belief. Proposed refinement process on tsukuba data set is demonstrated in Fig. 1. Left image of tsukuba and disparity map to be refined is shown in Fig. 1(a) and (b), belief aggregation result is presented in Fig. 1(c), belief propagation result is presented in Fig. 1(d). In proposed refinement method, only 2 additions and 4 multiplications are required for each pixel at all disparity levels, so the computational complexity is O(1). Performance evaluation tested on Middlebury data set [12] shows good performance both in accuracy and speed. The remaining of the paper is organized as following: Previous work is introduced in Section 2. Section 3 gives the details of the proposed disparity refinement algorithm. The experimental results are presented in Section 4. Finally, the paper concludes in Section 5.

2. Previous work In this section, we mainly review the MST (minimum spanning tree) based non-local filter [13]. The reference color/intensity image I is represented as a connected, undirected graph G ¼(V, E), where each node in V corresponds to a pixel in I, and each edge in E connects a pair of neighboring pixels. The graph G is thus simply the standard 4connected or 8-connected grid. For an edge e connecting pixels s and r, its weight is determined as follows: wðs; rÞ ¼ wðr; sÞ ¼ j IðsÞ  IðrÞj

ð1Þ

A tree T can be constructed by selecting a subset of edges from E. Yang [13] proposed to construct a MST connecting all the pixels and the sum of its weights is minimized. For any two pixels p and q, their distance Wðp; qÞ is determined by the sum of the edge weights along the path in T, and   Wðp; qÞ Sðp; qÞ ¼ exp  ð2Þ

σ

denotes the similarity between p and q where σ is a parameter to adjust the similarity between two nodes. Let C d ðpÞ denote the matching cost for pixel p at disparity level d. The final aggregated cost of pixel p at disparity level d is computed as follows: X C Ad ðpÞ ¼ Sðp; qÞC d ðqÞ ð3Þ qAI

Different from the local filtering-based methods, in the nonlocal cost aggregation method, p gets support weights from all the pixels in I. Yang proved that the non-local cost aggregation can be accomplished in exactly linear time by traversing the tree structure in two sequential passes: first from leaf to root, then from root to leaf.

In the first pass from leaf to root, the intermediate aggregated cost C A↑ d ðpÞ of each node p can be computed: X A↑ Sðp; qÞC A↑ ð4Þ C d ðpÞ ¼ C d ðpÞ þ d ðqÞ q is child of p

C A↑ d ðpÞ

Note that equals to the final cost aggregation C Ad ðpÞ if p is the root node. In the second pass from root to leaf, the final aggregated cost C Ad ðpÞ of each node p can be computed: C Ad ðpÞ ¼ Sðp; qÞC Ad ðqÞ þð1 S2 ðp; qÞÞC A↑ d ðqÞ

ð5Þ

where pixel q is parent of pixel p. The non-local filter is an efficient filter with the following advantages: 1. It provides a non-local solution, which theoretically and experimentally outperforms local cost aggregation methods. 2. It has low computational complexity: only 2 addition/subtraction operations and 3 multiplication operations are required for each pixel at each disparity level. 3. It can be used for non-local disparity refinement, which is proved to be more robust and effective than weighted media filter refinement method presented in [11].

3. Proposed disparity refinement In this section, we first propose belief aggregation method on a hybrid MST whose edge weight is determined by disparity distance and color distance, then we present belief propagation method on another MST whose edge weight is only decided by color distance, and lastly we discuss the computational complexity of the algorithm. 3.1. Belief Aggregation on Hybrid MST In a general local stereo matching algorithm, left and right disparity maps are obtained separately. Then left–right disparity check divides all the pixels into stable or unstable pixels. If left disparity is equal to the corresponding right disparity, the pixel is regarded as a stable pixel, otherwise it is considered as an unstable pixel. In addition, a hole filling step estimates disparity from stable pixels to unstable pixels. Fig. 1(b) shows one disparity map with guided filter aggregation method [11] followed by left–right check and hole filling. In this paper, we first propose disparity belief for each pixel to represent the possibility of correct disparity value. Obviously, the pixel has greater disparity belief if it has large number of close neighbors both in disparity and color. If we construct a hybrid MST

200

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

whose edge weight is determined by disparity distance and color distance, and initialize the disparity belief, then aggregated belief can be computed on this hybrid MST similar to cost aggregation on MST [13]. In the construction of the hybrid MST, the weight of edge connecting pixels s and r is determined by disparity distance and color distance as follows: wH ðs; rÞ ¼ wH ðr; sÞ ¼ ð1  αÞj DðsÞ  DðrÞj þ α j IðsÞ  IðrÞj

ð6Þ

where α is a balance parameter, D denotes the disparity map to be refined. Let B denote the initial disparity belief of D as follows: ( 1 p is stable: BðpÞ ¼ ð7Þ 0:1 p is unstable:

Claim 1 can be obtained from (9) and the definition of MST. Let v denote the nodes on r's sub trees, then propagated belief from r's sub trees to r is max Sðr; vÞBA ðvÞ. The propagated belief from T r to s v

is the maximum value of: 1) Propagation from r to s: Sðs; rÞBA ðrÞ 2) Propagation from r's sub trees to s: max Sðs; vÞBA ðvÞ v ¼max Sðs; rÞSðr; vÞBA ðvÞ ¼Sðs; rÞmax Sðr; vÞBA ðvÞ ¼Sðs; rÞ times the v v propagation from r's sub trees to r. From Claim 1, the propagation can be performed from leaf nodes to root node. Let BP↑ denote the propagated belief result obtained after this propagation, then at each node s, BP↑ ðsÞ ¼ maxðBA ðsÞ;

A

Let B denote the aggregated belief. Similar to cost aggregation in (3), BA can be computed without normalization as follows:   X X W H ðp; qÞ SH ðp; qÞBðqÞ ¼ exp  BðqÞ ð8Þ BA ðpÞ ¼ qAI

qAI

σH

where the subscript H indicates hybrid MST, W H ðp; qÞ represents the distance from pixels p to q and is determined by the sum of the edge weights along the path in this hybrid MST. BA can be efficiently computed on this hybrid MST in two sequential passes (same to cost aggregation in [13], first from leaf nodes to root, then from root to leaf nodes). Fig. 1(c) shows aggregated belief of Fig. 1(b). Although the disparity of h and j is the same in Fig. 1(b), but h has more close neighbors both in disparity and color, so h has greater aggregated belief in Fig. 1(c). 3.2. Belief propagation on MST When r is the pixel to be refined, it is required to find the largest neighbor area E which meets the following two conditions: 1) Color and disparity are both similar inside area E. 2) Color of E is similar to r.

Sðs; vÞBP↑ ðvÞÞ

ð11Þ

Note that in (11), if node s has no child then BP↑ ðsÞ ¼ BA ðsÞ, if node s is root node then BP ðsÞ ¼ BP↑ ðsÞ and propagation from all nodes to root node is finished. Claim 2. Let T r denote a sub tree with root node r and parent node s, then propagated belief from all nodes to node r can be proved as: BP ðrÞ ¼ maxðBP↑ ðrÞ; Sðs; rÞBP ðsÞÞ

ð12Þ

After disparity belief propagation from leaf nodes to root node, the root node receives propagation from all nodes, the rest nodes only receive propagation from its sub trees. Fig. 2(a) is an example of belief propagation from leaf nodes to root node v4, the propagation result BP↑ ðv3Þ contains propagation from its sub trees and itself (grouped in red ellipse), BP ðv4=v3Þdenote v4 receive propagation from all nodes other than v3 and its sub trees (grouped in blue ellipse). According to Claim 1, BP ðv4Þ denotes v4 receiving propagation from all nodes (grouped in green rectangle) which can be described as: BP ðv4Þ ¼ maxðSðv4; v3ÞBP↑ ðv3Þ; BP ðv4=v3ÞÞ

We construct another MST whose edge weight is only determined by color distance (same to MST in cost aggregation). For each pixel v, Sðr; vÞ denotes color similarity between r and v on this MST, aggregated belief BA ðvÞ represents the size of close neighbors of v both in disparity and color, so Sðr; vÞBA ðvÞ indicates the possibility of r and v having same disparity. When p ¼ arg maxðSðr; vÞBA ðvÞÞ, the disparity of p is the best disparity estimation vAI

max

v is child of s

ð13Þ

Fig. 2(b) is an example of belief propagation from root node v4 to child node v3. According to Claim 1, BP ðv3Þ denotes v3 receiving propagation from all nodes (grouped in green rectangle) which can be described as: BP ðv3Þ ¼ maxðBP↑ ðv3Þ; Sðv4; v3ÞBP ðv4=v3ÞÞ P↑

ð14Þ

P

If Sðv4; v3ÞB ðv3Þ r B ðv4=v3Þ, then P

B ðv4Þ ¼ BP ðv4=v3Þ

A

of r, maxðSðr; vÞB ðvÞÞ means the possibility of this best disparity vAI

estimation. The possibility of best disparity estimation is propagated from aggregated belief on MST, so it’s defined as propagated belief (denoted as BP ), the best disparity estimation is defined as propagated disparity (denoted as DP ). For each node r: BP ðrÞ ¼ maxðSðr; vÞBA ðvÞÞ vAI

DP ðrÞ ¼ Dðarg maxðSðr; vÞBA ðvÞÞÞ vAI

ð9Þ ð10Þ

The belief propagation result of Fig. 1(c) is shown in Fig. 1(d). Compared with pixel k, pixel j has close color but smaller aggregated belief, so j receives propagation from k, final disparity of j is assigned by the disparity of k. Similar to algorithm [13], we can get the following claims: Claim 1. Let T r denote a sub tree with root node r and parent node s, then the propagated belief from T r to s is the maximum value of: 1) The propagated belief from node r to s. 2) Sðs; rÞ times the propagated belief from r's sub trees to r.

BP ðv3Þ ¼ maxðBP↑ ðv3Þ; Sðv4; v3ÞBP ðv4ÞÞ If Sðv4; v3ÞBP↑ ðv3Þ 4 BP ðv4=v3Þ, then BP ðv4Þ ¼ Sðv4; v3ÞBP↑ ðv3Þ Since Sðv4; v3Þ r 1, P↑

B ðv3Þ 4Sðv4; v3ÞBP ðv4=v3Þ BP ðv3Þ ¼ BP↑ ðv3Þ ¼ maxðBP↑ ðv3Þ; S2 ðv4; v3ÞBP↑ ðv3ÞÞ ¼ maxðBP↑ ðv3Þ; Sðv4; v3ÞBP ðv4ÞÞ Considering the above two cases, conclusion (12) is proved. Because BP ðvÞ ¼ BP↑ ðvÞ for the root node of MST, the propagation result at each node can be computed using (12) by tracing from root node to leaf nodes. Hence, the whole propagation process is separated into two steps: 1) Propagation on aggregated belief BA from leaf nodes to root node using (11), the intermediate disparity belief stored as BP↑ .

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

201

Fig. 2. Two belief propagation steps. (a). Belief propagation from leaf nodes to root node v4. BP↑ ðv4Þ denotes v4 receive propagation from all nodes (grouped in green rectangle), BP↑ ðv3Þ contains propagation from its sub trees and itself (grouped in red ellipse), BP ðv4=v3Þ denotes v4 receive propagation from all nodes other than v3 and its sub trees (grouped in blue ellipse). (b). Belief propagation from root node v4 to child node v3. BP ðv3Þ denotes v3 receive propagation from all nodes (grouped in green rectangle) which can be proved to be BP ðv3Þ ¼ maxðBP↑ ðv3Þ; Sðv4; v3ÞBP ðv4ÞÞ. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

2) Propagation on intermediate disparity belief BP↑ from root node to leaf nodes using (12), the final results stored as BP . In the propagation process, the disparity value of each pixel is updated using (10). 3.3. Computational complexity As mentioned in Section 1, N denotes maximum disparity in this section. In non-local refinement method [13], a new cost volume will be computed based on left–right checked disparity map, then followed by non-local aggregation on MST at each disparity level and a winner-take-all operation to propagate the disparity values from stable pixels to unstable pixels. Operations required for each pixel at each disparity level is 2 additions and 3 multiplications, total operations required for each pixel is 2N additions and 3N multiplications. The computational complexity of each pixel is O(N). In const time weighted media filter method [16], the author proposed refinement by histogram whose size is N. Some existing O(1) edge-aware filter is performed on each histogram bin, total computational complexity of each pixel is O(N). In the proposed refinement method, belief aggregation on hybrid MST need 2 additions and 3 multiplications for each pixel, belief propagation on another MST need 1 multiplications for each pixel, total operations required for each pixel is 2 additions and 4 multiplications. In addition, we need time to build two MST which is independent on N. Obviously, computational complexity of each pixel is O(1).

4. Experimental results In this section, the experimental settings is introduced first, the speed and accuracy evaluation is presented afterwards. 4.1. Experimental settings Two typical refinement methods are compared with the proposed refinement method: non-local refinement method [13]

(denoted as Nlocal) and weighted media filter method [16] (denoted as WMF). In order to evaluate refinement performance on different cost aggregation methods, each disparity refinement method is evaluated on three aggregation methods: two complicated aggregation methods (guided filter aggregation method [11], non-local aggregation method [13]) and one simple box-filter aggregation method [1]. The experiments are carried on the Middlebury stereo benchmark [12]. In order to reliably evaluate performance on different texture stereo pairs, we evaluate all the pairs that have groundtruths available. We consider the error metric as the percentage of bad pixels with error threshold 1. One shortcoming of MST based non-local filter [13] is that the distance between two pixels is approximated by the sum of the edge weights along the path in the tree. The approximation leads to unreasonable support weight in textureless region. In textureless region, while pixels are looked almost identical, color distances have usually small but non zero values. This causes the small-weight-accumulation problem, that is, many small weight edges can accumulate along a long path and form undesirable high weight in textureless region. The left image of baby3 dataset is presented in Fig. 3(a) with textureless bottom region (shown in blue rectangle), support weight from one stable pixel (shown in green) to other pixels decreases quickly in Fig. 3(e), this makes the disparity refinement of unstable pixels in bottom textureless region failed as shown in Fig. 3(f). In order to suppress the influence of small-weight-accumulation problem in textureless region, the edge of neighbor pixels described in (1) needs a weight clip operation as follows in the proposed refinement method: ( j IðsÞ  IðrÞj  0:15 if j IðsÞ IðrÞj r2 wðs; rÞ ¼ wðr; sÞ ¼ ð15Þ j IðsÞ  IðrÞj else The support weight from the stable pixel to other pixels with MST edge weight clip declines slowly in Fig. 3(g), this leads to disparity refinement of unstable pixels in bottom textureless region has better accuracy which is shown in Fig. 3(h). Non-local refinement [13] results is also presented in Fig. 3(d) which is comparable with the proposed refinement results without MST edge weight clip.

202

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

Fig. 3. The proposed refinement result with MST edge weight clip shows higher accuracy. (a) Left image of baby3 dataset with textureless bottom region (shown in blue rectangle). (b)(c) disparity groundtruth and box-filter [1] aggregation result. (d) non-local refinement [13] result of (c). (e) and (g) are support weight from one stable pixel (shown in green) to other pixels without and with MST edge weight clip, the former decrease faster than the latter. (f) and (h) are the proposed refinement result of (c) without and with MST edge weight clip. In textureless bottom region, the proposed refinement result with weight clip shows better accuracy. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

In the proposed method, we do left–right check and simple hole-filling as in [11] before refinement. The parameter for the proposed method is kept constant for all benchmark stereo pairs: σ ¼ 0.04, σ H ¼ 0.0016, α ¼ 0.05. For non-local refinement method [13] and weighted media filter method [16], their parameters follow the settings of corresponding paper. The hardware platform for test is a desktop with Intel Core Duo 3.00 GHz CPU and 2 GB RAM, no parallelism technique is utilized. 4.2. Speed evaluation The speed comparison of three refinement methods is shown in Table 1 and Fig. 4. Note that the time has been translatated to needed time per 100,000 pixels to remove resolution influence. The time of proposed refinement method keeps constant while the time of other two refinement methods increases if maximum disparity increases, this is consistent with computational complexity analysis. 4.3. Accuracy evaluation Table 2 shows the accuracy evaluation of all Middlebury data sets [12] with three aggregation methods followed by three refinement methods. The percentages of the erroneous pixels with error threshold 1 are used to evaluate the accuracy, and the subscript numbers are the relative rank of the methods on the data sets. For visual comparison, we present the disparity results of four data sets with three aggregation methods followed by three refinement methods in Figs. 5 and 6. The bad pixels are marked in red, and the percentage of bad pixels is shown below each

Table 1 The time of different refinement methods on the Middlebury data sets [12]. Note that the time has been translatated to needed time per 100,000 pixels to remove resolution influence. Dataset

Max disparity

WMF (ms)

Nlocal (ms)

Proposed (ms)

Tsukuba Bull Baby1 Teddy Cloth2

15 32 46 59 76

95.85 207.31 312.15 405.33 530.55

96.75 141.84 174.73 207.70 250.61

122.07 123.66 126.30 125.04 126.71

disparity map. The complete disparity results can be found in the Supplementary material. In general, non-local refinement method [13] and the proposed refinement method show better accuracy performance than weighted media filter refinement [16]. A great advantage of MST based non-local filter [13] is that it gives a more natural image pixel similarity measurement metric. It is more accurate than the previous local-filter so that every pixel in the image can correctly contribute to all the other pixels during filtering. In contrast, weighed media filter refinement method [16] require a fixed window, and only pixels inside this window provide supports. It is impossible to find a window that is optimal for different data sets. On the contrary, the size of unstable regions in non-local refinement method [13] can be huge. The size of unstable regions is limited in weighted media filter refinement [16]. Our proposed refinement is also based on non-local filter. As a result, non-local refinement method [13] and proposed refinement method show better accuracy performance than weighted media filter refinement [16] for most of the test data sets.

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

Fig. 4. Relationship between the time of different refinement methods and maximum disparity, only the time of proposed refinement method keeps constant due to O(1) computational complexity.

203

Compared with non-local refinement method [13], our proposed refinement method shows better accuracy performance due to the following three factors: First, one shortcoming of MST based non-local refinement method [13] is small-weight-accumulation problem in textureless region which is described in Section 4.1, improved MST edge weight clip presented in (15) is adopted by proposed method. Second, our proposed belief aggregation is computed on a hybrid MST which is shown in (8), the weight of this hybrid MST edge is determined by disparity distance and color distance which is shown in (6). The contribution of disparity distance and color distance both in exponential form from (6) and (8). In non-local refinement method [13], a new cost volume will be computed, then non-local aggregated on MST and followed by a winner-take-all operation to select a best choice. The contribution of disparity distance is reflected in computation of new cost volume, and the contribution of color distance is reflected in non-local aggregation. Clearly, the contribution of color distance is exponential form, while contribution of

Table 2 Accuracy evaluation on all Middlebury [12] data sets with error threshold 1. Three aggregation methods (guided filter [11], non-local filter [13] and box-filter [1]) followed by three refinement methods (weighted media filter [16], non-local [13] and proposed) are evaluated by the percentages of the error pixels. The subscripts represent the relative rank of the methods on the data sets. Dataset

Teddy Cones Venus Tsukuba Sawtooth Bull Poster Barn1 Barn2 Map Art Books Dolls Laundry Moebius Reindeer Aloe Baby1 Baby2 Baby3 Bowling1 Bowling2 Cloth1 Cloth2 Cloth3 Cloth4 Flowerpots Lampshade1 Lampshade2 Midd1 Midd2 Monopoly Plastic Rocks1 Rocks2 Wood1 Wood2 Avg. error Avg. rank

Guided aggregation

Non-local aggregation

Box aggregation

WMF

Nlocal

Ours

WMF

Nlocal

Ours

WMF

Nlocal

Ours

12.113 6.603 0.993 4.263 1.942 1.182 1.173 2.013 2.853 5.333 15.252 17.983 11.403 20.013 16.223 9.611 11.883 9.032 7.501 10.333 22.531 17.322 6.873 11.543 6.413 8.063 21.932 15.243 21.782 39.543 32.073 25.583 40.092 6.722 5.183 9.941 1.741 12.443 2.493

11.782 6.262 0.631 3.541 1.581 0.981 1.092 1.431 2.211 3.031 15.413 17.731 10.472 19.102 14.192 12.013 10.812 12.163 14.583 9.062 28.163 19.713 4.912 9.972 5.341 7.412 24.833 12.882 23.313 31.632 26.682 23.542 40.071 7.183 5.162 12.382 3.253 12.282 2.002

11.001 5.891 0.862 3.652 2.013 1.923 0.901 1.762 2.642 4.472 13.571 17.982 9.251 18.131 13.021 10.222 9.501 9.021 10.902 7.571 22.702 16.891 4.841 9.721 5.402 7.031 20.271 10.311 12.211 22.731 24.831 19.051 43.793 6.111 4.631 12.993 1.752 10.801 1.541

11.613 6.923 1.212 5.053 1.702 0.871 0.861 2.082 4.033 7.823 16.873 19.432 12.403 19.103 15.823 14.383 12.133 14.501 17.611 11.193 29.311 21.351 6.143 11.303 6.193 7.753 30.382 18.513 21.163 25.473 27.882 23.323 50.313 5.592 5.322 14.152 2.932 13.583 2.413

10.622 6.401 1.131 3.431 1.511 1.102 1.213 1.551 2.721 4.211 15.692 18.971 10.982 18.242 12.921 14.312 9.492 16.723 20.493 10.762 31.832 21.402 4.162 9.762 4.941 7.402 31.113 16.811 18.022 25.432 33.113 20.632 47.631 6.133 5.101 13.661 3.653 13.062 1.812

10.471 6.822 1.423 4.152 2.333 1.113 1.152 2.263 3.702 7.012 15.211 19.663 10.691 17.711 13.482 13.541 8.401 16.102 20.282 10.101 33.453 23.123 3.691 9.281 5.082 6.951 29.651 18.172 15.671 19.931 23.761 20.041 48.002 5.171 5.393 15.963 2.891 12.751 1.781

13.383 6.732 1.131 4.303 1.641 2.332 1.121 2.023 2.713 4.643 16.153 18.993 10.643 21.283 16.203 8.282 10.753 8.491 7.561 10.533 25.892 14.602 5.193 9.313 5.333 7.013 20.362 15.613 27.063 41.293 37.861 28.002 46.482 5.712 4.732 9.301 3.522 12.872 2.323

13.383 6.753 1.142 3.581 1.802 1.451 2.043 1.711 2.041 2.881 15.462 17.842 9.262 21.082 14.692 11.013 9.652 11.693 13.713 9.552 28.083 17.323 3.322 7.362 4.341 6.492 23.233 13.262 26.312 40.502 41.823 31.283 48.143 5.813 4.863 12.753 4.113 13.233 2.272

12.771 6.551 1.183 3.982 2.233 3.123 2.002 1.922 2.602 3.682 14.231 17.421 8.901 19.481 13.821 7.641 9.441 8.602 9.132 7.381 21.611 14.061 3.031 7.081 4.352 5.981 18.941 8.981 9.401 39.321 39.642 27.021 46.261 4.991 4.351 9.852 2.821 11.451 1.431

204

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

Teddy

23.17

12.11

11.78

11.00

Cones

16.12

6.60

6.26

5.89

Baby3

21.68

10.33

9.06

7.57

34.70

21.93

24.82

20.27

Flowerpots left image

right image

ground truth

guided agg.

WMF

non-local

proposed

Fig. 5. Experimental results on some Middlebury data sets [12] with different refinement method. Results of all data sets is attatched as Supplementary material. (a) Left image. (b) Right image. (c) Ground truth of left disparity. (d) Guided filter aggregation [11] followed by left–right check. (e)–(g) Refinement result of (d) with different refinement method (Weighted media filter [16], non-local refinement [13] and proposed). The percentage of bad pixels marked in red also is shown below each disparity map. (a) left image (b) right image (c) ground truth (d) guided agg. (e) WMF (f) non-local (g) proposed. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

24.66

11.61

10.62

10.47

25.93

13.38

13.38

12.77

18.77

6.92

6.40

6.82

17.52

6.73

6.75

6.55

23.71

11.19

10.76

10.10

24.79

10.53

9.55

7.38

46.78

30.38

31.11

29.65

34.24

20.36

23.23

18.94

WMF

non-local

proposed

box agg.

WMF

non-local

proposed

nonlocal agg.

Fig. 6. (a) and (e) non-local aggregation [13] and box-filter aggregation followed by left–right check. (b)–(d) (f)–(h) refinement result of (a) and (e) with different refinement methods (Weighted media filter [16], non-local refinement [13] and proposed). The percentage of bad pixels marked in red also is shown below each disparity map. (a) nonlocal agg. (b) WMF (c) non-local (d) proposed (e) box agg. (f) WMF (g) non-local (h) proposed. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

205

Fig. 7. Non-local refinement [13] result with different computation of new cost value. (a) and (b) Left image and ground truth of Lampshade1 dataset. (c) box-filter [1] aggregation result with a lot of disparity noise (shown in blue rectangle). (d) and (e) non-local refinement [13] result with linear form (16) and exponential form (17), more noisy pixels are suppressed in latter form. (a) left image (b) ground truth (c) box-filter aggregation (d) non-local with (16) (e) non-local with (17) (error pixels 16.63%) (error pixels 13.77%). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 8. The proposed refinement result without and with hole-filling. (a)–(c) Left image, right image and ground truth of Teddy dataset. (d) box-filter [1] aggregation result (one shadow region is inside the blue rectangle). (e) and (f) non-local refinement [13] and the proposed result without hole-filling, two results are comparable. (g) holefilling of (d). (h) the proposed refinement result based on hole filling, accuracy of shadow region is better than without hole-filling. (a) left image (b) right image (c) ground truth (d) box aggregation result (e) non-local refinement result (error pixels 15.14%) (f) proposed without hole filling (error pixels 15.05%) (g) hole-filling (h) proposed with hole filling (error pixels 12.77%). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

disparity distance is in linear form as following: ( j d  DðpÞj p is stable: new C d ðpÞ ¼ 0 p is unstable:

ð16Þ

where DðpÞ denotes the disparity of pixel p to be refined, C new d ðpÞ denotes a new cost value of pixel p at disparity level d. Compared with the contribution of disparity distance as linear form in non-local refinement method [13], we find that the contribution of disparity distance as exponential form in the proposed method shows more robust accuracy on noisy initial disparity map. In order to verify this conclusion, we modify the contribution of disparity distance in non-local refinement method [13] to exponential form as following: ( DðpÞj  expð  j d  Þ p is stable: σc C new ðpÞ ¼ ð17Þ d 0 p is unstable: where σ c is smooth parameter. From tuition, if pixel p is stable with noise initial disparity DðpÞ, nearby pixel q is stable with unnoise initial disparity DðqÞand expect to get refined disparity

d, j d  DðpÞj usually is very large, thus C new d ðpÞ is close to zero. In followed non-local aggregation, the contribution from noise pixel p to unnoise q is little that means the influence of noisy pixel p is suppressed. In one word, influence of noisy pixel can be suppressed more efficiently in (17) than in (16). We evaluate non-local refinement [13] with exponential form (17) and linear form (16) on box-filter aggregation result, parameter σ c ¼10, σ ¼0.04, other parameters follow the settings of corresponding paper, the average percentage of error pixels in Middlebury stereo benchmark [12] decreased from 12.97% to 12.43%. One example is shown in Fig. 7. Finally, we do simple hole-filling as in [11] after left–right check in the proposed method, while hole-filling is not adopted by nonlocal refinement [13] method. Hole-filling can improve accuracy performance in shadow region. One example is shown in Fig. 8. Since three factors discussed above lead to our proposed refinement method shows better accuracy performance than nonlocal refinement [13], we can evaluate these two methods without influence of three factors. That is to say, proposed method modified without weight clip and hole filling (denoted as ours*), non-

206

X. Huang, Y.-J. Zhang / Pattern Recognition 55 (2016) 198–206

Table 3 Accuracy evaluation between non-local refinement [13] with (17) (denoted as Nlocal*) and the proposed method without weight clip and hole filling (denoted as ours*), details is shown in Supplementary material. The results of these two modified methods are very close. Dataset

Guided aggregation

Non-local aggregation

Box aggregation

Nlocal*

Ours*

Nlocal*

Ours*

Nlocal*

Ours*

11.37

12.44

12.31

12.41

12.54

Avg. Error 11.43

local refinement modified with exponential form (17) (denoted as Nlocal*). The parameter σ ¼0.04 is best for both modified methods, other parameters is same to previous evaluation. The accuracy performance of two modified methods is shown in Table 3, details is shown in Supplementary material. We can find that two results are very close. Both our proposed refinement and non-local refinement [13] are based on MST structure filter, the close accuracy performance is consistent with our expectation.

5. Conclusions We proposed an O(1) disparity refinement based on belief aggregation and belief propagation. The proposed fast refinement method only needs 2 additions and 4 multiplications for each pixel at all disparity levels. Performance evaluation on Middlebury stereo benchmark [12] demonstrates that the proposed refinement algorithm outperforms non-local refinement method [13] and weighted media filter method [16]. The future work should emphasize on constructing more effective trees to resolve the shortcoming of MST based non-local filter [13] and parallelism implementation.

Conflict of interest None declared.

Acknowledgment This work was supported by National Natural Science Foundation of China (NNSF: 61171118).

Appendix A. Supplementary material Supplementary data associated with this article can be found in the online version at http://dx.doi.org/10.1016/j.patcog.2016.01. 025.

References [1] D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, Int. J. Comput. Vis. 47 (1–3) (2002) 7–42. [2] S. Birchfield, C. Tomasi, A pixel dissimilarity measure that is insensitive to image sampling, IEEE Trans. Pattern Anal. Mach. Intell. 20 (4) (1998) 401–406. [3] S. Chambon, A. Crouzil, Similarity measures for image matching despite occlusions in stereo vision, Pattern Recognit. 44 (9) (2011) 2063–2075. [4] K.-J. Yoon, I.S. Kweon, Adaptive support-weight approach for correspondence search, IEEE Trans. Pattern Anal. Mach. Intell 28 (4) (2006) 650–656. [5] C. Richardt, D. Orr, I. Davies, A. Criminisi, N.A. Dodgson, Real-time spatiotemporal stereo matching using the dual-cross-bilateral grid, in: Proceedings of the European Conference on Computer Vision, 2010, pp. 510–523. [6] M. Gong, Y. Zhang, Y. -H. Yang, Near-real-time stereo matching with slanted surface modeling and sub-pixel accuracy, Pattern Recognit. 44 (10–11) (2011) 2701–2710. [7] X. Mei, X. Sun, W. Dong, H. Wang, X. Zhang, Segment-tree based cost aggregation for stereo matching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 313–320. [8] C. Pham, J. Jeon, Domain transformation-based efficient cost aggregation for local stereo matching, IEEE Trans. Circuits Syst. Video Technol. 23 (7) (2013) 1119–1130. [9] J. Sun, N. Zheng, H.Y. Shum, Stereo matching using belief propagation, IEEE Trans. Pattern Anal. Mach. Intell. 25 (7) (2003) 787–800. [10] Q. Yang, L. Wang, N. Ahuja, A constant-space belief propagation algorithm for stereo matching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1458–1465. [11] C. Remain, A. Hosni, M. Bleyer, C. Rother, M. Gelautz, Fast cost-volume filtering for visual correspondence and beyond, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3017–3024. [12] D. Scharstein, R. Szeliski, Middlebury Stereo Evaluation (Online). Available: 〈http://vision.middlebury.edu/stereo/eval〉. [13] Q. Yang, A non-local cost aggregation method for stereo matching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1402–1409. [14] S.D. Cochran, G. Medioni, 3-d surface description from binocular stereo, IEEE Trans. Patt. Anal. Mach. Intell. 14 (10) (1992) 981–994. [15] X. Sun, X. Mei, S. Jiao, M. Zhou, H. Wang, Stereo matching with reliable disparity propagation, in: Proceedings of the 3DIMPVT, 2011, pp. 132–139. [16] Z. Ma, K. He, Y. Wei, J. Sun, Constant time weighted median filtering for stereo matching and beyond, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 49–56. [17] S. Perreault, P. Hébert, Median filtering in constant time, in: Proceedings of the IEEE Transactions on Image Processing, vol. 16(9), 2007, pp. 2389–2394. [18] D. Cline, K.B. White, P.K. Egbert, Fast 8-bit median filtering based on separability, in: Proceedings of the IEEE Conference on Image Processing, 2007, pp. 281–284. [19] M. Kass, J. Solomon, Smoothed local histogram filters, SIGGRAPH, 2010, p. 100. [20] F. Durand, J. Dorsey, Fast bilateral filtering for the display of high-dynamicrange images, SIGGRAPH, 2002, pp. 257–266. [21] S. Paris F. Durand. A fast approximation of the bilateral filter using a signal processing approach, in: Proceedings of the IEEE Conference on Computer Vision, 2006, pp. 568–580. [22] F. Porikli, Constant time O(1) bilateral filtering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. [23] Q. Yang, K.-H. Tan, N. Ahuja, Real-time O(1) bilateral filtering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 557–564. [24] C. Tomasi, R. Manduchi, Bilateral filtering for gray and color images, in: Proceedings of the IEEE Conference on Computer Vision, 1998, pp. 839–846. [25] K. He, J. Sun, X. Tang, Guided image filtering, in: Proceedings of the IEEE Conference on Computer Vision, 2010, pp. 1–14. [26] E.S. Gastal, M.M. Oliveira, Domain transform for edgeaware image and video processing, SIGGRAPH, 2011, p. 69. [27] X. Huang, G. Cui, Y. Zhang, A fast non-local disparity refinement method for stereo matching, in: Proceedings of the IEEE Conference on Image Processing, 2014, pp. 3823–3827.

Xiaoming Huang received the master and bachelor degree from Peking University and Lanzhou University respectively. Currently, he is a Ph.D. candidate from the Department of Electronic Engineering at Tsinghua University, Beijing, China. His research interests include image processing, computer vision, and machine learning.

Yu-Jin Zhang received the PhD degree in applied science from Montefiore Institute at the State University of Liège, Belgium, in 1989. He was postdoctoral fellow and research fellow with the Department of Applied Physics and the Department of Electrical Engineering at the Delft University of Technology, the Netherlands from 1989 to 1993. In 1993, he joined the Department of Electronic Engineering at Tsinghua University, Beijing, China, where he has been a professor of image engineering since 1997. He has authored more than 30 books and published more than 400 papers in the areas of image engineering (image processing, image analysis, and image understanding). He is the director of academic committee of China Society of Image and Graphics. He is a senior member of IEEE, and a fellow of SPIE.