Automatic objects segmentation with RGB-D cameras

Automatic objects segmentation with RGB-D cameras

J. Vis. Commun. Image R. xxx (2013) xxx–xxx Contents lists available at SciVerse ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevi...

2MB Sizes 0 Downloads 86 Views

J. Vis. Commun. Image R. xxx (2013) xxx–xxx

Contents lists available at SciVerse ScienceDirect

J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

Automatic objects segmentation with RGB-D cameras Haowei Liu ⇑, Matthai Philipose, Ming-Ting Sun University of Washington, Seattle, WA 98195, United States

a r t i c l e

i n f o

Article history: Available online xxxx Keywords: Boundary detection Object segmentation Trilateral filter Graph Cuts

a b s t r a c t Automatic object segmentation is a fundamentally difficult problem due to issues such as shadow, lighting, and semantic gaps. Edges play a critical role in object segmentation; however, it is almost impossible for the computer to know which edges correspond to object boundaries and which are caused by internal texture discontinuities. Active 3-D cameras, which provide streams of depth and RGB frames, are poised to become inexpensive and widespread. The depth discontinuities provide useful information for identifying object boundaries, which makes automatic object segmentation possible. However, the depth frames are extremely noisy. Also, the depth and RGB information often lose synchronization when the object is moving fast, due to different response time of the RGB and depth sensors. We show how to use the combined depth and RGB information to mitigate these problems and produce an accurate silhouette of the object. On a large dataset (24 objects with 1500 images), we provide both qualitative and quantitative evidences that our proposed techniques are effective. Ó 2013 Elsevier Inc. All rights reserved.

1. Introduction Object segmentation is essential for many vision applications such as object recognition, pose estimation, and touch detection. However, it is a fundamentally difficult problem using conventional approaches which rely on edges in the RGB images. Shadows usually move with objects and cause difficulties in automatic object segmentation. The recent advent of Kinect, an active stereo camera that provides depth frames along with RGB frames at 30 fps, offers hope for automatic object segmentation as the depth discontinuities could serve as indicators of the object boundaries. Although depth provides useful information for automatic object segmentation, the limitations of the depth frames make it difficult to be used directly. Three depth frames from Kinect are shown in the top row of Fig. 1, where lighter pixels denote further away distances. The second row shows the results (lighter shade superimposed) using the plane removal method [1] while the bottom row shows the results using our proposed approach which will be described in details in this paper. Abrupt edges with large depth disparities disrupt depth estimates and manifest as noisy boundaries and ‘‘holes’’ (pixels with missing depth information), as do reflective and transparent or translucent materials. Rapidly moving parts of images are sporadically misaligned and narrow parts that are moving become invisible. To mitigate these problems, we develop the techniques based on the observation that the RGB frames contain detailed edges and are relatively impervious to motion artifacts. The RGB frames ⇑ Corresponding author. E-mail address: [email protected] (H. Liu).

may be relatively blurry due to low lighting and motion, but rarely share the noisy and motion artifacts of the depth frames. We therefore use an RGB frame aligned with the depth frame generate a clean segmentation mask. Our contributions fall into three areas: First, we introduce a trilateral filter which incorporates distance, RGB pixel values, and boundary information to smooth the depth map while preserving the object boundaries. We build on the joint bilateral filtering, which discourages depth smoothing across color discontinuities (and is therefore regarded as ‘‘edge aware’’), but nevertheless produce artifacts when colors are similar on both sides of the boundary. We show how to use a separately provided boundary prior of the pixel to add a ‘‘boundary constraint’’ to the bilateral filter to reduce these artifacts. As a specific instance of the boundary prior, we introduce a Pbd (probability-of-boundary with depth) boundary detector which is a depth-aware version of the RGB-based Pb (probability-of-boundary) detector [2]. Second, we observe that filtering, which can be viewed as a weighted local smoothing, does not recover large missing regions like fingers. We propose to use the refined mask from the trilateral filtering step as the seed for Graph-Cuts-based segmentation [3]. We extend the traditional RGB-based Graph Cuts to include depth-based terms to produce better results. Third, we propose a technique to mitigate the misaligned depth and RGB data of fast moving objects by information from previous aligned frames where object motion is relatively slow. We show how to use optical flow in the RGB frames to estimate the transformations from the aligned frames to the misaligned ones, thus allowing us to retrieve the depth maps for the missing parts. Managing errors in optical flow is the key to good performance. We

1047-3203/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jvcir.2013.03.012

Please cite this article in press as: H. Liu et al., Automatic objects segmentation with RGB-D cameras, J. Vis. Commun. (2013), http://dx.doi.org/10.1016/ j.jvcir.2013.03.012

2

H. Liu et al. / J. Vis. Commun. Image R. xxx (2013) xxx–xxx

Fig. 1. (Top row) Depth maps (missing values in black), and (middle row) rough object masks in lighter shade superimposed on RGB images. The hand and the watering can are frames extracted from a sequence where they are moving. Limitations from the sensors result in the loss of much of the shape of the bottle and alignment/shape of the pointer finger and the watering can. (Bottom row) Refined object masks using the approaches proposed in this paper. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this book.)

show how to use the techniques above to counter the inaccuracies in the optical flow calculations. These techniques produce significant quantitative and visual improvements. We use an extensive dataset (24 objects with 1500 images) to validate the effectiveness of our proposed approach. To our knowledge, this is the first work to provide an in-depth examination of 3-D camera anomalies and propose strategies for mitigating them.

2. Related work Since the inception of Kinect [4], many research work utilizing depth data have been reported. For example, Du et al. [5] use a Kinect sensor to construct an 3D model for indoor environment on a mobile laptop by computing inter frame alignment. Online visual feedbacks are provided to the users so they could correct possible failures. Similarly, in [6], Newcombe et al. construct a surface model for indoor scene using coarse-to-fine iterative closest point algorithm (ICP). However, instead of aligning the depth frames, they simultaneously estimate the camera pose and track live depth images against the constructing model. They show the reconstruction results with low drifting and high accuracy and demonstrate the advantage of tracking against a global model over frame-to-frame alignment. Lai et al. [7] publish a large scale object dataset and show good performance for object detection and recognition using histogram of oriented gradients over both color and depth images. In [8], Richtsfeld et al. present a framework for detecting 3D objects in RGBD-images and extracting representations suitable for robotics tasks. They address the problem

of cluttered scene by estimating surface patches using a mixture of planes and B-splines fitted to the 3D point cloud and then find the best representation for the given data. A graph is then constructed from surface patches and relations among them to derive scene hypothesis. Similarly, Silberman et al. [9] use a Conditional Random Field for scene segmentation. Other than the above higher level computer vision applications, research work have been carried out in the domain of image processing. For instances, in [10], Qi et al. propose an inpainting method to fill the holes in the depth images. In order to obtain satisfactory results, they modify the inpainting equation by incorporating geometric relationships and the structural information from both color and depth images in the local neighborhood into the inpainting parameters. In [11], Vaiapury et al. propose a GraphCutD for object segmentation when stereoscopic or multi-view images are available. In [12], Schiller et al. combine depth thresholding and mixture of Gaussians for appearance modeling to segment foreground objects. Abramov et al. [13] propose a video segmentation method, where depth is first used to penalize neighboring pixels with large depth differences. Correspondences are then established for object segments across temporal frames through optical flow in order to reduce the cost of full segmentation. In [14], a new camera design is set up to capture depth and color images. Joint bilateral filtering and temporal filtering are used to smooth the depth image. Finally, depth and color images are used to yield realistic rendering of the 3D scene. Both [14] and [13] are similar to our work in the sense that they are concerned with the smoothing and denoising of the depth information. However, [14] focuses more on smoothing the depth image instead of segmentation. Although [13] uses temporal

Please cite this article in press as: H. Liu et al., Automatic objects segmentation with RGB-D cameras, J. Vis. Commun. (2013), http://dx.doi.org/10.1016/ j.jvcir.2013.03.012

H. Liu et al. / J. Vis. Commun. Image R. xxx (2013) xxx–xxx

rough object masks

RGB

object mask extraction

trilateral filter (A) depth

region map warping (C) error cleaning (D) affine mapping (E)

Poisson reconstruction

boundaryconstrained graphcut (B)

object masks

denoised depth map

Pbd boundary map

Fig. 2. Pipeline for object segmentation. A, B, C, D, and E are the major components.

information for segmentation, it is used as a means to reduce computation instead of improving the segmentation quality. In addition, neither of them deals with the color/depth misalignment problem. One commonly used technique in the aforementioned work is joint bilateral filter [15,16], which has been widely used for depth data processing, e.g. smoothing or upsampling. For example, in [16], Yang, et al. propose a customized bilateral filter with a new distance measure to spatially super-resolve depth images by using both color and depth information. In [15], Dolson et al. show that a bilateral filter can also be applied in the temporal axis to generate interpolated depth frames for temporal super-resolution. However, one problem with the bilateral filter is that it produces artifacts when the pixels on both sides of the object boundaries have similar colors. It does not address the misalignment problem of fast moving objects we observed either. In this paper, we introduce a set of techniques to address these problems to produce accurate segmentation masks for the objects.

3. Object segmentation incorporating depth information We first give a high-level overview of our approach. Fig. 2 shows the overall block diagram. Using a Kinect, we need to perform registration between the depth and RGB images. Similar to [13], we use the OpenNI toolbox (www.openni.org) for registration. The RGB and depth input shown in are post-registration data. With the temporal correction component inactive, we feed the RGB/depth pair to the probabilistic boundary detector (labeled ‘‘Pbd’’) to derive the boundary prior, which maps each pixel to the probability that a boundary passes through the pixel, and the most likely orientation. The boundary prior is converted into its dual, the region map form suitable for use as one of the three channels of the trilateral filter via a Poisson reconstruction step. The original RGB and depth channels constitute the two other channels input to the trilateral filter, which then refines the depth image. We then extract rough object masks representing initial segmentations for the

3

desired objects in the scene. An example for the object mask extraction is the one used in [1] which extracts objects sitting on a round tabletop. Other kinds of object mask extractions could be used depending on the desired objects. For example, if the objective is to segment the depth images into regions representing various foreground objects and background, simple thresholding could be used to achieve the goal. The resulting bitmasks are used as the seed for the boundary constrained Graph Cuts step. The boundary-constrained Graph Cuts returns further improved bitmasks for the desired objects. We like to emphasize that our approach can be applied in a general setting. The Pbd and the object mask extraction can be customized depending on the specific needs. Even after registration, when the object moves fast, the depth pixels of the object start to loose sync with their color counterparts (other pixels are fine) and hence we need the temporal correction component, which we will describe in Section 3.1. In this section, we introduce our proposed trilateral filter, Pbd operator, and boundary constrained Graph Cuts.

3.1. Trilateral filter As Fig. 1 shows, depth images of objects often have holes where depth information is missing, and jagged boundaries at highly reflective/absorptive patches. In principle, these holes can be filled in based on values from the neighbors. However, given the dramatic noise at depth boundaries, the question arises as to which neighbors should contribute. The idea of using the color channel to refine the depth channel is at the heart of recent works on using joint bilateral filters to upsample the depth images using aligned high-resolution color frames [15–16]. The new depth D0 (p) value at a pixel p in these systems is simply the proximity and color-similarity-weighted sum of its neighboring depth values:

D0 ðpÞ ¼

X

    DðqÞGs kp  qk22 Gc kIp  Iq k22

ð1Þ

q2NðpÞ

where Gs and Gc are Gaussian Kernels for the spatial and color difference, respectively. The neighborhood can be set to be large enough to bridge holes that could arise in practice. This filter can be used to fill in holes if it ignores hole-pixels in the weighted averaging process. In practice, most holes are filled well. In some cases, however, the method introduces artifacts as shown in Fig. 3. If the color of a part of the foreground object is similar to that of the background, the filtering will filter across the object boundary. In this case, the background depth values will contribute excessively to the filtered foreground and produce artifacts even if a strong object boundary separates these locations.

Fig. 3. Comparison with the bilateral filter. (a) color image, (b) corresponding depth image, (c) green contour is the boundary of the pure depth segment result after applying the bilateral filter. Because pixels of the cereal box are similar in color to the wall (yellow), after filtering, they gain the depth values from the wall pixels, resulting in jiggery segmentation boundary. The corresponding result using our trilateral filter is shown in (d). Best viewed in color. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Please cite this article in press as: H. Liu et al., Automatic objects segmentation with RGB-D cameras, J. Vis. Commun. (2013), http://dx.doi.org/10.1016/ j.jvcir.2013.03.012

4

H. Liu et al. / J. Vis. Commun. Image R. xxx (2013) xxx–xxx

A bilateral-filter style solution assumes that relevant differences between two locations can be captured by the difference of pixel values associated with each of those locations. To fit the effect of an intervening boundary into this framework, we therefore need to associate pixels with values such that the difference in the values between two pixels correlates with the probability of a boundary between them. In effect, we need to find a field R (which we call the region map), the gradient of which is the image of boundary prior ~ E. We use technique in [17] to minimize jrR  ~ Ej by solving E involving a Laplace the Poisson differential equation, r2 R ¼ div~ and divergence operator. Given R, we can then suppress the problem of smoothing across the boundaries by adding a term to the bilateral filter that penalizes large differences in R:

D0 ðpÞ ¼

X

      DðqÞGs kp  qk22 Gc kIp  Iq k22 Ge kRp  Rq k22

ð2Þ

q2NðpÞ

Many fast implementations of bilateral filters have been proposed recently, such as in [18]. Most of them can be easily extended to the trilateral case. 3.2. Probability-of-boundary using depth A recent technique developed for finding the boundary prior map ~ E is Pb (probability of boundary) [2]. Pb is based purely on color frames. Given a color image I represented in the LAB color space, the Pb operator considers a fixed-size disk centered at each pixel p of I and divides the disk into two halves along a diameter with an orientation o. It then computes the difference of the histograms of the two half disks for each of the LAB channels and a texture channel as features. The histogram differences are fed into a logistic regressor to compute the probability of a boundary running through p with orientation o. Since the Pb operator only examines color information, internal edges induced by object texture are often erroneously considered probable boundaries. We discuss how to augment Pb with depth information as follows. Since we are interested in finding object boundaries, and depth discontinuities usually correspond to object boundaries, we could use the depth edges as the boundary prior. However, the depth images are usually very noisy as shown in Fig. 1. Therefore, we propose to augment the Pb operator with depth as follows (see Fig. 4). Given a depth image D, run an edge detector (e.g. a Canny edge detector) to find the depth edge map M. For each pixel p in M,

for each orientation o, consider a fixed-size disk. Project every depth edge pixel within the disk onto the diameter with orientation o. Let the total length of the projection on the diameter be d. Use d as an additional feature in the logistic regressor. The intuition behind the projection is that although the depth edges are noisy, they do correlate with the object boundary. Thus, projecting the edge pixels onto the central axis of the disk representing the edge direction of the center pixel can provide a useful feature. In addition, by projecting the depth edge pixels, we provide resistance to noise in the depth image. For example, consider the right-most image in Fig. 4 where we have a horizontal depth edge with jiggery noise in the vertical direction. Since we only look at the projected length as the depth feature, the jiggery noise will not contribute much when we impose a vertical disk, that is, the depth edge will still give the strongest response when we impose a horizontal disk. The probability of pixel p lying on a boundary of orientation o is then calculated as:



X  1 þ exp  as lps þ bs aps þ cs bps þ ks dps

Pbdo ðpÞ ¼ 1

!! ð3Þ

s

where lps, aps, bps, are color features computed in LAB channels and dps is the depth feature computed at pixel p using a disk of radius s. as ; bs ; cs , and ks are weights learned from labeled data distinct from the evaluation data. The orientation of the most likely boundary passing through p is argmaxo Pbdo ðpÞ. For each pixel p, let Ep ¼ maxfPbdo ðpÞg and hp ¼ argmaxo Pbdo ðpÞ, indicating the maximum probability of an edge and its associated orientation passing through p. With hp and Ep , we could form a vector field ~ E ¼ fEp coshp ; Ep sinhp g, which serves as the gradient field input to the Poisson reconstruction process [17] and construct a region map R accordingly. Note that we drop the texture features [2] in our Pbd detector as the texture features are costly to compute and we already incorporate depth as a more reliable feature for the boundary detection. In this paper, as in [2], we quantize the orientations to 22.5° apart from each other. 3.3. Boundary constrained Graph Cuts Given a ‘‘seed’’ foreground segmentation S and a loose bounding box B around it, let p be a pixel within B, and lp be the binary label, e.g., 0 for the background and 1 for the foreground. We seek to refine 0 S by estimating lp . Such seeded segmentation has been performed with good success by iterated Graph Cuts with the seed typically provided manually [3]. In our case, after the trilateral filter, we derive a rough object mask as the seed S. We then seek to incorporate depth-based cues for region boundaries as described below. With the object/background Gaussian color models G1 and G0 and the inference region B, our boundary constrained Graph Cuts is formulated as minimizing the following cost function:

Cðl; Gi ; X; IÞ ¼

X

 logGi ðIp jlp ¼ iÞ þ aVðlp ; X; IÞ

ð4Þ

p2B

Fig. 4. Computing depth features at three different orientations. The perimeter of the circle indicates the boundary of the disk and the central axis with arrowhead indicates the direction of the edges we are looking to detect. When imposing the center of such a disk on an image pixel, we only look at edge pixels within the disk. We then project the edge pixels, e.g. solid curved lines within the disk, onto the central axis. The (light blue) bold line indicates the projected edge pixels while the dotted lines indicate the projection direction. The directions of the disks are determined by the orientation quantization parameter. For example, in the experiments, we quantize the 180-degree range into 8 orientations, resulting in disks with orientations 22.5° apart from each other. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this book.)

where lp’s are the labels of the pixels p 2 B we are solving for, Gi’s are the color models, X is the output produced by the Pbd operator, (i.e. Pbdo(p) for every pixel p and every orientation o, not just the maximum value as used in the vector field ~ E), I is the color image, a is a parameter weighing the smoothness constraint, and V(lp, X, I) is the function that enforces the smoothness constraint and has the form:

XX p2B q2NðpÞ

dðlp –lq Þ

  expðbkIp  Iq k22 exp cPbdo? ðqÞ ðpÞ distðp; qÞ

ð5Þ

where d is an indicator function and has the value ‘‘1’’ if lp – lq, and ‘‘0’’ otherwise. N is the set of neighbors of pixel p, kIp  Iq k22 and dist(p,q) measure the distances between p and q in squared L2 norm

Please cite this article in press as: H. Liu et al., Automatic objects segmentation with RGB-D cameras, J. Vis. Commun. (2013), http://dx.doi.org/10.1016/ j.jvcir.2013.03.012

H. Liu et al. / J. Vis. Commun. Image R. xxx (2013) xxx–xxx

5

Fig. 5. The role of the boundary prior. (a) Overlays the seed (lighter shade) on the RGB image. The pointer finger is lost. We apply Graph Cuts on the circled portion of the image. (b) Pure color based Graph Cuts does not change the boundaries appreciably. (c) The boundary-constrained Graph Cuts detects the finger. (d) The neighborhood potentials of the standard Graph Cuts. (e) Shows the boundary information from the Pbd operator, and (f) The boundary-constrained version explains the difference where the brighter pixels indicate higher cost to assign different segmentation labels to adjacent neighbors.

in the color and Euclidean space (in the image space), respectively, and b and c are the relative weightings. The function V specifies how labels of neighboring pixels influence each other. In the original Graph Cuts formulation, it encourages assigning different labels if two neighboring pixels are different in both Euclidean and color spaces. Our boundary constrained Graph Cuts introduces a new term expðcPbdo? ðqÞ ðpÞÞ, representing the probability of p lying on a boundary with orientation O? ðqÞ, the direction perpendicular to  In this case, p and q are not likely to belong to the same line pq. object. The Pbd term introduces two extra sources of information to the traditional color-difference based potential. First, it incorporates depth information. Second, it is the result of an independently trained classifier using features derived from patches as opposed to pixels. In general, the boundary prior could even incorporate global constraints such as shape-based edge completion [19] and use arbitrarily sophisticated classifiers. Fig. 5 compares (b) the result of original Graph Cuts and (c) that of the proposed boundary constrained Graph Cuts, on a hand dataset. Note that given an initial imperfect segmentation (a), the original Graph Cuts fails to grow along the finger tip while our proposed version does. This can be explained through how the neighborhood potentials change. In (d), we see no apparent finger boundary, indicating the neighborhood potentials discourage assigning different labels along the finger boundary due to the color proximity of foreground and background. While in our boundary-constrained version, the neighborhood potentials (f) are altered because of the addition of the boundary information from the Pbd detector (e) encouraging label flipping at the boundary. 4. Moving object alignment We observe that the color/depth pixels of an object start to lose sync when it moves fast. This poses a problem when we create vision based surface touch applications (e.g. virtual whiteboard) or object-manipulation recognition systems [20]. In these applications, it is crucial to detect reliably when the fingertip touches the

surface or when the object touches/leaves the table so we could display the strokes or understand what object has been picked up/put down. In order to deal with the motion artifacts (see Fig. 1 second row), we incorporate temporal information using pair-wise optical flow. Recall the scenario is that when objects in the scene move fast, their depth images may contain misaligned parts. We seek to use spatiotemporal consistency requirements to realign these parts. Given a stream of color and depth images over a period T, denoted as Ci and Di, where i represents the frame index and takes the values [1, T], we assume there are sporadic aligned depth frames that do not suffer from the misalignment. Since the artifacts usually accompany large motion, an aligned frame can be detected by measuring the motion of the segmented object in the RGB images. To determine if Di is aligned, we compute the optical flow between the segmented objects, Oi  1 and Oi, respectively in Ci and Ci  1. The speed vo at which the object is moving at can then be estimated using the mean magnitude of the per pixel flow vectors over the object. When vo is large, Di would typically misalign Ci, making the object segmentation hard. On the other hand, a small vo falling below a predefined threshold indicates a depth frame which does not need alignment. 4.1. Handling XY anomalies by warping Our strategy is to propagate the depth information from the aligned frames to succeeding misaligned ones using optical flow in the color channel to estimate the required projection in the depth channel. Since the optical flow computation does not work well under large displacement [21], we employ the strategy of computing pair-wise optical flow, expecting the displacement between adjacent frames is small enough. Suppose at frame i = t  1, we are given Dt  1, an aligned frame followed by a sequence of misaligned frames Dt, Dt + 1, Dt + 2, . . ., Dt + N  1, and pair-wise optical flow field Fj, computed between color frames Cj  1 and Cj, for j = [t, t + N  1]. We project Dt  1 into D0t using Ft, and then D0i into D0iþ1 using Fi + 1, for i = [t, t + N  2]. The process

Please cite this article in press as: H. Liu et al., Automatic objects segmentation with RGB-D cameras, J. Vis. Commun. (2013), http://dx.doi.org/10.1016/ j.jvcir.2013.03.012

6

H. Liu et al. / J. Vis. Commun. Image R. xxx (2013) xxx–xxx

Ct-1

Dt-1

Ct

….

….

Ct+N

Dt

….

….

Dt+N

Ft Ct-1

D’t

….

4.3. Handling Z anomalies by affine mapping

….

S’ Ct

(a)

keep those depth values that fall inside the refined foreground object mask, and use the depth values from the background frame for the background. This operation ‘‘masks out’’ optical flow artifacts from the original depth map Di to get an updated depth map D0i , which is now ready to be warped to D0iþ1 . To summarize, we essentially use the consistency requirements between depth and visual channels at each time step to remedy inconsistencies introduced by optical flow.

(b)

Fig. 6. Depth segmentation using temporal information. (a) Color images are indicated as black frames. White frames in the second row are aligned depth frames which have few motion artifacts, and gray are the misaligned ones. (b) At frame t, we compute the optical flow field Ft, indicated by the arrows (top) from Ct  1 and Ct. Using Ft, we then project an already aligned frame D0t1 (which is the same as Dt  1 when the frame is aligned) to generate a depth image, D0t in order to address the misalignment problem in the original depth images. The process of aligning the depth frames with the color frames by pair-wise projection repeats until the next aligned depth frame, Dt + N, is detected. As can be seen, after the projection, the misaligned object Ot  1 at time t  1 (indicated by the white circles) is projected to a new location at time t, which is better aligned with the color frame.

goes on until another aligned frame Dt + N is detected, at which point this process is re-started with Dt + N. We call D0i s the ‘‘warped images’’ and use the term ‘‘warping’’ to refer to the operation of creating a new aligned depth image by projecting a previous depth image based on the pair-wise optical flow. To perform the warping, we project each depth value from the original depth image to the new one according to the computed flow vectors. Due to the inaccuracy nature of the optical flow computation, there might be unfilled depth pixels in the new depth image. We perform bilinear interpolation to fill in these pixels and smooth the new depth image. Fig. 6 illustrates our technique. After the process, without further intervention, the warped depth images D0i s have two remaining drawbacks. First, errors in optical flow accumulate to make the warped images, D0i s increasingly inaccurate [22]. Second, the warping is purely translational in the two-dimensional images, so motions toward and away from the camera are not taken care of well (i.e., the depth of the warped object may not be correct since the depth values will change). In this work, we use the optical flow implementation in [23]. 4.2. Cleaning-up optical flow errors Many techniques have been suggested to reduce both the error in optical flow [21] and the error due to the successive composition of flows [22], most notably by using the composed optical flow as a starting point for a direct flow estimation from the aligned frame to the current one. We take an alternate approach to mitigating the flow errors. Assuming a background depth frame is available, at each frame i, we use the warped image D0i , as the depth input to our proposed segmentation process which actually serves as a refinement pipeline as discussed in Section 3 and shown in component A and B in Fig. 2. That pipeline aligns the boundaries of the object mask derived from the depth map to those in the color frame by applying its combination of boundary-aware filtering and Graph Cuts. We use the resulting refined object mask and only

Finally, we address the problem when the object is moving toward and away from the camera. In this case, the depth values will change from the previous frame, and so, warping the depth values from the previous frame will not give correct depth values for the current frame. The basic idea to address this problem is that boundary points in a misaligned depth image Di are matched to the corresponding points in the warped image D0i which has been aligned with respect to the RGB image. Further, if the moving object can be assumed rigid during the period between frames, the affine transformation to these points from their corresponding boundary points in the warped image will capture how to transform all the other warped points. At frame i, given a projected depth image, D0i , a segmentation mask O0i generated using our previously discussed segmentation technique, we perform pure depth based segmentation to generate Oi on the incoming depth image Di, which might contain misalignment or erosion artifacts. We then align the silhouettes of O0i and Ot and extract the top k best matching pairs of points, {pn, p0n } n = 1–k on the silhouette boundaries using the technique in [2]. Let dn and 0 dn be the depth readings of pn and p0n in depth images Di and D0i , respectively. We fit an affine transformation d = ad’ + b and solve for the model parameters a and b for the reconstructed depth read0 ings dn and the projected depth readings dn by minimizing the least square error:

Error ¼

k X  0 2 adn þ b  dn

ð6Þ

n¼1

The projected depth image, D0i , is then updated with the depth values from Di using the fitted affine model and is now ready to be fed into the warping step for the next timeline. 5. Experiments We evaluate our approach on a number of datasets of RGB-D images: (1) a static object dataset covering 24 object categories with different camera perspectives (SO), (2) a slow moving hand dataset (SH), and (3) a body moving dataset (MB). Each dataset is recorded at two or more different backgrounds and lighting conditions. These datasets are aimed at evaluating our proposed segmentation technique (i.e., the trilateral filter and the boundaryconstrained Graph Cuts). For the object dataset, the object usually takes 1/20–1/10 of the entire image. The hand in the hand dataset usually occupies 1/20 of the image while the human body takes about 1/5 in the body moving dataset. The images are captured on different tables under different lighting conditions. Besides the image datasets, we also recorded two RGB-D video datasets to evaluate our motion compensated segmentation: (4) a fast moving hand dataset (FH), and (5) an object manipulation dataset (OM), which contains motion artifacts such as misalignment or erosion. The fast moving hand dataset is constructed under the same settings as the slow moving hand dataset except that the hand movement is faster. Same goes for the object manipulation dataset. All the dataset are recorded with VGA resolution. Fig. 7 shows some snapshots of the dataset.

Please cite this article in press as: H. Liu et al., Automatic objects segmentation with RGB-D cameras, J. Vis. Commun. (2013), http://dx.doi.org/10.1016/ j.jvcir.2013.03.012

7

H. Liu et al. / J. Vis. Commun. Image R. xxx (2013) xxx–xxx

Fig. 7. Examples of the dataset. Left: hand moving dataset. Middle: object/object manipulation dataset. Right: body moving dataset. The face region is blurred to conceal the identity of the experiment subjects.

To evaluate our proposed approach, we follow the correspondence computation algorithm and evaluation methodology of [2]. Given a human labeled binary boundary map Mg and a machine map M, the algorithm in [2] computes the false positives and true negatives by corresponding M to Mg with tolerance of localization errors of d pixels, where d is a tunable parameter. We follow [2] by setting d to a times the diagonal of the testing image. In all the subsequent experiments, we show results for a = 0.005 and 0.01 (essentially a slack of 4 and 8 pixels, respectively). We train a Pbd detector for each dataset. For example, for the object dataset (SO), we use a single scale disk for feature extraction and the learned values for the parameters as ; bs ; cs , and ks are 0.51, 0.2, 0.31, and 1.7. For all the experiments, the neighborhood size is set to 30 by 30 pixels for both the regular bilateral filter and our trilateral filter. In this section, we seek to answer the following questions:  Is our segmentation technique better than reasonable baselines?  How does motion affect the object segmentation?  What is the effect of boundary information on bilateral filter?  What is the effect of boundary information on Graph Cuts?  Does the Pb operator benefit from depth? 5.1. Object mask extraction In the experiments, we focus on the problem of extracting objects on a surface plane such as table or ground. The problem finds its applications in robot navigation or human computer interaction [1]. Hence, we use the depth-based object extraction method proposed in [1] to convert the depth map into an object mask. Given objects on a surface, the authors in [1] fit a plane in the 3-D space to the surface using the depth readings from the table pixels. Due to noise in the depth images, the fitting process is not perfect. Therefore, we extract pixels that are t cm away from the table as the foreground objects. The smaller t is, the more pixels from the surface will be included. On the other hand, when t is large, some pixels from the base of the object could be missing. We set t = 1 for all the experiments. Again, although we use this application to demonstrate the effectiveness our approach, our approach can be applied in a general setting. 5.2. Comparisons with baselines on image datasets In this experiment, we compare our single frame segmentation technique (Section 3) with the following baselines: the unrefined depth image, state-of-the art bilateral filtering [15], and pure color-based segmentation [3], seeded with an object mask derived from the original depth image. Table 1 shows the recall/precision of different segmentation approaches on the image datasets. We abbreviate the baselines (B.F. for joint bilateral filter and G.C. for GraphCut) and put active components of our proposed method.

Table 1 Comparison of our segmentation with baseline segmentation algorithms. Top: a = 0.005. Bottom: a = 0.01. Recall/precision. Dataset SO SH MB Dataset SO SH MB

Original 0.61/0.71 0.52/0.70 0.57/0.65 Original 0.78/0.91 0.61/0.82 0.74/0.85

B.F. 0.60/0.62 0.35/0.57 0.56/0.65 B.F. 0.80/0.82 0.51/0.82 0.74/0.86

G.C. 0.83/0.92 0.71/0.72 0.50/0.58 G.C. 0.88/0.96 0.79/0.79 0.63/0.73

Ours(A + B) 0.88/0.93 0.76/0.77 0.64/0.66 Ours(A + B) 0.92/0.97 0.88/0.89 0.76/0.78

A few points are worth noting. Our approach is competitive in all cases and significantly outperforms baselines in many. The plain bilateral filter is not much better than the baseline depth maps, for reasons similar to those mentioned in Fig. 3. Color-based Graph Cuts is competitive on static objects, but noticeably worse on the moving datasets. The underlying problem is that motion blurs and shadows make foreground and background hard to distinguish based on local color cues alone (as in Fig. 5). 5.3. Trilateral vs. bilateral filter In order to understand how much boundary-constraints add to the basic bilateral filter, we compare segmentation results on filtered depth images with our Trilateral Filter and without. Table 2 shows that with the additional boundary information, the bilateral filter can be improved by about 5% in both precision and recall. Although objects are not usually the same color as their background, it is surprisingly common for patches within the object to have a similar color as the background, resulting in glitches in purely color-based filtering as in Fig. 3. Gains are much more dramatic for moving hands because blur and shadow effects cause color-based local boundary cues to deteriorate sharply. Fig. 8 shows four visual examples of de-noised depth images using our proposed trilateral filter. In general, the boundaries are smoother and holes are filled. 5.4. Comparison with baselines on video datasets We evaluate the effectiveness of our motion compensated segmentation approach on the fast-moving datasets (FH and OM),

Table 2 Bilateral filter (b.f.) vs trilateral filter (ours). Top: a = 0.005. Bottom: a = 0.01. Recall/ precision. Dataset SO SH MB Dataset SO SH MB

B.F. 0.60/0.62 0.35/0.56 0.56/0.65 B.F. 0.80/0.82 0.51/0.83 0.56/0.64

Ours(A) 0.63/0.68 0.53/0.66 0.61/0.64 Ours(A) 0.82/0.90 0.61/0.76 0.61/0.64

Please cite this article in press as: H. Liu et al., Automatic objects segmentation with RGB-D cameras, J. Vis. Commun. (2013), http://dx.doi.org/10.1016/ j.jvcir.2013.03.012

8

H. Liu et al. / J. Vis. Commun. Image R. xxx (2013) xxx–xxx

Fig. 8. Four examples of our trilateral filtering results. Top: original color image. Middle: original depth image. Bottom: de-noised depth image. Best viewed in color.

Table 3 Comparison of motion compensated segmentation with the baseline segmentation algorithms on the video datasets. Top: a = 0.005. Bottom: a = 0.01. Recall/precision. Dataset FH OM Dataset FH OM

G.C. 0.63/0.62 0.47/0.63 G.C. 0.75/0.74 0.63/0.84

Ours(A + B) 0.63/0.64 0.77/0.83 Ours(A + B) 0.80/0.80 0.85/0.92

Ours(A + B + C + D + E) 0.79/0.77 0.88/0.92 Ours(A + B + C + D + E) 0.93/0.91 0.93/0.98

containing misalignment and erosion of the fingers or the objects. Using the regular Graph Cuts as the baseline, we first compare our proposed approach wo/w alignment. Table 3 shows that incorporating motion information to warp the depth images before segmentation brings 10% improvement. Our approaches outperform both baselines, and the gains from incorporating motion cues are clearly pronounced. Performance of color-based Graph Cuts is noticeably worse than for still/slowmoving objects due to large misalignment and erosion artifacts (e.g., the hand in Fig. 1). Next, we evaluate the effectiveness of the depth image correction. We compare motion compensated segmentation w/wo the affine correction. Table 4 shows that with the affine correction, the performance could be improved substantially, especially on the fast moving hand dataset, where the hand is moving toward/away

Table 4 Comparison of motion compensated segmentation w/wo affine correction with pure image warping. Top: a = 0.005. Bottom: a = 0.01. Recall/precision. Dataset FH OM Dataset FH OM

Ours(A + B + C + D + E) 0.79/0.77 0.88/0.92 Ours(A + B + C + D + E) 0.93/0.91 0.93/0.98

Ours(A + B + C + D) 0.68/0.65 0.87/0.89 Ours(A + B + C + D) 0.83/0.79 0.92/0.94

Ours(A + B + C) 0.57/0.47 0.76/0.70 Ours(A + B + C) 0.79/0.66 0.90/0.82

from the camera. Note that by masking out the non-object part of the depth image with the affine correction, our approach not only produces good segmentation but also suppresses cumulative optical flow errors over time. To see this, Table 4 shows the segmentation result based on warped images only, i.e., with no effort to ‘‘mask out’’ optical flow artifacts described in Section 4, from warped depth images.

5.5. Impact of boundary constraints on Graph Cuts We explore the effects of the boundary information on Graph Cuts in this section. Table 5 compares the results for (a)/(b) regular/boundary-aware Graph Cuts seeded with the original depth images, and (c)/(d) regular/boundary-aware Graph Cuts seeded with filtered depth images using our trilateral filter on all datasets. For the video datasets, we aligned all the depth frames before applying the segmentation process. Adding boundary information generally improves the performance but note that the improvement is smaller between (c) and (d) since the target depth images

Table 5 Comparison for Graph Cuts w/wo boundary information on all the datasets. Top: a = 0.005. Bottom: a = 0.01. Recall/precision. Dataset SO SH MB FH OM Dataset SO SH MB FH OM

G.C. 0.84/0.92 0.71/0.72 0.50/0.58 0.63/0.62 0.47/0.63 G.C. 0.88/0.97 0.79/0.80 0.63/0.73 0.75/0.74 0.63/0.84

Ours(B) 0.85/0.90 0.79/0.79 0.56/0.61 0.61/0.61 0.55/0.68 Ours(B) 0.91/0.97 0.90/0.90 0.71/0.77 0.80/0.80 0.74/0.90

Ours(A) + G.C. 0.88/0.94 0.61/0.64 0.60/0.66 0.75/0.74 0.84/0.89 Ours(A) + G.C. 0.91/0.98 0.73/0.77 0.71/0.78 0.90/0.89 0.88/0.93

Ours(A + B) 0.88/0.93 0.76/0.77 0.64/0.66 0.79/0.77 0.88/0.92 Ours(A + B) 0.92/0.97 0.88/0.89 0.76/0.78 0.93/0.91 0.93/0.97

Please cite this article in press as: H. Liu et al., Automatic objects segmentation with RGB-D cameras, J. Vis. Commun. (2013), http://dx.doi.org/10.1016/ j.jvcir.2013.03.012

9

H. Liu et al. / J. Vis. Commun. Image R. xxx (2013) xxx–xxx

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

pbd_SO

pbd_BM

pbd_SH

pb_SH

pb_SO

pb_BM

0.2

0.4

0.6

0.8

1

Fig. 9. Boundaries of segmented object. (a), (c) show color-based Graph Cuts, and (b), (d) show boundary-constrained Graph Cuts.

Fig. 10. Precision recall curves of the Pbd (lighter) and Pb (darker curves) on datasets SO (blue), SH (green), and BM (red). Y-axis represents precision and X-axis represents recall. Best viewed in color. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

are filtered with our trilateral filter, which has already incorporated the boundary information. We found that the boundary information helps the results more when the foreground object has similar colors to the background. One common scenario of this is when the hand or an object is moving on the table, the shadow of the hand makes it hard to get the exact boundary but adding boundary information could

easily mitigate the problem. An example is shown in Fig. 9(a) and (b) where because of the shadow, the regular Graph Cuts extends the object boundary to the table. Fig. 9(c) and (d) shows another similar example where adding the boundary information is beneficial. Without the boundary information, the regular Graph Cuts misclassifies the shadowed region around the handle as part of the foreground.

Fig. 11. (a) User initialization of ‘‘GrabCut’’. The rectangles represent possible regions for the foreground objects. Pixels from both the foreground and background are also specified. (b) Left column: original color image. Mid-left to right column: segmentation results using plane fitting [1], GrabCut [3], and the proposed approach.

Please cite this article in press as: H. Liu et al., Automatic objects segmentation with RGB-D cameras, J. Vis. Commun. (2013), http://dx.doi.org/10.1016/ j.jvcir.2013.03.012

10

H. Liu et al. / J. Vis. Commun. Image R. xxx (2013) xxx–xxx

5.6. Comparison between the Pbd and Pb operators Since neither of the Pb and Pbd operator has the notion of background and foreground, we label all the object boundaries and evaluate both Pb and Pbd operators on the image datasets. Fig. 10 shows the precision recall curves for both Pbd and Pb operators. Pbd finds more reliable boundaries than Pb even without the texture feature. For brevity, we only show the results for a = 0.005 but the results generalize to different a’s.

through our proposed Pbd operator and boundary constrained trilateral filter for object segmentation with both color and depth image. We also propose an approach that incorporates temporal information through optical flow to correct the motion artifacts in the depth images. We evaluate our approach using a number of datasets and show substantial improvement. We conclude that by careful integration of visual and temporal information, object masks derived from depth footage can be greatly improved. References

5.7. Qualitative results We show qualitative results of our tabletop object segmentation application. Fig. 11 shows some example objects on the turntable and we seek to segment out the objects. We compare our approach with two baselines: (i) the pure depth-based segmentation in [1] and (ii) ‘‘GrabCut’’ [3], the OpenCV implementation of a pure color-based Graph Cuts segmentation. Note that (i) is an automatic process. The objects are extracted after the parametric form of the table plane is determined. On the other hand, (ii) requires user initialization. Specifically, the users need to specify a rectangle surrounding the possible target object. They can also specify foreground and background pixels in case the foreground object has similar color as the background. Therefore, the quality of the segmentation results depends on how good the initializations are. Fig. 11(a) shows the initialization we provide, aiming at making GrabCut as good as possible. Our proposed approach as described in Section 3, is an automatic process that takes the depth image as input, refines it, extracts the objects based on the refined depth image, and eventually further refines the segmented object mask by fusing color cues. Fig. 11(b) shows the original color images and segmentation results using different methods. Note that the pure depth-based segmentation suffers from sensor failures and results in missing pieces and jagged boundaries while the pure color-based segmentation suffers from shadowing effects and yields unclean masks when the foreground object has similar color as the background. Our proposed approach takes care of these problems by fusing both depth and color cues whenever possible and generates the cleanest object segmentations. 5.8. Computational aspect We implement our methods using a mixture of MATLAB, and C++, and run the experiments on a Dell Precision workstation with Intel Xeon T3500 Processor. Running through our entire pipeline on a VGA image takes about 7 s on average. The process includes 1.35 s for running MATLAB implementation of optical flow computation on subsampled QVGA image, 3.57 s for running Pbd feature extractor on the original VGA image, 1.87 s for running trilateral filter on the original VGA image, and 0.9 s for running GraphCut on the region of interest. There are further optimizations that can be done as our future work. For example, currently we run feature extraction and trilateral filter on the entire image. We could reduce the computations by focusing on the area surrounding the object of interest. In general, if an object occupies about 1/10 of the image, we would expect to have 10 times speed-up. We will also explore the direction of paralleling the computation on GPU’s. 6. Conclusion We take a close look at the problem of extracting clean object masks from video with depth maps. We observe the major sources of noise in such footage and propose a framework based on integrating visual, depth, and temporal information to mitigate these problems. We show how to incorporate reliable boundary information

[1] Kevin Lai, Dieter Fox, Object Detection in 3d Point Clouds Using Web Data and Domain Adaptation, International Journal of Robotics Research 29 (2010) 1019–1037. [2] David Martin, Charless Fowlkes, Jitendra Malik, Learning to detect natural image boundaries using local brightness, color, and texture cues, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 26 (2004) 1–20. [3] Carsten Rother, Vladimir Kolmogorov, Andrew Blake, Grabcut: interactive foreground extraction using iterated Graph Cuts, ACM Transactions on Graphics 23 (2004) 309–314. [4] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, Andrew Blake, Real-Time Human Pose Recognition in Parts from Single Depth Images, in: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2011. [5] Hao Du, Peter Henry, Xiaofeng Ren, Marvin Cheng, Dan Goldman, Steven Seitz, Dieter Fox, Interactive 3d Modeling of Indoor Environments with a Consumer Depth Camera, in: Ubicomp, 2011. [6] Richard Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Andrew Fitzgibbon, Kinectfusion: Real-time dense surface mapping and tracking, in: IEEE International Symposium on Mixed and Augmented Reality, 2011. [7] Kevin Lai, Liefeng Bo, Xiaofeng Ren, Dieter Fox, A large-scale hierarchical multi-view Rgb-D object dataset, in: IEEE International Conference on Robotics and Automation, 2011. ¨ orwald, Johann Prankl, Jonathan Balzer, Towards [8] Andreas Richtsfeld, Thomas M scene understanding – object segmentation using Rgbd-images, in: Computer Vision Winter, Workshop, 2012. [9] Nathan Silberman, Rob Fergus, Indoor scene segmentation using a structured light sensor, in: ICCV, Workshop, 2011. [10] Qi Fei, Han Junyu, Wang Pengjin, Shi Guangming, Fu Li, Structure guided fusion for depth map inpainting, Pattern Recognition Letters 34 (2012). [11] Karthikeyan Vaiapury, Anil Aksay, Ebroul Izquierdo, Grabcutd: improved grabcut using depth information, in: Proceedings of the 2010 ACM workshop on Surreal Media and Virtual Cloning (SMVC 10), 2010. [12] Ingo Schiller, Reinhard Koch, Segmentation by adaptive combination of depth keying and mixture-of-Gaussians, in: Proceedings of the Scandinavian Conference on Image, Analysis, 2012. [13] Alexey Abramov, Karl Pauwels, Jeremie Papon, Florentin Worgotter, Babette Dellen, Depth-supported real-time video segmentation with the kinect, in: Workshop on the Applications of Computer Vision, 2012. [14] Christian Richardt, Carsten Stoll, Neil Dodgson, Hans-Peter Seidel, Christian Theobalt, Coherent spatiotemporal filtering, upsampling and rendering of Rgbz videos, in: Eurographics, vol. 31, 2012. [15] Jennifer Dolson, Jongmin Baek, Christian Plagemann, Sebastian Thrun, Upsampling range data in dynamic environments, in: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2010. [16] Qingxiong Yang, Ruigang Yang, James Davis, David Nistér, Spatial-depth super resolution for range images, in: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2007. [17] Ramesh Raskar, Adrian Ilie, Jingyi Yu, Image fusion for context enhancement and video surrealism, in: International Symposium on Non-Photorealistic Animation and Rendering, 2004. [18] Andrew Adams, Natasha Gelfand, Jennifer Dolson, Marc Levoy, Gaussian KdTrees for Fast High-Dimensional Filtering, in: ACM Annual Conference on Computer Graphics (SIGGRAPH), 2009. [19] Xiaofeng Ren, Charless Fowlkes, Jitendra Malik, Scale-invariant contour completion using conditional random fields, in: IEEE International Conference on Computer Vision (ICCV), 2005. [20] Jianxin Wu, Adebola Osuntogun, Tanzeem Choudhury, Matthai Philipose, James Rehg, A scalable approach to activity recognition based on object use, in: IEEE International Conference on Computer Vision (ICCV), pp. 1–8, 2007. [21] Thomas Brox, Jitendra Malik, Large displacement optical flow: descriptor matching in variational motion estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 33 (2011) 500–513. [22] Ankit Gupta, Pravin Bhat, Mira Dontcheva, Brian Curless, Oliver Deussen, Michael Cohen, Enhancing and experiencing spacetime resolution with videos and stills, in: International Conference on Computational Photography, 2009. [23] Ce Liu, Beyond Pixels: Exploring New Representations and Applications for Motion Analysis, Massachusetts Institute of Technology, 2009.

Please cite this article in press as: H. Liu et al., Automatic objects segmentation with RGB-D cameras, J. Vis. Commun. (2013), http://dx.doi.org/10.1016/ j.jvcir.2013.03.012