Continuous plane detection in point-cloud data based on 3D Hough Transform

Continuous plane detection in point-cloud data based on 3D Hough Transform

J. Vis. Commun. Image R. 25 (2014) 86–97 Contents lists available at SciVerse ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier...

2MB Sizes 143 Downloads 378 Views

J. Vis. Commun. Image R. 25 (2014) 86–97

Contents lists available at SciVerse ScienceDirect

J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

Continuous plane detection in point-cloud data based on 3D Hough Transform Rostislav Hulik ⇑, Michal Spanel, Pavel Smrz, Zdenek Materna Brno University of Technology, Faculty of Information Technology, IT4Innovations Centre of Excellence, Bozetechova 2, 61266 Brno, Czech Republic

a r t i c l e

i n f o

Article history: Available online 9 April 2013 Keywords: RGB-D sensor Point cloud Hough Transform Plane detection PCL library RANSAC Computer vision Shape extraction

a b s t r a c t This paper deals with shape extraction from depth images (point clouds) in the context of modern robotic vision systems. It presents various optimizations of the 3D Hough Transform used for plane extraction from point cloud data. Presented enhancements of standard methods address problems related to noisy data, high memory requirements for the parameter space and computational complexity of point accumulations. The realised robust plane detector benefits from a continuous point cloud stream generated by a depth sensor over time. It is used for iterative refinements of the results. The system is compared to a state-of-the-art RANSAC-based plane detector from the Point Cloud Library (PCL). Experimental results show that it overcomes the PCL alternative in the stability of plane detection and in the number of negative detections. This advantage is crucial for robotic applications, e.g., when a robot approaches a wall, it can be consistently recognized. The paper concludes with a discussion of further promising optimisation that will be implemented as a future step. Ó 2013 Elsevier Inc. All rights reserved.

1. Introduction Hough Transform (HT) [1–3] is a standard method for detecting lines, circles and other parameterized shapes in raster images. It can be also used for detection of 3D objects such as planes or cylinders. Both, the 2D and the 3D cases can be characterized by high computational costs. That is why a wide range of variants and optimizations of the basic HT have been proposed and successfully applied in many fields of computer vision. This paper particularly deals with HT applications in indoor robotics. Because of a large number of planar objects typically present in human-made indoor environments, there is a clear need for robust plane detection. It can identify dominant planes such as floor or table-top and provide a basis for polygonal model building. Fig. 1 shows an example of the plane detection in a kitchen. The HT-based approaches can cope with noise in input data. This feature is crucial for their use in today’s robotics which frequently employs low-cost depth sensors suffering from structural noise. Although the most widely used devices such as Kinect, PrimeSense or XtionPRO provide RGB visual data in addition to the depth data, this paper does not discuss a combination of the RGB information with the depth. Despite its potentially lower accuracy, this approach leads to general depth-processing

⇑ Corresponding author. E-mail addresses: ihulik@fit.vutbr.cz (R. Hulik), spanel@fit.vutbr.cz (M. Spanel), smrz@fit.vutbr.cz (P. Smrz), imaterna@fit.vutbr.cz (Z. Materna). 1047-3203/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jvcir.2013.04.001

methods, easily adaptable to data from other sensors such as LIDARs. The main motivation for use of the HT approach was the nature of our tasks – the environment map creation from depth data. Concerning this knowledge, the HT should provide very stable results in long time period due to it’s iteratively refining capabilities. Moreover, it is capable to accumulate the final environment model without any post-processing tasks. E.g., RANSAC method is clearly also very robust method for plane extraction from point clouds, but for the whole map creation, it is necessary to implement extensive post-processing for merging detected planes from several frames, multiple detection merge analysis etc. The RANSAC computation time is also very dependent on the number of available planes in the point cloud and number of maximum iterations. This paper also introduces several optimizations of the HT-based plane detection pipeline which significantly enhance memory and time requirements when compared to standard implementations. The proposed schema takes advantage of the continuous character of data provided by the sensors. Accumulation properties of the parameter space allow iteratively refining detected planes in time by composing information from several successive frames. The devised solution was tested on real data captured by Fraunhofer’s Care-O-bot [4] as well as on data generated by a specifically extended simulator using the Gazebo tool [5] and our Kinect sensor recorded frame sequences. Results are discussed along with an analysis of their dependency on various parameter settings. Finally, the implemented system is compared to a plane detector based on the RANSAC algorithm available in the PCL library [6].

R. Hulik et al. / J. Vis. Commun. Image R. 25 (2014) 86–97

87

Fig. 1. Results of plane detection based on the 3D Hough Transform. From left to right: a sample RGB image; input depth image; identified planes.

2. Related work This work further extends our plane detection in the depth map data sets, in which modifications for depth map segmentations using plane prediction/detection were presented by Hulik et al. [7]. The analysed methods consisted of plane prediction and RANSAC plane search. The algorithms were computationally inexpensive, usable for depth map preprocessing, but inaccurate for the environment reconstruction. The results of this experiment showed that methods are usable for depth map preprocessing, but for more precise solution, we need to combine the information from multiple frames. Three families of methods for plane detection in point clouds can be distinguished:  region growing techniques [8,9],  RANSAC-based methods [10–12],  and methods based on the feature space transformations [11,13–16]. 2.1. Region growing An efficient incremental region growing algorithm for identifying regions of points that lie on a plane extending a randomly chosen point and its nearest neighbour from point cloud data is presented in [8]. While the region grows, a mean square error (MSE) from an optimal plane of the region is compared to a threshold. Parameters of the optimal plane are estimated using the eigenvector analysis of data points in the region. Another region growing method was designed by Deschaud and Goulette [9]. The authors propose a fast and accurate algorithm able to detect planes in unorganized point clouds using filtered normals and voxel growing. The work is based on a robust estimation of normals at noisy data points. A score of local plane is computed in each point and the best local seed plane is identified. 2.2. RANSAC Random Sample Consensus (RANSAC) is a randomized procedure that iteratively fits an accurate model to observed data without trying all possibilities [17]. The data may contain a large number of outliers. Schnabel et al. adapted the RANSAC for plane detection [10]. To detect a plane with the RANSAC, three random points are chosen and the plane parameters are computed. A score function is used to determine how good the model fits remaining points. The score takes into account the number of points close enough to the plane. A plane with the best score is taken as a winner. The authors found that the method provides precise and fast results when properly tuned. Another technique based on the RANSAC which is able to detect textured or textureless planar surfaces has been presented by Heracles et al. [12]. Continuous textured regions are characterized

by similar local normals and textureless regions are identified by a smooth colour transition performing a region growing segmentation. Additionally, a spatio-temporal coherence between subsequent frames is employd to improve results and stability of the detection. Tarsh-Kurdi et al. compared results of the 3D HT to the RANSAC algorithm on detection of roof planes in 3D building point cloud [11]. Their results prioritize the RANSAC-based solution as it seems to be more efficient. 2.3. Feature space transformation The Hough Transform is a traditional approach to detecting parameterized shapes and objects in images using transformations into feature space. Originally, Hough [1] defined the transformation for detecting lines, later it was extended to more complex shapes and generalized for arbitrary patterns [3]. The Hough transform is defined by a parameterization of the detected shape. Each shape is described by two or more parameters while each point of the input data is mapped to a parameter space (the Hough space) and voting accumulators corresponding to different shapes are increased. Summary of existing variations of the 2D HT as well as some other optimizations are presented in [18]. A 3D Hough Transform for plane detection typically describes every plane by the slope of the plane along the X and Y axes and d – the height of the plane at the origin. Every point in this parameter space corresponds to a plane in the object space. After processing the whole input, the accumulators in the parameter space are searched for local maxima above a given threshold. They correspond to the detected shapes of interest in the original space. Bormann et al. evaluate different variants of the 3D HT with respect to their applicability to detect planes in 3D point clouds, focusing on the representation of the accumulator [13]. The authors discuss a novel accumulator design that guarantees the same size for each cell. The accumulator is a sphere divided into slices which are indexed using polar coordinates. The paper compares this approach to other existing designs. It is also possible to split the detection into steps: the detection of the plane normal and the detection of the distance of the plane. First, the normal vectors are mapped onto a half-sphere. It can be expressed as a 2-dimensional parameter space. The maximum on the sphere defines the direction of the normal vector. The remaining parameter d can be calculated from all points with a normal vector similar to the maximum on the sphere. The d values can be mapped to a 1dimensional parameter space. The maximum in this space determines the distance of a plane from the origin. Bormann et al. also compared different Hough methods. Their results show clearly that the Randomized Hough Transform (RHT) [18,16] is the method of choice when dealing with 3-dimensional data due to its exceptional performance. Mochizuki et al. [15] is also discussing the performance of the HT, especially compares the Randomized Hough Transform and the Three-point Hough Transform [19]. They generalize the ran-

88

R. Hulik et al. / J. Vis. Commun. Image R. 25 (2014) 86–97

domized HT to a n-point HT to enhance robustness in noisy image while retaining the speed of the randomized approach. Han et al. [14] presents a novel surface fitting aproach for 3D dense reconstruction (EDA). EDA transforms points from geometrical space to surface model space and apply deterministic annealing algorithm to partition surface space into several regions. The reverse transformation is able then to segment the original point cloud. 2.4. Proposed approach All described methods have one problem in common – the accuracy of plane detection in the Kinect sensor specific noised depth images. This is due to the quantization of data, which creates multiple easily detectable artifact planes. The proposed approach benefits from the key feature of the Hough space and the voting mechanism – it detects planes across the whole sequence of incoming frames. Thus, the detection can be accurate even when a frame is of a bad quality if we have several consecutive still (or nearly still) camera views available. Additionally, if we have a robot localization transformation available (transformation from camera coordinates to the world coordinate system), a Hough space can be accumulated from the whole scene by accumulating transformed points in the world coordinates to the Hough space. Our tests consist mainly of moving camera examples (using our localisation software), but also still camera tests.

3.1. Plane representation in 3D parameter space In 3-dimensions, a plane can be defined using the plane equation:

ax þ by þ cz þ d ¼ 0;

ð1Þ

where ða; b; cÞ defines the plane normal vector and d is a distance from the origin ð0; 0; 0Þ. However, this simple definition leads to a 4-dimensional HS which is, due to its memory requirements, practically unusable. Assuming normalized normal vectors, the HS can be represented in only three dimensions:

HS ¼ ða; b; dÞ ¼ ½p; p1  ½p; p1  R1 ;

ð2Þ

where a; b are first two Euler angles of the normalized plane normal vector. Fig. 2(a) demonstrates how the a; b angles are extracted. The third Euler angle is redundant since there is no need for the information about the rotation around ~ v axis when transforming the plane normal vector. The described HS representation saves a large amount of memory since the angle resolution is static and only the d parameter dimension needs to be extended in order to accumulate planes in a large space. The standard approach to the Hough space creation needs to examine all possible planes which pass through the current point in the 3D space and to accumulate parameters of these planes into the HS. The proposed plane detection algorithm uses an optimization step described in the following section which significantly reduces the number of examined planes. 3.2. Surface normal estimation

3. Detection of planes using 3D Hough Transform The Hough Transform is a robust shape extraction method even on noisy datasets. The major drawbacks are, however, its memory and computational time demands. This is due to the nature of the HT, theoretically, all possible shapes in every transformed point need to be examined. We propose several optimizations to minimize the disadvantages – reduce memory requirements of the Hough space (HS) and the computational cost of the HS construction/accumulation (the voting process) and the search for peaks. Following enumerations shows the proposed detection pipeline with specified enhancements to the standard algorithm. Particular steps are discussed in following sections. 1. Surface normal estimation  PCL’s fast integral image normal computation is used for plane candidate estimation in each point of the point cloud. 2. Accumulation to the parameter space  The parameter space is represented by a hierarchical structure which brings a significant memory saving while preserving accuracy.  For each point, parameters of a plane are estimated so the point contributes to the Hough space only once (this corresponds to the Randomized Hough Transform [18]).  Every point contribution is smoothed by a pre-computed Gaussian function.  Hough space caching is used to speed up the accumulation process. 3. Extraction of maxima  A sliding window technique is applied to search for maxima in the Hough space. 4. Refining of planes  A two-level parameter space computation is used to filter artifacts arising when several noisy frames are accumulated.

For the estimation of the surface normal, our method adopts an approach inspired by the PCL’s framework for the integral point cloud normal estimation which has proven to be sufficiently fast and precise even on noisy data. There are three possibilities how to compute the surface normal using multiple integral images in the PCL [20]:  Covariance matrix computes the normal vector by an eigenvector analysis of the covariance matrix of point’s neighbourhood.  Average 3D gradient – uses average horizontal and vertical 3D differences around the current point.  Average depth change computes horizontal and vertical 3D differences from averaged neighbours. Our implementation applies the covariance matrix method which is computed from a large neighbourhood. The size depends on the amount of noise in the image. The 8  8 neighbourhood is used when analysing the Kinect sensor data in the reported experiments. 3.3. Accumulation to the Hough space Having the surface normals pre-computed from the depth image, it is easy to estimate parameters of a plane for each point in the 3D space. The HS coordinates are computed from the extracted normal using the equation:

d ¼ ðnx x þ ny y þ nz zÞ;

ð3Þ

where ~ n ¼ ðnx ; ny ; nz Þ is the extracted normal vector (j~ nj ¼ 1) and p ¼ ðx; y; zÞ is the current point. Using all the information computed by the surface normal extractor and Eq. 3, we are able to express a point in the parameter space h ¼ ða; b; dÞ. We denote the transformation of a point p to its Hough image h as a function:

hðpÞ ¼ ða; b; dÞ

ð4Þ

R. Hulik et al. / J. Vis. Commun. Image R. 25 (2014) 86–97

89

Fig. 2. Illustration of the a, b angles computation for vector ~ v with respect to x axis (a); and hierarchical structure of the 3D parameter space (b).

The example of such accumulated space can be visible in Fig. 3. Two optimizations are applied in the accumulation (or voting) process. The first one aims at an efficient HS representation, the second optimizes the voting process itself. 3.3.1. Parameter space data structure The Hough space is represented as a simple static array. It can be supposed that a robot will operate in a closed area in indoor scenarios. Though the HS can be therefore limited, the 3D parameter space may be extremely large. This fact can be demonstrated by the following example: If the resolution of detected planes is to be at least D\ > 0:01 rad and Dd > 0:01 m, it is necessary to represent the Hough space with the dimensions at least 625  625  N, where N equals to the d parameter size. Knowing that a scene has dimensions 6  6 m, the d size of the HS must be at least N = 1200. The dimensions of the Hough space are then 625  625  1200 which means 1875 MB when using the 32-bit floating point arithmetic. Although it is easy to allocate such memory in today’s standard computers, it is certainly undesirable for on-board computers in robotic applications. To reduce the memory requirements, a hierarchical structure is used and tuned up based on the experimental results (see Chapter 4). The structure reflects the sparsity of the parameter space – the number of empty cells in the allocated space is high (90–99%, depending on the amount of noise). Based on an analysis of allocation patterns in a series of experiments, we have finally devised a 2-level hierarchical structure. The top layer consists of a grid of cells, each containing a level-two grid of size 163 – the chunk. The value was chosen experimentally as a tradeoff between the size of the two level grid and the number of empty chunks. An illustration of the structure is shown in Fig. 2. The use of the hierarchical structure significantly reduces the amount of allocated memory. The HS takes between 1% and 10% of its original size. Table 1 summarizes results of relevant experiments.

3.3.2. Point accumulation The last step of the HS creation is the accumulation of points that correspond to parameters of estimated planes. As only one plane is considered for each point and there can be a significant amount of noise in the input image, it is not desirable to accumulate the point precisely to its mapped location in the Hough space. To increase robustness of the plane detection, our method rather deals with blurred point contributions. Smoothing Gaussians are : used to smooth the value of a point contribution at point g  gj2 jh 1   g; rÞ ¼ pffiffiffiffiffiffiffiffiffi ffi 3 e 2r2 ; G3D ðh; ð 2prÞ

ð5Þ

 ¼ ða; b; dÞ is the projected point and r is the standard deviwhere h ation of the Gaussian function. The Gaussian kernel in the HS is always the same, so it can be pre-computed and incrementally added to the final Hough space at the position of a current point. 3.3.3. Caching the Hough space Unfortunately, smoothing each point is expensive. Even with a simple 113 Gaussian kernel, the number of ADD operations rises 1331 times. Hence we introduce a ‘‘cache space’’ representing an unblurred Hough space. Only one value per point is added when computing this space. The resulting Hough space is then computed using a simple Gaussian blur operator of a given size. Note that the proposed method does not need to take care of the circular shape of the Hough space (the case when the angle of detected planes lies near the edge of the parameter space). The plane would have two maxima (on both ends of the parameter space). In our implementation, this situation is solved in post processing steps described in following chapters. 3.4. Summary on Hough space creation Several enhancements of the representation and creation of the Hough space have been presented. To clarify the whole process, we

Fig. 3. Examples of the accumulated Hough space (ab plane). From left to right: noise without any specific plane; detectable plane with noise; clear maximum of a detectable plane.

90

R. Hulik et al. / J. Vis. Commun. Image R. 25 (2014) 86–97

3.6. Refining hypotheses about planes

Table 1 Percentage of empty chunks measured on simulated and real data. Simulated clean data

Real noisy data

Chunk size

Empty chunks (%)

Chunk size

Empty chunks (%)

1 2 4 8 16 32 64 128

99.25 99.13 98.86 98.35 97.14 94.46 88.87 68.75

1 2 4 8 16 32 64 128

96.39 96.17 95.72 95.03 93.54 90.56 82.71 57.03

present a summary in Algorithm 1. Moreover, the optimized parameter space data structure allows us to practically use the Hough transform for the 3D plane detection even with greater resolutions. Algorithm 1. Add a frame to the Hough space for each point p Estimate normal n Estimate coefficient d Calculate hðpÞ (see Eq. 4) and accumulate it to the cache space end for Smooth the cache space and add it to the global one – Algorithm 2

3.5. Extraction of peaks The second crucial step of the HT-based plane detection identifies significant maxima in the accumulated space. It is the most critical point of the whole detection because of the fact that detected maxima are most likely candidates for further detected planes. However, it can be beneficial to consider the accumulated Hough space as a combination of several frames. The fact that most of the planes are present in several consecutive frames simplifies the search for sharp local maxima, i.e., peaks, in the HS. The noise in real captured data presents a new obstacle that needs to be overcome. The structural and the transmission noise have been already suppressed by the Gaussian smoothing. Imprecise positioning of the depth sensor causes another difficulty as it results in multiple strong peaks within a small distance from each other, the same plane positioned slightly differently in the analysed point clouds. The implemented system addresses this problem from two different sides. The first is the peak extraction. A sliding window of a specified size is used to identify the peaks. A close neighbourhood is analysed for each point and the sum of neighbouring values is compared to the point. Adjacent extremes can be treated as one peak this way. The situation is frequent when dealing with quantized (discrete) data from a sensor – the image in the parameter space is then formed by close local maxima. The final peak position is extracted by interpolating the values in a sub-voxel manner. The resulting value of the maximum is defined as follows:

P mðxÞ ¼

p2WðxÞ P

hðpÞ  p

p2WðxÞ hðpÞ

;

ð6Þ

where WðxÞ is a set of neighbouring points around point x and hðxÞ is the HS value at point x.

If a plane is present in several frames, its image in the Hough space is stronger. A small variation in camera positions can be overcome thanks to the blurred representation and the sliding window maxima search. However, another problem related to the accumulation of several frames appears. Due to the structural noise present in all cheap depth sensors, recurring artifacts are accumulated to the Hough space. Although each individual one would be filtered out by the Hough space smoothing process, multiple frames with an artifact accumulate a false peak in the Hough space (the structural noise cannot be isolated by adding multiple frames). An inner loop is therefore added to the maxima search. Algorithm 2 outlines the procedure. Algorithm 2. Accumulative artifact countermeasure Create the HS for a current frame Search for maxima with a low threshold Accumulate the maxima to the global HS using the precomputed Gaussian kernel

A separate HS is created and local maxima are identified for each frame. It is not necessary to introduce a high threshold for the value of a local maximum in this step. Identified maxima are accumulated into the global HS. Each detected plane in a frame represents a single smoothed contribution to the HS. This modification is able to filter out the recurring accumulative noise. Fig. 4 demonstrates the effect of the noise suppression step. As mentioned above, a special attention needs to be paid to the circular shape of the parameter space. A post-processing refining tool is introduced to avoid multiple detections of planes on borders of the Hough space. It takes advantage of a measure for plane similarity which integrates the difference between normals and the difference in coefficients d:

eðn~1 ; n~2 Þ ¼ acosðn~1 :n~2 Þ

ð7Þ

eðd1 ; d2 Þ ¼ jd1  d2 j

ð8Þ

Eq. 7 assumes that the normal vectors are normalized. A similar pair of planes is merged using a weighted sum:

~ n¼

1 n~1 hðp1 Þ þ n~2 hðp2 Þ; hðp1 Þ þ hðp2 Þ

ð9Þ



1 d1 hðp1 Þ þ d2 hðp2 Þ; hðp1 Þ þ hðp2 Þ

ð10Þ

where p1 and p2 are corresponding points in the Hough space. 3.7. Thinning of Hough space It is supposed that the algorithm used in robotics applications will be running for long time. In the accumulation process, several problems can arise including:  Artifact planes accumulation due to strong noise effects,  Biased solutions due to non precise robot localisation problems (slightly different camera-world transformation matrix),  Non-delimited maximum of the accumulation process. Moreover, we live in non static world, so it is highly probable that in the accumulation process, several non permanent detections will be recorded (e.g. moving car etc.). This possibility must be also considered.

R. Hulik et al. / J. Vis. Commun. Image R. 25 (2014) 86–97

91

Fig. 4. Noise suppression in local/global parameter spaces – a cut of a local Hough space is displayed on the left; a corresponding image in the global space is shown on the right.

To overcome these problems, we designed a simple Hough Space thinning algorithm which decrements each cell by a specified amount in each frame. Knowing that the accumulated plane has the highest value equal to 1 in it is HT projected maximum, we propose frame decrement of each cell by 0.1–0.01. We measured experimentally that this decrementation successfully removes unwanted noise slow enough to preserve existing planes. Moreover, it is also able to successfully delete non permanent planes. We propose also to set a maximum value of the Hough Space cell. If the cell reaches the maximum, it becomes permanent and it is not decremented anymore.

4. Experimental results and discussion Two sets of tests evaluating memory requirements and precision of the proposed plane detector were carried out. The first one focused on the contribution of the hierarchical structure and its influence to the size of the parameter space. To guarantee an objective evaluation, real robot tests were supplemented by simulations with specifically tailored scenes revealing strengths and weaknesses of individual methods. A novel distance-dependent noise model was also implemented. It will be introduced in the following section together with a discussion on its effect on reliability of the simulated sensor output. 4.1. Simulated RGB-D sensor data The Gazebo simulation of Turtlebot (TB) [21] was used in tests. The simulated TB incorporates all necessary inputs, especially localization of the robot and simulated RGB-D data from the Kinect sensor. The robot was placed into an environment containing planes with known positions and it was navigated around to collect test data. The testing environment was a 3D model of a closed environment (a room and a corridor). Noise levels, parameter space setups and trajectories varied across 50 test runs captured. Each test run consisted of 100–200 frames. 4.2. Real RGB-D sensor data For real environment testing, the Fraunhofer’s Care-O-Bot recorded sequences [4] were used along with Kinect recorded sequences. All frame sequences were manually annotated and real positions of major detecteable planes were measured. Final detections were compared to these results. 36 test sequences were annotated, each consisted of around 200 frames.

4.3. Distance-dependent noise model The original model of the Kinect RGB-D sensor in the Gazebo simulator does not take into account noise characteristics of the depth measurement; the values are noiseless. It is only possible to add a Gaussian noise independent of the distance to specific objects in the environment. Measurements with a real Kinect sensor revealed that the level of noise grew when increasing the distance to a sensed obstacle. To improve faithfulness of the simulation and thus to get more realistic results from the tests, a novel distance-dependent model of the noise for RGB-D camera was proposed. It also employs the Gaussian noise but its parameters are made dependent on the distance. An average value and a standard deviation of pixel values in the depth image were measured for various distances between the sensor and an obstacle (a smooth white wall) during experiments with the real device. The measured average values were considered as precise distances. Second order polynomial (11) was fitted using the least squares method to express dependency between the precise distance and the level of noise (its standard deviation). The C noc coefficient controls the amount of generated noise.

r ¼ Q ð2; 6610  106  d21  C 2noc  1; 0787  103  d1  C noc þ 2; 7011Þ

ð11Þ

The noise model uses idealized data from the Gazebo simulator to produce realistic outputs. For each noiseless pixel of the depth map, a corresponding standard deviation is calculated using Eq. 11. The noiseless value is considered to be the mean of the normal distribution and the Box-Muller Transformation [22] is used to produce a value with a given level of noise. Real values computed with polynomial are uniformly quantized (Q function in Eq. 11) to 11 bits resolution as depth image from real Kinect also consists of 11 bits integer values. Finally, a depth map or a point cloud is generated that can be used in the same way as the original data from the Gazebo tool or from a real RGB-D device. A sample of the simulated data with varying noise levels is shown in Fig. 5. Standard deviation of this noised depth data vary between few millimeters to 6.3 cm for maximum range of Kinect (5 m). A more sophisticated model has been proposed by Khoshelham et al. [23], however results regarding noise levels are quite similar. 4.4. Error metrics As a comparison metric for the plane detector, an angle and a distance between a detected plane and a manually identified one

92

R. Hulik et al. / J. Vis. Commun. Image R. 25 (2014) 86–97

Fig. 5. Example of the simulated data using the distance-dependent noise model. From the top left, the noise levels are: 0.0, 0.3, 1.0, and 4.0.

Fig. 6. Memory usage of the hierarchical representation of the parameter space, 16 detectable planes, size of the environment: 6  9  3 m, max HS resolution: 10243 min. angle step: 0.0061 rad min. d step: 0.0136 m.

is estimated comparing normal vectors and coefficients d of both the planes. An error metric is then computed as:

nÞ ¼ acosð~ n ~ nreal Þ; E\ ð~

ð12Þ

Ed ðdÞ ¼ jd  dreal j;

ð13Þ

where ~ nreal and dreal are corresponding plane parameters in the manually annotated model. When a plane with an opposite normal is present, E\ ð~ nÞ and Ed ðdÞ are computed from the plane with the opposite normal. The error metric is therefore independent of the normal vector orientation. To characterize how many planes were missed by a method, the following sections also report another metric:

R. Hulik et al. / J. Vis. Commun. Image R. 25 (2014) 86–97 Table 2 Tested setups of the HT-based plane detection algorithm. Name

Resolution 3

Min. angular step (rad)

Min. d step (mm)

Hough 512

512

0.0122

19.5

Hough 1024

10243

0.0061

9.7

Hough 2048

20483

0.0030

4.8

Emiss ðframeÞ ¼

jPðframeÞ n DðframeÞj ; jPðframeÞj

93

noise in the input image increases the size of the Hough space, the growth is not critical. The parameter space size obviously depends on the number of detectable planes in the scene. Further test with different scenes and models revealed that the size did not exceed 50 MB even for scenes of 10  10  3 m containing more than 100 planes. A standard array-based parameter space representation would need at least 4 GB in this case. The hierarchical HS saved from 98% to 99.7% of the memory in all the tests. This verifies the hypothesis stated in Section 3.3.1.

ð14Þ

where D is a set of detected planes (decision is made by thresholding the response in the Hough space) and P is a set of all planes manually identified in a frame. In other words, Emiss ðframeÞ is the ratio of planes that were not detected, i.e., the miss rate. 4.5. Evaluation of the hierarchical representation To assess an added value of the hierarchical representation of the parameter space and its practical usability, we measured the memory used for storing the structure under different noise levels. Fig. 6 shows results of the test. It is clear that although the raising

4.6. Accuracy of the plane detection For testing detection accuracy, the presented technique was compared against the standard RANSAC plane fitting method [17] that is implemented in the PCL library [6]. The PCL implementation of the RANSAC method is highly optimized and it is further extended by successive least squares refining applied after random sample consensus fitting. Exact definition of the implemented RANSAC algorithm is described in [24]. The parameters of the used RANSAC algorithm is highly specific to the amount of noise present in the testing data, but were always chosen to meet the best results. The inlier threshold distance oscilates between 0.01 m for

Fig. 7. Detection accuracy of the HT-based plane detection technique against the RANSAC method.

94

R. Hulik et al. / J. Vis. Commun. Image R. 25 (2014) 86–97

Fig. 8. Stability of the plane detection (the worst case scenario – the wall plane equals to y þ 3 ¼ 0).

Fig. 9. Quantized noise effect in larger view angles (possible false plane detections).

95

R. Hulik et al. / J. Vis. Commun. Image R. 25 (2014) 86–97

unnoised images and 0.05 m for the real Kinect data. The maximum number of iterations for the detection of one plane was set to 10,000 on artificial data and to 500 on real data (more complex environment) to meet usable time efficiency (using non decimated Kinect point cloud). The HT-based methods in the reported tests are denoted by numbers referring to the resolution. Details on the setup are given in Table 2. 4.6.1. Different noise levels The first experiment evaluates the detection accuracy under different amounts of noise in the input data. Results of 60 measurements with different noise levels and different parameter space resolutions are summarized in Fig. 7. The proposed method is comparable to the RANSAC approach in accuracy. The measured errors differ in the order of 0.01 rad in the case of normal vectors and 0.01 m in the case of distances. Note that the noise level of the Kinect RGB-D sensor is somewhere around C noc ¼ 1:0. 4.6.2. Stability of the detection The second testing scenario evaluates detection stability by comparing distances from manually located planes. Datasets correspond to several sequences of the robot approaching a specified wall. Only one key plane is to be detected – a plane representing an obstacle facing the sensor. This presents a significant challenge for the detection as the noise ratio is greatest in a plane orthogonal to the projection axis. All tests were run with C noc ¼ 0:5 – a lower value than would characterize the current real sensor. This was chosen to reflect a potential improvement of the Kinect device in the future as well as to enable finer granularity of changes. A trajectory of about 5 m was considered for the evaluation (measurements from a distance greater than 5 m are practicably unusable when using the Kinect sensor). Fig. 8 prove that the HT approach produces more steady results than the RANSAC algorithm. This follows from the definition of the respective detectors. The RANSAC takes into account a current frame only. Although it incorporates a refining procedure, the results cannot be stable because of the stochastic computation. The implemented least squares refinement works only with the de-

Table 3 Comparison of measured errors for HT approach and RANSAC (the absolute values). Metric

Hough Transform

RANSAC

nÞ E\ ð~ Ed Number of detected planes

0.045 ± 0.047 rad 0.047 ± 0.037 m 53.2%

0.047 ± 0.053 rad 0.031 ± 0.030 m 33.19%

tected inlier set, it cannot search for better position in the whole scene. The difference is especially noticeable in the worst case of the perpendicular wall. Also, the noise significantly reduces the positioning accuracy of the RANSAC detector while the accuracy of the HT-based approach remains almost unchanged. It is also interesting to notice, that in the distance error the Hough 1024 shows more precise results than Hough 2048. This is due to the quantization effect of the 3D Hough Space raster. The resolution of the Hough 1024 has simply better position of cells for this specific case. The difference is about 0.002 rad, which is beyond the resolution capabilities of both methods. We can conclude that both the algorithms are robust with respect to the structural noise of the depth sensor. On the other hand, the proposed method provides stable results in the time domain.

4.6.3. Real data testing For all real data measurments, the Hough 2048 algorithm was used and compared with the RANSAC approach. Real data acquired by Kinect sensor showed significant quantized noise. This effect increases with raising angle between plane normal and camera view vector. As it was measured and described by Khoshelham et al. [23], the quantized depth error is around 8 cm in distance of 5 meters. The quantized behavior can then lead to multiple false detections. The error effect can be visible in Fig. 9. Demonstrated error causes problems especially to the RANSAC algorithm, where the artifact planes leads to a large number of false detections. Contratily, with the bigger inlier threshold, we loose the precision. To obtain relevant comparison, we truncated the incoming depth images to distance lesser than 3 m, where the error does not distort the point cloud significantly. We measured both the angular and distance error of planes. Additionaly,

Fig. 10. Comparison of measured errors for HT approach and RANSAC.

96

R. Hulik et al. / J. Vis. Commun. Image R. 25 (2014) 86–97

we measured the number of true detections ratio to the number of manually annotated planes in the scene. Fig. 10 and Table 3 demonstrates measured results. From achieved results, it is evident that the accuracy of both methods is similar. The higher error value of Ed metric for HT approach is supposed to have origin in true detections of planes with higher noise (large view-plane angle), which are skipped by the RANSAC approach. Even with increased inlier threshold for RANSAC could not increase the number of true detections especially for the specific noise in Kinect images. We can conclude that we were able to detect 60% more planes than using the RANSAC approach while keeping comparable error rate. On Fig. 11, we compared both algorithms visually. Both methods were set for almost same execution speed and the best results. No post-processing was applied (i.e. merging planes) excluding size filtration. The RANSAC approach shows precise detection into corners of the planes, but experiences large number of false detections (even if the normal filtration is applied). This fact is mostly due to the specific Kinect noise. The HT contrarily shows strong noise filtering capabilities, but of course in the cost of loosing of details.

RANSAC parameters were set to a maximum distance of the point to the plane to 0.05 m, maximum iterations for one plane set to 500. Large number of iterations will lead to extensive increase of computational time (multiple times). We have chosen these parameters to maintain the best results while retaining the computational time comparative to our solution.

4.6.4. Computational time We measured the execution time needed for the detection step when running on Intel Quad Core i7–2620 M CPU at 2.70 GHz. The average execution time for all real data examples was 531.68 ± 250.46 ms for the RANSAC technique. For the Hough transform, the average time was 453.57 ± 236.46 ms. Both algorithms were parallel, it is however necessary to mention that HT algorithm was not especially optimised for speed. In case of the Hough 2048 algorithm, almost 50% of time includes preprocessing of the point cloud. This fact is more evident on lower resolutions of the Hough space. For the Hough 256, the average elapsed time reaches 312.86 ± 264.21 ms, where the same

Fig. 11. Example of real data testing. From the left: RGB image; original depth frame; input point cloud; Detected planes by Hough Transform; Detected planes by RANSAC.

R. Hulik et al. / J. Vis. Commun. Image R. 25 (2014) 86–97

point cloud preprocessing takes more than 70% of time needed for the whole computation. The results show also that the run time of the HT algorithm highly depends on currently processed scene and the diversity depth segments. The main speed optimisation is in skipping empty chunks in the Hough space which means that the worst case scenario consists of a scene with a large number of non planar data and noise. The same principle however applies also to the RANSAC approach. 5. Conclusions This paper introduced an optimization of the standard Hough Transform for continuous plane detection in the point cloud data. The implemented method performed as accurate as the RANSAC approach in the experiments while it produces more stable output while retaining similar speed. The performance of the algorithm is generally dependent on the amount of noise present in the input data. This can be addressed by adjusting resolution of the parameter space. The designed hierarchical structure guarantees that even a high resolution does not lead to extreme memory consumption. Other proposed modifications also help to make the Hough Transform a practically usable detection tool. Our future research will focus on code parallelization able to speed up the HT computation on multi-core CPUs. We will explore a GPU-based acceleration reflecting limited capabilities of current embedded robotic platforms. Dynamic resizing of the parameter space during run time will be also addressed in our future work. Acknowledgments This work was supported by the European Community’s 7th Framework Artemis JU grant agreement no. 100233 (R3-COP), grant no. 247772 (SRS) and European Regional Development Fund in the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/ 02.00 70). References [1] P. Hough, Method and means for recognizing complex patterns, in: US Patent, 1962. [2] R. Duda, P. Hart, Use of the hough transform to detect lines and curves in pictures, CACM 15 (1) (1972) 11–15. [3] D.H. Ballard, Readings in Computer Vision: Issues, Problems, Principles, and Paradigms, Ch. Generalizing the hough transform to detect arbitrary shapes, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1987. pp. 714–725. [4] Care-O-bot, Fraunhofer IPA, , accessed on: 14/07/2012. [5] Gazebo, , accessed on: 30/07/2012.

97

[6] Point Cloud Library (PCL), , accessed on: 14/07/2012. [7] M.S.P.K.P.S. Rostislav Hulik, Vitezslav Beran, Fast and accurate plane segmentation in depth maps for indoor scenes, in: IEEE/RSJ International Conference on Intelligent Robots and Systems, Algarve, Portugal, 2012. [8] J. Poppinga, N. Vaskevicius, A. Birk, K. Pathak, Fast plane detection and polygonalization in noisy 3d range images, in: IROS’08, 2008, pp. 3378–3383. [9] J.-E. Deschaud, F. Goulette, A fast and accurate plane detection algorithm for large noisy point clouds using filtered normals and voxel growing, in: 3DPVT’10, 2010. [10] R. Schnabel, R. Wahl, R. Klein, Efficient ransac for point-cloud shape detection, Comput. Graph. Forum (2007) 214–226. [11] F. Tarsha-Kurdi, T. Landes, P. Grussenmeyer, Hough-transform and extended ransac algorithms for automatic detection of 3d building roof planes from lidar data scanner data, in: ISPRS Workshop on Laser Scanning 2007 and SilviLaser 2007, vol. XXXVI, Espoo, Finland, 2007, pp. 407–412. [12] M. Heracles, B. Bolder, C. Goerick, Fast detection of arbitrary planar surfaces from unreliable 3d data, in: Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent robots and systems, IROS’09, IEEE Press, Piscataway, NJ, USA, 2009, pp. 5717–5724. [13] D. Borrmann, J. Elseberg, K. Lingemann, A. Nüchter, The 3d hough transform for plane detection in point clouds: a review and a new accumulator design, 3D Res. 2 (2) (2011) 32:1–32:13, http://dx.doi.org/10.1007/3DRes.02(2011)3. [14] B. Han, C. Paulson, D. Wu, 3d dense reconstruction from 2d video sequence via 3d geometric segmentation, J. Visual Commun. Image Represent. 22 (5) (2011) 421– 431, http://dx.doi.org/10.1016/j.jvcir.2011.03.006. http://www.sciencedirect.com/ science/article/pii/S104732031100040X. [15] Y. Mochizuki, A. Torii, A. Imiya, N-point hough transform for line detection, J. Visual Commun. Image Represent. 20 (4) (2009) 242–253, http://dx.doi.org/ 10.1016/j.jvcir.2009.01.004. http://www.sciencedirect.com/science/article/pii/ S1047320309000066. [16] L. Xu, E. Oja, Randomized hough transform (rht): basic mechanisms algorithms and computational complexities, CVGIP: Image Understanding 57 (2) (1993) 131–154, http://dx.doi.org/10.1006/ciun.1993.1009. http://www.sciencedirect.com/science/ article/pii/S1049966083710090. [17] M.A. Fischler, R.C. Bolles, Readings in computer vision: issues, problems, principles, and paradigms, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1987, Ch. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated, cartography, pp. 726–740. [18] H. Klviinen, P. Hirvonen, L. Xu, E. Oja, Comparisons of probabilistic and nonprobabilistic hough transforms, in: Proceedings of Third European Conference on Computer Vision ECCV’94, 1994, pp. 351–360. [19] T.-C. Chen, K.-L. Chung, A new randomized algorithm for detecting lines, RealTime Imaging 7 (6) (2001) 473–481, http://dx.doi.org/10.1006/rtim.2001. 0233. http://www.sciencedirect.com/science/article/pii/S1077201401902335. [20] D. Holz, S. Holzer, R.B. Rusu, S. Behnke, Real-time plane segmentation using RGB-D cameras, in: Proceedings of the 15th RoboCup International Symposium, 2011, Istanbul, Turkey. [21] Willow Garage – TurtleBot, , accessed on: 30/07/2012. [22] G.E.P. Box, M.E. Muller, A note on the generation of random normal deviates, Ann. Math. Stat. 29 (2) (1998) 610–611. [23] K. Khoshelham, S.O. Elberink, Accuracy and resolution of kinect depth data for indoor mapping applications, Sensors 12 (2) (2012) 1437–1454, http://dx.doi.org/10.3390/s120201437. http://www.mdpi.com/1424-8220/ 12/2/1437. [24] O. Chum, J. Matas, Matching with prosac progressive sample consensus, in: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1 – vol. 01, CVPR ’05, IEEE Computer Society, Washington, DC, USA, 2005, pp. 220–226, doi:http:// dx.doi.org/10.1109/CVPR.2005.221.