Structure-based object detection from scene point clouds

Structure-based object detection from scene point clouds

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Structure...

7MB Sizes 3 Downloads 84 Views

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Structure-based object detection from scene point clouds Wen Hao n, Yinghui Wang Institute of Computer Science and Engineering, Xi'an University of Technology, Xi'an 710048, China

art ic l e i nf o

a b s t r a c t

Article history: Received 10 July 2015 Received in revised form 16 December 2015 Accepted 29 December 2015

We present a novel algorithm for object detection from the scene point clouds acquired in complex environments based on the structure analysis of objects. First, the scene point clouds are partitioned using the Gaussian map, and the primitive shapes are extracted from the segments. Second, each primitive shape is represented by a node, and the connection between two shapes is represented by an edge. The topological graph of the scene is reconstructed by defining the node properties, edge properties, and connection types, and the structure of the target objects is also analyzed. Then, an “assemblymatching” strategy is proposed to recognize the target objects in the scene. The qualified primitive shapes are assembled iteratively until no more suitable shapes are found. At the same time, the connection string of the combinational shapes is recorded using several numbers. Finally, the target object is detected by comparison to the connection string. The object is detected successfully if the structure coding between the iterative shapes is in line with the target object. The experimental results show that the proposed method can quickly detect common objects from massive point clouds. & 2016 Elsevier B.V. All rights reserved.

Keywords: Scene point clouds Segmentation Structure analysis Object detection Structure encoding

1. Introduction Point clouds are sets of data points that are usually defined by X, Y and Z coordinates and often are intended to represent the external surface of an object. Point clouds are widely used in many fields, such as virtual tourism, industrial manufacturing and the entertainment industry. Recent advances in terrestrial laser scanning (TLS) provide a convenient way to quickly collect 3D point clouds of the indoor or outdoor scene. 3D point clouds with high density and high accuracy can express the geometric details of objects. Humans have the ability to quickly identify an object from a complex scene. However, it is difficult to achieve the same facility with a computer because of the large variability of realworld objects and the absence of topological relations between points. To address the difficulty of object detection from the raw scene point clouds, we propose a novel approach to detect objects composed of assembled shapes in the scene based on the structure analysis of an object. Structure-based techniques aim to describe objects with a set of geometrical primitives [1]. The approach is motivated by the consideration that many man-made objects in the indoor or outdoor scene can be represented using a set of primitive shapes and that these geometrical relationships appear to be sufficient for humans to recognize the objects. In real data scanning, it is difficult to collect the complete point clouds of a n

Corresponding author. E-mail address: [email protected] (W. Hao).

single object due to occlusion, noise and single-view scanning. However, the main structure (the primitive shapes and topological relationships between shapes) of the man-made object in the scene tends to remain relatively stable. Recently, several viewbased methods of 3D shape analysis [2–4] were proposed. In these algorithms, the 3D object is projected at different viewpoints on the surface and presented with a collection of 2D views using the format of depth buffer or binary mask. Instead of representing a 3D object with a set of 2D projections, the structure of 3D object is analyzed in a three dimensional space in this paper. The 3D objects in the scene are first partitioned into primitive shapes, and the geometric relations between shapes are analyzed. Then, the encoding trees are built to represent the structure of objects based on their geometric relations. Finally, the structure of objects are represented by the connection strings. In this paper, we concentrate on detecting common objects composed of primitive shapes, such as cars, stairs, boxes, and cupboards. The main contributions of the paper are summarized as follows: (1) Because the majority of man-made objects are composed of primitive shapes and their structures tend to remain relatively stable, we present a structure-based approach for object representation and detection. (2) In our algorithm, the planes are first divided into the sloping plane, the horizontal plane and the vertical plane according to the normal direction. Then, different connection types are

http://dx.doi.org/10.1016/j.neucom.2015.12.101 0925-2312/& 2016 Elsevier B.V. All rights reserved.

Please cite this article as: W. Hao, Y. Wang, Structure-based object detection from scene point clouds, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2015.12.101i

W. Hao, Y. Wang / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

defined based on the plane types. An encoding algorithm is proposed to record the structure of the target object. (3) Instead of matching the graph itself, we propose an “assembly-matching” strategy to recognize the object. A method similar to region-growing is used to assemble the shapes. The qualified shapes are assembled iteratively into an entire object until no more suitable shapes are found. Then, the target object is detected by comparison to the connection string. The remainder of the paper is organized as follows. Section 2 presents a brief review of object detection. Section 3 presents the procedures of object detection from the scene point clouds. The detailed steps of object detection are proposed in Section 4. Experimental results are presented in Section 5. The limitations of our method and proposals for future research are indicated in the last section.

2. Related work Object detection from the scene point clouds has been one of the most challenging problems in computer vision. Previous works always focus on detecting a single object, such as a building [5], pole-like object [6], vehicle [7] or tree [8] from cluttered 3D urban point clouds. However, most of the existing work is based on prior knowledge or invariant feature descriptors of the specified object categories. It is difficult to extend these methods to the detection of multiclass objects. Many methods for multiclass object detection have been introduced in recent years that fall roughly into the following categories: 2.1. Local descriptors-based method Many methods have been proposed to detect and recognize free-form objects in 3D point clouds. The most popular ones are spin images [9,10], followed by 3D shape context [11], Intrinsic Shape Signatures [12] and Signature of Histograms of OrienTations [13,14]. Recently, Taati [15] proposed a variable-dimensional LSD for 3D object recognition. First, a set of invariant properties such as position, direction and dispersion properties for each point are extracted. Then, histogramming schemes (i.e., scalar quantization and vector quantization) are used to obtain the VD-LSD (SQ) and

VD-LSD (VQ) features, respectively. Guo [16] proposed a RoPS feature by rotationally projecting the neighboring points of a feature point onto three 2D planes and by calculating a set of statistics for the distributions of these projected points. More recently, Guo [17] proposed a novel Tri-Spin-Image feature for local surface description. First, LRF is constructed using a similar technique as in [16]. Then, spin images are generated using the x, y, z axis as their LRA, and the SI descriptor procedure is then adopted. Most of these methods have been tested on small datasets that contain few models, and they are usually designed for the 3D mesh model. They are hard to adapt to raw scene point clouds, which are large and cluttered. 2.2. Machine learning-based method Markov Random Field [18], Conditional Random Field [19] and Support Vector Machine [20] are the commonly used methods for object recognition or classification. Golovinskiy [21] proposed a framework for multiclass objects detection. The point clouds are first segmented using a graph-cut algorithm. Then, the feature vectors are labeled using SVM on a set of manually labeled objects. Objects such as cars, light standards and traffic lights are recognized and classified. However, this method requires manual localization. Velizhev [22] made use of a part-based model for recognizing 3D objects in LiDAR point clouds obtained in urban environments. This method replaces supervised classification with a 3D implicit shape model to recognize objects by voting for their center locations. Wang [23] presented a novel rotation-invariant method for object detection from terrestrial 3D laser scanning point clouds. An Implicit Shape Model is used to describe object categories, and the Hough Forest framework is extended for object detection in 3D point clouds. Mattausch [24] proposed an unsupervised algorithm that identified and consolidated repeated objects across large indoor scenes. Kim [25] proposed a method that relies on the presence of repetitions and symmetry to automatically recognize the objects in an indoor scene. This method acquires 3D models of frequently occurring objects and captures their variability modes from only a few scans. Then, the objects are recognized via a novel Markov Random Field (MRF) formulation. Machine learning-based methods, however, require an elaborate acquisition and analysis process for the training set.

Fig. 1. Overview of the proposed method.

Please cite this article as: W. Hao, Y. Wang, Structure-based object detection from scene point clouds, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2015.12.101i

W. Hao, Y. Wang / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

3

Fig. 2. Extracting planes from scene point clouds 1. (a) Scene point clouds 1 and (b) the planes in scene point clouds 1.

2.3. Graph matching-based method Schnabel [26] proposed a semantic system for 3D object detection that cannot be directly used in large, cluttered environments. Semantic entities are represented as constrained graphs that describe the configurations of basic shapes. Graph matching is performed to recognize the objects. Somani [27] proposed a shape-based object recognition method. The model and scene point clouds are decomposed into primitive shapes and represented as primitive shape graphs. A Minimum Volume bounding Box (MVBB) is computed for each of these primitive shapes. Matching these primitive shapes can be approximated by finding the intersection of their MVBB's. Nieuwenhuisen [28] generated shape compositions from CAD models and perform sub-graph matching with the primitives in the scene to detect and localize objects of interest. Berner [29] extended the previous method [28] to introduce the combination of 2D contour primitives and 3D shape primitives for detecting objects in point clouds through shape-graph matching. The main drawback of the graph matching method is that the graph complexity is high when dealing with non-trivial objects. It should be noted that most of the methods mentioned above are designed for a confined scenario, such as the one composed of several CAD models. In this paper, we evaluate our method using various real world scans. In addition, compared to the graph matching algorithm, our proposed algorithm does not need to predefine the query graph of the target object manually. A variety of objects can be detected although the objects differ in size.

(1) Primitive shapes extraction. The scene point clouds (Fig. 1(a)) are partitioned using the rough-detail segmentation algorithm, and the primitive shapes are extracted from the segments (Fig. 1(b)). (2) Graph construction. A primitive shape is represented by a node, and the geometrical relationship between the shapes is represented by an edge. The connection types between different surfaces are defined (Fig. 1(c)). The topological graph of the scene is reconstructed by defining the point properties, edge properties and connection types (Fig. 1(d)). (3) Structure encoding of the target object. For the common objects in a scene, such as staircase, car, box and cupboard, we analyze the structure of these objects. The encoding trees are built to represent the structure of objects based on their geometric relations (Fig. 1(e)). Meanwhile, the structure coding of an object is recorded. (4) Shape assembling and object detection. First, we select a qualified primitive shape in the scene as the seed point and connect another qualified shape according to the connection. The new qualified shape is defined as the seed point, and the process finds other qualified shapes iteratively until it is unable to find a suitable shape (Fig. 1(f)). The connection string is recorded while assembling the shapes (Fig. 1(g)). The target object is detected by comparing the connection string.

4. Structure-based object detection 4.1. Planar surfaces extraction

3. Algorithm overview

The scene point clouds are partitioned using the rough-detail segmentation algorithm [30]. The planar surfaces are extracted based on the properties of the Gaussian map.

The procedures of the object detection approach are described using a flowchart in Fig. 1.

(1) Rough segmentation. The normals of the point clouds are first mapped to the Gaussian sphere and then partitioned into

Please cite this article as: W. Hao, Y. Wang, Structure-based object detection from scene point clouds, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2015.12.101i

W. Hao, Y. Wang / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

groups using a mean shift clustering algorithm. (2) Detail segmentation. The distance-based clustering method is used to tackle the overlapping surfaces. The parallel planes are extracted separately. (3) Primitive shape extraction. The planar surface is extracted based on the properties of the Gaussian map. (4) Refinement. The adjacent shapes are merged based on similarity to guarantee the accuracy of shape extraction. Fig. 2(a) shows scene point clouds 1. Fig. 2(b) shows the planes in scene point clouds 1. Different planes are marked with different colors.

② Average normal We use Eq. (1) to compute the average normal AvgNormi ¼ ðnxi ; nyi ; nzi Þ of each primitive shape. (nxi ,nyi ,nzi ) represents the normal of the point pi in shape Si , and N is total point number of Si . nxi ¼

N 1X n N i ¼ 1 xi

nyi ¼

N N 1X 1X nyi nzi ¼ n Ni¼1 N i ¼ 1 zi

ð1Þ

③ Height difference: DeltZ i ¼ Z i max  Z i min is the height difference between the maximum z value and the minimum z value. (2) The properties of edges.

4.2. Graph construction After segmentation, the scene point clouds are partitioned into several primitive shapes S ¼ fS1 ; S2 ; …Sn g. A graph GðS; E; AS ; AE Þ is built to represent the target objects, where each shape Si A S corresponds to a node and each edge eij A E represents a geometrical relation between two shapes. AS denotes the properties of a node, and AE denotes the properties of an edge. (1) The properties of the node. We compute the centroid of the primitive shape. Nodes with different colors are used to represent different planes. A horizontal plane is represented by an orange node, a vertical plane is represented by a blue node, and a slanting plane is represented by a black node. The primitive shape properties represented by a node include the following: ① Area represents the Areai Areai ¼ lengthi  widthi

area

of

Connection type 1

Connection type 4

the

planar

surface.

Pairwise relationships between each pair of shapes are described by an edge eij . The edge properties include the following: ① Connection type As shown in Fig. 3, the connection types are divided into seven categories according to different surface types. a. As shown in Fig. 3(a), a horizontal plane is connected to a vertical plane with a green line. The two planes are perpendicular to each other. The connection type is defined as ‘1’. b. As shown in Fig. 3(b), a horizontal plane is connected to a sloping plane, and they are not perpendicular to each other. The connection type is defined as ‘2’. c. As shown in Fig. 3(c), two vertical planes are connected with a red line, and they are not perpendicular to each other. The connection type is defined as ‘3’. d. As shown in Fig. 3(d), two sloping planes are connected with a yellow line, and they are not perpendicular to each other. The connection type is defined as ‘4’.

Connection type 2

Connection type 5

Connection type 3

Connection type 6

Connection type 7

Fig. 3. The different types of connection between the nodes. (a) Connection type 1, (b) connection type 2, (c) connection type 3, (d) connection type 4, (e) connection type 5, (f) connection type 6 and (g) connection type 7. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)

Please cite this article as: W. Hao, Y. Wang, Structure-based object detection from scene point clouds, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2015.12.101i

W. Hao, Y. Wang / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

5

Fig. 4. The diagram of connectivity among primitive shapes. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)

e. As shown in Fig. 3(e), two vertical planes are connected with a cyan line, and they are not perpendicular to each other. The connection type is defined as ‘5’. f. As shown in Fig. 3(f), a sloping plane is connected with a vertical plane with a blue line, and they are not perpendicular to each other. The connection type is defined as ‘6’. g. As shown in Fig. 3(g), two sloping planes are connected with a brown line, and they are perpendicular to each other. The connection type is defined as ‘7’.

Fig. 5. The definition of the boundary points. (a) The diagram of boundary points and (b) the diagram of non-boundary points. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)

② The distance between two nodes Dist ij C i (xi ; yi ; zi ) and C j (xj ; yj ; zj ) are the centroids of the shapes Si and Sj , respectively. Dist ij measures the distance between the two plane centroids. qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2Þ Dist ij ¼ ðxi  xj Þ2 þðyi  yj Þ2 þ ðzi  zj Þ2 ③ The angle between two planes θij AvgNormi and AvgNormj denote the average normals of the two planes. The angle between the two planes is computed using Eq. (3): AvgNormi  AvgNormj    AvgNormi   AvgNormj 

θij ¼ arccosðAvgNormi ; AvgNormj Þ ¼ arccos

ð3Þ Here, to judge whether the two shapes are connected, the spatial proximity between them is computed. The n  n matrix W is used to record the connection relationship between shapes. W is defined as: ( 1 if Si ; Sj is connected Wði; jÞ ¼ ð4Þ 0 if Si ; Sj is not connected If Si and Sj are connected, Wði; jÞ ¼ 1, Wðj; iÞ ¼ 1。 Suppose S1 ; S2 ; S3 are the primitive shapes extracted from the scene. As shown in Fig. 4, the points in S1 are colored black, pi A S1 . The points in S2 are colored red, and the points in S3 are colored green. The k-nearest neighbors of pi are collected using k-d tree. Find the point set P k with a predefined distance τ. If P k contains any point pj that belongs to two or more shapes, such as pj A S2 , S1 and S2 are connected. In Fig. 4, Wð1; 2Þ ¼ 1, Wð1; 3Þ ¼ 1, Wð2; 3Þ ¼ 1. To improve the efficiency of the algorithm, we do not traverse each point in a plane to find the connecting planes. We just traverse the boundary points of each plane to find the k-nearest points and then determine whether the k-nearest points are included in other planes. As shown in Fig. 5, suppose pi (red point) is a point in sha pe S1 and find k-nearest neighbors of pi using k-d tree and find

Fig. 6. The boundary points of planes in scene point clouds 1.

the point set Pr  distace with a predefined distance dij , Pr  distace ¼ ffpi ; pj ; dij gj dij r rg; i a j. Point c (green point) is the centroid of Pr  distace , m is the point that has a maximum distance   from pi in Pr  distace . pi c is the distance between point pi and point   c, pi m is the distance between point pi and point m. If pi is a         boundary point, pi c=pi m is large. Otherwise, pi c=pi m is small. Thus, we determine whether the point pi is a boundary point     based on the value of pi c=pi m. After segmentation, the boundary points of each segment are extracted using [31]. Fig. 6 shows the boundary points of planes in scene point clouds 1. After extracting the boundary points, find k nearest point sets P n of each boundary point using k-d tree. The connection type between nodes are determined based on the plane types that P n contained (Fig. 3). Algorithm 1 shows the steps of judging the connection type between nodes.

Please cite this article as: W. Hao, Y. Wang, Structure-based object detection from scene point clouds, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2015.12.101i

W. Hao, Y. Wang / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

Fig. 7. The topological graph of scene point clouds 1. (a) Scene point clouds 1 and (b) the topological graph of scene point clouds 1. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)

Staircase

The topological graph of the staircase

Fig. 8. The topological graph of the staircase. (a) Staircase and (b) the topological graph of the staircase.

Algorithm 1. The detailed steps of judging the connection type between nodes. Input: The primitive shapes S1 ; S2 ; …Sn Output: The connection types between two shapes 1. for(int i¼0, ion; i þ þ) 2. for(int j¼ 0, joSi .size(); jþ þ) 3. Pr  distace ’f ind Neighbors In Radiusðpj ; rÞ 4. Compute the centroid c of point set Pr  distace and find the point m that has a maximum distance from pj in Pr  distace     5. if(pj c=pj m 4 δ) 6. pj is a boundary point 7. else 8. pj is not a boundary point 9. end if 10. end for 11. for each boundary point pm 12. find k nearest point sets P n of pm using k-d tree 13. if(there exists a point pk A Sj in P n && the   distancep p  o τ&&p A Si ) m k

m

14. Si and Sj are connected, i.e., Wði; jÞ ¼ 1 15. The connection types are judged based on the definition in Fig. 3 16. end if 17. end for

Fig. 7 shows the topological graph of scene point clouds 1. The blue nodes denote the vertical planes, and the orange nodes denote the horizontal planes. The black nodes denote the slanting planes. The different connection types between planes are marked in different colors. 4.3. Structure encoding of the target objects As shown in Fig. 8(a), the staircase is an important part of the building and is composed of several planes. The main characteristics of the stairs are the following. (1) The adjacent planes are perpendicular to each other. (2) The area and the height of the plane are small. (3) The distance between the centroids of the neighboring planes is small. Meanwhile, the relationship between the centroids of each plane can be approximated to a straight line. The main structure of the stairs can be represented in Fig. 8(b). The connection relationships can be encoded as '11 … 11'. The car is an important object in the urban scene that can be regarded approximately as a combination of multiple planes, as shown in Fig. 9(a). Because the point clouds are scanned from a side view, the top, front and side planes of the car are obtained. The single-view point clouds of the car may be represented as three mutually connected planes, as shown in Fig. 9(b). The main characteristics of the car are as follows: (1) the top plane is a

Please cite this article as: W. Hao, Y. Wang, Structure-based object detection from scene point clouds, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2015.12.101i

W. Hao, Y. Wang / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

7

1 2 3 The point clouds of car from a single view

The topological graph of the car

Fig. 9. The topological graph of the car. (a) The point clouds of car from a single view and (b) the topological graph of the car.

1

4 5

1 3

2 4 5

3 2

The diagram of a complete box

The topological graph of the box

Fig. 10. The topological graph of the complete box. (a) The diagram of a complete box and (b) the topological graph of the box.

1 5 2 4

3 6 The point clouds of a cupboard

The topological graph of the cupboard

Fig. 11. The topological graph of the cupboard. (a) The point clouds of a cupboard and (b) the topological graph of the cupboard.

sloping surface, the front and side surfaces are vertical; (2) the zvalue of the top plane is higher than the front and side surface; and (3) the area of the planes is less than a specified threshold. The main structure of the car can be represented in Fig. 9(b). The connection relationships can be encoded as '665'. A box consisting of six planes is a common object in indoor scenes. Because the lowest plane standing on the floor cannot be scanned, the complete point clouds of a box are composed of five planes (Fig. 10(a)). The main characteristics of the box are as follows: (1) the highest surface is a horizontal plane, and other surfaces are vertical; (2) the z-value of the top plane is higher than that of the vertical plane; (3) the connecting planes are perpendicular to each other; and (4) the area of the planes is less than a given threshold. The main structure of the box can be represented in Fig. 10(b). The connection relationship can be encoded as '11113333'. A cupboard is another common object in indoor scenes that consists of many planar surfaces. As shown in Fig. 11(a), the back

surface of the cupboard is incomplete due to occlusions in the original scans. However, the incompleteness does not affect the structure analysis. The main characteristics of the cupboard are as follows: (1) the top plane is a horizontal surface, and the connecting planes are vertical; (2) the z-value of the top plane is higher than other planes; and (3) the cupboard usually contains several shelves. The main structure of the cupboard can be represented in Fig. 11(b). The connection relationship can be encoded as '1… 1331… 1'. The encoding tree is built to represent the structure of objects based on their geometric relations. For a target object composed of nodes S1 ; S2 ; ⋯Sn , the nodes are first sorted by the average z-value in descending order. Then, the node with the highest elevation Si is selected as the root node, and the nodes S1 ; S2 ; ⋯Si  1 ; Si þ 1 ; ⋯Sn connecting to Si are defined as the child nodes. To ensure the uniqueness of the object coding, the child nodes are also recorded in descending order. Next, the nodes connecting to the nodes S1 ; S2 ; ⋯Si  1 ; Si þ 1 ; ⋯Sn are recorded recursively until all of the nodes and edges are recorded. Finally, the connecting types are recorded using breadth traversal. Note that if the edges are already recorded, they are not recorded again. Fig. 12 shows the encoding trees of the staircase, box, car and cupboard, respectively. The edges between two nodes denote that they are connected. The dotted lines indicate that they have already been recorded. Take the encoding tree of the box as an example. Node S1 is the highest one in S1 ⋯S5 . Assuming that nodes S2 ⋯S5 are already in descending order, the node S1 connects to nodes S2 ⋯S5 and the connection relationship between them is ‘1111’. For node S2 , the edge between node S1 and node S2 has already been recorded. As a result, we just record the edges e23 and e25 . The connection string between S2 and S3 ,S5 is ‘33’. Then, we record the untraversed edges of node S3 ⋯S5 in sequence. As a result, the structure coding of the box is ‘11113333’. Similar encoding is available for stairs, cars and cupboards. 4.4. Primitive shape assembling and object detection The object detection algorithm proposed in our paper can be regarded as an assembler of the primitive shapes. Here, we propose a method similar to region growing to assemble the nodes into a meaningful object. First, select a qualified shape as the seed point and assemble other qualified shapes according to the connection. Here, the criteria for the qualified planes are as follows. (1) The area of the plane Areai and the height difference DeltZ i should be in accordance with the characteristics of the object to be detected. For example, because the planes of staircase are small and short, their area and height difference should be smaller than a threshold σ , i.e., Areai o σ , DeltZ i o ε. (2) The angle between two nodes θij should be vertical or not. Take the planes of stairs as an example, the θij is smaller than a threshold, i.e., θij o α. (3) The plane type should be in accordance with the object characteristics, and the plane has not been traversed. Then, the new qualified

Please cite this article as: W. Hao, Y. Wang, Structure-based object detection from scene point clouds, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2015.12.101i

W. Hao, Y. Wang / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

1

1

2 3

2

3

4

5

4 N

1 3 5

1 2 4 1 3 5 1 2 4

The encoding tree of the staircase

The encoding tree of the box

1

1

2

2 1

3 3

1

1 5 3 4 6

2

3 1 5 2 6

4 1 5 2 6

2 3 4 2 3 4

The encoding tree of the car

The encoding tree of the cupboard

Fig. 12. The encoding trees of different objects. The encoding tree of the staircase, (b) the encoding tree of the box, (c) the encoding tree of the car and (d) the encoding tree of the cupboard.

Fig. 13. The procedures of the shape assembling.

shapes are defined as the new seed points in turn, and other qualified shapes are selected to assemble iteratively until unable to find another suitable shape. Finally, an assemblage of the primitive shapes is obtained. Taking the stairs as an example, the whole process of shape assembling is described. As discussed in Section 4.3, the planes comprising stairs are small and short. As shown in Fig. 13, the small vertical plane A is selected as a seed point. The seed point A is connected with plane 1 and plane 2 at the same time. Due to the characteristics of the stairs, the vertical plane 2 is selected for assembly with the seed A. Then, plane 2 is defined as the new seed B, and we will choose another new eligible plane for assembly with plane B. Continue to assemble the neighboring qualified shapes iteratively until we cannot find another qualifying one to assemble. Finally, record the connection types between planes for the object detection.

Compare the structure string with the target object. If the coding is in line with the target object, the assembly of shapes corresponds to the object we are looking for. For example, if the coding of the observed object is ‘11113333’, the object is the box we are looking for. Algorithm 2 shows the process of object detection from the scene point cloud. Algorithm 2. Object detection from scene point cloud. Input: Primitive shapes and the connection types between shapes Output: The target object 1. S is the set of shapes extracted from the scene, sorted by average z-value in descending order. 2. for(int i¼0; ioS.size(); iþ þ) 3. if (DeltZ i o ε && Areai o σ && The plane type of Si meets the condition && Si is not traversed)

Please cite this article as: W. Hao, Y. Wang, Structure-based object detection from scene point clouds, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2015.12.101i

W. Hao, Y. Wang / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4. Select Si as the seed point. SseedNum ’Si , ObjSeed’Si , objRecog’Si 5. do 6. for all nodes j connecting to the seed node SSeedNum 7. if(Conn(seedNum,j) ¼ ¼ -1)//i.e., the eseedNum;j is not traversed 8. if (DeltZ j o ε && Areaj o σ && θij o α && Sj is not traversed) 9. objRecog’Sj , Conn(seedNum,j)¼1// Assemble the node Sj to Assembling shapes and mark the eseedNum;j as traversed. 10. Record the nodes j Sj to the seed node, ObjSeed’Sj 11. Determine the connection type between SSeedNum and Sj , and record the connection string 12. else 13. jþ þ;//judge other nodes connecting to the node SSeedNum for eligibility 14. end if 15. else 16. jþ þ;// judge other nodes connecting to the node SSeedNum for eligibility

Fig. 14. The detection of the stairs from scene point clouds 1.

9

17. end if 18. if (all the nodes connecting to the SSeedNum are traversed) 19. Remove the first element in ObjSeed and SseedNum ’ObjSeed:f irstElement 20. end if 21. while(there exists another node that connects with SSeedNum and is not marked) 22. end for 23. Comparing the connection string with the target objects 24. if(the coding is eligible) 25. the detection is complete 26. else 27. Judge the connection string of another combinational shape 28. end if

5. Experimental results The experimental datasets (Scene 1–5) were acquired by Topcon scanner GLS-1500. Scene point clouds 6 is one room in OFFICE 2 that can be downloaded from website: http://www.ifi.uzh.ch/ vmml/publications/ObjDetandClas.html. The proposed algorithms are programmed with VCþ þ and OpenGL for display and rendering. All of the experiments in this paper are carried out on a PC with Intel(R) Core(TM) 2, CPU 2.80 GHz, 16 G memory. Fig. 14 shows the stairs extracted from the scene point clouds 1 using our algorithm.

Fig. 15. The detection of the car from scene point clouds 2. (a) Scene point clouds 2, (b) the planes in scene point clouds 2 and (c) the extraction of the car. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)

Please cite this article as: W. Hao, Y. Wang, Structure-based object detection from scene point clouds, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2015.12.101i

10

W. Hao, Y. Wang / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 16. The detection of the car from scene point clouds 3. (a) Scene point clouds 3, (b) the planes in scene point clouds 3 and (c) the extraction of the car. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)

Our algorithm depends on four parameters. In this paper, the threshold δ of the boundary point detection is set to 0.3 for all experiments. The thresholds ε and σ in the eligible node selection depend on the object to be detected. For the experimental results described in this paper, the threshold ε ranges from 0.2 to 3.0 and σ ranges from 0.1 to 2.0. Parameter α determines whether two nodes are vertical or not, and we fixed it to 0.15 for all experiments. Fig. 15 shows the experimental results presented in this paper. Fig. 15(a) shows the scene point clouds 2 scanned from Xi’an University of Technology, and Fig. 15(b) shows the planes extracted from the scene point clouds 2 after segmentation. As we can see, the car in the red box is segmented into three intersecting planes. Fig. 15(c) shows the cars extracted from the scene. However, the second car (Fig. 15(b) green box) is segmented into two intersecting planes due to the missing scan of the top planes. Its topology is represented by two connected nodes. As a result, the car in the green frame is not extracted due to the occlusion in the data scan.

Fig. 16(a) shows scene point clouds 3 scanned from one side view. It can be seen in Fig. 16(b) that the car in red box is segmented into three intersecting planes, and the car in green box is segmented into two intersecting planes due to the missing scan of the top planes. As a result, the car in green frame is not detected due to the incompleteness of the point clouds. Fig. 16(c) shows the car extracted from the scene. Fig. 17 shows scene point clouds 4, an indoor scene that contains several boxes. Fig. 17(b) shows the planes extracted from scene point clouds 4. The walls, ground and boxes composed of planes are extracted separately. Fig. 17(c) shows the detection result of the complete box point cloud from the scene. As shown in Fig. 18(a), scene point clouds 5 include a building, the ground and low vegetation. Fig. 18(b) shows the planes extracted from scene point clouds 5 using rough-detail segmentation. Fig. 18(c) shows the stairs extracted from the scene. Although two stairs in scene have different number of steps, they can also be detected. Fig. 19(a) shows the office scene 6 consisting of many planar surfaces and isolated objects (e.g. chairs, desks and cupboards). We reduced the dataset to 50% of its original size for scene 6. Note that for clarity, we extracted the patches corresponding to the room walls and colored them with gray. As shown in Fig. 19(b), each plane cluster in scene point clouds 6 is rendered in a single color. Two cupboards consisting of the same elements, but of different height are all detected (Fig. 19(c)). However, another cupboard (Fig. 19(d)) is undetected. That is mainly because the back surface of the cupboard is incomplete, and is partitioned into several planes due to the large interval. The connection string of the undetected cupboard is ‘1…1331…131…3…’, which is not in line with the target object. Table 1 shows the point numbers of the target object and the scene point cloud. The last line shows the time needed for shape assembly and connection string comparison. Table 2 shows a quantitative evaluation of the object detection results for the scene point clouds. The columns show the number of actual counted objects and detected objects, the number of false negatives and the corresponding values of precision. Note that some of the cars are not recognized due to the missing scan of the top surface. Fig. 20 shows the recognition result using [26]. Fig. 20(a) shows query graph for the box. The query graph and the attached constraints must be predefined by hand. Fig. 20(b) shows the boxes recognized from the scene point clouds 4. As we can see, the third and fifth boxes are extracted (red color, from left to right). However, the remaining boxes of different sizes cannot be extracted. That is mainly because their node constraints and edge constraints are different from those of the query object. The algorithm of [26] cannot detect similar objects of different size. However, our proposed algorithm aims to recognize not only a specific instance of an object but all objects within a category.

6. Conclusion and future work In this paper, we present a method for object detection based on primitive shapes and topological relationships. An “assemblymatching” strategy is proposed to detect the target objects in the scene point clouds. This strategy relies on the structure analysis of objects. Our method is ideally suited for man-made objects that can be described by compositions of primitive shapes. The experimental results show that the proposed method can quickly detect the target objects from massive point clouds. In this work, only objects composed of planar surfaces have been considered (stairs, cars, boxes, and cupboards). In future work, detecting more complex objects and 3D non-rigid objects

Please cite this article as: W. Hao, Y. Wang, Structure-based object detection from scene point clouds, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2015.12.101i

W. Hao, Y. Wang / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

11

Fig. 17. The recognition of the box in indoor scene. (a) Scene point clouds 4, (b) the planes in scene point clouds 4 and (c) the recognition of the boxes.

Fig. 18. The recognition of the stairs in scene point clouds 5. (a) Scene point clouds 5, (b) the planes in scene point clouds 5 and (c) the recognition of the stairs.

Please cite this article as: W. Hao, Y. Wang, Structure-based object detection from scene point clouds, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2015.12.101i

W. Hao, Y. Wang / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

12

Fig. 19. The detection of the cupboards in scene point clouds 6. (a) Scene point clouds 6, (b) the planes in scene point clouds 6, (c) the recognition of the cupboards and (d) the undetected cupboard.

Table 1 Point number of the target objects and scene in five datasets. Scene 1

Scene 2 Scene 3

Point number of 972,935 731,114 the scene 11,487 231,100 Point number of the target object 157 217 Shape assembling and connection string comparison times (ms)

Scene 4

Table 2 Quantitative evaluation of the object detection results in datasets.

Scene 5

Scene 6

776,644 535,356 1,876,646 1,661,366 21,462

15,345

31,436

117,874

287

401

125

439

Cars (Scene 2) Cars (Scene 3) Boxes (Scene 4) Stairs (Scene 5) Cupboards (Scene 6)

Count

Detect.

F.Neg

Precision (%)

7 2 5 2 3

5 1 5 2 2

2 1 0 0 1

71 50 100 100 66.7

Please cite this article as: W. Hao, Y. Wang, Structure-based object detection from scene point clouds, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2015.12.101i

W. Hao, Y. Wang / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

13

Fig. 20. The comparison between our proposed algorithm and [26]. (a) The query graph of the box and (b) the plane extraction result by using graph matching [26]. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)

will be studied. In addition, the method currently ignores the problem of occlusions. We hope to extend its capabilities to tackle the problem of occlusion of objects in the future. Such an extension could improve the detection accuracy of the proposed method.

Acknowledgments This work is supported in part by National Natural Science Foundation of China under Grant nos. 61272284, 61472319, 61302135, 61401355; and in part by Shaanxi Science Research Plan under Grant no. 2014K05-49; and in part by Doctor Startup foundation of Xi’an University of Technology.

References [1] A. Sgorbissa, D. Verda, Structure-based object representation and classification in mobile robotics through a Microsoft Kinect, Robot. Auton. Syst. 61 (2013) 1665–1679. [2] X. Bai, C. Rao, X.G. Wang, Shape vocabulary: a robust and efficient shape representation for shape matching, IEEE Trans. Image Process. 23 (9) (2014) 3935–3949. [3] X. Bai, S. Bai, Z.T. Zhu, L.J. Latecki, 3D shape matching via two layer coding, IEEE Trans. Pattern Anal. Mach. Intell. 37 (12) (2015) 2361–2373. [4] S. Bai, X. Bai, W.Y. Liu, F. Roli, Neural shape codes for 3D model retrieval, Pattern Recognit. Lett. 65 (11) (2015) 15–21. [5] J. Niemeyer, F. Rottensteiner, U. Soergel, Contextual classification of lidar data and building object detection in urban areas, ISPRS J. Photogramm. Remote Sens. 87 (2014) 152–165. [6] C. Cabo, C. Ordoñez, S. García-Cortés, J. Martínez, An algorithm for automatic detection of pole-like street furniture objects from Mobile Laser Scanner point clouds, ISPRS J. Photogramm. Remote Sens. 87 (2014) 47–56. [7] W. Yao, S. Hinz, U. Stillaa, Extraction and motion estimation of vehicles in single-pass airborne LiDAR data towards urban traffic analysis, ISPRS J. Photogramm. Remote Sens. 66 (2011) 260–271. [8] W. Yao, Y. Wei, Detection of 3D individual trees in urban areas by combining airborne LiDAR data and imagery, IEEE Geosci. Remote Sens. Lett. 10 (6) (2013) 1355–1359. [9] A.E. Johnson, M. Hebert, Using spin images for efficient object recognition in cluttered 3d scenes, IEEE Trans. Pattern Anal. Mach. Intell. 21 (5) (1999) 433–449. [10] H. Date, Y. Kaneta, A. Hatsukaiwa, M. Onosato, S. Kanai, Object recognition in terrestrial laser scan data using spin images, Comput. Aided Des. (2011) 1–11. [11] A. Frome, D. Huber, R. Kolluri, et al., Recognizing objects in range data using regional point descriptors, in: Proceeding of the 8th European Conference on Computer Vision, 2004, pp. 224–237. [12] Y. Zhong, Intrinsic shape signatures: a shape descriptor for 3D object recognition, in: Proceeding of the 12th IEEE International Conference on Computer Vision, 2009, pp. 689–696. [13] F. Tombari, S. Salti, L.D. Stefano, Unique signatures of histograms for local surface description, in: Proceedings of the 11th European Conference on Computer Vision, ECCV’10, 2010, pp. 356–369. [14] S. Salti, F. Tombari, L.D. Stefano, SHOT: unique signatures of histograms for surface and texture description, Comput. Vis. Image Underst. 125 (2014) 251–264. [15] B. Taati, M. Greenspan, Local shape descriptor selection for object recognition in range data, Comput. Vis. Image Underst. 115 (2011) 681–694. [16] Y. Guo, F. Sohel, M. Bennamoun, M. Lu, J. Wan, Rotational projection statistics for 3D local surface description and object recognition, Int. J. Comput. Vis. 105 (2013) 63–86. [17] Y. Guo, F. Sohel, M. Bennamoun, J. Wan, A novel local surface feature for 3D object recognition under clutter and occlusion, Inf. Sci. 293 (2015) 196–213.

[18] M. Najafi, T.S. Namin, M. Salzmann, L. Petersson, Non-associative higher-order markov networks for point cloud classification, in: Proceedings of the European conference on computer vision, 2014, pp. 500–515. [19] J. Niemeyer, F. Rottensteiner, U. Soergel, Conditional random fields for LiDAR point cloud classification complex urban areas, ISPRS Ann. Photogrammetry Remote Sens. Spatial Inf. Sci. I–3 (2012) 263–268. [20] H.J. Zhao, Y.M. Liu, X.L. Zhu, Y.P. Zhao, H.B. Zha, Scene understanding in a large dynamic environment through a laser based sensing, in: Proceedings of the International Conference on Robotics and Automation, 2010, pp. 127–133. [21] A. Golovinskiy, G.K. Vladimir, T. Funkhouser, Shape-based recognition of 3D point clouds in urban environments, in: Proceedings of the IEEE 12th International Conference on Computer vision, 2009, pp. 2154–2161. [22] A. Velizhev, R. Shapovalov, K. Schindler, Implicit shape models for object detection in 3D point clouds, ISPRS Ann. Photogrammetr. Remote Sens. Spat. Inf. Sci. I–3 (2012) 179–184. [23] H.Y. Wang, C. Wang, H. Luo, P. Li, M. Cheng, C.L. Wen, J.T. Li, Object detection in terrestrial laser scanning point clouds based on hough forest, IEEE Geosci. Remote Sens. Lett. 11 (10) (2014) 1807–1811. [24] O. Mattausch, D. Panozzo, C. Mura, O. Sorkine-Hornung, R. Pajarola, Object detection and classification from large-scale cluttered indoor scans, Comput. Graph Forum 33 (2) (2014) 11–21. [25] Y.M. Kim, N. Mitra, D.M. Yan, L. Guibas, Acquisition of 3D indoor environments with variability and repetition, ACM Trans. Gr. 31 (6) (2012), Article No.138. [26] R. Schnabel, R. Wessel, R. Wahl, R. Klein, Shape recognition in 3D point clouds, in: Proceedings of the International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, 2008, 2. [27] N. Somani, E. Dean, C.X. Cai, A. Knoll, Perception and reasoning for scene understanding in human-robot interaction scenarios, in: Proceedings of the 15th International Conference of Computer Analysis of Images and Patterns, 2013, pp. 1–14. [28] M. Nieuwenhuisen, J. Stückler, A. Berner, R. Klein, S. Behnke, Shape primitive based object recognition and grasping, in: Proceeding of the 7th German conference on Robotics, ROBOTIK, 2012, pp. 1–5. [29] A. Berner, J. Li, D. Holz, J. Stückler, Combining contour and shape primitives for object detection and pose estimation of prefabricated parts, in: Proceedings of the IEEE International Conference on Image Processing, 2013, pp. 3326–3330. [30] Y.H. Wang, W. Hao, X.J. Ning, et al., Automatic segmentation of urban point clouds based on the Gaussian map, Photogramm. Record. 28 (144) (2013) 342–361. [31] L.Q. Liu, Processing algorithms on scattered point cloud (Master thesis), Xi'an: China Northwest University, 2010.

Wen Hao received her BS and MS degrees from the Department of Computer Science at Shann’xi Normal University, Xi’an, China, in 2008 and 2011, respectively. During her master's study, she did an internship at the Institute of Automation, Chinese Academy of Sciences. She received her PhD degree from Xi’an University of Technology, Xi’an, China, in 2015. She is currently working at the Institute of Computer Science and Engineering, Xi’an University of Technology, Xi’an, China. Her research interests include pattern recognition, point cloud processing.

Yinghui Wang received his PhD degree from Northwest University, Xi’an, China, in 2002. From 2003 to 2005, he was a postdoctoral fellow at Peking University, Beijing, China. Now he is a professor at the Institute of Computer Science and Engineering, Xi’an University of Technology, China. His research interests include image analysis and pattern recognition.

Please cite this article as: W. Hao, Y. Wang, Structure-based object detection from scene point clouds, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2015.12.101i