Pattern Recognition Letters 14 (1993) 679-687 North-Holland
August 1993
PATREC 1092
Efficient 3-D object representation and recognition based on CAD Sam-Chung Hwang, Soon-Ki Jung and Hyun-Seung Yang Department o/ Computer Science, Korea Advanced Institute
(if Science
and Technology, Kusong-dong, Yusong-gu, Iaejon 305-701 .
South Korea
Received 21 February 1992
:Ibstrac •t
Hwang, S .-C ., S .-K . Jung and H .-S . Yang, Efficient 3-D object representation and recognition based on CAD, Pattern Recognition Letters 14 (1993) 679-697 . In this paper, a solid modeler by which one can automatically and efficiently build the 3-D object models is implemented . The proposed method provides a fast and compact object representation by incorporating view-dependent features and the object-centered representation . To test the efficiency of the proposed object representation scheme for recognition, we have implemented a matching process based on the constrained search paradigm . Experimental results show that the proposed method can reduce the time and storage space during the recognition process . Kerroords . Object recognition, 3-D object representation, 3-D object modeling, CAD-based vision .
1 . Introduction In building 3-D object models for the purpose of recognition, one might use conventional methods such as the manual construction and the construction by examples . However, a systematic way of generating models efficiently and automatically is to exploit a CAD database . Since the term CAD-based computer vision has been used by Bhanu (1987), many researchers have explored the relationship between CAD and computer vision . Hansen and Henderson (1989) studied the automatic synthesis of a specialized recognition scheme, called a strategy tree based on the CAGD (computer aided geometric design)
Correspondence in, I lyun-Seung Yang . Department of Comput-
er Science. Korea Advanced Institute of Science and Technology, Kusong-dung, Yusone gu . Taejon 305-701 . South Korea .
model . Hoffman et al, (1989) presented two experiments in CAD-driven object identification based on two types of sensor data : intensity images and range images . Flynn and Jain (1991) developed a system which uses 3-D object descriptions created on a commercial CAD system and expressed in both the industry-standard IGES (initial graphics exchange specification) form and a polyhedral approximation, and performed geometric inferencing to obtain a relational graph representation of the object which can be stored in a database of models for object recognition . Another issue in the 3-D object recognition is the model representation scheme . Two model representation schemes are widely used ; object-centered representation and viewer-centered representation (Korn and Dyer (1987)) . Object-centered representation describes objects by volumetric or surface models whose coordinates are fixed with respect to a reference frame . Since the features extracted for
0167-8655/93/$06 .00 © 1993 - Elsevier Science Publishers B .V . All rights reserved
679
Volume 14, Number 8
PATTERN RECOGNITION LETTERS
recognition from an image are represented by the viewer-centered coordinate system, these features do not directly correspond to the object-centered descriptions . Thus a 3-D to 2-D, or 2-D to 3-D transformation must be performed before the observed features can be matched to the model . This transformation can be time consuming . In the multiple-view approach, the features can be directly matched with those associated with each member of the multiple-view model set, however, they require a large amount of storage space . We propose a new representation scheme generated by integrating view-independent and view-dependent features .
August 1993
reason is the difference between the CAD model and the vision model of a 3-D object . So a geometric inference engine to build vision models is needed . The conventional model representation schemes have a trade-off between matching time and storage space : the object-centered models can be compactly stored, but require significant resources to compute features that can be matched with features extracted from images . The viewercentered models can be quickly searched for matching features, but require a large amount of storage space . In this section, we propose a new representation scheme generated by integrating viewindependent and view-dependent features . 3 .1 . View-independent features
2. Polyhedral boundary model The representation or modeling of 3-D objects has been a common concern in both CAD and model-based vision . Although many of the techniques are shared by these two disciplines, the requirements of each modeling task turn out to be quite different . The main goal of CAD systems is to design new shapes suitable for manufacturing objects automatically based on the desired specifications in a cost-effective manner . In contrast to CAD systems, model-based vision systems are used to analyze objects already in existence for recognition and manipulation . Because of the different modeling requirements for CAD and vision systems, we have built a solid modeler using the polyhedral B-rep instead of CAD models for the acquisition of vision models . A plane model is a planar directed graph of vertices, edges and polygons . On top of the planar graph, a special topology is defined by identifying edges and vertices of individual polygons . For representing this graph, we use a five-level hierarchical data structure so called half-edge data structure consisting of nodes of type Solid, Face, Loop, HalfEdge, and Vertex and the face-to-face relational node Edge (see Mantyla (1988) for details) . 3 . Representation of 3-D objects In general, the description of the solid modeler cannot be directly applied to vision tasks . The 68 0
View-independent features are defined as the rotationally-invariant and viewpoint-independent ones . View-independent features are described by the relational graph, which contains the adjacency relationships between pairs of faces . The relational graph G of an object 0 has the following properties . 1 . Each face f,, of the object 0 is represented by a node in G . 2 . For every edge e, between two faces t and fj , there exists an edge connecting the corresponding nodes in G . In a relational graph, G=
Volume 14, Number 8
PATTERN RECOGNITION LETTERS
August 1993
3 .2 . View-dependent features
View-dependent features such as surface area and simultaneous visibility are important in 3-D object recognition . With view-dependent features, one can directly classify objects by comparing the features from the image against those from reference views stored in a database . But the viewdependent features are dependent on the modelorganization so the selection of the representative views are important and these features are calculated from each viewpoint . 3 .2.1 . Representative views
To generate a set of views, the viewpoint space is partitioned into a finite number of regions, and a representative view is selected in each region . There are two approaches in partitioning the viewpoint space : (1) a uniform, object-independent partitioning, where the number of regions and their shape is fixed in advance, and (2) the aspect graph approach . The aspect is the topological appearance of the object from a particular viewpoint (Hansen and Henderson (1989)) . The formal techniques for aspect graph construction have two major drawbacks : (1) large time and space complexity, and (2) a lack of implemented algorithms for computing the aspect graphs of nonpolyhedral objects . So we take representative views from the uniform view sphere . In this approach, a set of viewpoints on the view sphere, which provides a near-uniform sampling of this finite space, is selected . An icosahedral tessellation of a unit sphere is used and then the tessellated sphere is uniformly scaled to the proper size . A total of 320 points on the view sphere are generated from a subdivided icosahedron, with its vertices on the unit sphere (see Ballard and Brown (1982) for details constructing a subdivided icosahedron) . 3 .2 .2. Visibility test and jump edge
To extract view-dependent features, Flynn and Jain (1991) exploit the synthetic range image . But raytracing techniques or Z-buffer Algorithm to synthesize the depth image on a large number of viewpoints is obviously time consuming and requires a large amount of storage . So we calculate
(a)
(b)
Figure L A solid with an occluding facet, (a) and the projected image at a viewpoint V„ (b) .
the view-dependent features with the visibility test of faces at a viewpoint . In general, a point on the object is said to be visible in an image whenever light arrives at the viewpoint from the point . Faces are invisible when the face is a backfacing polygon, or occluded by other faces . When a face is occluded partially, and it is connected with the occluding face in the projected image, jump edges may be produced . Figure I shows the adjacent faces of a face f in the image of a viewpoint VV, . A face f is occluded by f, at a jump edge depicted by the darker line in Figure 1 . Figure 2 describes the algorithm to test the visibility of other faces from an edge hp of the face f at an arbitrary viewpoint V1, . If the following conditions are satisfied, an edge hp of a face f is a jump edge . • A face f, is a fore-facing polygon . • The edge hp between two faces f,, and fj is a convex edge . • The face f is a back-facing polygon . The edge hp satisfying the above conditions falls into the following three categories : • Invisible by other occluding faces in the projected image . • Projected on other faces . In this case, a beam started from the edge hp is intersected with the other faces . The face connected by the jump edge hp is the closest plane of intersection with the beam . • Not adjacent with other faces . This edge hp forms the boundary of the object in the projected image . 68 1
Volume
14,
Number
procedure var line list points_list
8
PATTERN RECOGNITION LETTERS
jumpp edge(f,, hp, vpj
linked list, of parametric lines ; linked list of points ;
begin line list :- (0 .0, 1 .0, f := adjacent face with hp ; if hp is convex and f, is back-facing polygon then begin for each face fk do begin if fk is fore-facing polygon then begin points list :- projection of hp on the fk : if points_ list is above hp then sub points-list from line list ; else add pointslist and fk to line-list ;
end end end if line list
is (0 .0, 1 .0, f,) then f: is not adjacent with any other faces ; else if line list is null then f, is invisible ; else f, is adjacent with faces in, line_ list ;
August
1993
the real range images . The types (convex crease, concave crease, occluding or occluded) of each edge can be determined by comparing the range values of the two neighboring pixels of the edge . The adjacency relationships between two regions and other binary features are obtained during this edge tracking . Unary features are the attributes of each segmented region such as the normal vector of the kth segmented region rk (N, k ), the area (A,), the occlusion (O, is 1 if a region r; is occluded by other regions), the existence of holes or other regions in a region (C,k ) and the number of adjacent regions (NE,,) . Binary features are attributed arcs between two segmented regions . Binary features are as follows : the difference of orientation (0,,,t ), the area ratio (ARr„r ), the edge type (E,;, ), the distance between the centroids of two regions (D,,,) and so forth .
5 . Matching
end Figure 2 . Algorithm for testing the visibility of other faces from a face j, .
3 .2.3 . View-dependent features The following view-dependent features are used . • Always Occluded. AOf is 1 if a face is always occluded by other faces or invisible for all viewpoints . • Never Occluded. NOf, is 1 if a face is never occluded by other faces for all viewpoints . • Simultaneous Visibility . SVf f, -Two faces are never simultaneously visible if, for all views, the dot product of their face normal and view vector normal is never simultaneously positive or one of the two faces is fully occluded by others . • Jump Edge . Jf f -Two faces f, and f are connected by a jump edge if ff is partially occluded by f,1 at an arbitrary viewpoint .
4 . Extraction of features from range images The synthetic range images are used instead of 682
The purpose of the matching process is to find the model faces corresponding to the scene regions . In general, any solution to the problem of matching a scene object with a model object can be conceived of as a depth-first tree search (see Chen and Kak (1988) and Grimson (1990)) . A traversal through the search tree may be referred to as a matching process and each arc in the traversal represents an attempt at testing the correspondence between a scene feature and a model feature and each node is considered to represent the current state of the matching process . The system selects among unknown regions those which seem to be the most reliable and useful for recognition called kernels (Oshima and Shirai (1983)) . Then the system selects models which include faces corresponding to the kernel . The selected models are the candidate model objects Oa, . Next, the system performs a data-driven depth-first search and selects the most reliable and useful region among the remaining scene regions . Then the system searches the faces of the model object corresponding to the regions in the scene . This is processed by hypothesis generation and verification (Chen and Kak (1988)) . This process is represented by the search
Volume 14, Number 8
August 1993
PATTERN RECOGNITION LETTERS
tree . During searching, the features of the scene and the models are used as constraints of the generation of the hypothesis, i .e ., pruning off a branch of the search tree . Because the search process is inherently an exponential problem, the key to an efficient solution is to use constraints to remove the large subtrees from consideration without explicitly having to explore them . 'The following sections introduce these constraints used in our system . 5 . 1 . Constraints We can define a unary constraint as the similarity between two surface patches r; (in the scene) and fk (in the model), which can be calculated from the attributes of the surface patches . Predicates are tested . • O„ANOf,-If a scene region r; is occluded but a model face fk is never occluded, then the unary constraint is not satisfied . • O,,-If a scene region r; is occluded, then the unarv constraint of r; is satisfied for all model faces f. . • d(A,,Af,)Cthreshold, where A r (A f,) represents the area of the ith (kth) region (face) in the scene (of the model) and d(x, y) is the difference between x and y . • C,=Cry , where C,, (Cfk ) represents the number of holes in the region ri ( face fk) . It is a binary constraint to check the compatibility between a pair of matched surface patches (r ;, fk ) and another pair of matched surface patches (r,, f ) . We can exploit the following binary constraints to prune off inconsistent solution paths . • Connection consistency . e,,,, c efk f,, where e,, is the set of the edge types between two regions r; and r1 . • Simultaneous visibility . SVfk f,-Since two regions ri and r, are visible simultaneously, two faces fk and f, are also visible simultaneously . • Direction consistency. d(Or,,r ,Bff .,)
tial constrained tree search is a depth-first search, with downward termination based on constraint consistency . After the tree search, the verification process is carried out by calculating the transformation error between the scene object and the model object . 5 .2. Processing example To show how the matching algorithm works, we present a simple example, i .e ., the matching between a block model (block_L) and a range image . Figure 3 shows the segmented result of the range image and the block L model . The model and scene relational graph are generated from the segmented result of the range and the model as shown in Figure 3 . The matching process compares two graphs to test the correspondence between the scene regions and the model faces . In this example, the number of the faces in the scene is 4 and that of the model is 8 . The system selects among the scene regions those which seem to be the most reliable and useful for recognition . The resulting sequence of the selection of the kernel is (1, 3,2,4> . Then the system selects the candidate models from the model database . In this case, we assume the only block_L object to be selected . In Figure 4, all the possible paths of the depthfirst tree search in the matching between the scene regions and the faces of the block_L model are shown . To compare the object-centered representations and the proposed representations, two experiments are made . The first experiment uses only the viewindependent features as the features of the model graph . And in the second experiment, we use the in-
1 (a)
(bi
Figure 3 . An example of the segmentation result (a) and the hidden-line-removed image of a model block_L (b) . 683
Volume 14, Number 8
August 1993
PATTERN RECOGNITION LETTERS i scene region 4M t : model face
2
6 ®®®®®®®®®®®®®®
®®
4-
®® Figure 4 . An example of the matching tree . tegrated features of the view-independent features and the view-dependent features in the matching process . The thick lines in Figure 4 represent the latter experiment . In this example, the matching based
on the proposed object representation generates fewer nodes than based on the object-centered one, since the latter uses more constraints than the former during the matching process . For example,
(a) block-H
(b) block -L
(c) cblock
(d) cblock2
(e) chair
(f) mouse
I (g) pblock
(h) sblock
Figure 5 . The models used in the experiment . 6 84
Volume 14, Number 8 Table I Storages for each model (bytes(ratio)) Model
Method I
August 1993
PATTERN RECOGNITION LETTERS
Method 2 (20 views)
Method 2 (320 views)
Method 3
block-11 block_/L cblock eblock2 mouse pblock chair sblock
2048 928 1252 1252 1996 1532 2608 2372
20884 10784 9508 12420 17052 15812 27468 25168
326304 156980 149476 212588 322920 273436
2968 1040 1524 1468
432448
4288
393648
3516
Average
1748 .5 (1 .0)
17387 (9 .94)
283475 (162 .12)
2427 .5 (1 .38)
Since the matching process of our system only uses the adjacent relationships between two faces, the solution path is not unique . The solutions are verified by the transformation error between the scene object and the model object .
6 . Experiments
2820 1796
the face f l of the model object has no jump edges by the view-dependent features and the region r 1 has a jump edge with the region r3 , so that ri and f do not match .
The model library consists of 8 object models in Figure 5 . Three object representations, i .e . objectcentered representation (method 1), multiple-view representation (method 2) and the proposed representation (method 3) are compared . In method 2, we select the representative views from an icosahedron and a 320-sided polyhedron . Table 1 shows the storage space of each representation scheme . In this table, we can observe method I and method 3 are more compact than the multipleview representation . Figure 6 presents the range im-
(a) object-1
(b) object-2
(c) object-3
(d) object-4
(e) object-5
(f) object-6
(g) object-7
(h) object-8
(i) object-9
Figure 6 . The input scenes used in the experiment . 685
Volume 14, Number 8
PATTERN RECOGNITION LETTERS
August 1993
Table 2 Results for object-9 Model
Max depth Method 1 Method 3
block H block -L cblock cblock2
4
2
4
2
chair
1
Time (ms) Method 1 Method 3 7 .24 1 .22 0 .30 0.75 0.30 0.30 1 .56 14,46
I 1
mouse
I
pblock
4
2
sblock
10
10
ages of 9 arbitrary rotated objects . Figure 7 depicts the segmentation results of Figure 6 . Table 2 shows the matching depth, the matching time and the number of generated nodes of two methods for object-9 . In this table, we can observe the proposed representation is more efficient than the objectcentered representation in matching . Table 3 shows the comparison of the average matching times for
Number of nodes Method 1 Method 3
4 .16 0 .68 0 .30 0 .75 0 .30 0 .30 1 .04 9 .64
23 8 1 4 1 8 36
13 4 1 4 1 1 24
each method . No matched models are labeled as in this table . A number of conclusions can be derived from these data : • Multiple-view representation is very fast when a model is matched with a scene object but not efficient at mismatching . • The view-dependent constraints dramatically
a{ (a) object-1
(b) object-2
(c) object-3
flt (d) object-4
(e) object-5
(f) object-6
(g) object-?
(h) object-8
(i) object-9
Figure 7 . The results of the segmentation of the range images . 686
August 1993
PATTERN RECOGNITION LETTERS
Volume 14, Number 8
fable 3 Average matching time for all scenes (rest
Method I
At match Method 2
Method 3
Method 1
At mismatch Method 2
Method 3
Object-1 Object-2 Object-3 Object-4 Object-S Object-6 Object-7 Object-8 Object-9
4 .66 12 .83 13 .33 11 .66 10 .16 15 .99 3 .83 22 .16 17 .33
1 .66 3 .83 4 .83 1 .49 " " 4 .16 6 .16 4 .16
1 .99 9 .33 9 .99 8 .66 8 .16 10 .16 3 .49 18 .66 11 .16
2 .75 4 .51 0 .21 0 .11 1 .30 1 .58 0 .66 0 .97 1 .80
1 .32 1 .94 1 .89 1 .42 1 .95 1 .66 1 .78 1 .66 1 .92
1 .19 2 .75 0 .18 0.13 1 .26 1 .04 0.68 .45 0 1 .06
Average
12 .72
3 .75
9 .18
1 .57
1 .69
0 93
Model
reduce the number of matching nodes ; the constraints significantly increase the computation at each node, but decrease the total computation of the matching process . • Multiple-view representation with 20 views is not sufficient for matching complex objects such as sblock, mouse . As the number of views increases, the storage space for the model and the matching time increase dramatically .
7 . Conclusions In this paper, we proposed an efficient and automatic 3-D object model generation scheme based on CAD . By integrating the object-centered representation with view-dependent features, the proposed method provides a fast and compact 3-D object representation scheme . We confirm the efficiency of the proposed method for the recognition by experiments .
References Ballard, D .H . and C .M . Brown (1982) . Computer Vision . Prentice-Hall, Englewood Cliffs, NJ . Bhanu, B ., Ed . (1987) . Guest Editor's introduction (Special issue on CAD-Based Robot Vision), Computer 20, 13-16 . Chen, C.H . and A .C . Kak (1988) . 3D-POLY : a robot vision system for recognizing objects in occluded environments . Robot Vision Lab ., School of Electr . Engrg ., Purdue Univ ., Tech . Rep . TR-EE 88-48 . Flynn, P .J . and A .K . lain (1991) . CAD-based computer vision : from CAD models to relational graphs . IEEE Trans . Pattern Anal. Machine Intel( . 13, 114-132 . Crimson, W-E .L . (1990) . The combinatorics of object recognition in cluttered environments using constrained search . Artificial Intelligence 40, 121-165 . Hansen, C . and T .C . Henderson (1989) . CAGD-based computer vision . IEEE Trans. Pattern Anal. Machine Intel!. 11, 1181-1193 . Hoffman, R ., H .R . Keshavan and F . Towfiq (1989) . CADdriven machine vision . IEEE Trans . Syst . Man Cybernet . 19, 1477-1488 . Korn, M .R . and C .R . Dyer (1987) . 3-D multiview object representations for model-based abject recognition . Pattern Recognition 20, 91-103Mantyla, M . (1988) . An introduction to Solid Modeling . Computer Science Press, Rockville, MD . Oshima, M . and Y . Shirai (1983) . Object recognition using three-dimensional information . IEEE Trans . Pattern Anal. Machine intell . 5, 353-361 .
6 87