Encoding 3D structural information using multiple self-organizing feature maps

Encoding 3D structural information using multiple self-organizing feature maps

Image and Vision Computing 19 (2001) 99–118 www.elsevier.com/locate/imavis Encoding 3D structural information using multiple self-organizing feature ...

9MB Sizes 1 Downloads 87 Views

Image and Vision Computing 19 (2001) 99–118 www.elsevier.com/locate/imavis

Encoding 3D structural information using multiple self-organizing feature maps M. Takatsuka a,*, R.A. Jarvis b,1 a

b

School of Computing, Curtin University of Technology, Perth, WA 6102 Australia Intelligent Robotics Research Center, Monash University, Clayton, Vic. 3168 Australia Received 24 November 1998; revised 3 April 2000; accepted 28 April 2000

Abstract This paper describes a system which encodes a free-form three-dimensional (3D) object using Artificial Neural Networks. The types of surface shapes which the system is able to handle include not only pre-defined surfaces such as simple piecewise quadric surfaces but also more complex free-form surfaces. The system utilizes two Self-Organizing Maps to encode surface parts and their geometrical relationships. Authors demonstrated the use of this encoding technique on “simple” 3D free-form object recognition systems [M. Takatsuka, R.A. Jarvis, Hierarchical neural networks for learning 3D objects from range images, Journal of Electronic Imaging 7 (1) (1998) 16–28]. This paper discusses the design and mechanism of the Multiple SOFMs for encoding 3D information in greater detail including an application to face (“complex” 3D free-form object) recognition. 䉷 2001 Elsevier Science B.V. All rights reserved. Keywords: Encode; Range image; Self-organizing feature map

1. Introduction When a person sees objects or scenery, how does his or her brain represent this information? This is probably one of the questions most frequently asked by many scientists and which motivates them to explore brain activity mechanisms. Artificial Intelligence (AI) and Computer Vision (CV) researchers have developed various techniques to encode a three-dimensional (3D) object/world in order to build machine object recognition or image understanding systems. Most of these encoding techniques for 3D objects are classified into two categories according to whether a representation is a set of locally defined primitives (local representation) or a single description of a whole object (global representation). The global representation often has the form of a two-dimensional (2D) pattern so that simple matching schemes can be applied, while the local representation requires extensive searching for matching image and model primitives. Local representation, however, has the advantage of being robust to occlusion. Encoding methods based on local representation are also divided into further * Corresponding author. Tel.: ⫹1-814-865-5642; fax: ⫹1-814-863-7943. E-mail addresses: [email protected] (M. Takatsuka), [email protected] (R.A. Jarvis). 1 Tel.: ⫹61-3-9905-3470/3454.

small categories depending on the type of primitives used to describe objects. These primitives are 2D lines and vertices, 3D points, 3D edges, 3D surfaces, and 3D volumetric primitives. Methodologies of object encoding are classified and tabulated in Table 1. 1.1. Local representation 1.1.1. 2D Primitives 2D primitives are the simplest primitives. They are normally obtained through the segmentation of a 2D intensity image. The segmented primitives can be either boundaries or regions. This type of representation is only used in the systems that recognize 3D objects via their 2D projections. Such systems do not use 3D features of object for the process of recognition. Hence, encoding methods using 2D primitives are not considered here. Some 3D object recognition systems, however, use 2D primitives to recover 3D information. For instance Tomita et al. used 2D line segments in their stereo range finder [2]. By using 2D line segments, the complexity of finding the correspondence of elements between left and right images can be reduced. 1.1.2. 3D Points A range image is a set of range values sampled on object surfaces. The range point is the simplest 3D primitive. One can encode a 3D object using a set of 3D points and their

0262-8856/00/$ - see front matter 䉷 2001 Elsevier Science B.V. All rights reserved. PII: S0262-885 6(00)00047-0

100

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

Table 1 Object encoding methodology classification Local representation

Global representation

2D

X Segmented intensity image lines, vertices, regions [2]

X Single intensity image X Fourier image X Eigenspace [14]

3D points

X Set of 3D points Splash [3] Point signature [4,5]

3D edges

X Boundary representation [6,7]

3D surfaces

X Relational graph Planar [10] Quadric [10] Eight basic surfaces [9]

3D Volumetric primitives

X Relational graph Generalised cylinders [11,12] Geons Superquadrics [13]

geometrical relationships. The task of recognition with this type of representation would be achieved by finding correspondences of 3D points between the observed range image and the model range image. However, finding correspondences for all range points is computationally expensive and is often impractical. Stein and Medioni [3] proposed the “splash” to represent the immediate surrounding surfaces of range points. Each range point is encoded with this feature. Chua and Jarvis also introduced a similar supporting feature called the Point Signature [4,5]. In contrast to Stein and Medioni’s approach, they used range points on less structured surfaces because local shape features for those points are computed more reliably.

1.1.3. 3D edges Edge-based local representation describes objects or scenes using information obtained from depth or range discontinuities. If a range image is available in the form of the monge patch (x, y, z(x, y)), range points which form edges can be found by a filtering process such as Laplacian filtering [6]. When a range image is obtained as a set of 3D triplets (x, y, z), edges can be geometrically calculated [7]. Horaud and Bolles used edge-based representations in their CAD-based recognition system: 3DPO [6]. Range points, which are found to be part of edges by Laplacian filtering, are grouped into edge segments. Each edge segment is approximated by 3D lines or curves. In their system, information of surfaces is also used to support computing edges, describing objects or scenes, and the recognition process. Tomita and Kanade also used an edge-based representation [7]. In their system, edge segments are considered to be boundaries of surfaces. Surfaces are detected by tracing computed edge segments. Each surface segment is ranked according to the complexity

X Extended Gaussian image (EGI) [16] X Spherical attribute image (SAI) [17] X Aspect graph [15]

of its shape. Objects are then described using those surface segments represented by edges. Encoding 3D objects using edge information is very effective when objects contain many line and/or arc-shaped depth discontinuities—for example, industrial parts. On the other hand, this type of representation lacks surface information within or between boundary edges. Therefore, it is not suitable for objects containing smooth edges, which are very difficult to detect, and free-form surfaces.

1.1.4. 3D surfaces In most range finding systems, surfaces are measured as sets of range points, therefore 3D features of surfaces are easily obtained. However, a method to describe a general free-form surface has not been developed yet, therefore surfaces are normally approximated by simpler surface types such as planar and quadric surfaces [8,9]. Oshima and Shirai [10] developed a recognition system targeting objects with planar and quadric curved surfaces. In their system, range points are first grouped into small regions called kernels. Surface segments are then obtained around those kernels using a region-growing method. A scene description is derived in the form of an attributed relational graph whose nodes correspond to segmented surfaces and whose branches represent adjacency of segments. Fan et al. [9] used a surface-based local representation for more complex objects such as toys (a car, an airplane, a telephone). In order to reliably extract surfaces from a scene, they used edge-base segmentation. Surfaces are extracted by extending and tracing edge segments in a manner similar to Tomita’s method [7]. Segmented surfaces are approximated by quadric surfaces using the least squares method. The observed scene is described by an attributed relational graph. The graph is further divided into subgraphs

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

by examining boundary types: a jump (or limb) edge, a convex roof edge, and a concave edge. Encoding an object or a scene in the form of an attributed relational graph based on surface is very popular among 3D object recognition researchers. This is because surface information provides stronger features than edge information and is more easily obtained compared to volumetric information. The features of a graph itself can also be used for a simple matching strategy. Fan et al. [9] pre-calculated several features of subgraphs such as the number of nodes, the number of planar surfaces, and the area of the largest surfaces, in order to select, at most, five candidate models. 1.1.5. 3D volumetric primitives Volumetric local representation uses descriptions of volume parts of objects rather than the surface or edge parts. These volumetric descriptions form nodes in the graph representation and their geometrical or hierarchical relationships are used for branches of the graph. Generally, solid models such as cubes, prisms, and cylinders are considered to be 3D volumetric primitives in some geometric models such as the Boundary representation (BRep). Those volumetric primitives, however, are often described by using their edge or boundary information. In other words, the edges are the foundations of these models. Therefore, some geometric models like B-Rep can be categorized into the group of 3D edge primitives. The generalized cylinder (GC) is one well known volumetric representation of 3D objects. It represents an object as changes to the cross-section along an axis or spine. The axis or spine is a curved line in 3D space and the crosssection is perpendicular to this line. Nevatia and Binford [11], and Brooks [12] encoded 3D objects using the GC representation in their object recognition system. Both groups, however, used extracted 2D edges to construct 2D versions of GCs. The GCs obtained are then converted to 3D GCs with an assumption that volumetric parts are cylindrical. Generally, this representation is suited to, and is sufficient representation for, simple man-made objects. In relation to the GC, another volumetric representation called superquadrics is also used in 3D object encoding. In the field of 3D object recognition, superquadrics, specifically the parameters of superquadrics, are reconstructed from sets of range points. This representation suffers from object self-occlusion unless objects are symmetric, and is only suited to recognition systems handling simple objects [13]. 1.2. Global representation The global representation addressed here is a method of encoding a 3D object with a single image. A 2D intensity image is a 2D projection of the illumination of the 3D object. This representation is easily obtained using relatively inexpensive equipment. The object recognition is, in this case, usually carried out using simple template

101

matching or a correlation measure method. However, 2D intensity images of 3D objects vary depending on the lighting environment and most importantly, the view-point from which the images are obtained. Further, the 3D structural information of objects can not be extracted and analyzed from a single 2D image. As a result, the system needs to use a number of images, which are obtained under different lighting and object posing conditions, for each object model. This results in a vast search space for the recognition process. By transforming the intensity image into other types of information such as a Fourier image, which is translation invariant, and Eigenspace representation [14], which is invariant to fluctuations in illumination and magnification, the system may be able to reduce the search space. The aspect graph, which was originally proposed as the visual-potential of an object by Koenderink and van Doorn in 1979 [15], is a method to encode a 3D object using a set of aspects which are characteristic views of the object. Even though the aspect graph has a structure like the attributed relational graph, this representation method is classified as a global representation because each aspect is the global representation of the object. Each node of the aspect graph corresponds to the aspect from a certain viewpoint and the possibility of transiting from one aspect to another is represented by an edge. Global representations of an object as an image can also be generated from sets of 3D surface features such as surface normals and curvatures. The Extended Gaussian Image (EGI) [16] is one such 3D global representation. It is based on the Gaussian Image representation which maps all surface normals of object surfaces onto the unit Gaussian sphere. The EGI is obtained by first tessellating the Gaussian sphere and computing the surface normal densities of tessellated patches. This EGI also suffers from the change of viewpoints. The EGI is also sensitive to occlusion and can be used only for convex objects because some convex and concave objects can have the same EGI. Another representation, which belongs to the class of spherical representations like the EGI, is the Spherical Attribute Image (SAI). The EGI has the fundamental problem that several parts of an object may be mapped on the same point on the Gaussian sphere. Dellingette et al. [17] proposed the SAI in order to achieve a one-to-one mapping between a spherical surface and an object surface. The oneto-one mapping is found by deformable surface mapping. Although the system with the SAI has the advantage of being able to handle non-convex objects, it has to carry out iterative and expensive computation for finding the deformable surface mapping. The global representation based encoding techniques has the advantage of being able to use very simple matching schemes, because an object is usually encoded as a pattern or a feature vector. In this case, however, knowledge available to the system is often restricted to a very low level (for example pixels and sets of surface normals). Moreover, 3D structural information of an object is often encoded only

102

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

implicitly. Therefore, it is very difficult to analyze an object in terms of parts and their relationships. In this work, a biologically inspired technique to encode 3D structural information is proposed. This technique utilizes Self-Organization Feature Maps (SOFMs) to produce a global feature, in other words a pattern, derived by local features of components of an object. Therefore, even though a 3D object is represented by one global feature, each component of the feature explicitly represents 3D structural information. 2. Pre-processing 3D range images In order to encode 3D objects based on parts and their relationships, those parts need to be extracted and various features of them need to be computed. The 3D objects used in this study consist of various types of surfaces, which include very simple planar surfaces as well as very complex free-form surfaces. 2.1. Local shape feature extraction In this work an object is measured as a set of range points which represents points on the surfaces of the object. In order to segment and characterize free-form 3D surfaces, a local shape feature has to be able to represent free-form 3D surfaces quantitatively. Gaussian and Mean curvatures are often used as local features amongst CV researchers to support segmentation [18]. They, however, characterize surface only qualitatively. It is also desirable for the local features to be view-independent, in other words, rigid motion invariant. For these reasons, a local feature called “Surface/Sphere Intersection Signature: (SSIS)” [1] is used in this study. The SSIS is one type of local feature whose derivation is based upon the Radial Decomposition (RD) technique. The SSIS is calculated as follows. At a given point p on a surface, the surface normal n is calculated as a reference normal. A sphere of radius r is then centered at the point p. A local surface region is then defined by a boundary, which is derived as an intersection of the surface and the sphere. Next, latitudes at the points along the boundary are calculated with respect to the reference normal. By plotting the latitudes f u s against the corresponding longitudes u s, the SSIS of the point p can be obtained. The SSIS at the ith point pi: spi(u ) represents the boundary of the locals surface as a one-dimensional periodical function. Even though the SSIS only describes the boundary shape of the local surface, it can be used to indicate the underlying surface structure to which the point p belongs. In this work, Fourier descriptors of the SSIS are used as a local shape feature of a surface instead of the SSIS. 2.2. Segmentation and surface feature extraction Image segmentation itself is a very complex problem in

CV and has various aspects worthy of study. The main idea of segmenting a 3D object into surface parts was extended from the “part theory” proposed by Hoffman and Richards [19]. If surfaces are segmented at the loci of the concave and convex discontinuities, they will be segmented into concave, planar and convex surfaces. Planarity of a surface is determined by the DC basis component of the SSIS [1]. By examining DC basis components of the SSISs with respect to the threshold values that divide concave, planar and convex surfaces, all range points are assigned to one of these three surface types. It should be noted that the threshold values can be determined systematically from the level of noise in the range images. Once surface types of all range points are determined, range points are grouped into surface parts by a region growing and splitting process [20]. After segmentation, the unary (geometric) information of each surface part and the binary information (geometrical relationships) among them are calculated in order to describe objects. 2.2.1. Local features of a surface part To characterize segmented surface parts, the following features are computed for all surface parts. Each of these features constitutes a surface feature vector for later use. • • • •

Centroid Area Average Fourier descriptors of SSISs Eigenvalues and Eigenvectors from the Covariance Matrix

2.2.2. Binary features of surface parts The following features are calculated as binary features for pairs of surface parts in order to describe their geometrical relationships. • Centroid Distance • Surface Normal Angle Difference • Angle Differences in Principal Directions The details of local features and binary features are described in Sections 3.2 and 5.2, respectively. 3. Encoding a 3D object as a set of surface parts Recognizing 3D objects or building 3D models by parts is now a widely accepted approach. A variety of part types is used and there is no distinct concept of a part. In some systems, a range image is segmented into surfaces and is approximated by 3D volumetric primitives such as generalized cylinders [12], geons [21], and superquadrics [22]. In other systems, an object is simply represented by 3D surfaces [23] or 3D edges [24,25] obtained by the segmentation process. In this study, surface parts, which are

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

103

petitive layer are fully connected and the connection weights of the competitive neuron j shall be denoted by wj ˆ ‰wj1 wj2 …wjn ŠT 僆 Rn : The jth competitive neuron computes the similarity of xi and wj as the output yj: v uX u n …1† yj ˆ 储xi ⫺ wj 储 ˆ t …xi ⫺ wjk †2 : kˆ1

Fig. 1. A hexagonal array of neurons in the SOFM.

produced by the segmentation process described in the previous section are used as 3D primitives. In this study a Self-Organizing Feature Map (SOFM) is used to encode 3D surface parts. Artificial neurons, which will be referred as neurons in the rest of this paper, are trained to classify surfaces with particular 3D structural features as listed previously. With this SOFM, 3D structural information of surface parts are encoded as states of neuron activities (Fig. 1). 3.1. Self-organising feature map The SOFM is a set of artificial neurons, which are ordered in R n space. A 2D array …n ˆ 2† is the most common map and is used to map an input signal in R m …m ⬎ n† space onto the 2D space. A SOFM consists of two layers. One is an input layer into which input feature vectors will be fed and the other layer is a 2D competitive layer which orders the neurons’ responses spatially. Neurons can be arranged on a rectangular map so that they can be implemented using simple 2D data array. A hexagonally arranged neuron map is, however, often used because it has the advantage that Euclidean distances between adjacent neurons are equal for all six nearest neighbor neurons [26]. 3.2. Learning algorithm [26] The SOFM is trained without teacher signals (unsupervised) unlike some other ANNs in which supervised training is used, such as in backpropagation networks. The learning algorithm used in this study is the same as Kohonen’s algorithm [27] and described as follows. Let xi ˆ ‰x1 x2 … xn ŠT 僆 Rn be the ith input feature vector which will be fed into the neurons in the input layer. Because the task of the input neurons is just to pass the input feature vector onto the competitive layer, the input vector xi will be used as an output vector of the input layer. The neurons in the input layer and the neurons in the com-

There are several ways to measure the similarity [27]; the Euclidean distance is used in this study. By training the connection weights according to the following algorithm, the competitive neurons will have the appropriate weights so that the competitive neurons adjacent to each other will respond to input vectors that are close in the input feature vector space. First, all competitive neurons compute the match of the input xi with their connection weights wj. Then, the bestmatching neuron and its adjacent neurons now update their connection weights with the following Hebbian learning rule: wj …t ⫹ 1† ˆ wj …t† ⫹ a…t†h…t†‰xi …t† ⫺ wj …t†Š;

…2†

where t is a variable in the discrete time index, a (t) is a learning rate and h (t) is a neighborhood kernel. The learning rate is a monotonically decreasing function of time (0 ⬍ a (t) ⬍ 1) and defined as:   t a…t ⫹ 1† ˆ a…t† 1 ⫺ ; …3† T where T is a training period. The neighborhood kernel defines the adjacent region of the best match neuron c. All neurons within this region will update their connection weights. The neighborhood kernel is also a monotonically decreasing function of time: ( 0 for 储rj ⫺ rc 储 ⬉ Nc …t† ; …4† h…t† ˆ 1 otherwise   t ; Nc …t ⫹ 1† ˆ Nc …t† 1 ⫺ T

…5†

where rj and rc are the position vectors of the jth and the cth neuron, respectively, in the 2D array so that 储rj ⫺ rc 储 represents the Euclidean distance between those neurons in the array. Nc(t) is the radius of the kernel which actually defines the kernel region. After this learning process, the neurons that are ordered in the 2D array preserve the topological information of the original feature space. Neurons geometrically close to each other in the map will represent the input features which are close to each other in the input feature space. In practice, the training data xi …i ˆ 1; …; n† are iteratively used during the learning period T. After the training, input feature vectors which have continuous values as elements are topologically mapped onto the 2D map. This SOFM is used in order to analyze segmented surface parts of

104

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

Fig. 2. Examples of surface parts and their feature vectors.

objects. In addition, it transforms the feature vectors of surface parts into 2D vectors that represent the positions of neurons. 3.3. Input signals for the self-organizing feature map As a result of the preprocessing of 3D range data, several geometrical features of each surface part are computed as attributes of them. Input feature vectors xi 僆 R 9 are constructed from Fourier descriptors of the SSIS, three eigenvalues and an area, as follows: xi ˆ …Si …0†; Si …1†; Si …2†; Si …3†; Si …4†; e1i ; e2i ; e3i ; Ai †:

…6†

The Fourier descriptors of the SSIS Si(k ), where k is a frequency, are obtained by simply averaging Fourier descriptors of SSISs over the points on the surface part. A sphere with a very small radius is chosen to extract SSISs as “local” shape features of surfaces. Hence, range points, which are segmented into the same surface by range point grouping and region splitting, are assumed to have very similar SSISs. Therefore, simple averaging of Fourier descriptors of SSISs is sufficient to describe surface parts. This local feature describes the underlying structures of the surface part. Each surface part is approximated by a set of small triangular surface patches. The area Ai of each surface part is given by the summation of areas of all triangular patches. Each surface normal on the surface Ri is calculated using these triangular patches so that the length of the normal represents the surface area of patches. Hence, the area Ai of the surface part Ri is computed as the summation of lengths of all surface normals [28]. Three eigenvalues {e1, e2, e3} are derived from the covariance matrix C as: Cˆ

n 1X  i ⫺ p† T …p ⫺ p†·…p n iˆ1 i

…7†

where n is the total number of range points in the surface part, pi ˆ {xi ; yi ; zi } is a range P point at position i in the surface part and p ˆ …1=n† niˆ1 pi is the mean position vector of the component. The eigenvector associated with

the smallest eigenvalue indicates the direction of a surface normal for the surface component. The others represent the two principle directions. Therefore, these eigenvalues indicate how the points are distributed in 3D space. Three eigenvectors e1, e2 and e3 corresponding to each eigenvalue are also stored as attributes. The eigenvectors are used to interpret orientations of surface parts. The area of the part and three eigenvalues help to discriminate surface parts which have similar surface structures. If range images are dense enough to extract the boundary of the surface part consistently and reliably, boundary information can be added to the feature vector. In this research, the range finder produces 64 × 64 range images and the resolution is not sufficiently fine to include the boundary information reliably. Fig. 2 shows examples of feature vectors obtained from the mask object. In particular, feature vectors from surface parts around the eyebrow, the nose and the cheek region are shown. All three of the parts are convex surfaces. The surface part from the nose region, however, has a distinct DC component of the SSIS’s Fourier descriptors and a large eigenvalue e1. The other two surface parts have similar Fourier descriptors and areas but there is a significant difference in the eigenvalue e2. 3.4. Scaling of input signals’ components The SOFM will be trained to map the R 9 space onto its N2 …N ˆ 1; 2; …; M† space by using the input signals described above in order to preserve the structural difference in the input vector space. Input feature vectors, however, consist of elements which might be represented at different scales. If that is the case, input signal elements, which are represented with larger scales than others, are dominant in determining the ordered regression of weight vectors of each neuron. Therefore, if there are significant differences in the scales of each element, those scales have to be normalized in order to obtain “good” mappings. The quality of the mapping can normally be measured by a quantization error. However, we found that it is difficult to judge the effect of the normalization using the quantization error. In

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

105

where xti; j is the jth element of the training feature vector xti and xtj is the mean of the jth element over the training data. The effect of this scaling was experimentally tested and will be discussed later. 3.5. Encoding a 3D object as a set of surface parts

Fig. 3. Connection of the reset neuron, competitive neurons and SOFM’s buffer neurons. The reset neuron is connected to all buffer neurons, but only the connections between the reset neuron and buffer neurons in the front row are shown in this example.

this study, the goodness of the mapping is analyzed by visually examining how each dimension is mapped onto the N 2 space (see Section 4.3). In this study, each element xij …i ˆ 0; …; N and j ˆ 0; …; 8† of feature vector xi is rescaled by the standard deviation s j of the elements of feature vectors in the training data, as: x 0i; j ˆ

xij ; sj

v uX u N …xti; j ⫺ xtj †2 ; sj ˆ t n⫺1 iˆ1

…8†

…9†

The SOFM, which has been described in Section 3.1, can take only one feature vector of a surface part at a time. After analyzing (classifying) all feature vectors of all surface parts, the system will have several classification results of all surface parts. In order to encode an object, the system needs a process which produces a global feature vector or pattern of the object from the set of local feature classification results. A reset neuron and buffer neurons are added to the standard SOFM in order to allow the SOFM to have the capability of extracting the global feature of the object. Fig. 3 shows an example of a 2D SOFM and the interconnections among the competitive neurons, buffer neurons and the reset neuron (interconnections from the reset neuron to only the front row of buffer neurons are shown in the figure). The buffer neurons are arranged on the same grid map as the competitive neurons. The buffer neuron and the competitive neuron, which are placed at the same position in the map, are connected to each other. The buffer neuron also has a self-feedback loop. The reset neuron is fully connected to the buffer neurons. The update rule to compute the activity vbi …t† of the ith buffer neuron at the discrete time t is described as vbi …t† ˆ: vr …t† ∧ …vci …t† ∨ vbi …t ⫺ 1††;

…10†

where v r(t) is the activity of the reset neuron, and vci …t† is the activity of the ith competitive neuron. The competitive neuron’s activity vci …t† is set at 1 if the neuron is the winner, whose connection weights are the best match of the input feature vector, of the competition, otherwise it is set at zero. With the above update rule, this modified SOFM processes the given N feature vectors of N surface parts as

Fig. 4. Global feature pattern produced by the SOFM as a combination of surface parts.

106

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

Fig. 5. Setup of range image acquisition for training and test data.

follows: 1. At the discrete time t ˆ 0; the activity v r(0) of the reset neuron is set to 1. Activities of the buffer neurons are calculated according to the Eq. (10). Because vr …0† ˆ 1; all buffer neurons’ activities are set to 0. 2. Before feature vectors xm …m ˆ 1; …; N† are sequentially examined by the SOFM, the reset neuron’s activity v r(t) is set to 0. 3. A feature vector xm is fed into the SOFM. 4. According to the winner-take-all mechanism, one competitive neuron nci ; which has the best match connection weight wc to the feature vector xm, sets its activity to 1, and the rest of the competitive neurons have an activity of zero. 5. The buffer neurons then compute their activities according to the Eq. (10). 6. The processes from 2 to 5 are repeated until all feature vectors are examined.

represented by the winning neuron. For each feature vector, the SOFM produces a corresponding classification result as an activity of the winning neuron. Let a pattern Pm be a whole competitive neurons’ activities after the feature vector xm of mth surface part was classified. Each buffer neuron computes the union of competitive neuron’s classification results over feature vectors of all surface parts. This is equivalent to computing the union of the pattern Pm …m ˆ 1; …; N†: The neuron activities of the buffer neurons (Oi) is a new pattern which summarizes results of surface parts classifications. The activity pattern Oi of the buffer neurons describes the global feature of the object as a combination of surface parts. Therefore, a 3D object is encoded in the form of a binary image in terms of its surface parts.

4. Experimental results of encoding 3D objects as a set of surface parts 4.1. Input range images and surface parts

Fig. 4 illustrates the concept of the global feature as a set of surface parts. The feature vector xm, which represents the mth surface part, is classified by the SOFM at stage 4 and is

An experimental setup for range image acquisition is shown in Fig. 5. The camera and the light stripe projector

Fig. 6. Range images ((a)–(g)) of objects used in the experiment. (a) Polyhedron. (b) Dodecahedron. (c) Icosahedron. (d) Pipe. (e) Duck. (f) Mask. (g) Chimp_Mask.

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

107

Table 2 Standard deviations of each element of training input signals Si(0)

Si(1) ⫺2

Si(2) ⫺2

4:423 × 10

1:414 × 10

1:562 × 10

e1i 19.896

e2i 127.933

e3i 272.337

Si(3) ⫺2

are placed approximately 950 mm away from the center of the measured object. This defines an observation sphere whose center is at the center of the object and whose radius is r ˆ 950 mm: Two circles, whose geodesic radii are rsmall ˆ 950…p=12† and rlarge ˆ 950…p=6†; respectively, are defined on the observation sphere. View points to acquire range data are set on those two circles as shown in Fig. 5. Range data used for training and testing the SOFM are obtained at viewpoints indicated by the black and white circles, respectively. Eight range images were taken for each object in order to train the SOFM. In Fig. 6, range images of seven objects (Polyhedron, Dodecahedron, Icosahedron, Pipe, Duck, Mask and Chimp Mask) are displayed in (a)–(g). The training data consists of 606 feature vectors of all surface parts in the training range images. In order to rescale all feature vectors in both the training and test data, standard deviations s i …i ˆ 0; …; 8† of each feature vector component are computed using the training data and listed in Table 2. The standard deviations of three eigenvalues and the area are larger than that of five Fourier descriptors of the SSIS by an order of 3–4. Hence, a good feature map will not be formed without the rescaling because the ordered regression of weight vectors heavily depends on the values of the feature vector elements of the three eigenvalues and the area.

Si(4) ⫺2

0:516 × 10

0:279 × 10⫺2

Ai 1521.256

4.2. Training the SOFM Training the SOFM, whose map dimension is M × M; is carried out in two stages [29]. The aim of the first stage of the training is to roughly order the weight vectors of each neuron in the input vector space. For the first training, the initial learning rate a (0) is set to 0.05 and the initial size of the neighborhood kernel Nc is set to 80% of the map dimension M. The number of steps in training is 10 000. During the second stage of the training, approximately ordered neurons are fine-tuned. Because all neurons are already ordered approximately, weight vectors of neurons need not be modified significantly. Therefore, the initial learning rate a (0) of the second training stage is set to 0.02. The initial size of the neighborhood kernel Nc is set to 20% of the map dimension M in order to increase the stiffness of the map so that the approximate ordering created in the first stage does not change. The number of steps in the second training stage is 100 000. If the number of types of shape structures is already known, the SOFM can be designed so that it has the same number of neurons as the number of surface types. However, this parameter is not known in this experiment. Therefore, six different sizes of the SOFMs …M ˆ 5; 10, 15, 20, 25, 30) were tested. Training performances were evaluated by an average quantization error of the SOFM. The quantization errors were computed for all reference vectors

Fig. 7. Training performance: average quantization errors of the SOFM which is an M × M two-dimensional map: (a) 1st learning stage with the initial learning rate a…0† ˆ 0:05 and the initial radius of the neighborhood kernel Nc ˆ 0:8M; (b) 2nd learning stage with the initial learning rate a…0† ˆ 0:02 and the initial radius of the neighborhood kernel Nc ˆ 0:3M:

108

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

Fig. 8. Surface parts of each object and their labels used in feature map analysis.

in the SOFM using the training data. The average quantization error was then calculated by averaging quantization errors of all reference vectors. Fig. 7 shows the training performances of the six different SOFMs in the two training stages. As shown in the figure, average quantization errors decrease as a function of the number of steps. That means weight vectors of neurons are gradually stabilized in the input feature vector space. Note that the map of size 5 × 5 has a larger error at the end of the training than the other five maps. This is because the number of neurons is much smaller than that of surface parts in the training set. This is equivalent to setting very large tolerance to the similarity measure. In the case of a larger map …25 × 25 or 30 × 30 map), the number of neurons is larger than that of training samples. Hence each training sample can be represented by at least one reference vector. However, we assumed that some training samples are very similar and form clusters. Thus we except the SOFM to find neurons each of which represents a group of training samples. The larger map also has a disadvantage in terms of execution speed when it is

Fig. 9. Labelled distance map of the SOFM trained with original feature vectors. Labels of surface parts are written at the positions where corresponding neurons are placed.

implemented on a computed with Von Neumann architecture. In the rest of the experiment, therefore, the SOFM of size 10 × 10; which was the smallest map among the other five maps, was used. 4.3. Feature map analysis After training, each neuron in the SOFM should respond to a certain surface shape structure. Moreover, adjacent neurons should be activated by similar surface shapes. Here, the trained SOFMs (one is trained with non-scaled feature vectors and the other is trained with rescaled feature vectors) are analyzed to see which surface parts are similar, how the R 9 space is mapped onto the N 2 space and how the rescaling of feature vector components affects the mapping result. In order to conduct feature map analysis, several surface parts are selected from the training data. Each selected surface part is labeled. Labels themselves do not have a particular meaning but they are needed to indicate which neuron responds to which surface part. Fig. 8 shows selected surface parts and their labels. In the map analysis, the SOFM is first calibrated by assigning labels of surface parts to their corresponding neurons. The map analysis is then carried out by examining how those surface parts are represented by neurons. 4.3.1. The SOFM without scaling feature vector components The SOFM, which was trained with the original feature vectors (unscaled), is analyzed in order to compare it to training with rescaling. The trained SOFM is calibrated with feature vectors of selected surface parts. For each surface part, the best match neuron, which has the most similar weight vector to the feature vector of the part, is found and labeled. Fig. 9 illustrates a labeled distance map of the SOFM trained without rescaling. The distance map displays the distances between connection weights of adjacent neurons in a gray scale [30–32]. Planar surfaces (Pop1, Pop2, Pop3, Pop4, Itri and Dop) are mapped onto the right-hand side of the map. Convex surfaces (Piedg1, Pedg1, Iedg and Doedg), which have

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

109

Fig. 10. Maps displaying the values of weight vector components. The SOFM displayed here was trained without rescaling feature vector components.

sharp edges, are found around the middle of the bottom half. Neurons in the top-left quarter region tend to respond to small smooth convex surfaces (MiL, Dbeak, MebR, MebL, Mnos, etc.). It should be noted that concave surfaces (CnhR, CeR, Pedg2, Piedg2, etc.) are also found in this region and large smooth convex surfaces share the region with planar surfaces near the bottom-right corner. These

conflicts can be clearly seen by displaying values of each weight vector component. Fig. 10 shows the maps display values of the weight vector components using a gray scale. As a value increases, the color representing it becomes lighter. As explained previously, the standard deviations of three eigenvalues and the area are much larger than

110

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

Fig. 11. Maps displaying the values of weight vector components. The SOFM displayed here was trained with rescaled feature vector components.

those of the Fourier descriptors of the SSIS (see Table 2). Therefore, the eigenvalues and the area components control the development of the mapping. Indeed, in the maps of those four components (Fig. 10(f)–(i), neurons are regularly ordered. These can be observed as a regular gradient in gray scale. Neurons around the bottomright corner respond to surfaces which have large areas

and volumes, and the area and the volume of the surface gradually decrease towards the top-left corner. Maps of Fourier descriptors of the SSIS, on the other hand, show that there are clusters, but there is no overall regularity as seen in the maps of the eigenvalues and the area. There are no regular gradients of gray scale in the maps.

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

111

activated by a particular surface shape. In other words, each neuron locally represents a class of surface parts. Furthermore, by rescaling each component of the input feature vector, all individual features equally contribute to the training process and that results in good mapping from the input feature vector space R 9 to the 2D neuron map without conflicts. 4.6. An object as a set of surface parts

Fig. 12. Labelled distance map of the SOFM trained with rescaled feature vectors.

4.4. The SOFM with scaling feature vector components Fig. 11 shows the maps which display the values of weight vector components trained with rescaled feature vectors. Contrary to the case of the training without rescaled feature vectors, all maps show that neurons are regularly arranged in the maps so that the values of each component gradually change. For instance, maps of the first basis component of the SSIS, three eigenvalues and the area show the values decrease from the region near the top-left corner towards the bottom-right corner. The other four basis components of the SSIS have larger values around the topright corner and have smaller values towards the bottom-left region. The effect of the scaling the feature vector components can be clearly seen in the labeled distance map shown in Fig. 12. In this map, concave surfaces and planar surfaces were no longer sharing regions with smooth convex surfaces. Neurons which respond to the planar surfaces are situated on the left-hand side, and concave surfaces can be found near the bottom-right corner. The rest of the map is occupied by neurons which respond to convex surfaces. Neurons sensitive to the convex surfaces with sharp edges are found in the top-left regions which is a subregion of the convex region. Moreover, the volume and the area of the surface parts become smaller along trajectories from the top-left corner to the other three corners. The SOFM is effectively trained so that each neuron is Table 3 Parameters for training the 10 × 10 SOFM

1st stage 2nd stage

Learning rate a

Kernel size Nc

Learning steps

0.05 0.02

8 3

10 000 100 000

Range images of six more objects were added to the training set. Objects used in the experiment were chosen based on the following three characteristics: (1) simple objects which consist of simple surfaces such as planar and cylindrical surfaces; (2) relatively complicated objects which consist of piecewise smooth curved surfaces; (3) complex objects which share most of their surface parts but the spatial configurations of those surface parts are different. The SOFM was trained using 1053 feature vectors of surface parts extracted from the training set of range images. The training parameters are listed in Table 3. Each element of all feature vectors was rescaled before training. After training the SOFM, global features of each object, which are patterns of buffer neurons’ activities, were obtained by feeding each feature vector of the objects into the SOFM. In Fig. 13, range images and global features of fourteen objects (Polyhedron, Dodecahedron, Icosahedron, Pipe, Duck, Mask, Dino_Mask, Chimp_Mask, Model1, Model2, Model3, Model4, Model5 and Model6) are displayed. 5. Encoding a 3D object as an articulated set of surface parts 5.1. System structure The 3D object encoding system described here consists of six stages, as shown in Fig. 14. The first three stages from the range image acquisition process to segmentation are the same processes as those used in the system presented in the previous section. The SOFM at the fourth stage is used only for surface parts analysis and not to produce global features as was presented previously. This network, which will be referred as the “SOFM_SP”—short for “SOFM for Surface Parts”, is used to classify surface parts and encode them in the form of x–y coordinates in the map. At the fifth stage, the results of surface parts analysis for a pair of surface parts are combined with their geometrical relationships. At the next (sixth) stage, new feature vectors of pairs of surface parts are used to train neurons in another SOFM, which will be referred as “SOFM_GR”—short for “SOFM for Geometrical Relationships”. After the learning process, each neuron becomes a classifier of a pair of surface parts with a particular geometrical relationship. This SOFM_GR is used not only for the analysis of pairs

112

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

Fig. 13. Range images and Surface Parts combinations which are represented as firing patterns of the SOFM.

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

of surface parts but also for producing global features of objects including geometrical relationships of surface parts in the same manner as the SOFM in the previous system. 5.2. Encoding 3D structural information In this new encoding system, global features of objects now include information about the geometrical relationships between surface parts. Here, the details of how those global features are extracted are described. As explained earlier, the geometrical relationship of surface part Ri is measured with respect to another surface part Rj. It is also possible to measure geometrical relationships of a surface part with respect to a group of other surface parts. We selected three types of geometrical relationships. The centroid distance (dij) is a euclidean distance between centroids. It describes how far apart surface parts are. The angle between surface normals of two surface parts (a1,ij) is also calculated. It indicates the orientation of tangent planes of two surface parts. Principal directions of a surface part are computed as the two eigenvectors associated with the two largest eigenvalues. Angle differences in principal directions (a2,ij and a3,ij) are calculated for pairs of eigenvectors. In this section, a pair of surface parts and their geometrical relationships are considered to be primitive structural information from which to compose global features.

113

The first task to construct each global feature including the structural information is to recognize a pair of surface parts. Surface parts are firstly analyzed by the SOFM_SP at the fourth stage. The SOFM_SP is the same SOFM as presented in Section 3. Each neuron in this network is trained to represent the particular surface part. The map also develops the topological mapping from input feature vector space to 2D space. Therefore, two similar input feature vectors will activate two neurons, which are closely placed on the 2D map. With this SOFM_SP, surface parts’ feature vectors fi, which consist of nine unary attributes, are now represented by 2D feature vectors f 0i …ˆ {xi ; yi }†: After surface parts are classified and encoded into 2D feature vectors, the next step is to construct primitive structural information which is a set of geometrical relationships of a pair of surface parts. The new feature vector gij is generated from the 2D feature vectors ( f 0i and f 0j of surface parts Ri and Rj), and their geometrical relationships (dij, a1,ij, a2,ij, and a3,ij). The new feature vectors gij (i ˆ 0; …; N and j ˆ 0; …; N where N is the number of surface parts) are used for training the SOFM_GR. The SOFM_GR is the same SOFM introduced in the previous section. It has a reset neuron and buffer neurons which are associated with competitive neurons. At the end of the learning process, competitive neurons in this SOFM_GR have connection weights which represent certain pairs of surface parts with particular geometrical relationships. The buffer neurons, at run-time, store neuron activities of corresponding competitive neurons by which a set of feature vectors gij is analyzed. As a result, the SOFM_GR encodes 3D structural information and produces a global feature vector which describes an object in terms of the required pairs of surface parts with particular geometrical relationships, in order to construct the object.

6. Experimental results on encoding a 3D object as an articulated set of surface parts 6.1. Training the SOFMs

Fig. 14. A 3D Objects Encoding System with Neural Networks. Neural Networks contain two Self-Organizing Feature Maps (SOFM_SP and SOFM_GR).

Experiments of 3D Object encoding were carried out with the same fourteen objects used before. The training data set contains eight range images obtained from many different viewpoints for each object. The system used in this experiment consists of two ANNs: the SOFM_SP and SOFM_GR. The SOFM_SP for classifying surface parts was built with a 10 × 10 array of neurons. The SOFM_GR, which is used to classify pairs of surface parts with geometrical relationships, was constructed with a 15 × 15 array of neurons. The size of this neural network was arbitrary chosen. The parameters used in the learning processes for both networks are tabulated in Table 4. The SOFM_SP was trained with 1053 feature vectors of surface parts in the same way as described in the previous

114

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

Table 4 Parameters for training the SOFM_SP (Msp × Msp map: Msp ˆ 10) and the SOFM_GR (Mgr × Mgr map: Mgr ˆ 15)

1st stage 2nd stage

Learning rate a

Kernel size Nc

Learning steps

0.05 0.02

0.8 Msp (or Mgr) 0.3 Msp (or Mgr)

10 000 100 000

section. The SOFM_GR was trained with feature vectors gijs of pairs of surface parts. As previously mentioned, there are three types of relational features extracted from segmented range images. Consequently, there are seven combinations 3 X

3

kˆ1

k

!

of these relational features. Three typical ones are listed below in order to form a feature vector gij representing a pair of surface parts. 1. gij ˆ {xi ; yi ; xj ; yj ; dij }: 2. gij ˆ {xi ; yi ; xj ; yj ; dij ; a1;ij }: 3. gij ˆ {xi ; yi ; xj ; yj ; dij ; a1;ij ; a2;ij ; a3;ij }: Here (xi, yi) and (xj, yj) are position vectors of the ith and the jth neurons in the SOFM_SP which, respectively, correspond to the ith and the jth surface parts, dij is the centroid distance, a1,ij is the surface normal angle difference, and a2,ij, a3,ij are the angle differences in principal directions. Consequently, the SOFM_GR can produce three different types of global features (binary images) of objects. Fig. 15 shows three representations in binary images of the chimpanzee mask (Chimp_Mask) object using three types of features.

6.2. Encoding a 3D object as an articulated set of surface parts In the experiments presented here, an object is described as an articulated set of surface parts rather than a set of surface parts. Moreover, all geometrical relationships among surface parts are utilized to produce feature vectors of the pairs of surface parts. By examining all pairs of surface parts with their relational features by the SOFM_SP and SOFM_GR, the system encodes 3D structural information of each object into a binary image. Fig. 16 shows range images of fourteen objects and their encoded binary images. 6.3. Encoding human faces As an application of this 3D object encoding system, experiments on encoding human faces were carried out. Seven human faces were used for the experiment. The left-hand side of Fig. 17(a)–(g) shows rendered range images of all seven faces. For each face, 26 range images obtained from different viewpoints were prepared. The range images were first submitted to the local feature extraction and segmentation processes. The segmented surface parts were then analyzed by the SOFM_SP and transformed into global features by the SOFM_GR. The parameters of the ANNs’ training process are presented in Table 5. Because faces are more complex objects than the objects used in the previous experiment, ANNs of larger size were used in order to represent inductively the greater number of surface parts and relational features compared with those of the previous experiments. Results of encoding are shown in Fig. 16. With this system, a 3D object is now encoded and represented by a simple binary image instead of ones with complex data structure such as a graph. By using this encoding technique,

Fig. 15. Global features with different relational features. (a) The rendered range image of the Chimp_Mask object, (b) surface parts, (c) (d) (e) global features with different relational features. Relational features used in (c), (d) and (e) are {dij}, {dij, a1,ij}, and {dij, a1;ij ; a2;ij ; a3;ij }, respectively, where dij is the centroid distance, a1,ij is the surface normal angle difference, and a2,ij, a3,ij are angle differences in principal directions.

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

Fig. 16. Range images and encoded binary images which represent firing patterns of the SOFMs.

115

116

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118

Fig. 17. Rendered range images and encoded binary images of 10 human faces.

3D object recognition can be achieved by a simple correlation method or a pattern matching method. Needless to say, there will be variations in binary images of one object depending on the viewing direction. In such a case, some clustering or machine learning techniques are needed to find appropriate grouping among those binary images. Moreover, the size of the binary image is fixed at the stage of designing ANNs. This allows various Artificial Neural Networks (ANNs) such as Backpropagation ANNs and Learning Vector Quantization ANNs to be applied to a task of 3D object recognition [1]. Such ANNs normally take an input signal, the size of which is determined when the ANNs are designed. If 3D objects are represented in the form of a graph, the number of nodes and arcs varies depending on the viewing direction. Hence it is not suitable for many ANNs. Currently, there is no method to decode 3D structural information from the encoded binary image. However, provided an appropriate decoding method exists, the 3D

structural information encoding system described here will be useful for efficient transmission of such rich information. 6.4. An application for face recognition In this section, an experimental result of face recognition is presented as an application of the system described in this work. In addition to seven faces in Fig. 17, three objects (Chimp_Mask, Dino_Mask, and Mask) were used for the experiment. For each face, 26 range images obtained from different view-points were prepared; half of them were used for training the ANNs and the other half were used to test recognition performance. A Learning Vector Quantization (LVQ) ANN [26] is used to perform recognition tasks, because it has a similar neural structure to, and employs a similar training scheme as, a SOFM. In LVQ ANN, an input feature space is described by using a limited number of reference vectors. Those reference vectors partition the feature space so that one of the

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118 Table 5 Parameters for training the SOFM_SP (Msp × Msp map: Msp ˆ 15) and the SOFM_GR (Mgr × Mgr map: Mgr ˆ 30)

1st stage 2nd stage

Learning rate a

Kernel size Nc

Learning steps

0.05 0.02

0.8 Msp (or Mgr) 0.3 Msp (or Mgr)

10 000 100 000

reference vectors within a region becomes the nearestneighbor vector of vectors which fall into that region. The task of recognition is achieved by finding the corresponding reference vector for an input feature vector. The range images were first submitted to the local feature extraction and segmentation processes. The segmented surface parts were then analyzed by the SOFM_SP and transformed into binary images by the SOFM_GR. The binary images of faces were then recognized by the LVQ ANN. The parameters of the ANNs’ training process are presented in Tables 5 and 6. Because faces are more complex objects than the objects shown in Fig. 16, ANNs of larger size were used in order to inductively represent the greater number of surface parts and relational features compared with those of the previous experiments. The results of face recognition are tabulated in Table 7. Despite the fact that faces are real free-form 3D objects and that they are, strictly speaking, deformable objects, the system achieved high performance with the overall recognition rate of 80.00%. The face KA was not correctly recognized at all. In order to analyze this result, the entropies of co-occurrence matrices were computed for all binary images. Fig. 18 shows the average entropies of co-occurrence matrices for all faces. According to the figure, the face KA has an average entropy close to that of AR and RJ. This indicates that those three faces have very similar binary images. Furthermore, with the particular training samples used in the experiment, the LVQ ANN dominantly learned AR and RJ rather than KA. Hence, only 2 reference vectors were assigned to KA, while 6 and 10 reference vectors were assigned to AR and RJ, respectively.

7. Conclusion In this paper, a 3D object encoding system using ANNs has been described. The proposed encoding system utilizes two SOFMs (the SOFM_SP and SOFM_GR) to encode surface parts and their geometrical relationships. Neurons in each of these SOFMs are trained to represents surface Table 6 Parameters for training the LVQ ANN Initial number of reference vectors

130

Initial learning rate a i(0) Learning steps

0.3 5200

117

Table 7 Results of face recognition Face

Recognition rate (%)

AL AR Chimp CKS Dino KA KE ML Mask RJ Total

83.33 91.67 100.0 91.67 100.0 0.00 66.67 91.67 91.67 83.33 80.00

Fig. 18. Average entropies of co-occurrence matrices for ten faces.

parts and pairs of surface parts with relational information. These neurons, however, are not exact representations of samples in the training data. Instead, they represent the clusters in each feature space. Therefore, the resulting binary images describe 3D objects with generalized primitives. In this system, pairs of surface parts and their geometrical relationships were considered to be basic components in constructing 3D objects. The system is capable of encoding not only simple or piecewise smooth surfaces but also free-form surfaces. The system developed may not be suitable for tasks such as inspecting and sorting industrial parts which require highly accurate matching. However, if this system is to be used in the industrial environment, it may be used to reduce the number of candidate objects which are going to be examined by the exact matching process. Nevertheless, the system will find useful applications for tasks such as face recognition and robot navigation using 3D natural landmarks. References [1] M. Takatsuka, R.A. Jarvis, Hierarchical neural networks for learning

118

[2]

[3]

[4]

[5]

[6]

[7]

[8] [9]

[10]

[11] [12]

[13]

[14]

[15]

[16] [17]

M. Takatsuka, R.A. Jarvis / Image and Vision Computing 19 (2001) 99–118 three-dimensional objects from range images, Journal of Electronic Imaging 7 (1) (1998) 16–28. F. Tomita, H. Takahashi, Self-calibration of stereo cameras. ETL Technical Report TR-88-21, Electrotechnical Liboratory, Tsukubashi, Ibaraki-ken, Japan, May 1988. F. Stein, G. Medioni, Structural indexing: efficient 3-D object recognition, IEEE Transactions On Pattern Analysis And Machine Intelligence 14 (2) (1992) 125–145. C.S. Chua, R.A. Jarvis, Point signatures: a new representation for 3-D object recognition, International Journal of Computer Vision 25 (1) (1997) 63–85. C.S. Chua, Free-form three dimensional object recognition, PhD thesis, Department of Electrical and Computer Systems Engineering, Monash University, Clayton 3168, Victoria, Australia, February 1995. P. Horaud, R.C. Bolles, 3DPO’s strategy for matching three-dimensional objects in range data, International Conference on Robotics, IEEE Computer Society Press, Silver Spring, MD, 1984 (pp. 78–85). F. Tomita, T. Kanade, A 3D vision system: generating and matching shape descriptions in range images, The First Conference on Artificial Intelligence Applications, IEEE Computer Society Press, Silver Spring, MD, 1984, pp. 186–191. M. Oshima, Y. Shirai, A scene description method using three-dimensional information, Pattern Recognition 11 (1) (1979) 9–17. T.J. Fan, G. Medioni, R. Nevatia, Recognizing 3-D objects using surface descriptions, IEEE Transactions on Pattern Analysis and Machine Intelligence 11 (11) (1989) 1140–1157. M. Oshima, Y. Shirai, Object recognition using three-dimensional information, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-5 (4) (1983) 353–361. R. Nevatia, T.O. Binford, Description and recognition of curved objects, Artificial Intelligence 8 (1) (1977) 77–98. R. Brooks, Model-based 3-D interpretations of 2-D images, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-5 (2) (1983) 140–150. T.E. Boult, A.D. Gross, On the recovery of superellipsoids, Proceedings of the DARPA Image Understanding Workshop, Washington, DC, 1988, pp. 1052–1063. H. Murase, S.K. Nayar, Visual learning and recognition of 3-D objects from appearance, International Journal of Computer Vision 14 (1) (1995) 5–24. J.J. Koenderink, A.J. van Doorn, The internal representation of solid shape with respect to vision, Biological Cybernetics 32 (1979) 211– 216. B.K.P. Horn, K. Ikeuchi, The mechanical manipulation of randomly oriented parts, Scientific American 251 (2) (1984) 100–111. H. Dellingette, M. Hebert, K. Ikeuchi, A spherical representation for

[18] [19] [20]

[21]

[22] [23]

[24]

[25]

[26] [27] [28]

[29]

[30]

[31]

[32]

the recognition of curved objects, Proceedings of International Conference on Computer Vision, May 1993, pp. 103–112. P.J. Besl, R.C. Jain, Three-dimensional object recognition, ACM Computing Surveys 17 (1) (1985) 75–145. D.D. Hoffman, W.A. Richards, Parts of recognition, Cognition— International Journal of Cognitive Psychology 18 (1984) 65–96. M. Takatsuka, Free-form three dimensional object recognition using artificial neural networks, PhD thesis, Department of Electrical and Computer Systems Engineering, Monash University, Clayton 3168 Victoria, June 1996. I. Biederman, Human image understanding: recent research and a theory, Computer Vision, Graphics, and Image Processing 32 (1985) 29–73. A. Pentland, Perceptual organization and the representation of natural form, Artificial Intelligence 28 (1986) 293–331. T.J. Fan, G. Medioni, R. Nevatia, Segmented descriptions of 3-D surfaces, IEEE Journal of Robotics and Automation RA-3 (6) (1987) 527–538. R. Bolles, P. Horaud, 3DPO: a three-dimensional part orientation system, The International Journal of Robotics Research 5 (3) (1986) 3–26. J.S. Beis, D.G. Lowe, Learning indexing functions for 3-D modelbased object recognition, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos, CA, 1994 (pp. 275–280). T. Kohonen, The self-organizing map, The Proceedings of the IEEE 78 (9) (1990) 1464–1480. T. Kohonen, Self-Organization and Associative Memory, Springer, New York, 1989. M. Takatsuka, R.A. Jarvis, Surface/sphere intersection signature: a new local shape feature for range data analysis. Technical Report MECSE-95-2, Intelligent Robotics Research Center, Monash University, Clayton, 3186 Victoria, Australia, October 1995. Hynninen J. Kangas, J. Kohonen, T. and J. Laaksonen, SOM_PAK: The Self-Organizing Map Program Package, version 3.1 edition, April 1995. M.A. Kraaijveld, J. Mao, A.K. Jain, A non-linear projection method based on kohonen’s topology preserving maps, Proceedings of the 11th International Conference on Pattern Recognition, IEEE Computer Society Press, Los Alamitos, CA, 1992, pp. 41–45. A. Ultsch, Self organized feature maps for monitoring and knowledge acquisition of a chemical process, in: S. Gielen, B. Kappen (Eds.), Proceedings of the International Conference on Artificial Neural Networks (ICANN93), Springer, London, 1993, pp. 864–867. J. Iivarinen, T. Kohonen, J. Kangas, S. Kaski, Visualizing the clusters on the self-organizing map, Multiple Paradigms for Artificial Intelligence. Finnish Artificial Intelligence Society, 1994, pp. 122–126.