Inference of structure: hands

Inference of structure: hands

~.___~H_~ October 1994 Pattern Recognition Letters H 5;EVIER Pattern Recognition Letters 15 ( 1994 ) 957-962 Inference of structure: hands Robert ...

463KB Sizes 1 Downloads 48 Views

~.___~H_~

October 1994

Pattern Recognition Letters H 5;EVIER

Pattern Recognition Letters 15 ( 1994 ) 957-962

Inference of structure: hands Robert A. McLaughlin, Michael D. Alder *, Christopher J.S. deSilva Centre for Intelligent Information ProcessingSystems and The Department of Mathematics, The University of Western Australia. Nedlands, W.A. 6009, Australia Received 28 January 1993; revised 25 March 1994

Abstract

Many images can be viewed in terms of simpler components, such as a hand viewed as a collection of fingers and a palm. Recognition can then be achieved by finding these components and examining the relationships between them. If each component can be further decomposed into sub-components, then this approach may be applied recursively to the image until some low-level primitives are obtained. This decomposition imposes a hierarchy upon the recognition, as the same image is described at several different levels as a collection of increasingly simple parts. This paper explores an attempt to solve a recognition problem using such an approach, Notably each level of the problem is solved using essentially identical techniques. "['his uniformity of approach suggests a great potential for generalisation to many different types of recognition problem. Keyword: Syntactic pattern recognition

1. Introduction

C o n s i d e r the p a l m print o f Fig. 1 and the p r o b l e m o f how we distinguish left hands from right. It seems reasonable that we first aggregate sets o f black pixels into c o m p o n e n t s such as fingers and a palm. These are then formed into a single entity, a hand. It is the relationships between the c o m p o n e n t s o f this entity

Fig. 1. Silhouette of a hand. * Corresponding author. E-mail: [email protected].

which distinguishes left hands from right. This approach of considering an entity as consisting o f a set o f simpler components, each of which may in turn consist o f yet simpler sub-components, has been t e r m e d a syntactic or structural one and was explored by Fu (1982). A computer program utilising such a method would be interesting, not because o f its ability to distinguish different classes o f hands, but because o f its possible a p p l i c a t i o n to the vast array o f images which m a y be structurally decomposed. Such a process is explored in this paper. The m o t i v a t i o n b e h i n d choosing hands is that their structure is simple enough to make the issues clear a n d yet not so simple as to be completely trivial. In order to d e m o n s t r a t e the generality of such an approach, the c o m p u t e r program used to distinguish left hands from right was then used to distinguish a c o m m e r c i a l passenger aeroplane from a jet fighter. Thus no prior information specific to the im-

0167-8655/94/$07.00 © 1994 Elsevier Science B.V. All rights reserved SSDI0167-8655(94)00029-3

958

R.A. M c L a u g h l i n et al. / Pattern Recognition Letters 15 (1994) 9 5 7 - 9 6 2

age was contained in the program; the program extracts such information itself.

2. Finding fingers Consider the black pixels of Fig. 1 as forming a point set in ~2. The task is to identify those subsets of points that form a finger or a palm. A set of points in a linear space (such R~) may be characterised by taking low-order central moments of the set. We shall restrict ourselves to describing a set of points using only first- and second-order central moments, these corresponding to the mean and covariance matrix of the points. As the mean and covariance matrix uniquely define a gaussian probability density function, describing a set of points can be viewed as fitting a gaussian to those points. By assuming that each of the points in R 2 belongs to a set which is either a finger or a palm, the task of identifying the fingers and palm of a hand reduces to the problem of finding a gaussian mixture model of the image, with a separate gaussian positioned over each finger and the palm. The maximum likelihood gaussian mixture model of a point set may be obtained by use of the E.M. algorithm (Dempster et al., 1977). Given an initial set ofgaussians, the E.M. algorithm will iteratively modify their parameters until the maximum likelihood model is found. In order to represent this we shall have each gaussian of mean m and covariance matrix C define the quadratic form

Fig. 2. Desired decomposition of a hand.

Fig. 3. Result from poorly initialised E.M. algorithm. •

"?:~'~:""

' 21!)i~i

" "~i"""

F:~z~R x~(x--m)TC-l(x--m)

.

The points satisfying {x: F(x) = 1} form an ellipse and a typical gaussian mixture model of the image of Fig. 1 can be represented as the ellipses of Fig. 2. Unfortunately, given some random initial set of six gaussians, the E.M. algorithm will often fail to converge to the desired model. A typical example of convergence is shown in Fig. 3. One possible explanation for such a convergence is the highly non-gaussian nature of the data. A second is that a model with a separate gaussian over each finger and the palm is not a state to which the E.M. algorithm converges. A third explanation suggests that the E.M. algorithm may fail to converge to the maxim u m likelihood model, possibly as the rate of convergence becomes comparable in magnitude to the

Fig. 4. A gaussian hand.

inaccuracies of the computer or because of local maxima in the likelihood ofgaussian models. The first of these possibilities was explored by using a collection of six gaussian probability density functions to generate an image which resembled a hand. The E.M. algorithm was then run on this data. Whilst the results showed a general improvement over those from silhouettes of hands, a poorly initialised set ofgaussians was still found to often converge to a less than ideal model, as is shown in Fig. 4. The second explanation was tested by initialising a set of gaussians to a state close to that shown in Fig.

R.A. McLaughlin et aL / Pattern Recognition Letters 15 (1994) 957-962

959

2. Given such an initialisation, the gaussians converged as desired. This left the third explanation which prompted the conclusion that an algorithm was required which would produce a rough approximation to the desired gaussian mixture model. This could be used to initialise the E.M. algorithm which would then converge as desired. This algorithm has been labelled The Star Algorithm.

3. The Star Algorithm Fig. 6. Decompositiongenerated by The Star Algorithm. A random point is chosen in the image. A set of rays through the point are taken, and those points which are along a ray and correspond to black pixels are collected. The rays terminate upon intersecting a white pixel, so the resulting set of points is star shaped on the given initial point (Fig. 5 ). Recall that set S is star shaped with respect to the point ae S iff

VbES, Vte[O, 1],

ta+(1-t)b~S.

The centroid of the set thus obtained is computed and this becomes the next centre. The process is iterated until it converges or cycles. There are cases of high symmetry where the centre may leave the set, although this does not happen for the data set of hands. When the centre has stabilised, compute the covariance matrix of the set of points in the star shaped set generated by the rays through the centre. Remove from the image all points within 2.8 standard deviations of the centre. Repeat the entire process until no points are left, or alternatively until the points left have measure less than some threshold fraction of the original image. This yields a set of quadratic forms which decompose the image into ellipses. The pa-

rameter of 2.8 was found to be appropriate for hands but may need to be altered for different classes of images. The above algorithm when initialised randomly anywhere inside a silhouette of a hand, rapidly converges to a centre on the palm. If it starts at a finger tip, it migrates down the finger quite rapidly on to the palm, where it soon stabilises. It then removes the palm and most of the fingers. The tips of the fingers are left, and they each develop their own ellipses. The result of the star algorithm on a hand silhouette is shown in Fig. 6. The ellipses shown represent 2.8 standard deviations of the covariance matrices. That is to say, they are the sets: {x~N2: ( x - m j ) V C f ' ( x - m j ) ~<2.8} where mj are the centres and Q are the corresponding covariance matrices. It is clear that the algorithm has found the component parts, although it has attached rather a lot of weight to the palm. If the given quadratic forms are used as initialisation data for the EM algorithm applied to a gaussian mixture model, they yield the satisfactory results of Fig. 2.

4. The UpWrite

Fig. 5. First iteration of the Star Algorithm.

Whilst aggregating points in ~2 into sets which could be labelled as either a finger or a palm, each set was characterised by low-order central moments. We shall use this same principle when aggregating fingers and a palm into either a left hand or a right. In doing so, we shall have achieved a certain amount of uniformity in our approach to the different levels of this recognition problem.

960

R.A. McLaughfin et aL / Pattern Recognition Letters 15 (1994) 957-962

Consider the gaussian mixture model of Fig. 2 and note that each of the gaussians may be uniquely defined by five numbers. There are several ways to parameterise a two-dimensional gaussian but we have chosen to record the mean (x, y), the eigenvalues of the covariance matrix (21, 22) and the angle of the dominant eigenvector (0~) relative to the x-axis. This angle is somewhat ambiguous as a given eigenvector could map to any of a number of angles differing by n ~ radians (n ~ 7/). Also, two eigenvectors, one of angle slightly more than 0 radians and the other of slightly less than 2~ radians, have greatly differing numerical values in spite of being very similar vectors. Thus we define the mapping 0Xl ~ [cos(20~ ), sin (20a~) ] transforming an angle to a point on the unit circle, where angles differing by n~ radians (n~7/) map to the same point. The six numbers [x, y, 21,22, COS(20A1), sin (20Al) ]X define an embedding from a 2-D gaussian to a point in ~ 6. Thus the set of points in ~2 forming a finger or a palm have been converted into a single point in ~6. This process is termed an UpWrite. More generally, an UpWrite is a process whereby a set of points in some space are represented by a single point in a different, generally higher-dimensional space. The next step is to take the set of six points in ~6 representing the gaussian mixture model of a hand and UpWrite them to a single point in a higher-dimensional space, say ~ . By characterising the regions o f ~ n to which left and fight hands map, we shall be able to recognise such images. As before, given a set of points we shall characterise them by low-order central moments. The point set is now in ~6 instead o f ~ 2 but this presents no conceptual difficulties. Also as before, we shall consider the first- and second-order central moments, corresponding to the mean and covariance matrix of the points. These may be parameterised by several numbers, six numbers for the mean, six numbers to list the eigenvalues of the covariance matrix and fifteen angles to define its eigenvectors. Note that whilst the first eigenvector requires five angles to describe it, the second (being orthogonal) requires only four, the third requires only three, etc. Mapping each angle to a point on the unit circle as before

0 ~ [cos(20), sin (20) ] we obtain a parameterisation of the low-order moments that requires forty-two numbers. These numbers define an embedding of a set of points in ~6 as a single point in ~42. This single point represents an entire image and the image may be recognised as being either of a left or a right hand depending upon where in ~5~42it lies.

5. Dimension and subspace At this stage, let us re-iterate the motivation for this research. It is, to find a general method of recognition applicable to a wide range of images. To this end, we began with a large set of points in R2 representing the black pixels of a two-tone image. By recursively applying a process referred to as an Up Write, this point set was compressed to a single point in E42. This represents an extraordinary level of compression. This has been achieved by identifying those subsets of points forming components of the object, and describing each component as a single point in a different space. A single point in ~42 represents a set of points in ~6, each of which in turn represents a set of points in ~2 or a component of the original object. Thus each point in ~42 represents an object in a black and white image. Note that this image need not be of a hand. Images of aeroplanes, tanks, fish or fruit would all map to points in ~42. We stress this to allay any fears that forty-two dimensions may be excessive when examining the gross structure of an image such as a hand. To reduce the dimension would require that we build in assumptions about the images being recognised. If we were to allow such assumptions, we could reduce the dimension to one, with left hands corresponding to a positive value and right to a negative. However, such a system would not have the ability to generalise to other classes of images and hence be of little interest. As a compromise between the requirements of generality and the demands of computational simplicity, an affine subspace of E42 was found which spanned all points representing hands. By calculating the mean and covariance matrix of a large selection of points, and performing simple Principal Components Analysis, the dimension of this affine subspace was re-

R.A, McLaughlin et al. / Pattern Recognition Letters 15 (1994) 957- 962

duced to twenty-four. It should be noted that whilst the lower dimension simplifies any future computations, the program's ability to differentiate between different images is reduced. In order to differentiate left hands from right, a large number of left hands were mapped to points in R42 and these points were then projected on to the twenty-four-dimensional affine subspace. This formed a cluster of points in R24 which was modelled by a gaussian probability density function. This twenty-four-dimensional gaussian assigns to every point in ~24 a likelihood reflecting how well the point represents a left hand. The process was then repeated with a collection of right hands.

6. Summary The overall process of recognition may now be summarised as follows. Given a black and white image, the black points are interpreted as a point set in ~2. This set is separated into its component parts (i.e., fingers and a palm) by use of the Star Algorithm and the E.M. algorithm. Each component, being a subset of points in ~2, is characterised by low-order central moments and UpWritten to a point in R6. The set of six points in ~6 (representing five fingers and a palm) are in turn characterised by low-order central moments and UpWritten to a single point in E42. This point is then considered in relation to a twentyfour-dimensional affine subspace which approximates the space containing all points generated by images of hands. If the point does not lie in or near the affine subspace, it may immediately be discarded as being unlike any hand. If the point lies near but not in the affine subspace, it should not be discarded as we are not working with the correct subspace but a lower-dimensional approximation to it. The point is projected on to the affine subspace and then considered in relation to gaussians representing both left and right hands. This returns the likelihood that the original image belonged to either class. If neither likelihood is above some threshold value, the image is considered to not be of a hand. Such a system was implemented on a Sun Sparc workstation. A set of 83 left hands and 83 right hands were used to train the system. A separate set of 20 left

961

hands and 20 right hands were then used as a test set. The system was able to classify all images used in the training set correctly. All but one of the images in test set were correctly identified, with the exception being a right hand which was identified as being a left.

7. Generality The generality of this method is probably best demonstrated by a second example. Using the very same computer program written to differentiate hands, we were able to distinguish a commercial passenger aeroplane from a jet fighter (Fig. 7 ). Only two parameters of the program were altered. • The number of ellipses used was changed from six to five. • After UpWriting the set of points in ~'~ representing the gaussian mixture model of the image to a single point in R42, a fifteen-dimensional affine subspace was found to approximate the manifold containing all relevant points. With hands, a twentyfour-dimensional subspace was used. With the exception of the fine tuning of these parameters, no other alteration to the program was necessary. Images typified by those of Fig. 7 were scanned into the computer. The silhouettes were shown at a variety of orientations and a small amount of translational noise was included. Each image was decomposed as shown by the ellipses of Fig. 7. It is clear that for finer discriminations between aircraft, a more detailed decomposition would be required. This does not present other than computational difficulties of a tolerable sort. Each ellipse was then embedded as a point in N6 and the resulting set of five points in ~6

Fig. 7. Decompositionof a passenger aeroplane and a jet fighter.

962

R.A. McLaughlin et al. / Pattern Recognition Letters 15 (1994) 957-962

were characterised by first- and second-order central moments and mapped to a single point in ~42. This point was than projected on to a fifteen-dimensional affine subspace o f E 42 which approximated the manifold containing all points representing these aeroplanes. The reader should note that this process is identical to that used in the recognition of hands. A single gaussian probability density function was used to model the region o f E 15 to which the passenger aircraft mapped. This gaussian was found by showing the computer a large number of examples of the passenger aircraft and having it map each to a point in ~J~. The mean and covariance matrix of these points defined the appropriate gaussian. The process was then repeated with the fighter aircraft and a second gaussian was used to model the appropriate region. Note again that this is merely a repeat of the process used earlier to model those regions of space corresponding to left and right hands. Given an image not seen before, the program will decompose it and map it to a point in ~42. If the point does not lie near the fifteen-dimensional affine subspace that spans those regions corresponding to the aircraft, the image is judged to resemble neither aircraft. If it does lie near or in the subspace, the point's projection is compared to the gaussian modelling each region. Each gaussian returns a likelihood of how closely the image resembles each type of aircraft. A training set of 61 images of the passenger aircraft and 58 images of the fighter aircraft was used. The test set consisted of 91 images of the passenger aircraft and 88 images of the fighter aircraft. All images in both sets were identified correctly.

8. Conclusion The essential points of this method can be sum-

marised as follows. An image is abstracted as a set of points, in this case in ~2. Subsets of these points are identified, each being considered as forming a single entity (e.g. fingers, a fuselage or wings) and each is UpWritten to a single point in a higher-dimensional space. This newly formed set of points constitutes a hihger-level description of the image. For hands this will be in terms of fingers and a palm instead of as a collection of black pixels. The process is repeated recursively until only a single point is produced. Having abstracted the entire image to a single point in space, recognition reduces to noting where in space the point lies. By characterising the regions of space occupied by different classes of images, whether they be left and right hands or different aeroplanes, such images can be identified. It takes little imagination to see how this method may be applied to other silhouetted data. More interesting however is that essentially the same methods have been applied to the recognition of hand drawn cubes and pyramids. This work has been documented elsewhere and awaits publication (McLaughlin and Alder, 1994). Programs and papers demonstrating this work may be obtained by anonymous ftp from ciips.ee.uwa.edu.au.

References Dempster, A.P., N.M. Laird and D.B. Rubin (1977). Maximum likelihood from incompletedata via the EM algorithm. Proc. Roy. Stat. Soc. B 39 ( 1), 1-38. Fu, K.S. ( 1982). Syntactic Pattern Recognition and Applications. Prentice Hall, EnglewoodCliffs, NJ. McLaughlin, R.A. and M.D. Alder (1994). Recognisingcubesin images. Pattern Recognition in Practice IV, Vlieland, The Netherlands, June 1-3, 1994.