Pattern Recognition Letters l 0982) 61-68 North-Holland Publishing Company
October 1982
A paradigm for invariant object recognition of brightness, optical flow and binocular disparity images Lowell JACOBSON
and Harry WECHSLER
Department of Electrical Engineering, University of Minnesota, Minneapolis, Minnesota 55455, U.S.A. Received 28 May 1982
Abstract: We suggest the Wigner distribution (WD) for the analysis of 2-D images. The WD can be used to rigorously define a local power-spectrum at each point of an image. Furthermore, an invariant representation of a given image can he obtained by applying a complex-logarithmic (CL) conformal mapping to the spatial-frequency domain of the WD. The representation is such that all local spectra are invariant, within a linear shift, with respect to linear transformations of the image. A discrete WD has been implemented and results are shown. We next describe how the same CL-mapped WD of a scalar or vector field could be used for binocular disparity and motion analysis, respectively, where the goal is object recognition.
Key words." Binocular disparity, invariant object recognition, spatial-frequency analysis, optical flow, Wigner distribution.
1. Introduction T h e W i g n e r d i s t r i b u t i o n ( W D ) is n a m e d a f t e r E u g e n e W i g n e r , the N o b e l physicist, w h o , in 1932, i n t r o d u c e d it to c h a r a c t e r i z e the q u a n t u m m e c h a n i c a l d u a l i t y b e t w e e n the p o s i t i o n a n d m o m e n t u m o f a p a r t i c l e . M o r e recently, the W D has received a t t e n t i o n in the fields o f optics [2,3] a n d speech analysis [1]. A n excellent series o f articles has been p u b l i s h e d [5,6,7] s h o w i n g the p r o perties, t o g e t h e r with e x a m p l e s o f the W D as a p p l i e d to f u n c t i o n s o f one v a r i a b l e . These articles show t h a t all p r e v i o u s l y p r o p o s e d t r a n s f o r m a t i o n s for t i m e - f r e q u e n c y analysis can be expressed as averages o f the W D . W e suggest herein the W D for the a n a l y s i s o f 2-D images. T h e W D can be used to r i g o r o u s l y d e f i n e a local p o w e r - s p e c t r u m at each p o i n t o f an image; hence, it p r o v i d e s high resolut i o n in b o t h s p a c e a n d s p a t i a l - f r e q u e n c y . F u r t h e r -
Note: This work was supported in part by the National Science
m o r e , an i n v a r i a n t r e p r e s e n t a t i o n o f a given i m a g e c a n be o b t a i n e d b y a p p l y i n g a c o m p l e x l o g a r i t h m i c (CL) c o n f o r m a l m a p p i n g to the s p a t i a l - f r e q u e n c y d o m a i n o f the W D . T h e r e p r e s e n t a t i o n is such t h a t all local s p e c t r a are i n v a r i a n t , within a linear shift, with respect to linear t r a n s f o r m a t i o n s o f the image.
1.1. Definition and properties o f the W D A s s u m e an a r b i t r a r y i m a g e f u n c t i o n , f ( x , y ) , d e f i n e d over the c a r t e s i a n c o o r d i n a t e s x , y , -oo
Wx(X, y, u, o) : = ~_ f ( x + r a , Y + T / 3 ) f (x---let, y - ? ~ 3 ) oo
x exp{ - j ( a u +/30) } dad/3
= ~ ~ Rf(x, y, a,/3)exp{ - j ( a u +/30) }d ad/3, where
Foundation under Grant ECS-8105168 and by funds from the Graduate School at the University of Minnesota. 0167-8655/82/0000-0000/$02.75
(1)
© 1982 N o r t h - H o l l a n d
g f ( x , y, a,/3) = = f ( x + ½or,y + ½ f l ) f * ( x - ½a, y - ½fl), 61
Volume 1, Number 1
PATTERN RECOGNITIONLETTERS
and the asterisk denotes complex conjugation. The WD is similarly defined in terms of the Fourier transform o f f ( x , y) as,
P7:
October 1982
I f f ( x , y ) = g ( x , y ) * h ( x , y ) , then W:(x,y,u,o)= Wg(X,y,u,o)[] Wh(x,y,u,o) S
.
where ' [ ] ' denotes convolution w.r.t, the s
w a x , y, u, o) =
spatial variables x, y. (Convolution property)
oa
=4 lrr2~:=F(u + ½rl, o + ½~)F*(u - ½rl, o - ½~)
I f f ( x , y ) = g ( x , y ) h ( x , y ) , then 1 Wf(x, y, u, o ) = - ~ 2 Wg(x, y , u, o ) ~ Wh(x, y, u, o)
x exp{j(qx+ ~y)}dr/d~ :411r2~ ~Sf(u, o, r/, ¢)exp{j(r/x + ~y)}dqd~
(1) where s:(u, v, ~, ~) = = F(u + ½q, v + ½ 4 ) F * ( u - ½q, v - ½~),
and by convention the familiar 2-D Fourier transform and inverse transform pair is defined by F(u, o) = ,;¢[f (x, y)] = ~ ~ f ( x , y ) exp{-j(xu + y o ) } d x d y , oo
(3) and f ( x , y) = :: l[F(u, o)] oo
=
P8:
1 I~F(u,o)exp{j(xu+yv)}dudo, 47r 2
(4) The following properties of the WD are especially relevant to (2-D) image processing applications. P 1:
Wf(x, y, u, o) is a strictly real-valued function.
P2:
For real f (x, y),
where ' [ ] ' denotes convolution w.r.t, the sf
spatial-frequency variables u, o. (Windowing property) Property 1 implies that the WD, unlike the Fourier transform, has no phase associated with it. Yet, the WD, like the Fourier transform, is a reversible transformation. (Actually, given its WD, an image can only be recovered to within a minus sign.) Thus 'Fourier phase information' is implicitly encoded in the WD. This observation is relevant, because, as Lim and Oppenheim [13] have shown, much information is contained in the Fourier phase function of most ordinary images. Though the WD has an obvious interpretation as defining a local power spectrum at each image point, the analogy with the familiar power spectrum is not complete since the WD can in fact attain negative values. This should come as no surprise, since the WD in effect tries to extract more information in the combined spatial/spatial-frequency domains that is actually available. For further discussion of this inherent 'uncertainty principle', see De Bruijn [8].
W A x , y, u, o) = W A x , y, - u, - o).
~o P 3 : 4 ~ z 2 1 ~ W f ( x , y , u , o ) d u d o = II(x,Y)I2. 1
P4:
I ~ W f ( x , y , u , o ) d x d y = If(u,o)l 2.
P5:
If g(x,y) = f ( x - x o , y - Y o ) , then Wg(X, y, u, o) = W f ( x - Xo, y - Yo, u, v). (Shift property)
P6:
If g(x,y)=f(x,y)exp{j(XUo+yOo)}, then Wg(X, y, u, o) = Wf(x, p, u - Uo, o - Oo). (Modulation property)
62
2. Conformal mapping of the WD spatial-frequency domain
Complex-logarithmic (CL) conformal mapping has been advocated by machine vision researchers as a means for achieving rotation and scale invariance about a single image point [15,19] or about the origin of the Fourier power spectrum [4]. Let's review briefly the mathematics involved in such a transformation. Assume a functionf(x, y), defined over the cartesian coordinates x,y, - oo < x < ~ , - oo < y < oo. We choose to represent points in the cartesian
Volume 1. Number 1
PATTERN RECOGNITIONLETTERS
plane by (x,y)=(Re(z),Im(z)), Thus, we can write
where z = x + j y .
r=
(5)
Izl = l / ~ + y 2,
or equivalently,
WTf(.~,.~,elnr o)= Wf(x,y, elnr+lnk o--q)),
z = r exp{j 0}, where
October 1982
and
0=arg(z) =
arctan(y/x). Now the CL mapping is simply the conformal mapping of points z onto points w defined by w = In(z) = ln(r exp {j 0 }) = In r + j 0.
(8)
where r=(u2+02) 1/2, O=arctan(o/u), k is the image scale factor, and ~ is the image rotation angle. If W r l a n d Wfare plotted with respect to an orthogonal coordinate system whose axes are In r and 0, then the resulting conformally mapped WDs, ff'rl and Wf, are related by
(6)
ff'ri(2,p, lnr, O)= ff'f(x,y, lnr + lnk, O-tk). (9)
Thus, points in the target domain are given by (lnr, 0)=(Re(w),Im(w)). The effect o f this mapping is shown in Fig. 1. Both logarithmically spaced concentric rings and radials of uniform angular spacing are mapped into uniformly spaced straight lines. More generally, after CL mapping, rotation (about the origin) and scaling in the cartesian domain correspond to simple linear shifts in the 0 and In r directions, respectively. Now using the CL mapping described above, one conformally maps the (u, o) coordinate plane of the WD, performing this mapping w.r.t, the origin (u, o) = (0, 0). Then, according to the principle of invariance discussed above, any linear transformation T(x,y) on the image domain will affect the WD as follows: For point (x,y) and a given linear transformation T, there is a corresponding point (2,9) such that (2,.~)= T(x,y). Now for both spatial domains, before and after the linear transformation T, one can compute the corresponding Wigner distributions W t(x,y, u, o) and WT:(2,~9,U, O), and these distributions will be related as follows:
Therefore all local spectra constituting the WD of an image can be made invariant, within a linear shift, to translation, rotation, and scaling of that image.
(7)
WTs(2,.9, r, O) = Wf(x, y, kr, O- ~ ), e290 ° Yl
)
2.1. Examples of CL-mapped WDs Figs. 2, 3, 4 show the 'local' correlation function, the WD and the CL (conformal) mapped WD of a circle of radius R = 8, a circle of radius R = 16, and an incomplete circle of radius R = 8, respectively. The correlation functions and the WDs were computed from computer generated images over a 63 x 63 window using a discrete version of the WD. Since the images used were spatially limited, the graphical plots closely approximate in form the continuous WD over the range of spatial frequencies displayed. The CL (conformal) mapping of the WD was implemented using bilinear interpolation. As seen by comparing Fig. 2c and 3c, the conformally mapped WDs are indeed invariant, within a linear shift, to scaling. Furthermore, Fig. 4c illustrates how the conformally mapped WD can be used to detect a shape (as a whole) even when significant parts of the contour are missing (blurred). 90°
5°
e' e-1
0o
4
45°
C>
0°
0
-1
1
!
_45 ° 45 °
_90o
_90 °
Fig. 1. The complex-logarithmicmapping of the cartesian half-plane. 63
Volume 1, Number 1 × ~
P A T T E R N RECOGNITION LETTERS
7o
X :
¥ =
70
x =
70
¥ =
70
z. o
//~"~"
70
X :
,...~
~?'~
October 1982
~ 'z°"z E , 9 0 .->,<2,
,-
~
~v~
¥ = , 76
70
'~ =
-
~
70
¢-H
-,5"
z~r
<,> ~5~,'~'O/.
-
-z.
× =
b"
70
v =
70
9" ~,,
C
Fig. 2. (a) The local correlation function, (b) the WD at the center o f a circle o f radius R = 8, (c) the CL mapped WD.
64
Fig. 3. (a) The local correlation function, (b) the W D at the center of a circle of radius R = 16, (c) the CL mapped WD.
Volume 1, Number 1 X
-
PATTERN RECOGNITIONLETTERS
70
×
¥
70
-
70
¥
3. B i n o c u l a r vision and m o t i o n analysis
70
-o~ ,~
October 1982
,
U -ii" "~ )D" i b
3" q'
Fig. 4. (a) The local correlation function, (b) the WD at the center of an incomplete(blurred) circleof radiusR = 8, (c) the CL mapped WD.
Binocular vision, (object) motion analysis, and monocular depth perception via head-movement parallax are all based on the ability to extract disparity information from two images (distinguished with respect to space or time), where disparity is a measure of the relative displacement between 'corresponding' points in the two images. Though the process of extracting disparity information is conceptually simple, in practice a solution to the so-called 'correspondence problem' has been elusive. Nevertheless, a remarkably successful theory for binocular vision in humans was recently proposed by Marr and Poggio [12] and implemented by Grimson [9]. Recognizing the inadequacy of previous techniques for solving the correspondence problem, Marr and Poggio developed their theory (a) to be consistent with neurophysiological and psychophysical knowledge of binocular vision in humans and other primates, and (b) to be formulated in a manner that makes implementation and testing of the theory possible. In regard to the former point, Marr and Poggio were influenced by research that suggests that independent spatialfrequency channels underly the ability of humans to extract disparity information from stereo image pairs. Concerning the second goal above, Marr and Poggio successfully developed an explicit theory for disparity processing that employs multiple 'spatial-frequency channels' and which basically works as follows: (1) filter both right and left images using masks of four different sizes to yield four smoothed versions of each of the original images; (2) find the zero crossings in each of the filtered images derived in Step (1); and (3) seek and record matches between zero crossings in each pair of filtered images. Two especially important points were raised by Marr and Poggio. First, by employing multiple 'spatial-frequency channels' (the four filtered images in their model), the ambiguity of matches between points in the two images can be reduced. Second, features of the 'spatial-frequency channels', upon which the matching process is based (zero crossings in their model) must be locally defined. We concur with Marr and Poggio's general approach to disparity processing as outlined above. Nevertheless, we suggest that, rather 65
Volume 1, Number 1
PATTERN RECOGNITION LETTERS
than using spatial-frequency-based zero-crossing matches to extract disparity information, one should instead seek matches between the invariant local spectra (ILS's) constituting the respective C L - m a p p e d C P W D (Composite Pseudo Wigner distribution I of the stereo image pairs. Such ILSbased disparity matching (a) can be performed using standard correlation techniques, (b) incorporates multiple spatial-frequency channels, and (c) is based upon local image features (the ILS's) where 'locality' is of course 'greatest' for the highest spatial frequencies encoded in the ILS's. In short, an ILS-based disparity processing model would rigorously satisfy the m a j o r criteria set forth by Marr and Poggio. So far we have only discussed the extraction of disparity information from stereo image pairs. However, this is apparently only the very beginning of the stereo vision process [16]. That is, when one generates raw disparity information, using for example the model of Marr and Poggio, one simply assigns one scalar value (a disparity) to each visible 2 point in a scene, where disparity is measured along a direction parallel to an imaginary line connecting to two eyes (cameras) that recorded the pair o f images. The disparity function so obtained is merely a scalar function of two spatial dimensions; like each of the two original images from which it was derived, the disparity function will yield useful information only if it is l The WD involves integrals over infinite bounds; therefore, it is not a computable function in practice. As a consequence, Claasen and Mecklenbr/iuker [6] defined the Pseudo Wigner distribution (PWD) for one dimensional signals. When extended to two-dimensional signals, the PWD definition consists of using bounded integration in Def. (1). This results in a PWD which, relative to the true WD, is smoothed with respect to the spatial-frequency domain only. As discussed in [10] for both computational and theoretical reasons, such a definition is not well suited to image processing applications, and a new definition, the 'Composite PWD' (CPWD) is suggested. The CPWD employs, as an intermediate step, the computation of a patchwork of spatially limited Fourier transforms covering the entire visual field; then based on these Fourier transforms, Def. (2) using bounded integration is employed to generate a patchwork of 'generalized PWDs' which, taken together, constitute the CPWD. As compared to the true WD, the CPWD is smoothed with respect to both the spatial and spatial-frequency domains. 2 Of course some points may he unique to one of the two images; hence at such points the disparity is not defined. 66
October 1982
subjected to further processing. We therefore propose that a 2-D disparity function, like an ordinary (brightness) intensity function, should be transformed into an invariant representation by computing its C L - m a p p e d C P W D . Objects can then be recognized by matching the ILS's of the function (intensity or disparity) with those in memory. Psychophysical evidence does, in fact, support the notion that, in so far as object recognition is concerned, the h u m a n visual system treats disparity functions very much like intensity functions. This is clearly illustrated by random dot stereograms, pioneered by Julesz [11], which demonstrate that objects can be readily identified based on disparity information even when monocular cues are completely absent. Furthermore, as initially suggested by Tyler [17] and more recently corroborated by Schumer and Ganz [16], the disparity function, once obtained within the h u m a n brain, is apparently first subjected to spatialfrequency analysis, after which pattern recognition takes place. This is clearly consistent with our proposal that the C L - m a p p e d C P W D of the disparity function is computed within the cortex followed by pattern matching of the disparity function ILS's. It is also well known that shape and depth information can be readily perceived by humans when monocularly viewing appropriately moving random dot fields in the absence of non-motion cues. This applies both to movement of individual 'objects' in a r a n d o m dot scene [18], and to movement of one's head when viewing a random dot scene under conditions that simulate visual movement parallax [14]. Like stereo vision, motion analysis is based on disparity information, where 'motion disparity' is derived from successive images in time. But while stereo disparity defines a scalar function of two variables, motion disparity defines a complex-valued function of two variables i.e., a vector field. In analogy with our suggestions for computing stereo disparity functions, we propose that motion disparity functions can be found by seeking matches between corresponding ILS's of successive light intensity functions. Assuming a motion disparity function has been obtained in this manner, how does one proceed to recognize forms based only on motion information, i.e., optical
Volume 1, Number 1
PATTERN RECOGNITION LETTERS
flow? To answer this question, recall that the WD was defined for c o m p l e x - v a l u e d images. Also recall that, like the W D o f a real-valued image, the WD of a complex-valued image is a strictly realvalued function (Property 1). Therefore, complexvalued images (e.g. motion disparity functions), like scalar-valued images (e.g. intensity and binocular disparity functions), can be transformed into an invariant f o r m by computing their CL-mapped C P W D ; this permits moving shapes to be recognized based on their corresponding ILS's. Note, however, that unlike the WD of a realvalued image, the W D o f a complex-valued image will not, in general, by symmetric with respect to spatial frequency (recall Property 2). Therefore, the ILS's of complex-valued functions must be defined for a full 360 ° o f orientation if they are to be uniquely specified. We are not aware of any studies which directly indicate that motion disparity images undergo spatial-frequency analysis in the h u m a n visual cortex. Nevertheless, using r a n d o m dot techniques, Rogers and G r a h a m [14] have recently investigated the human visual system's sensitivity to sinusoidal depth modulations that are specified by motion parallax under monocular viewing conditions. In particular, they obtained perceptual thresholds as a function of spatial frequency of depth modulation which were remarkably similar to thresholds obtained using a similar procedure under motionless stereoscopic viewing conditions. They concluded that the mechanisms responsible for depth from motion parallax and depth f r o m stereopsis are more closely related than had previously been thought. This is interesting in view of our proposal that motion disparity functions and binocular disparity functions have in c o m m o n that they are each separately transformed to yield their respective C L - m a p p e d C P W D s . In fact, our readers should be able to convince themselves that (a) the WD of the complex-valued disparity function derived f r o m head-movement parallax under monocular viewing o f an otherwise stationary scene will be equivalent (within a constant multiplicative factor) to (b) the WD o f the real-valued binocular disparity function obtained f r o m stationary viewing o f the same scene as in (a), provided that both eyes lie along the line of head m o v e m e n t used in (a).
October 1982
4. Conclusions We conclude this paper by emphasizing that the ILS's constituting the C L - m a p p e d C P W D s of disparate light intensity functions appear to be well suited to the generation of both stereo disparity and motion disparity functions. Furthermore, once such disparity functions have been obtained, their respective C L - m a p p e d C P W D s can be computed, yielding a strictly real-valued function. This then permits 'objects' to be recognized based on the ILS's constituting the respective CL-mapped C P W D s of these disparity functions. Accordingly, the C L - m a p p e d C P W D appears to be universally applicable to object recognition tasks, whether 'objects' are defined by light intensity, stereo disparity, or motion disparity functions.
References [1] Bartelt, H.O., K.H. Brenner and A.W. Lohmann (1980). The Wigner distribution function and its optical production. Optics Comm. 32, 32-38. [2] Bastiaans, M.J. (1978). The Wigner distribution function applied to optical signals and systems. Optics Comm. 25, 26-30. [3] Bastiaans, M.J. (1980). Wigner distribution function and its application to first-order optics. J. Opt. Soc. Am. 69, 1710-1716. [4] Casasent, D,, and D. Psaltis (1975), Position, rotation and scale invariant optical correlation. Applied Optics 15, 1795-1799. [5] Claasen, T.A.C.M., and W.F.G. Mecklenbr~iuker(1980). The Wigner distribution -- a tool for time-frequency analysis, Part I: Continuous-time signals. Philips J. Res. 35,217-250. [6] Claasen, T.A.C.M., and W.F.G. Mecklenbr~iuker(1980). The Wigner distribution -- a tool for time-frequency analysis, Part II: Discrete-time signals. Philips J. Res. 35, 276-300. [7] Claasen, T.A.C.M., and W.F.G. Mecklenbr~iuker(1980). The Wigner distribution -- a tool for time-frequency analysis, Part llI: Relations with other time-frequency signal transformations. Philips J. Res. 35,372-389. [8] De Bruijn, N.G. (1967). Uncertainty principles in Fourier analysis. In: O. Shisha, ed., Inequalities. Academic Press, New York, pp. 57-71. [9] Grimson, W.E.L. (1981). A computer implementation of a theory of human stereo vision. Proc. R. Soc. Lond. B292, 217-253. [lO] Jacobson, L., and H. Wechsler (1982). A new paradigm for computational vision based on the Wigner distribution, TR, EE Dept., Univ. of Minnesota. 67
Volume 1, Number I
PATTERN RECOGNITION LETTERS
[1 i] Julesz, B. (1980). Binocular depth perception of computergenerated patterns. Bell System Techn. J. 39, 1125-1162. [12l Mart, D., and T. Poggio (1979). A computational theory of human stereo vision. Proc. R. Soc. Lond, B204, 301-328. [13] Oppenheim, A.V., and J.S. Lim (1981). The importance of phase in signals. Proc. IEEE 69, 529-541. [14] Rogers, B., and M. Graham (1979). Motion parallax as an independent cue for depth perception. Perception 8, 125-134. [15] Schenker, P.S., E.G. Cande, K.M. Wong and W.R. Patterson III (1981). New sensor geometries for image processing: Computer vision in the polar exponential grid.
68
October 1982
Proc. Pattern Recognition and Image Processing. Dallas, Texas, pp. 1144-1148. [16] Schumer, R., and L. Ganz (1979). Independent stereoscopic channels for different extents of spatial pooling. Vision Res. 19, 1303-1314. [17] Tyler, C.W. (1975). Stereoscopic tilt and size after effects. Perception 4, 187-192. [18] Ullman, S. (1979). The Interpretation of Visual Motion. The MIT Press, Cambridge, MA. [19] Weiman, C.F.R., and G. Chaikin (1979). Logarithmic spiral grids for image processing and display. Computer Graphics and Image Processing 11, 197-226.