297
Dynamic motion vision J o a c h i m Heel
1. Introduction
Artificial Intelligence Laboratory, Room 826, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA Motion vision deals with the analysis of image sequences acquired by a camera moving relative to an environment. The goal is to recover the motion of the camera as well as the structure of the environment. We model the changing structure of the scene perceived in the camera as a dynamical system. The state of this system is the depth, the distance of a scene point from the camera, the measurement is the optical flow, a vector field of image velocities which can be computed from the image brightness arrays acquired by the camera. We use the dynamical systems model in the construction of a Kalman filter which optimally estimates the depth of a point in the image incrementally over the sequence of frames. By using one K a l m a n filter per image pixel we are able to recover a dense depth m a p of the environment, i.e. the scene structure. In every filter iteration we estimate the motion parameters of the camera from the optical flow field using the filter's current depth estimate in a least-squares fashion. The resulting algorithm recovers both the camera motion and the scene structure in an systematic way which could not be accomplished previously. We have implemented this algorithm and tested it on real and synthetic images.
Keywords: Motion vision; Kalman filter; Motion and structure estimation; Optical flow.
Joachim Heel graduated from the University of Karlsruhe in West G e r m a n y in 1987 with Diplom de-
glees in Electrical Engineering and Business Administration. He is currcntly a doctorate student in computer science at the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. His research interests include computer vision and robotics, in particular motion vision. His research experiences include projects at Hewlett-Paekard, Siemens,
B A S F and Matsushita Electric. North-Holland Robotics and A u t o n o m o u s Systems 6 (1990) 297-314
Once a child of robotics, machine vision has become a research area of its own and has developed many subdivisions in itself. The most recent of these subdivisions is motion vision. Motion vision deals with the processing of sequences of images as they may be obtained by a camera moving through an environment or viewing a scene in which objects are in motion. Not only does this field bear interesting applications such as mobile robots, autonomous vehicles, space robots etc. but it was also hoped that it would provide solutions to vision problems which were not easily solvable with conventional techniques such as the measurement of scene structure. Initially, motion vision research focused on the problem of computing approximations to the motion field, the projection of three-dimensional velocity fields into the image plane. The approximating vector field is commonly referred to as the optical flow. One can think of it as attaching a vector of instantaneous velocity to each point in the image (pixel) indicating where it appears to be currently moving. The key to these algorithms was the assumption of small motion between the frames of a sequence. The t-n-st algorithrn.~ for the computation of the optical flow were based on cross-correlation of two subsequent frames of the image sequence (see Anandan [3] for an overview). In their well-known paper [14], Horn and Schunck outlined a totally different approach which may be characterized as "differential" as opposed to the integration-based approaches before. Subsequent work focused on the shortcomings of their predecessors, in particular Hildreth's edge-based optical flow [12], multi-grid algorithms [25] and spatio-temporal filtering [10]. Optical flow computation is expensive and in most cases rather noisy. The optical flow itself, however, is only an intermediate representation of information. The ultimate goals in motion vision might be characterized as: 1. Segmentation: Identify regions in the image which belong to independently moving objects. 2. Passive naoigation: Recover the relative motion
0921-8830/90/$03.50 © 1990 - Elsevier Science Publishers B.V. (North-Holland)
298
J. Heel
/ Dynamic motion vision
between the camera and each independently moving object in the scene. 3. Structure: Recover the structure of the scene. Many of the existing techniques use the optical flow as input data. These items, especially the last two, are the main focus of current research in the area. A common feature of most of the work mentioned below is the fact that the analyses are based on two subsequent frames out of the sequence. Each such pair is processed in an isolated fashion, neglecting the results of computation on previous images. Segmentation is most often based on optical flow. The general approach to the solution of this problem is to cluster optical flow vectors together, if they are consistent with the motion of a single rigid object. The Hough transform is often used in this context. One of the best-known and most practical segmentation algorithms is due to Adiv [2] which is also based on two images. Passive navigation and structure recovery are dual problems in the sense that knowledge of one quantity makes the computation of the other rather simple. With exception of only a few ([19,15 and 16]), all of the methods for structure from motion (as it is commonly referred to) use the optical flow or the matching of features (precisely localizable image attributes such as corners) as an input. Examples of such methods which use only two frames may be found in [8,26,17,28] and others. A recent focus of interest has been, how informarion from an entire sequence can be used rather than processing pairs of images in isolation. Clearly the structure or motion recovered from frames i and i + 1 should not be ignored when processing frame i + 2. The key question becomes, how the processing of a sequence can proceed in a systematic way by exploiting the temporal coherence between the frames. A first class of algorithms [5,24,27] use intuitive or biologieaUy motivated relationships between frames as the basis for multi-frame processing. In the so-called rigidity scheme [27] for instance, Ullman proposed that the motion between frames should be computed in such a way, that the recovered structure model requires the least amount of modification, i.e. its rigidity is maximized. Although many of these methods are intuitively very convincing failure cases are readily construetible and their performance on real images has not been satisfactory.
Another group of algorithms makes use of multiple frames as a means of accumulating a sufficient number of constraints for the solution of a set of equations for the motion or structure parameters. Examples are [22, 21]. These methods do not make explicit use of the temporal relationship between frames. The photogrammetric problem of relative orientation motivated another approach: Find the motion parameters which minimize the spatial error of points in the real world corresponding to a given set of extracted features [13,23]. This approach requires the entire sequence of images to be present in the computers memory and a set of features to have been extracted and matched across the frames. Then a large nonlinear minimization problem over the features can be solved for the motion and structure parameters. In face of these difficulties, recursive estimation has become popular recently. The idea of iteratively improving an estimate of depth or motion as new frames are acquired by the camera is very appealing since it obviates the need for storage of images which is very expensive in general. The first solutions along this line were due to Stuller, K_rislmamurthy [24], and Broida and CheUappa [6,7]. None of these emerged as a major breakthrough because a key point was neglected in the formulation of the approaches: the vital role of the dynamical model. The dynamical models employed were either ad hoe or extremely complex and did not lend themselves to practical implementations. Recently, Matthies, Szeliski and Kanade [18] added a new perspective. Using a simple yet precise dynamical model for temporal changes in depth they implemented a Kalman filter based algorithm for the recovery of structure from real images with promising results. Shortcomings of their approach are the necessity to know the camera motion as well as the restriction that motion may only be translational and parallel to the image plane. Nevertheless, this work has shown that recursive estimation is a practical method for the solution of motion vision problems. We believe that a rigorous formulari6n of the motion imaging situation in terms of dynamical systems can allow us not only to overcome the restrictions that limit the previous methods for structure and motion recovery, it can also provide a systematic way of addressing other problems in motion vision. In this paper we construct a dy-
299
J. Heel / Dynamic motion oision
namical systems model of the motion imaging situation free of any of the previous limitations. We use this model as the basis for a Kalman filter which optimally estimates the scene structure at every point of the image. We show further, how the estimation of motion can be interleaved with the stages of the Kalman filter to provide a truly practical structure and motion algorithm. To verify the practicality of the approach we have conducted experiments on synthetic and real images the results of which are reported at the end of this paper.
2. Dynamical systems for motion vision 2.1 The motion imaging situation
It was the contribution of Longuet-Higgins and Prazdny [17] to first introduce convenient coordinate systems and descriptions of motion which form the basis of current research in this area. As shown in Fig. 1, a coordinate system is chosen so that its origin coincides with the focal point of the camera and the z-axis coincides with the optical axis. Points in the real world are denoted by capital letters R = [ X , Y, Z] T whereas small letters are used for points in the image r = [x, y, f]T. Note that the distance f between origin and image plane is constant. It is refered to as the focal length of the camera system. The relationship between objects in the real
world and and points in the image is given by the equations of perspective projection R
=
1[ X ]
1[i]
= 7
=
~
.
(i)
To eliminate the focal length we use the scaled coordinates x' = x/f
and y ' = y / f
(2)
to describe locations in the image plane in what follows and omit the primes. W e now turn to the description of motion. Obviously there are two cases, namely a moving camera and a moving object. However, for the observer only the relative motion of the object with respect to the camera is apparent and can be recovered. If you have ever been sitting in a train in the station and another train next to you is pulling out you have a vivid image of this phenomenon. Consequently we will describe the relative motion by one set of motion parameters. We will decompose 3-D motion into translation and rotation. Three parameters are needed for each component. By considering only instantaneous motions the representation becomes very compact. Every 3-D motion can be represented by a translation vector t = [U, V, W] T and a rotation vector to = [A, B, C] T. The vector to points along the axis of rotation which runs through the origin of the coordinate system and its magnitude is proportional to the angle of rotation.
Scene
g
Optical Axis v
Fig. 1. Coordinate system and projection geometry in the motion imaging situation.
300
J. Heel
/ Dynamic motion vision
2.2 A state model of environment depth If P is a point with position vector R on the object moving with t and a~ relative to the camera we have that k = - t-
x R.
(3)
The third component of this vector differential equation is of particular interest. It describes the temporal change in Z, the environment depth of a particular realworld feature. Knowledge of Z for every environment point is equivalent to having a "depth map" structure description of the object being viewed. It would be extremely useful to formulate a dynamical system which describes the temporal behavior of this quantity. More formally we find that the third component of (3) is
Z, =
- Ire" - ( A Y -
(4)
BX).
Now we make use of the perspective projection equation (1) t o obtain 2 = - w-
(Ay-
Bx)Z
(5)
which is a simple state equation for the depth Z. We see that the image location (x, y) as well as the motion parameters A, B, W are parameters in this formulation. Therefore this dynamical system is particular to the feature of which we are observing the depth. For now we will assume that the motion parameters are known and explain how they may be obtained later.
2.3 A measurement model: optical flow
this quantity. To complete the dynamic modeling we must formulate the output or measurement equation. In general the measurement in an imaging situation is an array of brightness values E(x, y). An output equation should model the dependency between this measurable quantity E and the state Z. This is quite tedious (see the attempt in [11]) and involves many assumptions about fighting conditions, surface structure and reflectance properties which restrict the generality. We therefore take an alternative approach. As mentioned in the introduction, the estimation of optical flow is a problem which has been thoroughly considered in the vision research community. In this process, a displacement or velocity vector [u, v] T is estimated for every image point [x, y]T given two images as input. We will assume that such an estimation procedure is available so that we interpret y = [u, v] T as our measurement quantity. Details of the optical flow estimation procedure are described below. The main question is therefore: what is the relationship between the measurable quantities [u, o] T and the state Z? Observe that [u, v] r is the velocity of the projection of a point in the real world in the image. Therefore
[:] Now x and y are related to the real world quantities X, Y, Z via the perspective projection equation (1) x=~
We have formulated a state equation for the depth Z and have established the basis for the use of an observer/filter scheme for the estimation of
X
Z
+ Axy- B(x2+ I) + Cy
Z
r
UjV
-V + yW -
Kinematics
(7)
Measurement Model -U + xW
(Ay- Bx)Z
Y
If we differentiate these equations with respect to time and make use of the equations of object
System Model
=-w-
andy=~.
Bxy + A(y2+ 1) - Cx
Motion Field
Fig. 2. The dynamicalsystemsmodelof motionvision.
J. Heel / Dynamic motion vision
301
motion (3) we obtain u=
-U+xW Z + A x y - B(x2 + I) + CY
(8)
°=
-V+yW Z
(9)
Bxy+A(y2+X)_Cx.
Fq NeighborhoodN
These equations are often referred to as the motion fieM equations. Notice that, if we assume the motion t - [U, V, W] T and to = [~4, B, C] T as well as the location [x, y]T in the image plane to be parameters equation (9) is the output equation of our dynamical system. A block diagram of the complete dynamical model is shown in Fig. 2.
Neighborhood N
Search Area S
Image 1
2.4 Details of the opticalflow measurement In he previous subsection we have mentioned t h a t the o p t i c a l flow [u, v]T c,a n be estimated at every pixel and we have assumed that a procedure that produces these estimates is available. In this subsection we describe in more detail, how this can be accomplished and how variances of the measured quantities (essential for the recursive filtering algorithm we will propose later) are to be obtained. Recall first that the optical flow is an estimate of the projection of the 3-D velocity field into the image plane. An optical flow field should therefore contain a vector [u, o]T for every point in the image describing where this same point is projected in the next image. The main idea that is exploited in the correlation-based or sum-of-squared differences (SSD) optical flow is depicted in Fig. 3. For a given point P = (x, y) in image I we wish to determine where it moves to in image 2. We assume two things: • The interframe displacements do not exceed a certain number of pixels in each dimension. • The brightness in an area surrounding the point remains approximately unchanged by the motion. The first assumption leads to the concept of a search area which is an area of pixels in image 2 that we will consider as possible correspondences for P. The seeond assumption introduces the sum of squared differences.As a measure of h o w well P corresponds to each candidate point P ' in image 2 we will use the difference between a neighborhood surrounding P and the corresponding neighborhood around P'. The measure is com-
Image 2 Fig. 3. Computation of SSD optical flow.
puted as the sum of the squared differences of corresponding pixel values over the entire neighborhood. More formally, if we let E(x, y) denote the brightness values at location P = (x, y) in image 1 and E'(x, y) a brightness value in image 2 then we seek for every P = (x, y) in image 1 a P ' = (x', y) in image 2 such that
ssD(x', y') =
rain
E
[E(x+Ax, y+Ay)
x',y" e S 4 x , A y e N
- Z ' ( x ' + ax, y' + z~y)l 2. (lO) N is called the neighborhood, S is referred to as the search-area. The vector [u] = [ ~ : - - y ]
(11)
is called the optical flow vector at point P. The scheme described so far will only determine the displacement to pixel accuracy. This is certainly tmsatisfaetory for the purpose of estimating depth. To obtain sub-pixel accuracy, we proceed as follows: consider the surface of the sumof-squared differences. We have samples of this surface at the vertices of our regular image grid given by SSD(x, y) where x, y are integral. Suppose that (x', y') was identified to the be the location of the minimum of the samples. In looking for a sub-pixel displacement we are saying that the true minimum of the continuous SSD surface is not at (x', y') but between this point and one
302
J. Heel / Dynamic motion vision
of its neighbors. To find the true minimum, we therefore fit a surface to the given samples and then analytically determine the minimum of this surface. We set out to fit a quadratic surface in x and y to a 3 × 3 neighborhood around the minimum ( x ' , y ' ) . This method was thrown off fairly easily by moderate amounts of noise. We are now using a 1 × 3 neighborhood to fit a one-dimensional quadratic surface to both x and y directions. An example of an SSD-surface with the neighborhoods indicated as well as the corresponding interpolation in the x direction is shown in Fig. 4. More formally we fit the quadratic
(12)
5"(x) -- ax 2 + bx + c
to the three samples s_a, 5'0, 5'1 where we have selected the origin to be located at the center of the neighborhood. The coefficients of the best-fitting surface are a = ½(s a + 5 ' - a ) - So
(13)
b -- ½(s a - 5'_1)
(14)
c = 5'0.
(15)
Once these coefficients have been determined we find that the minimum of the surface is located at
and the value there is
s(x') = c - b2/4a.
(17)
From our formulation, x ' should be a value between - 1 and 1 and can be used to improve the integral value of the optical flow determined from (10). For the purpose of the recursive estimation procedure which we intend to apply to our dynamical system it is necessary to interpret the measured quantities as stochastic processes. The stochastic nature of the measurements is due to noise which corrupts any measurement of image brightness. The noise in the brightness measurement E propagates through the SSD optical flow algorithm and appears in the values of [u, o] T. Assuming that the distribution of these quantities is Gaussian the task is to determine the variances. This can be accomplished (see [11]) by using variance propagation on (10) but is rather tedious and therefore omitted here. Under certain assumptions we found that the variances can be approximated as follows
s(x')
o~
and
s(y')
=
(18)
(16)
x" = - b / 2 a
•
Minimal SSD
(X,y')
ssdC.)
Fitted
/
,l •I
I
x-Neighborhood
I
y-Neighborhood
-1
0
Subpixel Minimal Minimum SSD
Fig. 4. Subpixelinterpolationof the optical flow.
1
303
J. Heel / Dynamic motion vision
Using our sub-pixel interpolation scheme, these quantifies are extremely simple to obtain. The formula for s ( x ' ) has already been given in (17) and the curvature is simply
d2s(x ') dx 2
= 2a.
A Kalman filter is a dynamical system which estimates the value of state variables of another dynamical system from the value of its output variables using the known system dynamics. Most importantly, the estimation proceeds in such a way, that the state vector is reconstructed in an optimal way in the presence of a Gaussian noise by interpreting the measured quantities as stochastic processes. Due to the nonlinear nature of our dynamical system we make use of the so-called Extended Kalman filter which is explained in detail in [9]. Conceptually, the Kalman filter operates according to the block-diagram of Fig. 5. For our purposes it suffices to note that the filter consists of two stages. The update stage improves the current estimate of the state vector (the depth in our ease) in accordance with the incoming measurements. The prediction stage uses the known dynamics of the observed system to project the state expected in the next iteration. The filter also takes the certainty (in the form of variances) of measurements and state estimate into account. To accomplish this, the noise in system and measurement are interpreted as stochastic processes. Our goal is to apply the Kalman filtering methodology to the dynamical system we have construtted for motion vision. Observe that the depth Z is the state of our system but is not contained in the measurable output. This is a perfect task for the filter: estimate the state component Z from the measurement [u, off. We need only apply the filter to our special dynamical system. We discuss the components of the filter separately.
(19)
Interestingly, the theoretically determined variances correspond closely to "confidence measures" suggested 5y Anandan [3]. 2.5 Dicrete time approximations
To use the derived dynamical systems model in the implementation of a recursive filter we will need its discrete time equivalent. Let us adopt the convention x ( k T ) = x k where T is the time between frames. We will approximate derivatives by first differences so that 2 = Z k . a - Zk
T
(20)
Our state equation now becomes Zk÷ 1 = - T W + (1 - T ( A y k - - B X k ) ) Z k and the output is simply uk = Ok ~
-U+xW Zk -V+yW Zk
+ A x y - B ( x 2 + 1) + Cy
(21)
(22) (23)
Bxy+A(y2+l)_Cx.
3. A Kalman filter for structure estimation : 3.1 The Kalman filter
We will now exploit the formulation of imaging a scene in relative motion as a dynamical system to estimate depth i.e., the structure of the object.
Measurement
Update
Updated State
Prediction
Delay Fig. 5. The Kalman filter principle.
I
i e dState icted
304
J. Heel / Dynamic motion vision
1. Update component. The filter update operates by correcting the current depth estimate Z k by an amount proportional to the difference between measured and expected optieal flow. The proportionality factor (filter gain) depends on the certainty (variance) of both the current depth estimate and the measured optical flow. Simultaneously the variance of the depth estimate is updated. Updating is indicated by the superscript changing from " - " to " + " . The update equations are: Filter gain State update T
2. Prediction component. In this stage of the filter the current depth estimate is simply plugged into the known dynamic equation of the system to project which depth to expect in the next iteration. The depth variance is adjusted accordingly. Predicted values are indicated by the iteration index k increasing to k + 1. In our case the prediction equations have the following form: State prediction
2 T T2 e~ ffiff~,k¢~¢ [ ¢kCkOff.,k dr Rk ] -1
-t-
denote the motion field computed according to (22), (23) as opposed to the optical flow (u, o) measured from the images.
(24)
Uk
Zk+ 1 = -- T W + [1 - r ( h y k - BXk) ] Z~ Covariance prediction
o2-z.k+l= [1- r( a y k -
Covariance update
07.,k We have used the following notation: Z k and a2k are the depth and its variance, the quantities the filter estimates. Further we have
(25)
L(v-ywl/z, j
(27)
BxD] Oz
Recall that we used T to denote the time interval between frames and ( X k , Yk) to denote the image plane location of the point at which we are estimating the depth in iteration k. This is the Kalman filter we will use for depth reconstruction from motion sequences. A block diagram of the filtering algorithm is shown in Fig. 6.
If we consider an implementation of this filter we encounter a number of difficulties with the details of the algorithm. These are discussed in the following section.
and 0
3.2 Making the algorithm practical the measurement eovadance matrix. Note that we have assumed that the estimates of the optical flow .components u and o are independent. Finally, u' and o' in the state update equation
Let us briefly reflect upon the effect of the technique outlined above. A point P in the enviroument will appear at P ' = (x, y) in the image
t, w SSD Optical Flow
[ T
Image k+l Image k
t 3CO
(u,v} ~[ Update
I
Predict
-I- + Z~, crz~
~=~" -I
r
I Z~v~s
Fig. 6. Conceptualblockdiagramof the l(alman filt~.
J. Heel / Dynamic motion vision
plane. The state equation of our system describes, how the depth of P will change due to the motion t, to. The Kalman filter uses this knowledge to reconstruct the depth Z from the optical flow measurement [u, o] T. Note that the filter produces only the result for Z at P ' or in other words: we need one Kalman filter at every pixel in order to obtain a dense depth map.
3.2.1 Depth reinterpolation This uncovers the main difficulty we are faced with in implementing the technique: The point P ' at which the Kalman filter operates also moves inbetween frames. In practice, this will have the following consequence. Suppose the current depth estimate is stored in an array corresponding to the image pixel array. The predicted value for Z computed at (x, y ) is not valid at (x, y ) but rather at some (x', y') which is the location to which (x, y ) moves due to the relative motion of the object. Fortunately, ( x ' , y ' ) is given to us by the motion field vector [u, v] T over the time interval T (refer to (22), (23)): I
(28)
x, = Tu + x
y =To+y.
In general, (x I, y ' ) need not coincide with the center of an depth array cell. Fig. 7 illustrates what a typical warping of the depth may look like. How can we maintain our depth array after the Kalman filter prediction has warped it in such a way?
I
/ ,/ /, /
J
fl
Fig. 7. Problems we eneountex in warping the depth map.
305
The basic approach to solving this problem is to consider the computed depth values as samples of a function of two variables taken at the locations (x', y') indicated b y the motion field. We resample this surface at the locations of the gridpoints by interpolating between the given "samples". Many solutions to this problem are possible (for instance bi-linear and bi-cubic interpolation; see [20, 4 or 1] for some interesting and practically useful techniques), some are discussed in more detail in [11]. We briefly present the scheme we have incorporated into our current implementation which we found to be both efficient and accurate. We compute the value of Z at the grid-point value as the weighted average of the estimates which fall into its "vicinity" (neighboring grid cells within some radius).
Z ( x ' , y ' ) = ~ w,Z(x,, y,).
(29)
i~l
If the weights wi are chosen to be 1
d, w,=.
1"
(30)
where d i is the Euclidean distance of sample (xi, Yi) from the grid point (x', y') tl~ scheme has many desirable propem.'es and can be implemented in computational time proportional to the number of samples considered. See the appendix 6 for details. Now we almost have a complete scheme for structure estimation. We observe that the vectors t and to appear in the filter equations above. Let us for now assume that these parameters are known to us from some source (say some other sensor or the known dynamics of the motion actuator). In this case, our filter has the form shown in Fig. 8.
3.2.2 Motion estimation So far we have assumed that the motion parameters t and to are known. In practice this may be the case for a camera amounted on a mobile robot for instance. In many eases, however, the exact motion is unknown and we are in fact interested in estimating it. The problem we face is of the chicken-and-egg type for if we had an accurate motion estimate, the structure estimation would
306
J. Heel / Dynamic motion oision
t,w
SSD
](u,v)
Optical Flow
I [ ~ cry I
T
.I
-I
tsw
Update
+ + Z ~ O'Zk
Reinter-
polation
-I
Image k+l Image k
L
I
-I
Predict
ZL1,~ 1
Fig. 8. First design of the Kalman filter.
be solved at this point and on the other hand if we had an accurate structure estimate, the recovery of motion would be simple. The above conflict is both the problem and the solution. We get an approximation to the real surface structure using the previously described Kalman filter. Using this structure we can compute the motion which best fits this structure and the observed optical flow. This motion in turn can be used in the next iteration of the Kalman filter as an improved estimate of the motion parameters. This takes the form of a least-squares estimation of the motion parameters which will be interleaved with the Kalman filter stages. In essence we obtain an adaptive filter that optimally adjusts the motion parameters while estimating the structure values. Recall the motion field equations that relate the "measurable" optical flow to the motion parameters and depth U----
- v +Z~ w + ~ y - B ( ~ + I )
+ cy
01)
O----
- v +Zy w
B~y+A(y~+l)_C~.
(n)
Given at least 3 optical flow measurements (u, o) and corresponding depth values Z we could formulate a least-squares problem for the motion parameters t and ~. However, the depth Z is unknown - it is precisely what the filter is estimating. The key idea is to use the current estimate, in this case Z~" as the depth value. This will of course lead to errors in the motion parameters
which disappear in the course of the estimation procedure. For the least-squares estimation, let us call the optical flow measurements (u~, v4), the depth estimates Z i (which are the output of the prediction stage of the filter) and the locations at which the measurements are taken (xi, y~) for i-1 . . . . . n. Then we wish to find t and ~ such that
i-1
+ [x,y,,-(x, = + 1), y,] .,o) + o~+ O,z~,
,:]
z~ .t
; E (~,+-,'t+b,'~)~+
(o,+~,'t+a,'o,) ~
i--1
(33) is minimal. Differentiating with respect to the motion parameters t and ~ yields the following linear system of 6 equations as a necessary condition for a minimum
=
-
~ (uja, + vici) i--1
J. Heel / Dynamic motion vision
t,w ssD
Optical Flow
[ l
i(.,., I|
*'
.I vI
,11, I Motion I Estimation
!
Update
I Predict
+- + Z Ia ~Zk
t Z~+*'a~+x ~.I
Image k+l Image k
307
-I
Reinter-
polation
t l
-I
Z ~,, ~
I
Fig. 9. Integratingmotionestimationinto the Kalmanfilter.
= -- ~ ( u i b i + oidt). i-1
(34)
This system can easily be solved for the desired parameters. We can now replace the mysterious source of the motion parameters in Fig. 8 with this motion estimation procedure. The result is shown in Fig. 9.
4. Implementation and experiments
4.1 Implementation and experiments To verify the practicality of the proposed approach we have implemented the Dynamic Motion Vision structure and motion estimation from optical flow and present below the results of experiments on synthetic and real images. The implementation was done in C on a Sun 3 computer. All images, optical flow fields and structure plots shown here are screen dumps or file output produced by our program. We subjected our algorithm to three different levels of testing. On the first level we used synthetic optical flow fields so as to suppress the amount of error introduced by the optical flow computation. We expect the algorithm to perform particularly well on such input. In the experiment presented here a sequence of optical fields has been constructed as observed by a an observer
passing over a wedge-shaped structure. A look at the optical flow fields in Fig. I0 confirms the intuitive idea that image points corresponding to the tip of wedge which are closest to the observer have large interframe motion so that the optical flow is largest there. The background was chosen to be fiat so that the optical flow is constant there. The motion was chosen so that the displacements at the tip of the wedge did not exceed 3 pixds. Fig. I0 shows one of the synthetic optical flow fields as well as the structure recovered by the algorithm after 3 iterations (corresponding to 4 images). The total sequence consisted of 10 optical flow fields but the convergence i s extremely fast, i.e. the structure is recovered almost perfectly after only a few iterations and does not change significantly during the remaining iterations. Errors are mainly due to the errors in the unknown motion parameters. We discuss the results of the motion estimation procedure in more detail below. We note however, that with an initially flat structure estimate the initial motion estimate was within 10% of the true motion (recall that motion can be recovered only up to a scale-factor, however, the motion/depth ratio can be used as a measure of quality). This indicates that the motion estimation procedure is quite robust with respect to depth errors. The next step of our experimental testing was to use synthetic images generated by a graphics package by the name of S-Geometry. The package runs on Symbolics Lisp Machines and allows to generate 3-D CAD representations of shapes,
308
J. Heel / Dynamicmotion oision
1
Fig. 10. The synthetic optical flow and the recovered structure for the wedge sequence. select viewing and illumination conditions and then produces the image as seen by the imaginary observer in the chosen setting. We created a rotationaUy symmetric "crater". Fig. 11 shows the
3-D representation of the constructed structure which was used for the generation of the image sequence. The crater's surface was repeatedly subdivided
Fig. 11. The structureused by the geometrypackage to generate the synthetic images for the crater sequence.
J. Heel / Dynamic motion vision
309
Fig. 12. The first three images of the crater sequence.
into polygonal regions. The package allows to randomly perturbe the normals of these surface patches by a small amount. This was done to provide variations in brightness over the entire surface of the crater which is a necessary precondition for the optical flow algorithm to work properly (in other words: regions of uniform brightness lead to garbage optical flow estimates as can be seen by recalling the procedure for computing SSD optical flow). The first three images of the crater sequence which consists of 6 images in total, are shown in Fig. 12. In constructing the sequences a constant motion of 50 centimeters between frames was used, the distance to the crater base plane was 10 meters. The latter value was used as the initial value
........
• ...............................
for the depth estimation (i.e. the initial structure estimate was a plane parallel to the image plane at distance 10 meters). The first three optical flow fields computed for the above sequence are shown below in Fig. 13. As we see, the motion and distances have been chosen to limit the maximal displacement between frames to 5 pixels here. Finally, Fig. 14 shows the structure recovered after processing the entire sequence. In comparing it with the initial structure used in the geometric modeler note that the viewing perspectives differ slightly and that only a portion of the crater is visible in the images so that the crater appears truncated in the recovered structure map. The next series of experiments was conducted on real images and designed to test the limits of
t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
l ........................................
.................................................................................................................
::::::::::::::,
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
~
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: " _" : . ~ ; 7.:. . . . .
===============================
. . . . . . . . . . . . . . . . . . . . . .
.~-.--
~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
ii~!}i::':::i._.~_-~b~ ~iii!iiiiiiiii~i~-~,~ i~ ii::}ii': !].--_----:'!ii_~giiiii!:--iiiiiiiig~--[ ~i~ g}]}i::~gg~-::z_--i~iii!!!iiii~iiii--------.~i i}} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ' ""~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........ t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
."" --."-
: ~[[:[:[[2[:~:
............................................
.............................
[:[::::[:[::[:
: ::Z
.
.
.
.
.
~
.....
~: LI[
[ [ : [ ~[ [ [
:.i'-'.-'. ~
Fig. 13. The first three optical flow fields from the crater sequence.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
310
J. Heel / Dynamic motion vision
Fig. 14. The recoveredstructure of the crater scene. the algorithm's practicality. A C C D camera with a focal length of 10 m m in pure translational motion over a scene consisting of a small spray bottle on a table before a flat background. The bottle was 730 m m away from the camera, the background was 1000 m m away. Fig. 15 shows the first three images of the sequence of images used in the bottle sequence and Fig. 16 shows the corresponding optical flow fields. To test the behavior of the motion estimation, a non-uniform motion of the camera was chosen: The camera translated 5 m m between frames except between frames 2 and 3 where the translation was only 2 mm. The distances and motions were chosen so as to limit the components of the optical flow vectors to less than 6 pixels. Fig. 17 shows the relative error in the recovered translational motion. The absolute values of the
translational motion parameters and depth are of course only recovered up to a constant factor which is determined by the choice of the initial value of Z. In all experiments we began with an initially flat depth map of 900 m m (an average of actual depth in the scene as it may be obtained by a simple range sensor). It may appear strange that the relative error in the estimated motion does not decrease quadratically as usually shown for Kalman filtering schemes. Note, however, that the motion is not estimated by a Kalman filter. The motion estimation is a least-squares procedure which is interlaced with the stages of a Kalman filter. Finally, Fig. 18 shows a complete set of structure estimates as produced by the algorithm from the bottle sequence. We considered this experiment to be a crucial test for the algorithm for the following reasons. 1. The motion was nonuniform and therefore bound to reveal any problems due to the simultaneous estimation of depth and motion. 2 The depth variations in the scene are small. As a consequence the difference between the optical flow on the bottle and on the background would be very small. 3. Due to absence of texture on the bottle the optical flow there was rather noisy. The algorithm seems to handle all of these problems reasonably well. Since no assumption about the motion was employed, the non-uniformity was detected as the plot of 17 of the motion parameters reveals. We observed that the z-component of the translation vector is particularly sensitive to errors in the optical flow/depth. The .
Fig. 15. The first three images of the bottle sequence.
J. Heel / Dynamic motion vision
• : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
........... ...., ........... . ............. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • ......
............... , ................... ............. ooo.o....o .................. ........... .o..oo..oo ............. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
o . . . . . . . . . . .
o . . . . . . . . . . .
311
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
o . . . . . . . . . . . .
, .....
L
• ...... o .......
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : oooo.
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
•
. . . . . . . . . . . . ooo . . . . . o o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . ~ . . . . . o . . . . . . . .
iiiiiiiiiiiiii!i!iiiiiiiiiii!!!i!!i
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
.....
o. . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . . . . . . .
....~
. . . . . . . . . . . . . . . .
.. o
. o . . . o . . .
o ............
. . . . . . . . . . . .
.
. . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
,,
......
o .
4----
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : o.,.. ............... ,.. .... o.,..o..~ ..... : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
%
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
.............. . ................... ,...,o. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E . ~
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii!i~iiii"
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
. . . . . . . . . . .
o ....
o.
. . . . . . . . . . . . . . . . . . . . . . .
Fig. 16. The first three optical flow fields from the bottle sequence.
R e l a t i v e Error in V (%)
Translation V{mm) computed
O
10
a c t u a l
-10, Iteration k 1
2
3
4
5
6
7
Iteration k 1
2
3
4
5
6
7
Fig. 17. (left) shows the recovered translational motion compared to the actual motion. (right) shows the relative error. The errors in all other motion components were negligible small.
ii!!!iiiiiilE"-:~iiii!!!!! i i i i i i
ii1[~
iiiIIIIIIIIIIIIiii ii
I.l t.l . . 1 ). .I .I . I. .I . . I. I. .I .I .I . I. .I .I I Ill I I I I I I I I i p I I I i i i i i i i i i iiii
iiiii
i i i i .I .I.I.I. ,~,
ii II ii
tl
I l l l I I I I I L ~ I l J l l l l l l L
II
III~III]IIU iiill ....
[ t | l l l l l t l l l I i
II1,:,,,~
I l l l l l l l l l l l l k % l l l l A I l ~ ' N l l l I t l I I I I I I I V I I I ~ I / I I I /
F
I
I I I I I I I I I l l II~LIIIIIIIII I I I I I ~ I I I I jill L ( I ] ~ I l i i Illll ,,,
II l
till LI Ill I t III3 IIII ~ II I II1|! I I I I ( l l l l e t t l f l ~ l l l l l l ~ l # , , l ~ l ~ l l l l l l l l l l l l ~ X l t l ~ l l l l l Z N I I I I ] I I I I I I I I ~[lllf~ I I I I I I I Ill
~ll',
[ 111
. . . . . . . . . . . . .
IIl[ll ILL~ l l l ' t ' ~ - ~
Fig. 18. The structure recovered after each iteration step of the D M V algorithm from left to fight and top to bottom.
312
J. Heel / Dynamic motion vision
structure of the bottle, the table on which it stands and the flat background are clearly recovered. As expected the scarcely textured bottle cap's structure is not recovered as accurately. The implementation is also reasonably fast: for the image size of 3000 b y 300 pixels one iteration takes approximately 2 minutes on a Sun 3 / 6 0 if the optical flow has been precomputed. Note, however, that due to the local nature of the Kalman filters this algorithm is ideal for a parallel implementation on a S I M D machine and a considerable speedup should be achievable. A problem we encountered in the course of our experiments are motions which involve a focus of expansion in the image. The focus of expansion is a singular case for the structure estimation procedure. A look at the Kalman filter update equation (24) reveals that the filter gain is always zero at the focus of expansion and very small in the vicinity of this point. Hence, the filter never changes the estimate of structure in that area and it remains equal to the initial guess. We are currently conducting theoretical work to determine what the limits of the structure recovery are in this case.
tion procedure of the camera carrier such as in mobile robots or aircrafts and satellites. Another powerful technique that one could import from control theory is system identification. Sophisticated techniques have been developed to estimate the parameters of dynamical systems which are the constants in the plant and output equations. What does it mean to perform system identification on one or our models? W e can use one of these identification procedures to measure parameters in the imaging situation such as focal length of the camera. It is also possible to formulate the estimation of camera motion as an online parameter identification problem. The main objective in this paper was to introduce a systematic way of dealing with a series of frames in motion vision. We have shown that dynamical systems provide a way for dealing explicitly with time dependency and have formulated the solution to a c o m m o n motion vision problem in terms of such systems. It is conceivable that this approach can be extended to other problems in which information is acquired over a series of frames such as multiframe edge-detection, segmentation, color, texture etc.
5. Conclusion
Acknowledgments
We have shown how the the theory of dynamical systems can be applied to machine vision. As a result we have obtained a K a l m a n filter for the recovery of environment structure from image sequences. A motion estimation procedure was interleaved with the stages of the K a l m a n filter so that our method solves both the structure and motion estimation problem which are considered to be the core problems in motion vision. The most significant feature of this approach however, is that it is not limited to the application of filters for depth and motion estimation. Once the motion imaging situation has been formulated in terms of a dynamical system other methods from control theory are easily applied. For instance what would it mean to close the feedback loop over one of the dynamical systems presented above? It means that we measure some indication of motion in the image plane and feed it back to control the motion of the camera. This could be used in tracking or some other orienta-
Many thanks to Berthold H o r n for the m a n y discussions and insights. T o m m y Poggio, HansHellmut Nagel, Ellen Hildreth, Jean-Jacques Slotine made valuable comments on earlier versions of the paper, provided me with excellent references on the subject and ensured that the applicability of the algorithms to real images remained the most important concern. I would also like to thank m y friends Sundar Narasimhan and Brian Subirana for the inspiring conversations we had about some of the problems in this paper.
References [1] I.E. Abdou and K.Y. Wong, Analysis of linear interpolation schemes for bi-level image applications. J. Res. Develop. 6, IBM (November, 1982). [2] G. Adiv, Determining 3-d motion and structure from optical flow generated by several moving objects, COINS Technical Report 84-07, University of Massachusetts, Amherst, April, 1984.
J. Heel / Dynamic motion vision [3] P. Anandan, Computing dense displacement fields with confidence measures in scenes containing occlusion, COINS Technical Report 84-32, University of Massachusetts, Amherst, December 1984. [4] R. Bernstein, Digital image processing of earth observation data, J. Res. Develop., IBM (January, 1976). [5] S. Bharwani, E. Riseman and A. Hanson, Refinement of environment depth maps over multiple frames, in: Proc. Workshop on Motion: Representation and Analysis, Charleston, S.C. (May, 1986). [6] T.J. Broida and R. Chellappa, Estimation of object motion parameters from noisy images, IEEE Trans. Pattern Analysis and Machine Intelligence, PAMI-8(1) (January, 1986). [7] T.J. Broida and R. Chellappa, Kinematics and structure of a rigid object from a sequence of noisy images, in: Proc. IEEE Workshop on Motion: Representation and Analysis, Charleston, SC (May, 1986). [8] A.R. Bruss and B,K.P. Horn, Passive navigation, Computer Vision, Graphics, and Image Processing 21 (1983). [9] A. Gelb (ed.), Applied Optimal Estimation (MIT Press, Cambridge, MA, 1974). [10] D.J. Heeger, Optical flow using spatiotemporal filters, Internat. J. Computer Vision 1(4) (1987). [11] J. Heel, Dynamical systems and motion vision, AI Memo 1037, MIT Artificial Intelligence Laboratory, April 1988. [12] E.C. Hildreth, Computations underlying the measurement of visual motion, Artificial Intelligence 23 (1984). [13] B.K.P. Horn, Relative orientation, AI Memo 994, MIT Artificial Intelligence Laboratory, September 1987. [14] B.K.P. Horn and B.G. Schunck, Determining optical flow, Artificial Intelligence 17 (1981). [15] B.K.P. Horn and E.J. Weldon Jr., Direct methods for recovering motion, Internat. J. Computer Vision 2 (1988). [16] K. Kanatani, Structure from motion without correspondence: General principle, in: Proceedings Image Understanding Workshop, Miami, FL (December 1985). [17] H.C. Longuet-Higgins and K. Prazdny, The interpretation of a moving retinal image, in: S. Ullman and W. Richards (Eds), Image Understanding 1984 (Ablex, 1984). [18] L. Matthies, R. Szeliski and R. Kanada, Kalman filterbased algorithms for estimating depth from image sequence, in: Proc. DARPA Image Understanding Workshop, Cambridge, MA (April, 1988). [19] S. Negahdaripour and B.K.P. Horn, Direct passive navigation: Analytical solution for planes, in: Proc. IEEE Internat. Conf. Robotics and Automation, San Francisco, CA (April, 1986). [20] S.S. Rifman and D.M. McKinnon, Evaluation of digital correction techniques for errs images, Technical Report E74-10792, TRW Systems Group, July 1974. [21] J.W. Roach and J.K. Aggarwal, Determining the movement of objects from sequences of images, IEEE Trans. Pattern Analysis and Machine Intelligence, PAMI-2(6) (November 1980). [22] H. Shariat and K.E. Price, How to use more than two frames to estimate motion, in: Proc. Workshop on Motion: Representation and Analysis, Charleston, S.C. (May, 1986). [23] M. Spetsakis and J. Aloimonos, A multi-frame approach to visual motion perception, Technical Report CAR-TR-
[24]
[25]
[26]
[27]
[28]
313
407, Computer Vision Laboratory, Center for Automation Research, University of Maryland, November, 1988. J. Stuller and G. Krishnamurthy, Kalman filter formulation of low-level television motion estimation, Computer Vision, Graphics and Image Processing 21(2) (February, 1983). D. Terzopoulos, Efficient multiresolution algorithms for lightness, shape from shading, and optical flow, in: AAAI, University of Texas, Austin (August, 1984). R.Y. Tsai and S.T. Huang, Estimating three-dimensional motion parameters of a rigid planar patch, IEEE Trans. Acoustics, Speech and Signal Processing, ASSP-29(6) (December, 1981). S. Unman, Maximizing rigidity: The incremental recovery of 3-d structure from rigid and rubbery motion, AI Memo 721, MIT Artificial Intelligence Laboratory (June, 1983). A.M. Waxman and S. Ullman, Surface structure and three-dimensional motion from image flow kinematics, Internat. J. Robotics Research 4(3) (1985).
6. Appendix: An efficient pixel interpolation scheme When computing Kalman filter updates or predictions of depth a n d variance maps the problem arises that the updated value is no longer valid at a grid-point of the pixel-array because the projection of the point under observation has moved. The proposed solution was to reinterpolate the gridpoint values from the warped values. Here, we give a computationally efficient solution for the following problem: Given n points (x i, Yi) and a function value Z(xi, Yi) at those points interpolate Z at a point (x, y). We will compute the interpolated function value as a weighted sum n
Z ( x ' , y ' ) = ~ wiZ(xi, Yi).
(35)
i=l
A good weighting function should fulfill the following requirements:
• 0 < wi~
E,."= aw~ = 1.
• The wi should decrease as the distance d/2 = (x i - x ' ) 2 +(Yi _ y,) 2 decreases. We therefore choose the weighting factors to be 1
d? wi
n
1 "
i~l
(36)
i
This clearly fulfills all of the requirements but some special cases must be considered. Suppose all n estimates involved in the interpolation have equal distance d~ = d from the gridpoint. In this case 1 wi
d2 n 1
1 n
(37)
3". Heel / Dynamic motion vision
314
which means that all estimates are weighted equally as one would expect. A more tricky case occurs when some number k of the estimates are actually located at the grid-point i.e. dr = 0 for i • l ..... k. For estimates on the grid-point we can rearrange the expression for the weights (36) to obtain 1
b
2
d~ +
1
(38)
d~
as d+ ~ 0 for i = 1 . . . . . k. In other words we ignore all estimates that are not on the grid-point and obtain the interpolated value as the mean of the estimates located at the grid-point. This may be very easily encoded into an efficient O ( n ) algorithm which we give in the following pseudocode notation: k*'-O; W ~ O ; F o r i *--1..... n d~ *-- (x~ - x ' ) 2 +(y~ - y,)2; if d~ = 0 t h e n k *- k + 1 e l s e W ~ W + it k > 0 then For i * - l , . . . , n if d~2 = 0 t h e n w, *- ( I / k ) e l s e w, *- O;
(1/d~);
i-k+1
as d I --, 0 for i = 1..... k. Similarly we rearrange the expression for the weights in the case of an estimate that does not
coincide with the grid-point
else
dl
For i *-- 1.... , n
4
w, *- ( 1 / d ~ ) / W ;
.o
2
i-k+l
(39)
When dealing with real estimates they wig rarely fall directly into the grid-point and we may wish to replace the test d~ ffi 0 in the above algorithm with d~ < c where c is chosen according to the numerical accuracy of our processor.