A Scene Segmenter; Visual Tracking of Moving Vehicles

A Scene Segmenter; Visual Tracking of Moving Vehicles

A SCENE SEGMENTER; VISUAL TRACKING OF MOVING VEHICLES S.M. SMITW and J.M. BRADY·· • DRA Chert8ey, Chobham Lane, Cherlley, Surrey, England •• Robotic, ...

2MB Sizes 3 Downloads 98 Views

A SCENE SEGMENTER; VISUAL TRACKING OF MOVING VEHICLES S.M. SMITW and J.M. BRADY·· • DRA Chert8ey, Chobham Lane, Cherlley, Surrey, England •• Robotic, Relearch Group, Department of Engineering Science, Univer,itll of Orford. Orford, England Abstract. In this paper the image processing system ASSET (A Scene Segmenter Establishing Tracking) is described. ASSET receives a sequence of video images taken by a possibly moving camera and segments each image into separately moving objects using image motion. The moving objects are tracked, and their outlines are accurately estimated. The ASSET system will provide a useful source of world infonnation, for example, in the area of autonomous vehicle guidance. Key Words. Image processing; obstacle avoidance; target tracking; automotive driving; obstacle detection

© BritIsh Crown Copyright 1993, Defence Research Agency, Farnborough, Hampshire

1.

INTRODUCTION

gives a more complete review. Spoerri and Ullman (1987) and Waxman and Duncan (1986) segment the image flow field into independen tly moving objects using local flow discontinuities, giving fairly inaccurate boundaries. Meygret and Thonnat (1990), Thompson (1980), Thompson and Pong (1987), McLauchlan et. al. (1992) and Nelson (1991) segment on the basis of simple flow clustering or inconsistency with the background flow . Adiv (1985), Debrunner and Ahuja (1992), Diehl (1990) , Burt et. al. (1989), Bergen et. al. (1990) and Torr and Murray (1992) segment using analytic image transformations; fits are found within the flow field of analytic functions with a number of parameters. Murray and Buxton (1987), Bouthemy and Lalande (1990) and Black (1992) use statistical regularization to perform globally optimal segmentation, utilizing various locally defined cost functionals . Verri et. al. (1989) and Rognone et. al. (1992) perform segmentation on the basis of global motion clustering. Previous work on segmentation has usually involved making restrictive assumptions about the world, the motion of the camera etc. ASSET does not depend on this sort of assumption; it is shown later that good results can be obtained without doing so.

This paper describes the ASSET (A Scene Segmenter Establishing Tracking) vision system. ASSET receives a sequence of video images taken by a possibly moving camera and segments each image into separately moving objects using image motion. The system is fast, accurate and reliable. requiring no camera calibration and no knowledge of the camera's motion. The work has been conducted using as a typical real life situation a vehicle travelling along a road , taking video pictures of what is in front of it. If other vehicles can be seen, they must be "segmented " from the background of the rest of the scene so that their motion can be estimated and the necessary evasive action taken. This situation provides a testbed for the developed program which is quite general, and ASSET has been tested on other types of sequence containing independent motion . This paper gives a brief review of current image sequence segmentation methods followed by an overview of the ASSET system . Next short descriptions of each of the sections from which ASSET is made are given. Finally, results of using the ASSET system are shown and discussed . 2.

GU14 STD , UK

Very little work has been undertaken which involves the temporal integration , or tracking, of the segmentation results. Irani et. al. (1992) use the segmentation methods of Burt, Bergen et.

REVIEW OF PAST RESEARCH

A summary of previous work on segmentation of the image flow field is now given. Smith (1992a)

119

al. with temporal integration of the segmentation results to improve performance. However, this "temporal integration" does not involve any sort of shape tracking or modelling, so that information about scene events is not readily available. Meyer and Bouthemy (1992) use the approach of Bouthemy et. al. to achieve segmentation, and then track the moving objects over time . The objects are assumed to be convex, and are matched over time using a polygonal representation . The number of vertices used to track each object is fixed at the instantiation of that object and is not allowed to vary; if the number of vertices becomes unsuitable for accurate representation of the object's image projection then the model must be restarted . Also the scale of the polygon is fixed, so that objects are not allowed to change size in the image. Results are only presented for a static camera.

( .,... di,;liud C,.,. vi6eo )

_______ 1_______ _

: fi-.:l c:onten i. ICCDC : ,------- ____ A_A'

-------------------,

: • • tc I

I

:

:,-- -..~~~~.~~ .. ----,

tey:

~

,-------- ... _-----'"',

The research described in this paper covers this area as well as the segmentation, that is, a temporally coherent list of segmented objects is maintained as time proceeds, and objects move about in the image .

ril&cred "pelllli.

_ , Dew . ._ ..ulist

I

mark. objecu OD display

:

',-------r -------'

~~

.------------------ --------. I

iDtcrpre1 20 objccu iato 3D iaformatioD

:

to Jive to bigher levclac::cac

:

I

iatcrpte~r. :

dlit will bave coalrol over vehicle

:

,--------------------------,

Fig. 1. Overview of the ASSET system 3.

ASSET - OVERVIEW one . This list of matches is equivalent to a list of flow vectors . It is passed to a flow segmenter which splits the list of vectors into groups which have similar flow within them and are different to each other .

The ASSET system is built around the method of feature based image motion estimation . The features used primarily are two dimensional, and edges are used to refine the results obtained by using two dimensional feature motion. Th e main part of ASSET has two dimensional calculations and not three . This means that feature matching is performed in the image plane, as is object segmentation and tracking. The only area in the main body of ASSET where three dimensional behaviour is acknowledged is in predicting which one of two touching objects will occlude which . The reason for staying in the image plane is that no more information about the world processes can be obtain ed by working in three dimensions than by staying in the original two. Unless interaction with outside processes and measurements is Qeeded at an early st.age in a vision system, there is no need to move to three dimensions until the final in terpretation of the processed data.

Next, this list is compared with the temporally filtered list of scene segments, and the filtered list is updat.ed using the newly found segments . This gives good results as it stands but the boundaries of the segments in the filtered list can be further refined by using image edges. This list. of segments which are spatially and temporally significant is finally used to provide information about the motions of objects viewed by the video camera. The three dimensional per sitions and motions of these objects can be estimated by making simple assumptions about the world. In the case of an autonomous vehicle , this information is of practical value; for example it can enable the avoidance of other vehicles or the following of vehicles.

A brief summary of ASSET is now given, and can be seen graphically in Fig. 1. ASSET is fed a stream of digitized video pictures , usually taken from a moving platform. Each frame taken by the video camera is initially processed to find two dimensional features and edges . The two dimensional feature list is passed to a two frame matcher which matches features from the present frame and the previous

4.

OPTIC FLOW ESTIMATION

The decision to use image features to find optic flow was made for several reasons . Firstly, optic flow found using two dimensional features should contain as much information about the scene motion as is available , at the places in the

120

image where the process of flow recovery is most well conditioned and where the information is most relevant. Directly related to this, flow discontinuities are not smoothed in the proposed method, as they are with most others, and no constraints are necessary in order to find full flow . The use of features to find optic flow naturally leads on to a sensible and simple representation of object shape. Finally, the process of finding image flow is computation ally more expensive using other methods of motion estimation than with the feature based method.

functions suitable for matching include image brightness and local image spatial derivatives . The image functions chosen are typically combined into a vector which is compared with the corresponding vector in the second image. It was found that the best attributes to use were the image brightness (not smoothed) and the x and y components of the position of the USAN centre of gravity. Thus these attributes have been used for matching in ASSET. None of these attributes will be affected by local structure in the background as they are all derived from information found inside the corner.

The two dimensional features are found using either the SUSAN corner detector (Smith, 1992a, b) or the Plessey corner detector (Harris and Stephens , 1988) .

Figure 2 shows the set of flow vectors found by the matching stage of ASSET at frame 14 of a

Correctly matching features from one frame to another is the most important part of the image motion estimation . An incorrect match will obviously destroy the value of accurate feature localization . If image motion is smaller than the distance between image features then there is no difficulty in finding a list of correct matches, and the problem is trivial. However, even in the case of stereo matching, where epipolar geometry constrains the possible disparity vectors , ambiguous matching possibilities usually arise. In monocular motion estimation this occurs even more frequently. An intelligent algorithm to disambiguate different possible matches is therefore nec essary.

Fig. 2. An example set of flow vectors found by the two frame matching stage of ASSET sequence taken from ROVA (ROad Vehicle Autonomous) at DRA Chertsey. In this sequence ROVA is travelling forwards and turning slowly to the left with the curve of th e road. The ambulance is overtaking ROVA and the Landrover is travelling with almost identical speed to ROVA . The few false matches , particularly in th e for eground , are due to the lack of matching validation at this stage. Note the negative divergenc e in the flow of the ambulance, which can easily be seen by eye.

Th e matching don e is based on local operations , using a time-symmetric method (Charnley et . al .. 1988; Smith , 1992a) which ensures that the featur e lists are pro cessed in an order independent way, and that parallelization is simple. Features are matched on th e basis of their similarity ; th e criteria which have been used previously to det ermine the quality of a match vary greatly in th eir complexity. Many off-line programs (both stereo an d motion) have used correlation of small patches to give a measur e of the quality of the match (Barnard and Thompson, 1980; Lawton , 1983 ; \'Vang et . al., 1990 ; Shapiro et . al. , 1992) . However, the theoretical justification for using the correlation based matching method is weak if the features are formed from physical corners of objects , where the moving background will make up more than half of th e surrounding patch. The method is also in general much more computationally in tensive than the one described below .

5.

FLOW SEGMENTATION

The list of two dimensional flow vectors found by the matching stage are next segmented into regions which have consistent internal flow and which are different from each other. There is no question at this stage of ensuring that regions do not overlap ; the regions are found individually, and once a flow vector has been used it is removed from the flow list.

Another meth od for matching (for example, used by Charnley tt . al. (1988)) is to use a smaller amount of information describing the structure of the image to distinguish features . Possible

The segmentation method used is based on the assumption that the separately moving image regions have "roughly constant" image flow which

121

form one. Note the small errors in the two important clusters, particularly in the Landrover cluster, which should be filtered out over time. The two clusters in the foreground will not appear in the filtered cluster list as "good" clusters (see the following section) because they will not be consistent over time.

is different from the flow of adjacent regions . The assumption that the individual regions have roughly constant flow rather than modelling flow variation across an object has been found to be good enough for the aims of the ASSET system, as shown in Section 8. An increase in the model complexity and accuracy would be appropriate, for example, in order to find static obstacles, or to succeed in modelling objects with large variation of flow across their visible area, such an object with large rotation about the axis pointing away from the camera or an object which nearly fills the field of view and has high velocity in the direction pointing away from the camera.

6.

CLUSTER TRACKING AND FILTERING

The new list of flow clusters is next matched from the present frame to the main (i.e. filtered) cluster list, using a similar matching procedure to the feature matcher described in Section 4. In this case the attributes used for matching are just the x and y components of the cluster's motion.

The actual segmentation following the simple model of uniform motion within each flow cluster is done in a logical way which is similar to the minimum spanning tree (Duda and Hart, 1973)' although no spanning tree is actually produced. The "distance" measure between two flow vectors depends simply on the similarity of the vectors themselves, where the similarity is only calculated if the vectors are "not too far" apart in the image. This makes sense because the distribution of the flow vectors will be independent of the motions in the scene , so that the distance between flow vectors should not be used as a good indication of their likelihood to be part of the same object , which could have det.ailed or bland struct ure . This distance measure is used to split up the list of vectors into clusters which have "different " mean flow from each other. The list of independent clusters of flow vectors and their centroids and mean image flows is passed on to the part of .\SSET which tracks clusters over lime .

The cluster centroid and image velocity are updated using a temporal filter . A weighted sum of the newly measured and expected values is taken to give the predicted value for each of these quantities. The weights change over time so that when a new cluster is initialized the measurements are given less importance relative to the predicted values as time goes on. The shape of the cluster requires more complicated processing, as it is clearly not possible to take an "average" of two shapes which are defined by lists of two dimensional points. The method used is now described . The outline of the cluster is first found by forming a convex hull (Preparata and Shamos , 1985) around the set of points which make up the cluster. Because of the irregular shape of the hull , this representation of the shape is not well suited to averaging between two hulls. Also , at the later stage of shape correction using image edges , the convex hull representation is not. easy to work with . In fact , the correction may call for concavities to be represented. Therefore the hull is converted into a radial map . This is an array of distances from the centroid of the hull to the boundary calculated at equal angle increments; see Fig . 4. In ASSET, 32 radii are used , which is easily enough to match the accuracy with which the convex hull is found.

Figure 3 shows th e set of clusters found by the flow segmentation stage of ASSET at frame 14

The radial map representation allows two shapes to be smoothed very easily , using for each radial estimate a similar method to that described for the cluster centroid and motion. The "smoothed" cluster shape is stored in the filtered list and updated at each frame. This overall approach to shape tracking seems much more flexible than that described by Meyer and Bouthemy (1992), which places restrictions on size and shape changes in a t.racked object's im-

Fig. 3. An example set of clusters found by the flow segmentation stage of ASSET of the ambulance sequence . No clust.er has been formed where a Aow vect.or has no neighbours to

122

. poiall formia8 dUlter

.

l

,. ......

Edge correction of a cluster's boundary only takes place after it has been tracked for a certain number of frames. This is partly to prevent the wastage of computational effort on spurious clusters, and, more importantly, because, until the estimate of the cluster boundary is close to the actual object boundary, edge correction could degrade the object boundary instead of improving it.

./-

..; .............:~::::\.! /<.~::. ..

This part of ASSET works by taking the list of SUSAN edges and creating a vector field from it which guides the object boundaries towards image edges. A vector field is initialized to (O,O? at every point. Next each edge point is used to add to the existing field a line of vectors pointing towards that edge point - see Fig. 5. This line

Fig. 4. Converting a list of image points to a radial map via a convex hull

age projection .

""'~""

Apart from cluster tracking and filtering, this stage of ASSET also recognizes cluster occlusion, by testing for cluster overlap. If overlap is found the cluster which will be occluded is predicted using the safe assumption that this cluster will have a higher image position for its lowest point than will the other cluster. This will hold for a "ground based" situation, and could be replaced by more detailed analysis of cluster shape change in a more general situation. Once a cluster has been identified as being occluded its shape is kept constant as it moves behind the occluding object. Once the object reappears, it is tracked as before.

Fig. 5. The addition to the edge vector field caused by a single edge point of influence extends away from the edge in both directions perpendicular to it. Next the radial map describing the cluster outline is allowed to ''follow the forces" acting at each small section of the map's outline . Each radius is allowed to change its length by a maximum of one pixel (in either direction), depending on the vector ''force'' on it. This process is repeated ten times. This method was chosen instead of allowing a small number of larger changes to the radii for the following reasons . This procedure allows the shape to change gradually and the radii to affect each other to find a best global fit to the edge influences. It also solves the problem of the fact that the closer the section is to an image edge the greater the force on it, which would give unstable (or at least unpredictable) oscillations about the correct position if the correction to the radius were made proportional to the force on it.

The filtered list of clusters can be passed to a higher le"el stage for scene interpretation . Not all clusters in the filtered list are output as reliable objects . All the clusters found in each image are passed to the filtered list, and until they can be discarded as being spurious, they remain in the list. Only clusters which have been seen for longer than a minimum lifetime and which have been seen for more than a minimum percentage of this lifetime are counted as reliable objects.

7.

I

BOUNDARY ENHANCEMENT Finally, the corrected cluster shape is used to improve the original filtered radial map. To provide good temporal stability a weighted mean of the input to the edge correction algorithm and its output is taken.

Boundaries of objects in the real world almost always give rise to image edges. Therefore a section has been built into ASSET which uses image edges found by the SUSAN edge detector (Smith, 1992a) to "fine-tune" the cluster boundaries so that nearby edges can pull the boundaries into them. This will of course not always be successful, as background edges may confuse the issue . However, in the results presented later, this part of ASSET is a valuable addition to the core processes.

8.

RESULTS

In this section the results of testing ASSET are presented. The image sequences were produced at DRA Chertsey except where indicated otherwIse.

123

Fig . 7. The output of ASSET at frame 17 of the ambulance sequence

Figure 8 shows sections of frames 35 to 57 of the ambulance sequence . Both vehicles are still being accurately tracked , with the exception of the minor error in the top edge of the Landrover . There is no single isolated image edge corresponding to the top of the Landrover, so there is a slight inaccuracy in the estimation of its boundary here. In frame 37 the ambulance is partially occluded by the Landrover. ASSET shows that it has recognized this occurrence by marking the ambulance 's boundary with a darker line. The tracking of the two vehicles is good , with the "unocclusion" of the ambulance being recognized in frame 51 . Figure 9 shows a vehicl e travelling away from th e camera. In this case , the camera is static . This shows ASSET functioning accurat.ely in a situation where there is no background flow .

Fig . 6. Seyen frames showing the increasing accura cy of th e es timat.ed object boundary as time commences, with out (left) and with (right) edge correction

Figures 10 and 11 show an image sequence taken from a moving vehicle at night with an infra-red camera . The quality of the infra-red images is not as good as images taken with normal vid eo cameras ; th e resolution is not as high, and there are horizontal dark and light stripes sup erimposed on the image. However , ASSET still functions well ; the overtaking vehicl e is accurat ely tracked .

Figure 6 sh ows the tracking of the Landrover in th e sequence shown earli er, from the first tracked frame. On th e left , the edge correction stage has not -b een used to "fine-tune" the cluster boundaries. On the ri ght , image edges have been used ; they are first used in frame 8. In both cases ASSET is working stably, and the improvement over time of the estimat.ed boundary is evident. By fram e 10 th e advantage of using edges is clear .

Figure 12 shows a frame from a sequence in which a Landrover fills up a large portion of the image . Even this large object is tracked as a single cluster successfully. Figure 13 shows a frame from a short sequ ence in which the camera is rotating downwards and the mannequin head is moving to the right. ASSET not only successfully segments out the moving head , but the edge correction stage has found the head's outline well given the image quality , even pulling in the boundary at the neck , having

Figure 7 sh ows frame 17 of th e ambulance sequ ence. Th e boundaries are by this stage being accurately and stably tracked. The ambulance is object number zero (no identifier) and the Landrover is object number one.

124

Fig. 9. The output of ASSET at frame 10 of the Ford sequence changes in object size should be marked as occlusion by the background . Also, the ASSET system will be integrated with a structure-frommotion system which recovers world structure in a static environment. ASSET will be used to segment out and track moving vehicles, and the static part of the scene will then be tracked by the structure-from-motion system. This paper has presented a system for detecting and tracking moving objects with a moving camera. The system requires no camera calibration or knowledge of camera motion . It performs automatic instantiation of tracking for each new object detected, and any number of objects can be tracked . Occlusion of one object by another is correctly handled . The system is fast , and will soon be implemented on ROVA to give real-time control.

Fig. 8. Twelve frames showing how ASSET tracks objects before , during and after occlusion started with a convex description . These results show ASSET to be successful in its aims. It has been designed to be a practical processing system; as well as attempting to accurately segment and track moving objects, every effort has been made to minimize the computational time involved in running ASSET . On a SPARC 2 processor ASSET current ly runs at a speed of just over one complete frame per second . In the very near future, developments in computing hardware will allow ASSET to run at least an order of magnitude faster . For "real time" use , a processing speed of 10 frames per second is sufficient to give useful information to an autonomous vehicle or other mechanical system . 9.

10.

REFERENCES

Adiv, G . (1985). Determining three-dimensional motion and structure from optical Bow generated by several moving objects. IEEE PAMI, 7(4) :384-401. Barnard, S. and Thompson , W . (1980) . Disparityanalysis of images . IEEE PAMI, 2(4):333- 340. Bergen, J., Burt, P., Hingorani, R. , and Peleg, S . (1990) . Computing two motions from three frames . In Proc. 3rd ICCV, pages 27-32. Black, M . (199 2). Combining intensity and motion for incremental segmentation and tracking over long image sequences . In Proc. 2nd ECCV. Springer- Verlag. Bouthemy, P . and Lalande, P . (1990) . Detection and tracking of moving objects based on a statistical regularization method in space and time. In Faugeras, 0 ., editor, Proc . 1.t ECCV. Springer-Verlag.

FUTl1RE WORK AND COr-:CLUSIONS

Burt, P ., Bergen, J ., Hingorani, R ., Kolczynski , R ., Lee , W ., Leung, A ., Lubin, J ., and Shvaytser, J. (1989). Object tracking with a moving camera; an application of dynamic motion analysis. In IEEE Work . on Vi • . Mot ., Irvine , CA .

In the near fu t ure ASSET will be extended to incorporate knowledge of the world so that simple estimates of the three dimensional positions and motions of objects in its field of view can be made . Also it is hoped that more advanced tracking can be developed; for example, sudden

Charnley, D. , Harris, C., Pike, M ., Sparks, E., and Stephens, M . (1988) . The DROm 3D vision system - algorithms for geometric integration . Tech. Rep. 72/88/N488U , Plessey Research Roke Manor .

125

Debrunner, C. and Ahuja, N. (1992). Motion and structure factorization and segmentation of long multiple motion image .equences. In Proc . fntl ECCV. SpringerVerlag. Diehl, N. (1990). Segmentation and motion estimation in image .equences. SPIE J f60; Sen.ini and Recon.tr"ction of Three-Dim . Object. and Scene., pages 250-261. Duda, R. and Hart, P. (1973). Pattern Clauijication and Scene Analy,i,. Wiley, New York. Harris, C. and Stephens, M. (1988). A combined corner and edge detector. In Proc. Alvey Vi,. Con!. , pages 147-151.

Fig. 10. The output of ASSET at frame 9 of a sequence taken by an infra-red camera at night. The sequence was kindly provided by Pilkington Plc

Irani, M ., Rousso, B., and Peleg, S. (1992). Detecting and tracking multiple moving objects using temporal integration. In Proc. £nd ECCV. Springer- Verlag. Lawton, D . (1983). Processing translational motion sequences . CVGIP, 22 :116-144. McLauchlan, P., Reid, 1., and Murray, D. (1992) . Coarse image motion for saccade control. In Proc. Srd BMVC. Meyer, F . and Bouthemy, P. (1992). Region-based tracking in an image sequence. In Proc. 2nd ECCV. Springer- Verlag. Meygret , A . and Thonnat, M. (1990) . Segmentation of optical flow and 3D data for the interpretation of mobile objects . In Proc. Srd ICCV, pages 238-245.

Fig. 11. The output of ASSET at frame 17 of a sequence taken by an infra-red camera at night

Murray, D . and Buxton, B . (1987) . Scene segmentation from visual motion using global optimization. IEEE PAMI, 9(2) :220-228. Nelson, R. (1991) . Qualitative detection of motion by a moving observer. In Proc. Con!. CVPR. Preparata, F. and Shamos, M . (1985). Computational Geometry. Springer-Verlag , New York . Rognone, A ., Ca.mpani , M ., and Verri , A . (1992). Identifying multiple motions from optical flo w. In Pr oc. 2nd E C CV. Springer- Verlag. Shapiro , L ., \Vang , H ., and Brady, J . (1992) . A matching and tracking strategy for independently-moving , nonrigid objects. In Proc . Srd BMVC. Smith, S. (1992a). Feature BaHd Imag e Sequence Un · der.tanding . D .Phil. thesis , Robotics Research Group, D epartment of Engineering Science, Oxford University. Smith , S. (1992b). A new class of corner finder. In Pr oc. Srd BMVC.

Fig . 12. The output of ASSET , given a sequence taken following a Landrover at close range

Spoerri , A. and Ullman, S. (1987) . The early detection of moti on boundaries . In Proc . 1,t ICCV, pages 209-218 . Thompson, W . (1980). Combining motion and contrast for segmentation. IEEE PAMI, 2(6) :543-549 . Thompson, W . and Pong, T .-C . (1987) . Detecting m o ving objects . In Proc . lit ICCV, pages 201-208 . Torr, P. and Murray, D . (1992) . Statistical detection of independent movement from a moving camera. In Proc . 3rd BMVC. Verri , A., Uras , S., and De Micheli, E . (1989) . Motion segmentation from optical flow . In Pr oc. A Ivey Vi., . Con/. , pages 209-214. Wang , H ., Brady, M ., and Page, 1. (1990) . A fast algorithm for computing optic flow and its implementation on a transputer array. In Proc. lit BMI'C, pages 175180.

Fig. 13 . The output of ASSET, given a sequence taken with a rotating camera of a translating mannequin head . The sequence was kindly provid ed by Philip McLauchlan , Robotics Group, Oxford University

Waxman, A. and Duncan , J. (1986) . Binocular image flows : Steps toward stereo-motion fusi on . IEEE PAMI, 8(6) :715-729 .

J~6