Signal Processing 15 (1988) 315-334 North-Holland
:~15
IMAGE SEGMENTATION BASED ON OBJECT ORIENTEI) MAPPING PARAMETER ESTIMATION Michael H()TTER and Robert THOMA Institut fiir Theoretische Nachrichtentechnik und informationsverarbeitung, Universita't Hannover, D-3000 Hannover, Fed. Rep. Germany Received 31 January 1988 Revised 25 April 1988
Abstract. A hierarchically structured segmentation algorithm is presented, which is based on the hypothesis, that an area to be segmented is defined by a set of uniform motion and position parameters denoted as mapping parameters. Here, eight parameters are used, which can describe any arbitrary three-dimensional motion of planar rigid objects. In a first step, a change detector initializes the segmentation distinguishing between temporally changed and unchanged regions of two successive fields. Each changed image region is interpreted as one object. The motion and position of each object is described by one set of mapping parameters. Based on the mapping parameters and the temporally preceeding field, a reconstruction of the temporally subsequent field can be achieved. In the next step of hierarchy, those regions of the image, which are not correctly described in their mappings are again detected by a change detector and treated according to the changed parts of the first step of hierarchy. Experimental results confirm, that the presented segmentation technique separates moving objects from uncovered and covered image regions, detects moving objects in front of moving objects and tracks their motion. Zusammenfassung. Ein hierarchisch strukturierter Segmentierungsalgorithmus wird vorgestellt, der aufder Hypothese basiert,, dab eine zu segmentierende Fl~iche durch einen Satz gemeinsamer Bewegungs- und Lageparameter, den Abbildungsparametern, definiert ist. Es werden acht Abbildungsparameter verwendet, die eine beliebige dreidimensionale Bewegung ebener, starter Objekte beschreiben. In einem ersten Schritt ermittelt ein ,~nderungsdetektor eine Segmentierungsmaske, indem zeitlich geiinderte von nicht ge~inderten Bildteilen getrennt werden. Jeder ge~inderte Bildbereich wird als ein eigenst~indiges Objekt interpretiert. Bewegung und Lage eines Objektes werden durch einen Satz von Abbildungsparametern beschrieben. Mit Hilfe dieser Abbildungsparameter krnnen die Objekte des zeitlich neueren Bildes unter Verwendung des zeitlich ~ilteren Bildes vorhergesagt werden. In der zweiten Hierarchiestufe werden alle Bildteile, die in ihrer Abbildung nicht korrekt vorhergesagt bzw. beschrieben wurden, wieder mittels des ,~nderungsdetektors detektiert und wie die ge~inderten Bildbereiche der ersten Hierarchiestufe verarbeitet. Experimentelle Untersuchungen best~itigen, dab der beschriebene Segmentierungsalb gorithmus bewegte Objekte von frei- und verdecktwerdenden Bildteilen trennt, bewegte Objekte vor bewegten Objekten detektiert und ihre Bewegung verfolgt. Rrsumr. Cet article prrsente un algorithme de segmentation structur6 selon une approche hirrarchique, I'hypoth~se sousjacente 6tant qu'on peut drfinir toute rrgion obtenue h l'issue de la segmentation par un ensemble de param/~tres drnommrs paramrtres de projection et donnant son mouvement uniforme, sa position et son orientation. Huit param~tres sont ici utilisrs, ils peuvent drcrire pour un objet plan rigide tout mouvement arbitraire dans l'espace tridimensionnel. Dans une premirre &ape, la procrdure est initialisre en drtectant les rrgions qui ont chang6 entre deux trames successives. Chacune de ces rrgions est interprrtre comme un objet. Le mouvement, la position et I'orientation de chaque objet sont drcrits par un ensemble de param~tres de projection. En se basant sur ces param~tres, une reconstruction de I'image est rralisre. Au cours de la seconde 6tape de cette segmentation hirrarchiquement structurre, les rrgions de l'image qui ne sont pas correctement drcrites par les param/~tres de projection sont drtectres et traitres de la mSme fagon que les parties drterminres en changement lors de l'rtape d'initialisation de la segmentation. Des rrsultats exprrimentaux confirment que la technique de segmentation prrsentre srpare les objets en mouvement des zones drcouvertes et des zones qui vont Stre recouvertes, drtecte les objets en drplacement siturs devant d'autres 6galement en drplacement et op~re le suivi de leur mouvement. Keywords. Differential motion estimation, change detection, segmentation, model-based image analysis, motion compensating prediction, digital television sequences. 0165-1684/88/$3.50 O 1988, Elsevier Scient~erPublishers B.V. (North-Holland)
316
M. H6tter, R. Thoma / Object oriented image segmentation
1. Introduction Image segmentation and motion estimation are fundamental problems of dynamic scene analysis which are closely related. In the literature, the use of motion estimation has been proposed for segmenting moving objects evaluating a displacement vector field [1-3]. On the other hand, image segmenters based on simple change detectors [4, 5, 8] have been applied to improve the motion estimation. The main difficulty of these approaches is the interdependency between motion and segmentation estimation in a scene, i.e., for an accurate motion estimation, the object boundaries have to be known, while for a correct segmentation an accurate description of motion is necessary. Hence, motion estimation and segmentation have to be treated jointly, because they influence each other. In this contribution a segmentation algorithm is presented, that formulates the segmentation task as a hierarchical application of motion and object boundary estimation. This segmentation algorithm is based on the hypothesis, that an area to be segmented is defined by a set of uniform motion and position parameters denoted as mapping parameters. In a first step, a change detector initializes the segmentation mask distinguishing between temporally changed and unchanged regions of two successive fields, k and k + 1. Each changed image region is for the present interpreted as one object. Then, for each object a motion parameter estimation is performed, i.e., the motion and position of each object is described by one set of mapping parameters. By means of the mapping parameters, an examination of the model, assumptions is achieved to calculate the boundaries of the objects, i.e., to separate the moving object from the uncovered background within the changed area. The uncovered background and the unchanged region represent the background in the temporally older field and are excluded from the further analysis. The moving objects represent the detected objects of the first step of hierarchy. If there are moving objects in front of moving objects, the assumption that each changed region consists of Signal Processing
only one moving object is not fulfilled and hence, the mapping description is only correct in a part of the changed region. To handle with this problem, the algorithm is applied hierarchically, i.e., the other parts of the changed regions, that do not obey the mapping description of the first step of hierarchy, are detected and again analysed in further steps of hierarchy. To detect these parts, the image contents of the temporally older field within the changed regions are first motion-compensated by their mapping parameters. Then, as the second step of hierarchy, the change detector using field k + 1 and the motion-compensated field k, extracts those image parts for further analysis the estimated mappings of the first step of hierarchy have not been valid for. The motion-compensated changed regions the estimated mappings of the first step of hierarchy have been valid for, are treated according to the unchanged regions of the first step of hierarchy and excluded from further analysis. Thus, the segmentation mask is hierarchically refined until all objects are described in their mappings and their boundaries. In Section 2, the structure of the developed segmentation algorithm is described. It consists of a unit of four blocks, that are hierarchically applied to a pair of pictures. These four blocks are outlined in Sections 2.1-2.4. In Section 3, the performance of the algorithm is investigated in detail. Applications as object tracking, detection of uncovered background or background to be covered and detection of moving objects in front of moving objects are discussed and experimentally investigated. Section 4 summarizes the most significant results and discusses limitations and recommendations for further work.
2. The segmentation algorithm The segmentation algorithm has to distinguish between stationary background, uncovered image regions, image regions to be covered and moving objects in two successive fields, even if the moving objects are in front of moving objects. For this
317
M. HStter, R. Thoma / Object oriented image segmentation
purpose a hierarchically structured segmentation is performed, based on the hypothesis, that a moving object to be segmented can be defined by mapping parameters. As a first step in this direction change detection algorithms have been presented [4, 9], that distinguish between temporally changed and unchanged regions in two successive fields, i.e., that separate objects with the motion of zero (temporally unchanged regions) from objects with the motion of non-zero (temporally changed regions). Figure 1 shows the block diagram of the segmentation algorithm, that can be seen as a structure of four interconnected units hierarchically applied to a pair of successive fields. These units are the change detector, the estimator of the mapping parameters, the detector of uncovered background and background to be covered and the motion compensating predictor. In a first step, the change detector distinguishes between temporally
changed and unchanged regions of two fields, where the unchanged regions are identically to the stationary background. Then, each connected changed region is for the present interpreted as one moving object. The motion and the position of these objects are described by mapping parameters. Using these parameters, the change detection mask is verified resulting in a segmentation mask, where the moving objects are separated from uncovered background or background to be covered, respectively. Further, the mapping parameters are used to perform a motion compensating prediction in the changed regions. In the next step of the hierarchically structured segmentation, the image regions not correctly described by the mapping parameters are detected from the change detector, evaluating the prediction result. These detected regions are treated according to the changed regions of the initializing step. This procedure is hierarchically repeated until all
1st step of hierarchy
N-th step of hierarchy
Field Sk+ 1
S k+l
Initializing segmentation
Change dmeat sl'=~t i° n
threshold
l
+ I-'~"$~/ L,nange ~ _ I
detector ~
I
I Estima'.tionof I mapping I parameters of I Jchanged regions J
Segmentation threshold
,+
I D=ectionofI I uncovered background I and background I to be covered I
Field S ~0)
1t
Mapping p a r a m e t e r s
LIall changed regions are motion compensated
Segmentation mask
~ Motion compensated prediction of changed regions
Memory of object parameters
S~ 1)
S(kN-1 )
I S k(i) : motion compensated field of the i-th step of hierarchy
Fig. 1. Block diagram of the hierarchically structured segmentation algorithm. Vol. 15, No. 3, October 1988
318
M. H6tter, R. Thoma / Object oriented image segmentation
separately moving objects are described by their mapping parameters. In the following subsections the four blocks of the segmentation algorithm are described in detail.
the change detection algorithm has been extended, so that these thresholds are adapted to the image signal. 2.1.1. The change detection algorithm The block diagram of the change detector is shown in Fig. 2. In a first step the algorithm calculates pel-wise the frame differences of two successive fields k and k + l . Then, the absolute frame differences are summed up using a measurement window of the size of 3 x 3 picture elements. Comparing the result to a threshold Tch, the central picture element is assigned either to the changed state (C~ = 1) if the threshold Tch is exceeded or to the unchanged state (C~ = 0) if the result is beyond or equal Tch. In the next step a median filter is applied, using a measurement window of 5 x 5 picture elements [4]. This filter smoothes the boundaries between the changed and unchanged regions. In the last step of the change detection process small isolated regions are eliminated. Here, the decision whether a region has to be reassigned to the state of the surrounding region is achieved. First the number of changed and unchanged
2.1. Change detector
The aim of the change detection is to distinguish between temporally changed and unchanged regions of two successive fields. The performance of this detector is essentially dependent on two parameters. First, the choice of a threshold separating the changed from the unchanged luminance pixels and second finding a criterion, that eliminates small unchanged regions within changed regions and vice versa. In [4] a change detector is presented where a fixed threshold is used and only singular picture elements are eliminated. However, these two parameters are dependent on the image contents, precisely, the threshold separating the changed from the unchanged luminance pixels is influenced by the image noise and the elimination of small regions by the sizes of the changed and unchanged regions, respectively. For that reason,
TS, .J v[ Tch
FD
Jvl
Threshold operation
J -I
C1
~
Input 3
Median filtering
3
C1
I I
J Elimination of vl small regi°ns
5
FD
: Frame difference
mch : Threshold
0 ~ unchanged
~lFal >Zoh 1 ~ changed
c Output
5
C1
TS : Segmentation threshold C = 0 ~-unchanged C = 1 ~-changed
Fig. 2. Block diagram of the change detector. Signal Processing
Threshold calculation
M. H6tter, R. Thoma / Object oriented image segmentation
regions, respectively, and the numbers of their picture elements are measured. Then, the numbers of picture elements are sorted in decreasing order in two diagrams, one for the changed and the other for the unchanged regions. By means of the diagrams the maximum difference of two successive numbers of picture elements is calculated. The mean value of those two successive numbers of picture elements, the maximum difference has been calculated for, is used as threshold to separate dominant regions from small regions. All regions whose sizes are beyond the calculated threshold are reassigned to the state of the surrounding region. The now remaining regions result in the change detection mask, that is used for further processing. In an additional block of the change detector the threshold Tch is calculated. In order to calculate the threshold Tch, a distinction between either the application for the next step of the segmentation hierarchy or for the next successive pair of fields has to be achieved (Fig. 1). The threshold calculation for the hierarchical use of segmentation (threshold T,) is described in detail in Section 2.2.4, as this threshold is dependent on the accuracy of the mapping parameter estimation. In the other case the threshold Tch is calculated for the next two fields k + 1 and k + 2, using the change detection mask and the frame differences from the fields k and k + 1. As initial threshold the value of T~h = 3/256 has been chosen, where 256 is due to the quantization according to 8 bit per sample. Assuming that the luminance pixels of the unchanged regions of two fields differ due to temporal noise only, the mean squared frame difference in the unchanged regions is a measure of the power of the noise. However, to obtain the necessary threshold as an amplitude in the range of 0.0 ~ Tch <~255.0 the standard deviation of the mean squared frame difference is calculated and used as the new threshold Tch. This procedure is repeated every two fields and has shown to converge very fast, i.e., from the second or third calculation on, the threshold Tch does not change significantly. In the case of a scene cut, a zoom or a pan the entire picture content is detected changed
319
and the last calculated threshold is stored for further processing. The resulting change detection mask is preserving the moving objects in its entirety, as the number and the sizes of the detected regions are controlled automatically by the threshold calculation and the elimination of small regions.
2.2. The global mapping parameter estimator The block diagram of the global mapping parameter estimator is shown in Fig. 3. It consists of four blocks: the estimation of the eight mapping parameters, the reliability test of the parameter luminance data
field Sk(il
field Sk+1 segmentationthreshold
Estimation of eight L mapping parameters L for each changed region I~
of the previousstep
parametervector of the previous iteration
i I mappingparameter vector
Reliability test of the parameter vector estimate
l
mappingparameter vector of sufficientreliability
Model test of the parameter vector estimate
Segmentation threshold estimate
/
/ mapping parameter vector / of each changed region
/
SkO): motion compensated field of the i-th step of hierarchy
Fig. 3. Block diagram of the mapping estimation, Vol. 15. No. 3, Oclober 1988
320
M. Hiitter, R. Thoma / Object oriented image segmentation
vector estimate, the model test of the parameter vector estimate and the segmentation threshold calculation. First, the estimation of the eight mapping parameters is achieved. Then, the reliability of the estimated parameters is tested and the mapping model modified, if the reliability of the estimate is insufficient. The model test of the parameter vector estimate controls for all luminance pixels the consistency of the mapping model and the image movement. All pixels which do not support the estimated mapping are excluded from the next iteration of parameter estimation. In the segmentation threshold calculation, the threshold Ts is measured that controls the change detector in the next step of segmentation hierarchy to detect all luminance pixels which do not obey the mapping of the considered region. These pixels are processed according to the changed pixels of the initial segmentation step. First, these blocks are described in detail, then important properties of the whole estimation scheme are discussed. 2.2.1. Estimation o f eight mapping parameters
Temporal luminance changes of successive pictures in television sequences are essentially originated from motion of objects or motion of the camera. To describe the relation of temporal luminance changes to motion, four models are necessary, the model of mapping the threedimensional space into the image plane, the motion model, the object model to describe the surface of the objects and the signal model to approximate the local luminance signal. Spoer [5] and Huang [6] derive an eight parameter mapping model, the essential assumptions of which will be briefly discussed in the following. Their signal approximation is extended to a second order Taylor series expansion adopting a proposal of Bierling [7]. The estimation algorithm of Spoer and Huang is based on the assumption that the brightness of all physical points in a moving three-dimensional scenery does not change temporally. Any changes of luminance in the image plane are thus due to moving objects or motion of the camera. A camera Signal Processing
maps the surfaces of three-dimensional objects of the object space into the two-dimensional image plane. According to the physical properties of the camera, this mapping can be described by central projection (Fig. 4). Consider a particular point P on an object. Let (x, y, z) = object-space coordinates of a point P, (X, Y) = image-space coordinates of P. In the coordinates of Fig. 4, it is obvious, that X = F. x/z,
(1)
Y = F. y/z.
In their motion model Spoer and Huang assume that the objects undergo translation, rotation and linear deformation. Then x. = 8 " x._, + T,
(2)
with R = rotation and linear deformation matrix, T = translatory motion vector, x,_,, x, = coordinates of a point P in the object-space before and after motion, holds. Central projection
Ik X
//
~/ / /
/Image
plane
/ (x,y,z)
/
object-space coordinates
Y z
Y = F • y/z Fig. 4. Illustration of the central projection.
M. H6tter, R. Thoma / Object oriented image segmentation
The object model in [5, 6] restricts considerations to planar rigid objects with equation
ax + by + cz = 1.
(3)
The signal model describes the luminance amplitude of a point (X, Y) as a function of the local coordinates and the mapping parameters. In this paper, the linear signal model of Spoer is extended to a two-dimensional quadratic signal model with equation
Sk(X , Y ) = A I + A 2 X + A 3 Y + A 4 X Y + A 5 X 2 + A6 y2.
(4)
Bierling shows in [7], that for the case of differential displacement estimation it is suitable to apply this signal model which is based on a second order Taylor series expansion. He proves that the second order derivatives in the Taylor series corresponding to the quadratic terms of the signal model according to (4) can be substituted by averages of the first order spatial differences of two successive fields. Thus, no second order derivatives have to be calculated. This principle can be generalized to the mapping estimation of Spoer and Huang. From equations (1)-(3) it follows [5, 6], that the mapping of an arbitrarily moved plane into the image-space is described according to: ( a l X + a 2 Y + a 3 a4X+a5Y+a6~ (x ' , r') = \ a--~+ a-~-~ ~ , a-~+ ~ ]
321
Accounting all luminance changes on object motion:
Sk+1(X, Y)= Sk(X', Y')= Sk(A(X, Y)),
(6)
the frame difference of two successive fields becomes: FD(X, Y ) = Sk+,(X, Y ) - S k ( X , Y) = s k ( x ' , Y ' ) - s k ( x , Y)
= Sk(A(X, Y)) - Sk(X, Y).
(7)
Using a Taylor series expansion to express the luminance function with respect to the mapping parameters ai, we obtain
Sk(A(X, Y)) =Sk(X, Y)+ ~ aSk(X, Y) Aai i=l
aai
1 8 ~ aSk(X, Y) aSk(X, Y) +2i~lj=lL L
aai
aaj
Aai Aaj
+ r(X, Y),
(8)
and thus with (7) FD(X, Y)
= ~ aSk(X, Y)Aa, i=l
~ai
ask(x, Y) ask(x, Y)
1 +2;=lj=tL L
~ai
aaj
Aai Aaj
+ r(X, Y),
(9)
with Aa = a-e,
=: A(X, Y).
(5)
Assuming that the image plane is infinitely extended, it can be shown [5, 6], that this mapping A(X, Y) considered as a group, satisfies the four group axioms, namely closure, existence of an inverse, existence of identity, and associativity. The properties of closure and associativity enable the algorithm to estimate iteratively. The resulting mapping parameters consist of a sequence of mapping descriptions, each of which represents a single step of iteration. The existence of the inverse and the identity permit a motion compensating interpolation or prediction of a field at an arbitrary temporal position.
e = ( 1 , 0 , 0 , 0 , 1,0,0,0) T, a = (al,
a 2 , • • . , a 8 ) T,
where r(X, Y) denotes the higher order terms of the Taylor series expansion. If the two-dimensional quadratic image model (4) is valid, r(X, Y) equals to zero and (9) simplifies to FD(X, Y) = GxXAa~ + GxYAa2+ G~Aa3 + GyXAa4+ G y y A a s + GyAa6
- X (GxX + Gy Y)Aa7 - Y(GxX+ GyY)Aa8 =: H. Aa
(10)
Vol. 15, No. 3, October 1988
M. H6tter, R. Thoma / Object oriented image segmentation
322
with
with
d~(X, Y)
Aa= a - e , e = ( 1 , O,O,O, 1,0,0,0) 1, a = (al,
a2,
• •.,
- a : X 2 - asXY + (al - 1)X + a2 Y + as
a7X + as Y + 1
as) T,
dy(X, Y)
and
- a T X Y - a8 y 2 + ( a s - 1) Y+ a4X + a6 a7X + a8 Y + 1
I [OSk+,(X, Y) OSk(X, Y)} Cx =2t ~-x ~ ox '
The vectors describe local displacements within the image-plane, which are functions of the spatial location X, Y and the parameter vector a.
I [OSk+,(X, Y) OSk(X, Y)~ Gy =-~( OY J¢ OY J " The spatial derivatives have to be calculated using the samples of the luminance signal. Adopting a proposal from Cafforio and Rocca [8], the spatial derivatives are approximated as one half of the differences between two adjacent picture elements in the X- and Y-direction, respectively. According to the proposal of [5, 6, 7], the parameter estimation can be formulated as a minimization problem: E[{FD(X, Y ) - F - D ( X , Y)}2]--> Min,
(11)
where FD(X, Y) is the measured and FD(X, Y) is the theoretical FD due to mapping parameters and model assumptions. The solution of (11) is obtained by the evaluation of (10) at p observation points resulting in a system of p linear equations: FO = _H. Aa.
Signal Processing
A(A'(X, Y))=A'(A(X, Y)) = (X, Y)
(14)
(15)
and
Sk+,(X, Y)= Sk(A(X, Y)) or
Sk+,(A'(X, Y))= Sk(X, Y), holds. Therefore, the frame difference FD in (7) is replaced by the displaced frame difference (DFD) DFD(X, Y)
= Sk+I(A'(X, Y))--Sk(X, Y)
(13)
By means of the mapping parameters, a corresponding displacement vector field D(X, Y) can be calculated [5]:
D(X, Y)= (dx(X, Y), dy(X, Y)),
As described in detail in [4, 5], a mapping parameter estimate obtained by differential estimation can be far away from the true parameters. This is due to the fact, that the actual image signal may differ drastically from the mathematical signal model the algorithm is based on. To overcome this problem, a motion compensating iteration technique is used, as explained in [4, 5, 7]. As the mapping A(X, Y) satisfies the four group axioms [5, 6], an inverse mapping A'(X, Y) exists, such that
(12)
The lines of the matrix _H consisting of the vector H and the components of the vector FD are measured according to (10) from two fields to be evaluated. Solving the overdetermined equation system (12) by linear regression yields the parameter vector Aa as Aa = (_H T" _ n ) -1 • H T . FD
2.2.1.I. Motion compensating iteration
(16)
with
A(X, Y) = first estimate of the true mapping, A'(X, Y) = the inverse of A(X, Y), such that A'(A(X, Y))= (X, Y) holds.
M. H6tter, R. Thoma / Object oriented image segmentation
with
Using the DFD, an update
B( X, Y)
N = a3bT+ a6b8+ 1.
(blg+b
2Y+b 3 b4Xq-b 5Y+b6~
=\b~~+5'
b-~-X--~sY-~i/
(17)
is calculated in the second iteration according to
Sk+,(A'(X, Y))= Sk(B(X, Y)).
(18)
All terms belonging to field k + 1 have to be taken from the displaced position A'(X, Y) including the spatial derivatives of Sk+1(X, Y). The update of the mapping parameters (bl, b 2 , . . . , bs) calculated in this second step is combined with the set of mapping parameters ( a ~ , a 2 , . . . , as) obtained in the first step of iteration so that the following holds:
Sk+,(X, Y)= Sk(C(X, Y)) with
C(X, Y)= B(A(X, Y)) =(c1X+c_zY+c_3 c4X+csY+c6] \ c7X + cs Y + I ' c-~+ c8-Y-+l / (19) and
albt + a4b2+ a7b3
Cl~
N
C2 ~--
a2bl + asbz + a863 N
C3--
a3bl+a6b2+b3 N
c4 -
C5 --
a l b 4 + a 4 b s + a7b 6
N a2b4+asbs+a8b6 N a364+
c6-
C7--
C8--
323
a6b5 q- b 6
N a l b 7 q- a4b8 q- a 7
N a2b7 + asbs + as N
The parameter vector c = ( c l , c 2 , . . . , c8)x represents the starting point of further iterations. This procedure is repeated until a sufficiently precise estimation is achieved.
2.2.1.2. Selection of the observation points controlled by the signal statistic The stability and the convergence velocity of the mapping estimation algorithm is mainly influenced by the position of the observation points. Two different aspects disturbing the estimation are investigated in the following. First, the proposed mapping estimation represents a gradient technique, i.e., it depends on an accurate measurement of the local gradients as well as of the unblurred local frame differences. The smaller the gradient components or the frame differences, the higher the risk, that these components are generated by noise. Thus, flat regions, areas of small local gradients and regions of small frame differences in successive fields tend to decrease the stability of the algorithm and should be excluded from the linear regression. Second, the proposed mapping estimation method is a global technique, i.e., regions like moving objects which do not support the mapping disturb the mapping parameter measurement, decrease the convergence velocity and hence should be excluded from linear regression as well. This task is achieved iteratively as for the recognition of moving areas disturbing the estimation knowledge about their motion is necessary. This problem is discussed in detail in subsection 2.2.3. First, let us consider the recognition of flat areas and of areas, where frame differences are mainly due to noise. Assume, that an additive, stationary, zero mean and temporally uncorrelated noise signal is superimposed upon the image signal
Ik(X, Y)=Sk(X, Y)+ Nk(X, Y), (20)
Ik+l(X, Y)=Sk+I(X, Y)+ Nk+,(X, Y), Vol. 15, No. 3, October 1988
M. H6tter, R. Thoma / Object oriented image segmentation
324
with &, &+~ = the image signals, Nk, Nk+~ = the additive noise components, Ik, lk+t = the resulting disturbed image signals. These assumptions generally hold for natural images [9]. With (20), the error of the frame difference measurement becomes EFDk.k+I(X, Y ) = Ik+I(X, Y ) - Ik(X, Y)
These assumptions are normally valid for natural images [9]. The error of the gradient measurement in X-direction becomes with (23) EDIF(X, Y ) = I ( X +AX, y ) - I(X, Y)
-(S(X+AX,
= N ( X + AX, Y) - N ( X , Y). (24) As the noise signals are zero mean and locally uncorrelated, the variance of E D I F results as 2 = O'EDIF
-(Sk+,(X, Y ) - S k ( X , Y)) = Nk+,(X, Y) - Nk(X, Y). (21) As the noise signals are zero mean and temporally uncorrelated, the variance of EFD results as O'2FD = 2" cr2 ,
(22)
with tr~ = E[N2(X, Y)] = variance of the image noise, J
tr~FD = E[{Nk+,(X, Y) - Nk(X, y)}2] = variance of the frame difference noise. In the same way, the relation between the variance of the gradient measurement error and the image noise can be derived. Assume, that an additive, stationary, zero mean and locally uncorrelated noise signal is superimposed upon the image signal:
2" t r y ,
(25)
with o-% = E[N2(X, Y)] = variance of the image noise,
2 OrEDIF
E[{N(X+AX, Y)-N(X,
r)}2]
= variance of the frame difference noise. The same relation is valid for the error of the gradient measurement in Y- direction, respectively. For additive, stationary, zero mean, temporally and locally uncorrelated noise, the relationship between image noise and the error of gradient measurement as well as the relationship between image noise and the error of frame difference measurement can be summed up according to 2 = O'EDIF
tr2EFD= 2" ~r~.
(26)
This relationship can be used to exclude noisy observation points as following
(X, Y) is declared as observation point, if FD(X, Y) > x/tT~FD and G,,(X, Y), Gy(X, Y) > ~/O'ED --5-- w.
I(X, Y) = S(X, Y) + N ( X , Y), I ( X + AX, Y)
Y ) - S ( X , Y))
(23)
= S ( X + AX, Y) + N ( X + AX, Y), with
(X, Y) is no observation point, else. ~r2FD can be related to the change detector threshold Tch separating changed and unchanged luminance pixels by
S(X, Y ) = the image signal, N ( X , Y ) = the additive noise component, I(X, Y ) = the resulting disturbed image signal. Signal Processing
~VD = TEh.
(27)
Hence, a value of this parameter unequal to zero is caused by noise.
M. H6tter, R. Thoma / Object oriented image segmentation
Excluding noisy observation points, two effects are guaranteed: first, the stability of the mapping parameter estimation is improved and second, the computational load, mainly due to the regression (13) is reduced.
2.2.2. Reliability test of the parameter vector estimate An important problem in motion estimation is the question how to calculate and how to judge the accuracy of the estimate. In [10], these tasks are investigated for gradient based estimation methods. Influences of gradient approximations in local and temporal direction, variations of the optical flow and conditioning of the equation systems are evaluated and judged by thresholding. In this contribution, the parameter estimate is tested according to the accuracy of the vector field generated from it. First, the reliability test is derived, then strategies to modify the mapping model are discussed, if the reliability of the parameter vector estimation is unsufficient. The analysis of the parameter estimation error yields [5]: - The estimator is unbiased, i.e., E[a Ia, rue] = a,rue
(28)
holds, if I/true is the true, known mapping parameter vector. - The Error Covariance Matrix can be calculated as~
_2
O" a ~
E[(a
--
E[alat~ue] ) (a-E[alatrue]) T]
= E [ ( _/-'/T- N -1" H)--l],
(29)
325
estimation, the Error Covariance Matrix results as: 2
0"2 • (_HT . /4)-'.
O-a~
(31)
Two factors influence the error covariance of the parameter estimation, the conditioning of the equation system given by the inverse of the system matrix ( H T . / 4 ) which is known from the evaluation of the linear regression (13), and the variance 0"2 of the image noise. At the initial step, 0"2 is calculated according to (22) as 2 0"N
=
1 2 20"EFD,
(32)
with x/~2FD = threshold to separate changed and unchanged image parts, i.e., as one half of the squared averaged differences of all luminance pixels of the stationary background. In further segmentation steps, 0"~FD is replaced by the squared averaged differences of all luminance pixels of the area the actually considered region is lying on and which has been motion compensated in previous steps of hierarchy. Thus, measurement inaccuracies of previous steps of segmentation are taken into account. The less accurate the mapping of the initial step, the less possible an accurate mapping estimation becomes in further segmentation steps. To judge the accuracy of the estimation, the influence of the estimation error on the displacement vector field D(a) corresponding to the parameter vector a is investigated. We create a set of test vectors {bl j)} consisting of the estimate a and a uniform grid surrounding it. The evaluation of the parameter accuracy yields: max{lD(a ) - D(b~J))[}
{X,Y}
with
< tl: estimation accurate, i> tl: estimation inaccurate,
N = covariance matrix of the image noise, /4 = system matrix of equation system (12). For locally uncorrelated noise signals with identical variances 0"2 (29) becomes: 2
_O-a~
0"2" E[(_HT -H)-I].
(30)
Evaluating equation (30) for a single parameter
(33)
with {X, Y} = set of the coordinates of all pixels belonging to the considered region h(j) ai const.. ~/(~_o),.i, i=j, vi = ai else, Vol. 15, No. 3. O c t o b e r 1988
326
M. H6tter, R. Thoma / Object oriented image segmentation
and
D(a) = displacement vector field corresponding to the mapping parameter a, D( b Cj)) = displacement vector field corresponding to the mapping parameter b
algorithm is based on. In general, this mismatch of actual image signal and signal model yields a parameter estimate whose direction agrees to the true motion but whose absolute size is too small. This influence of the mismatch is reduced, if the estimation is achieved iteratively, i.e., if the mapping is estimated in several steps as described in Section 2.2.1.1. Further, an analysis of the image signal is performed to select all luminance pixels which support the differential mapping estimation. This problem will be discussed in Section 2.2.3 in detail. The iterative technique as well as the selection of luminance pixels increase the convergence and accuracy of the estimation without modifying the image signal. To provide an image signal that is matched to the signal model, Bierling and Thoma propose in [4] a lowpass prefiltering. The motion model is not valid If the area to be described in its mapping consists of a lot of different moving objects, a mapping description based on a single parameter vector is not valid. The considered area has to be split up into smaller regions that support mapping descriptions. This problem will be discussed in Section 2.2.3 in detail. The equation system to be evaluated is ill conditioned If the motion is ambiguous with respect to a subset of mapping parameters, the rows or columns of the corresponding equation system are linear dependent. To guarantee linear independency of the parameters, the mapping description has to be modified. The optimization criterion is ambiguous If the optimization criterion for the estimate, i.e., here the expectation of the squared displaced frame difference, is not dominantly peaked, the algorithm risks to diverge. Bierling and Thoma propose in [4] to smooth the optimization criteria by a lowpass prefiltering of the image signal. Another possibility is to modify the mapping model, the algorithm is based on. This strategy will be discussed in the following.
327
M. H6tter, R. Thoma / Object oriented image segmentation
In [5], Spoer outlines, that the two parameter mapping model, i.e., the pure two-dimensional translatory motion model, as well as the six parameter mapping model, describing an affine mapping of the image coordinates can be interpreted as special cases of the eight parameter mapping model considered in this contribution. The parameters of these simplified mapping models can be directly derived from equation system (13) without any new linear regression. This property implies the following strategy. If the reliability of the eight parameter mapping model is insufficient, the parameters Aa7 and Aa8 are assumed to be zero, i.e., the eight parameter mapping model is modified into a six parameter mapping model. The corresponding rows and columns of (13) are deleted and the new parameter set and its accuracy are calculated. If the reliability remains insufficiently, Aal, Aa2, A a 4 and Aa5 are additionally set equal to zero, i.e., the mapping model is further simplified into a pure translatory motion model. The corresponding rows and columns are deleted and the new parameter set and its accuracy are calculated. If the reliability still remains insufficiently, the considered region cannot be described by the mappings. Thus, the best fit of mapping models is searched automatically, adaptive to the image signal. 2.2.3. The model test o f the mapping parameter vector
In Section 2.2.2, a control strategy is presented, that is based on the evaluation of the equation system (13). No additional knowledge of the image contents is necessary, only properties of the estimation accuracy are considered. In this section, a model test is derived which controls the consistency of the mapping model and the image movement for all luminance pixels. The squared theoretical displaced frame difference (16), i.e., the criterion, the optimization of the mapping estimation is based on, measures the validity of the estimated mapping parameters and the model assumptions. First, the residuum r = F D - _ H - A a is calculated using picture informa-
tion, where FD is the measured a n d / 4 . Aa is the theoretical frame difference due to mapping parameters and model assumptions. Then each observation point is selected dependent on the amount of its residuum by threshold decision:
Ir(X,, r,)l < t2: observation point (Xi, Y~) used for the next iteration step, > t2: (Xi, Y~) not suitable as observation point, (34) with t2 = x/o'Zr and o-2 = variance of the residuum r of the observation points. All luminance pixels which do not support the estimated mapping, e.g. moving objects, are excluded from further iterations in order to guarantee the convergence of the algorithm and to reduce the number of necessary estimation iterations. The threshold t2 is suitable to select the observation points, as it takes into account the properties of the residuum which reflects the validity of the assumptions, the algorithm is based on, and the accuracy of the estimate. This technique is applied iteratively. Its use for segmentation will be discussed in the next section. 2.2.4. The segmentation threshold estimation
The first three blocks of Fig. 1, i.e., the eight mapping parameter estimation, the reliability test and the model test build up the internal loop of the mapping parameter estimation. This loop is iteratively repeated until a sufficiently accurate measurement is achieved. The segmentation threshold estimation block belongs to the external loop. In this block, the validity of the measurement is verified and regions which are not correctly described in their mappings are detected. These regions are interpreted as objects in the next step of hierarchy. In the model test of the mapping parameter vector, all luminance pixels which do not support the estimated mapping are excluded from further linear regression (13) to calculate the parameter vector. In the segmentation threshold estimation Vol. 15, No. 3, October 1988
328
M. H6tter, R. Thoma / Object oriented image segmentation
block, an evaluation of the iterative mapping measurement is achieved. First, the expectation of the squared theoretical displaced frame difference is judged by the expectation of the squared frame difference of all luminance pixels which are accepted as observation points of the regression: E [ ( F D - H . Aa) 2] E[FD 2] { ~ t3: the mapping description of the region is valid, t3: the mapping description is insufficient,
(35)
with t3=threshOld to judge the validity of the mapping description. If the quotient in (35) exceeds a given threshold t3, the optimization achieved by the algorithm is insufficient, i.e., a mapping description of the considered region according to the model assumptions of the algorithm is not possible. Second, all luminance pixels which have been excluded at the last iteration step by the model test of the parameter vector estimate are declared "changed" in the next step of hierarchy, i.e., the changed regions of the second step of hierarchy consists of the excluded pixels of the first step. Thus, the segmentation threshold estimation yields a hierarchically structured, mapping oriented subdivision of the image contents.
2.2.5. The properties of the mapping estimation scheme After the derivation of the single blocks of Fig. 1, let us now sum up the most important properties of the whole mapping estimation scheme. The control strategies of the global mapping parameter estimation The reliability test of the parameter vector estimate adapt the mapping model to the image movement. The rule of the parameter estimate validity is generated by the accuracy of the displacement vector field that is used for further image processSignal Processing
ing, e.g. motion compensating prediction. The model test of the estimation excludes areas which do not support the mapping and hence, which disturb the measurement of the parameters. The excluded areas of the last iteration step are interpreted as new objects whose mappings are calculated hierarchically.
The selection of the observation points The selection of the observation points is controlled by the signal statistic and the estimation accuracy of previous steps of hierarchy (26) (lower bound) and the threshold t2 of the model test (34) (upper bound). All parameters and thresholds are calculated automatically, dependent on the signal noise, the motion estimation accuracy and the image movement. The segmentation result The segmentation result is a subdivision of the image plane into areas of uniform mapping parameters. Knowledge about the accuracy of the mapping estimation, the validity of the mapping models and the position of the moving areas is obtained. 2.3. Detection of uncovered background and background to be covered As described in [11], a segmentation algorithm has to distinguish at least between four classes of image regions. These are stationary background, moving objects, background to be covered and uncovered background. In contrast to [12], in this contribution the separation of the four classes is achieved using only one change detection mask of two successive fields. Through a verification of this change detection mask by the corresponding vector field one segmentation mask for the temporally newer field k + 1 and one for the temporally older field k is obtained. The vector field is calculated from the estimated mapping parameters (14) as described in Section 2.2.1. As demonstrated in Fig. 5, the changed regions of the change detection mask include the moving
329
M. Hrtter, R. Thoma / Objectoriented imagesegmentation
X ChangedRegion Moving Object
m!~
UnchangedRegion
"1" Backgroun~ tobecovered
Fieldk
/
vD'gg~oa~t merit
motion of objects from field k + l to k, the uncovered background and the object in field k + 1 are obtained by the same proceeding. Thereby, it is not necessary to estimate the mapping parameters completely new, from which the vector field is calculated being valid from field k + 1 to k. As shown in Section 2.2 the mapping parameters estimated for field k can be converted to field k + 1. It should be noted, that through a combination of change detection and motion information using two fields only, a segmentation is achieved that is able to distinguish between stationary background, moving objects and uncovered background or background to be covered, respectively.
2.4. Motion compensated prediction
~ ~ I F I I
t
Uncovered Background+ I
Moving Object
Fieldk+l
-I I I
Fig. 5. The separation of changed areas into moving objects, uncovered background and background to be covered.
objects and the background to be covered for field k and the moving objects and the uncovered background for field k + 1. Assume, that the displacement vectors describe the motion of the objects from field k to field k + l , i.e., to every picture element of field k a vector is assigned. Then, these displacement vectors may only connect corresponding luminance pixels of moving objects starting from field k within the changed region. This is assured by assigning zero displacements to the picture elements in field k, that belong to the unchanged area (Fig. 5). Now, all picture elements which are to be displaced from the changed region of field k to the unchanged region of field k + 1 belong to the background to be covered. The other picture elements within the changed region of field k belong to the moving object. From the inverse situation, i.e., from the vector field describing the
After each step of hierarchy in the segmentation scheme a motion compensating prediction has to be performed, as the frame ditterences as well as the local gradients for the mapping calculation of further objects have to be taken from displaced positions. Based on the estimated mapping parameter vector and the segmentation mask, the addresses of the corresponding picture element in the field Sk of a position (X, Y) in the field S to be predicted are determined with help of the mapping parameters of the considered region. The picture element to be predicted results as [5]:
S(X, Y ) = Sk(A(X, Y)),
(36)
with
A(X, Y ) = mapping of the considered region. This motion compensating prediction is repeated for each step of hierarchy until all segmented objects are motion compensated.
3. Experimental results
The presented motion estimation technique has been experimentally investigated by means of computer simulations. The test sequence "Miss America" with a reduced field frequency of 7.5 Hz has been used, sampled at 6.75 MHz and quantized Vol. 15, No. 3, October 1988
330
M. Hrtter, R. Thoma / Object oriented image segmentation
according to 8 bit/sample. Each of the non-interlaced luminance fields consists of 288 lines and 352 picture elements per line. This sequence was provided by British Telecom Research Laboratories on behalf of the COST-21 Ibis project. It is a typical videophone scene showing head and shoulders motion. Figure 6 and Fig. 7 show two successive fields which have been evaluated: closing eyes and an opening mouth are superimposed onto head motion. First, the single steps of the segmentation algorithm are demonstrated, then object tracking as a possible application in computer vision is discussed. Figure 8 shows the change detection mask obtained in the initializing step of the segmentation. Changed areas and stationary background are displayed with bright and dark gray luminance values. Each connected changed image part is interpreted as one moving object. The mapping of each single object is described by a parameter vector. This mapping vector allows to separate the uncovered background and the moving object within the changed area. In Fig. 9 the uncovered background and the segmented object are displayed with white and gray luminance values. In the next step of segmentation, the image parts which have not been correctly motioncompensated in the first step of hierarchy are detected. The result is shown in Fig. 10. These regions are treated corresponding to the changed parts of the initializing step, i.e., each region is interpreted as one moving object and described in its mapping by a parameter vector. These changed regions are verified by the motion information, i.e., uncovered background and the objects are extracted. In Fig. 11 the uncovered background of the second step of hierarchy and the segmented objects are displayed with white and gray luminance values. Figure 12 shows the masks of the segmented objects. For the evaluation of the next couples of fields, all essential segmentation information is stored. Maintaining the results of the mapping estimation yields a segmentation algorithm with memory which uses suitable segmentation information to Signal Processing
increase the segmentation efficiency in the processing of the next fields. Figure 13 shows the objects of Fig. 12 which are correctly described in their mappings. The left eye of "Miss America" has not been excepted by the reliability test and hence, it is not maintained as object in the processing of the next fields. The memory aspect of the segmentation algorithm supports object tracking which has been investigated in a second experiment. Figures 14-20 demonstrate the tracking performance in a sequence of images. Two properties of the memorized segmentation algorithm can be recognized. First, the boundaries of the main object, i.e., head and shoulders, become smoother and smoother and more and more details of the object are recognized and included as shown in Figs. 14-17. These figures show segmentation masks of four successive fields. Second, the tracking works well. Figure 18 shows the 8th, Fig. 19 the 20th and Fig. 20 the 37th segmentation mask of the sequence which have been evaluated. As can be seen, the segmentation masks track the movement of the object keeping convergence. 4. Conclusion
A hierarchically structured segmentation algorithm is presented, which is based on the hypothesis, that a region to be segmented is defined by a set o f uniform motion and position parameters signified as mapping parameters. Image segmentation and mapping parameter estimation are interpreted as mutually dependent problems of scene analysis. They support, control and verify each other in a hierarchical procedure. The segmentation method is based on the minimization of the mean squared error of the displaced frame difference, as this criterion also represents the optimization criterion of the mapping estimation. An essential property of the presented segmentation algorithm is, that no a priori knowledge about the syntactique structure and properties of the objects is required, i.e., their size, their position and their relation to other objects.
M. H6tter, R. Thoma / Object oriented image segmentation
Fig. 6. Twelfth original field of the sequence "Miss America"
Fig. 8. Change detection mask of Figs. 6 and 7.
Fig. 10. Change detection mask of the second step of segmentation hierarchy.
331
Fig. 7. Sixteenth original field of the sequence "Miss America".
Fig. 9. Separation of the change detection mask Fig. 8 into moving objects and uncovered background.
Fig. 11. Separation of the change detection mask Fig. 10 into moving objects and uncovered background. Vol. 15, No. 3, October 1988
332
M. H6tter, R. Thoma / Object oriented image segmentation
Fig. 12. The final segmentation result evaluating Figs. 6 and 7.
Fig. 13. Objects of Fig. 12 which are correctly described in their mappings.
Fig. 14. Segmentation mask of the first two fields of the test sequence representing a 4:1 field rate reduced version o f " M i s s America".
Fig. 15. Segmentation mask of the second and third field of the test sequence maintaining reliable segmentation results from Fig. 14.
Fig. 16. Segmentation mask of the third and fourth field of the test sequence maintaining reliable segmentation results from Fig. 15.
Fig. 17. Segmentation mask of the fourth and fifth field of the test sequence maintaining reliable segmentation results from Fig. 16.
Signal Processing
M. H6tter, R. Thoma / Object oriented image segmentation
Fig. 18. Segmentation mask of the eighth and ninth field of the test sequence maintaining reliable results from previous segmentation evaluations.
Fig. 20. Segmentation mask of the thirty seventh and thirty eighth field of the test sequence maintaining reliable results from previous segmentation evaluations.
Experimental results confirm, that the segmentation algorithm is able to detect moving objects in front of moving objects, to separate covered and uncovered image regions and to track the motion of the objects. The performance of the method is mainly due to the presented control strategies that support both, convergence of the mapping parameter estimation and the segmentation. A further important property of the presented segmentation scheme is its independency of a suitable selection of significant thresholds. All thresholds are calculated by the algorithm itself, adaptively to the signal noise, to the estimation accuracy and to the
333
Fig. 19. Segmentation mask of the twentieth and twenty first field of the test sequence maintaining reliable results from previous segmentation evaluations.
validity of the models, the technique is based on. Additionally, validity and accuracy of the segmentation results are improved, if the segmentation and estimation results are continuously updated in a sequence of images to be processed. The presented segmentation method is restricted to planar rigid objects. The concept of object oriented mapping parameter estimation and segmentation can be extended in further investigations to more complex, three-dimensional object models. Maintaining the principle idea, motion estimation and segmentation are interpreted as mutually dependent problems of image analysis which support, control and verify each other in a hierarchical procedure.
References [1] G. Adiv, "Determining three-dimensional motion and structure from optical flow generated by several moving objects", IEEE Pattern Analysis and Machine Intelligence, Vol. PAMI-7, No. 4, July 1985, pp. 384-401. [2] J.L. Potter, "Velocity as a cue to segmentation", IEEE Trans. on Systems, Man, and Cybernetics, May 1975, pp. 390-394. [3] S. Ullmann, The Interpretation of Visual Motion, M.I.T. Press, Cambridge, MA, 1979. [4] M. Bierling and R. Thoma, "Motion compensating field interpolation using a hierarchically structured displacement estimator", Signal Processing, Vol. 11, No. 4, Dec. 1986, pp. 387-404. Vol. 1"5, No. 3, October 1988
334
M. Hrtter, R. Thoma / Object oriented image segmentation
[5] P. Spoer, "Sch/itzung der 3-dimensionalen Bewegungsvorg/inge starrer, ebener Objekte in digitalen Fernsehbildfolgen mit Hilfe von Bewegungsparametern", Ph.D. dissertation, Univ. of Hannover, Hannover, 1987. [6] R.Y. Tsai and T.S. Huang, "Estimating three-dimensional motion parameters of a rigid planar patch", IEEE Trans. on Acoustics, Speech, and Signal Processing, Vol. ASSP-29, No. 6, Dec. 1981, pp. 1147-1152. [7] M. Bierling, "A differential displacement estimation algorithm with improved stability", 2rid Internat. Tech. Symposium on Optical and Electro-Optical Applied Science and Engineering, Cannes, December 1985, pp. 170-174. [8] C. Cattorio and F. Rocca, "The difterential method for image motion estimation", in: T.S. Huang ed., Image Sequence Processing and Dynamic Scene Analysis, Springer, Berlin, 1983, pp. 104-124.
gignalProcessing
[9] H.C. Bergmann, "Ein schnell konvergierendes Displacement-Sch/itzverfahren fiir die Interpolation von Fernsehbildsequenzen", Ph.D. Dissertation, Univ. of Hannover, Hannover, Feb. 1984. [10] J.K. Kearney, W.B. Thompson and D.L. Boley, "Optical flow estimation: An error analysis of gradient-based methods with local optimization", IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. PAMI-9, No. 2, March 1987, pp. 229-244. [11] R. Thoma, "A segmentation algorithm for motion compensating field interpolation", Picture Coding Symposium, Stockholm, Sweden, June 1987, pp. 81-82. [12] R. Thoma, "A refined structure for a motion compensating field interpolation algorithm", Picture Coding Symposium, Tokyo, Japan, April 1986, pp. 91-92.