Motion compensation using backward prediction and prediction refinement

Motion compensation using backward prediction and prediction refinement

Signal Processing: Image Communication 18 (2003) 381–400 Motion compensation using backward prediction and prediction refinement$ Kenneth Anderssona, ...

662KB Sizes 0 Downloads 70 Views

Signal Processing: Image Communication 18 (2003) 381–400

Motion compensation using backward prediction and prediction refinement$ Kenneth Anderssona, Mats Anderssona, Peter Johanssonb, Robert Forchheimerb, Hans Knutssona,* a

Department of Biomedical Engineering, Linkoping University, 581 85 Linkoping, Sweden . . b Image Coding Group, Linkoping University, 581 83 Linkoping, Sweden . .

Received 1 July 2002; received in revised form 18 December 2002; accepted 30 December 2002

Abstract This paper presents new methods for use of dense motion fields for motion compensation of interlaced video. The motion estimation is based on previously decoded field-images. The motion is then temporally predicted and used for motion compensated prediction of the field-image to be coded. The motion estimation algorithm is phase-based and uses two or three field-images to achieve motion estimates with sub-pixel accuracy. To handle non-constant motion and the specific characteristics of the field-image to be coded, the initially predicted image is refined using forward motion compensation, based on block-matching. Tests show that this approach achieves higher PSNR than forward blockbased motion estimation, when coding the residual with the same coder. The subjective performance is also better. r 2003 Elsevier Science B.V. All rights reserved. Keywords: Backward–forward motion compensation; Phase-based motion estimation; Irregular sampling; Continuous normalized convolution

1. Introduction Since the advent of television obtaining high perceived quality using a limited bandwidth has been an important issue. From these early days the demand for visual information to be stored or transmitted has progressively increased. Critical video coding components in hybrid coding [10] are motion estimation and motion compensated pre$ Supported by the Swedish Foundation for Strategic Research (SSF) and the program Visual Information Technology (VISIT). *Corresponding author. Tel.: +46-13-227294; fax: +46-13101902. E-mail address: [email protected] (H. Knutsson).

diction which also are the main issues in the present paper. Standardized hybrid coders like MPEG-1, MPEG-2, MPEG-4, H.261 and H.263 use block-based motion estimation and motion compensated prediction. The motion parameters and the prediction residual are coded and transmitted to the decoder. The simplicity and the relatively good performance of block-based motion compensation has made it the standard approach for motion compensation. To deal with the unnatural motion compensated prediction inherent in block-based algorithms, a dense motion field has been estimated and used in a forward motion compensated scheme [20]. To use a dense motion field but avoid the overhead to

0923-5965/03/$ - see front matter r 2003 Elsevier Science B.V. All rights reserved. doi:10.1016/S0923-5965(03)00012-2

382

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

code it, backward video coding has been introduced using the assumption of predictability in space [27], in time [26] or in spatial frequency [23]. In backward video coding, motion estimation is performed at both the encoder and the decoder, and only transmission of the coded prediction residual is necessary. In Lim et al. [24], backward video coding using temporal prediction of a dense motion field is selected for a block when the motion compensated prediction error is lower than for forward video coding using H.263. Yang et al. [33] estimate motion for blocks of size 4  4 pixel on previously decoded images at a low-to-high frequency scale. To improve the motion estimates for low bit rates, with increasing quantization noise, the backward motion estimates are refined using forward motion estimation. In this approach, block artifacts can still be present since the motion estimation is block-based. We propose a new backward–forward motion compensation scheme [1]. Initially a dense motion field is estimated on previously decoded fieldimages using a new phase-based motion estimator. The motion field is temporally predicted and used for motion compensated prediction of the current field-image. To handle non-constant motion field and the specific characteristics of the field-image to be coded the initial backward motion compensated prediction is refined using block-based forward motion compensation. The use of an initial pixelbased motion compensation reduce problems with block-artifacts as those in standard block-based forward motion compensation schemes. Image sequences consist of a large amount of data. For this reason methods for image sequences have to be computationally efficient to be possible to use. This aspect is a parallel central theme in the work presented. This paper is organized as follows. In Section 2, a quadrature filter bank representation of an image is efficiently produced. Phase-differences from products of quadrature filter responses are in Section 3 used in a new motion estimator for generation of a dense motion vector field. In Section 4 we present a novel approach for motion compensation using a dense motion field to get an initial prediction and a sparse motion field for refinement. Experimental results are given in

Section 5. Finally, some conclusions are given in Section 6.

2. Quadrature filter bank representation This section describes an efficient computation of a quadrature filter bank representing the image in different orientations and scales covering the Fourier domain. An fundamental property of quadrature filters is that the magnitude of the output is invariant to the phase of the underlying signal. There are also many physiological and psychophysical experiments that are consistent with the hypothesis that the visual cortex contains cells that have similar behaviour [28]. For these reasons quadrature filters has been extensively used as a basis for image representation and analysis [16]. In the present work the initial image representation has mainly two purposes. In addition to produce local phase estimates for the motion estimator in Section 3 it is essential that the representation supports a video quality estimator adapted to the human visual system. This opens up the possibility to control video coding using video quality estimation. 2.1. Quadrature filters The phase, fðxÞ; and magnitude, mðxÞ; are defined as the phase angle and the absolute value of the complex-valued response, qðxÞ; of the quadrature bandpass filter, f: qðxÞ ¼ ðf * IÞðxÞ ¼ qreal ðxÞ þ iqimag ðxÞ ¼ mðxÞeifðxÞ ; fðxÞ ¼ argðqreal ðxÞ þ iqimag ðxÞÞ; mðxÞ ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qreal ðxÞ2 þ qimag ðxÞ2 ;

ð1Þ ð2Þ ð3Þ

where x denotes the spatial coordinates ðx; yÞT ; * the convolution operator, qreal ðxÞ and iqimag ðxÞ the real and imaginary part of qðxÞ and I is a real valued image. The magnitude mðxÞ of the quadrature filter response determines how well the image fits to the shape of the filter and also indicates the local

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

energy in the image. The filter’s operating range depends on its angular bandwidth and radial frequency bandwidth. Attainable joint spatial and frequency resolution are fundamentally restricted by the well-known uncertainty principle. A large bandwidth gives phase estimates with good spatial localization but on the other hand increase Fourier domain interference in the phase estimation. 2.2. Motion estimation A differential technique similar to that for intensity [13] can be used for phase-based motion estimation [7,12]. Since the phase is close to linear within a filter’s operating range we can constrain the motion using phase-differences: fx vx þ fy vy þ ft ¼ 0;

ð4Þ

where v ¼ ðvx ; vy ÞT ; are the velocity components in the spatial direction x and y respectively, fs is the partial derivative of f and s denotes the direction of differentiation and can be one of the spatial directions, or the temporal direction t: Eq. (4) is however ambiguous. For this reason at least two filters are needed to solve the ambiguity of the phase constraint equation. Further, for simple neighbourhoods [11], the so-called aperture problem will be present when using phase in the same way as when using the original image for estimating local velocity. 2.3. Efficient filter bank for interlaced video In this paper, we have selected the relation between the radial and the angular frequency response of the filters to make the magnitude of the kernel as isotropic as possible. Typical values of frequency and orientation bandwidths are 1.5–2 octaves and 70 respectively. This provides compact spatial support and promotes efficient filtering. The large amount of image data enforce requirements on computationally efficient filtering. To minimize the number of coefficients used in the computation of the quadrature filter responses a sequential combination of small one-dimensional

383

(1D) filter kernels are used, i.e. a filter net [4,18] is used. The filter net is furthermore designed for interlaced video, the most common format for video sequences. This is achieved by designing the filter net with a restricted Fourier domain in the vertical direction (half of the horizontal) and restrict the spatial positions of the filter coefficients in the vertical direction, see Fig. 1. This means that the final filter net can be applied directly on interlaced images without any previous interpolation and resampling to an image with equal spacing in the vertical and the horizontal direction. 2.4. Filter bank design An ideal set of frequency and orientation selective quadrature filters are designed in the Fourier domain. The frequency response of the ideal quadrature filters are generated from difference of lowpass filters multiplied with filters with angular selectivity. All filters are more or less 1D. Four directions of filtering are used: n# x ¼ ð1; 0ÞT ;

ð5Þ

n# y ¼ ð0; 1ÞT ;

ð6Þ

1 n# xy ¼ pffiffiffið1; 1ÞT ; 2

ð7Þ

1 n# xy ¼ pffiffiffi ð1; 1ÞT : 2

ð8Þ

An ideal quadrature filter, Fk ðuÞ; is generated as Fk ðuÞ ¼ Dk ð#uÞBPk ðuÞ;

where k ¼ ½1y9 ;

ð9Þ

where u# ¼ u=jjujj and with the angular selectivity function given by ( ð#uT n# k Þ2 ; when u# T n# k > 0 Dk ð#uÞ ¼ ð10Þ 0; otherwise: The bandpass filters generated as BPk ðuÞ ( ¼

LP1k ðuÞ  LP2k ðuÞ;

k ¼ ½2y9

ðLP1k ðuÞ  LP2k ðuÞÞLP3 ðuÞ; k ¼ 1 ð11Þ

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

384

where ajk ¼ p=ð2 ln ðuhjk =uljk ÞÞ and finally

with lowpass filters given by

LP1k ðuÞ 8 Tn 1k 2 > jÞ ; minðcosða1k ln juul1k > > > > > uT n> 2 > > < cosða1k ln j ul1k1k jÞ Þ; ¼ > > > > > 1; > > > : 0;

LP3 ðuÞ 8

uT n# y 2 p > cos ln j j ; > < 2 ln ð2Þ p=8 ¼ 1; > > : 0;

ul1k o ¼ juj

juj > uh1k ;

where n> 1k is perpendicular to n1k ; and

ul2k o ¼ juj

jujop8

ð14Þ

juj > p4:

2.5. Filter optimization

o ¼ uh2k jujoul2k juj > uh2k ;

¼ jujo ¼ p4

The parameters for each quadrature filter can be described by the direction of filtering nk ; n1k and n2k ; and low and high cut-off frequencies of the lowpass filters ul1k ; uh1k ; ul2k and uh2k : The parameters for the ideal quadrature filters are described below together with an indication of centre frequency oc ; see Table 1. It can be noted that all quadrature filters comes in pairs of two at each scale with the exception of the horizontal filter F1 ðuÞ: Due to interlacing F1 ðuÞ has no correspondence in the vertical direction and this filter has to be designed differently. A contour plot of the frequency responses of the ideal quadrature filters at half amplitude is shown in Fig. 1.

o ¼ uh1k jujoul1k ð12Þ

LP2k ðuÞ 8 uT n2k 2 > > > minðcosða2k ln j ul2k jÞ ; > > > uT n> 2 > > < cosða2k ln j ul2k2k jÞ Þ; ¼ > > > > > 1; > > > : 0;

p 8o

ð13Þ

First an ideal set of frequency and orientation selective quadrature filters are designed in the

Table 1 Description of the parameters for generation of the frequency response of the ideal quadrature filters. oc denotes the spatial centre frequency Filter

oc (rad)

nk

n1k

n2k

ul1k p 2 p 4 p 4

p 2 p 2

p pffiffiffi 4 2 p pffiffiffi 4 2 p 8 p 8 p pffiffiffi 8 2 p pffiffiffi 8 2

p pffiffiffi 2 2 p pffiffiffi 2 2 p 4 p 4 p pffiffiffi 4 2 p pffiffiffi 4 2

F1

0:5p

n# x

n# x

n# x

F2

0:35p

n# xy

n# x

n# xy

F3

0:35p

n# xy

n# x

n# xy

F4

0:25p

n# x

n# xy

n# x

F5

0:25p

n# y

n# xy

n# x

F6

0:18p

n# xy

n# x

n# xy

F7

0:18p

n# xy

n# x

n# xy

F8

0:125p

n# x

n# xy

n# x

F9

0:125p

n# y

n# xy

n# x

uh1k

ul2k

uh2k

p

p 8

p 4 p pffiffiffi 8 2 p pffiffiffi 8 2 p 16 p 16 p pffiffiffi 8 2 p pffiffiffi 8 2 p 16 p 16

p pffiffiffi 16 2 p pffiffiffi 16 2 p 32 p 32 p pffiffiffi 16 2 p pffiffiffi 16 2 p 32 p 32

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

π/2

Levels

F5

F3

385

F2

1

F6

uy

F7

F9 F8

-π/2 -π

F4

ux

F1

π

Fig. 1. Contour plot of the frequency responses of the ideal quadrature filters at half amplitude. Note that the Fourier domain is restricted in the y direction due to interlacing.

18

Fourier domain. The purpose of the optimization is to obtain the set of spatial kernels in the filter net that minimizes the weighted difference between the frequency response of the ideal and the optimized filters, i.e. find the minimum of e2 ¼

X X k

jjW ðuÞðF*k ðuÞ  Fk ðuÞÞjj2 ;

ð15Þ

u

where u denotes the Fourier coordinates ux and uy ; and W ðuÞ weights errors according to the expected spectrum. For most natural images the spectrum behaviour can on the average be expected to be independent of scale and decay as juja [18], where qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi juj ¼ u2x þ u2y : No restrictions on the spatial shape of the filters are considered other than the definition of the spatial support of the kernels. It is however possible to incorporate spatial criteria in the optimization, see [18] for more details on multiple domain filter optimization. With the ideal quadrature filter responses in mind a filter net is designed, see Fig. 2. The nodes denotes summation points and the filters are located on the arcs connecting two consecutive levels. Ideally it would for a given network structure be desirable to optimize the number of coefficients in each arc and the spatial position for each coefficient for the entire net simultaneously. This is, however an extremely complex problem and a method for finding an optimal solution has not been found. For this reason the design of the network has to be made based on experience. For

~ ~ ~ ~ ~ ~ ~ ~ ~ f 1 f 2 f 3 f 4 f 5 f6 f 7 f 8 f 9 1

14

Nodes

Fig. 2. Filter net for generation of 9 quadrature filter kernels f*k using 141 real-valued kernel coefficients. The dots corresponds both to the number of non-zero coefficients in the simple filter and the position for these coefficients. The simple filters in the bottom of the filter net have complex-valued kernel coefficients.

a given network the value of the filter coefficients can be optimized using Eq. (15). The optimization of the filter net is recursive. Each level in the filter net is optimized separately with respect to the final output nodes and the current state of the other levels in the filter net. The convergence of this approach is not guaranteed but tests show that for educated choices of filter structure and coefficient distribution the network converges after a few iterations to a stable solution. For further details on this optimization see [4]. 2.6. Result of the filter network optimization The optimized filter network requires 141 realvalued multiplications per pixel which corresponds to 15–16 multiplications for each quadrature filter response. The resulting optimized quadrature filters are shown in Fig. 3. Using optimization of a single filter based on F6 ; with the same frequency weighting function as in Eq. (15), will require 361

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

386

π/2

~

~

F3

~

~

uy

F2

F6

F7

~

F9

~

F8

-π/2 -π

3. Motion estimation based on phase-differences

~

F5

ux

~

F4

~

F1

π

Fig. 3. Contour plot of the frequency responses of the generated quadrature filters f*k at half amplitude. Note that the Fourier domain is restricted in the y direction due to interlacing.

Fig. 4. Magnitude times phase of the quadrature filters for a spatial region of 37  37 pixel. Based on quadrature filter kernels from left to right starting at the top: f*1 ; f* 2 ; f*3 ; f*4 ; f*5 ; f*6 ; f*7 ; f*8 and f*9 :

filter coefficients for similar weighted error as for the filter net approach. This shows that in this application the filter net solution is more than 20 times faster to generate a quadrature filter response than using a filter kernel from a single filter optimization. For comparison, using a fast fourier transform (FFT) approach will on average also be more than 10 times slower. We can visualize the characteristics of the generated quadrature filters by looking at the magnitude and phase characteristics in Fig. 4.

Phase was first introduced in [22] using the Fourier phase for signal registration, i.e. phase correlation. Other approaches based on phasedifferences are [32,7,31]. Methods based on phase have shown to perform well in comparison with other methods, see [5]. Phase variations are nearly linear with respect to displacements a feature which makes it possible to obtain subpixel accuracy without explicit interpolation. Phase is further relatively insensitive to variation in luminance and contrast, and also to small variations in scale and rotation [9]. A draw-back with methods based on phase-differences is that they can be completely unreliable at singular points [7]. Such points can, however, be detected and avoided in the computation. Using narrowband filters reduce the risk for singular points. However, in this case more filters are needed to cover the Fourier domain and the filters will get larger in the spatial domain increasing the risk of mixing different objects in the motion estimation. In this paper, the velocity is estimated by a leastsquare fit of several local phase constraints computed from responses of the quadrature filters given in Section 2. This kind of minimization has also been used by Fleet and Jepson [7] and Hemmendorff [12]. Many methods are based on a relatively large spatio-temporal support. We focus on rapidly changing motion and use a temporal support of two or three field-images and a limited number of broad-band quadrature filters. Multiple hierarchical motion estimation is used to obtain high spatial resolution of the motion field. First in Section 3.1 we describe how phase-differences are computed. Then in Section 3.2 a velocity is estimated based on a selected set of the available quadrature filter responses. This velocity is then post-filtered in Section 3.3 to reduce uncertainty in the motion estimates. In Section 3.4 the complete motion estimation algorithm that produce the final velocity is described. 3.1. Computation of phase-differences As was shown in Eq. (4) the velocity is related to the spatial and temporal phase-differences. Our

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

main requirement on the measurements of the phase-differences is that they should be local both spatially and temporally to separate objects of different motions and support motion estimates with high spatial accuracy. For this reason we use the 2D spatial quadrature filters, f* k ; from Section 2, in two or three field-images. The phase-differences fkx ; fky and fkt and corresponding magnitudes mkx ; mky and mkt ; based on quadrature filter response q* k ; are computed as the argument and the magnitude of the following complex-valued functions: ! ! 1 ; t q* k ðx; tÞn qqkx ðxÞ ¼ q* k x  0 ! !n 1 þ q* k ðx; tÞ q* k x þ ;t ; ð16Þ 0 ! ! 0 qqky ðxÞ ¼ q* k x  ; t q* k ðx; tÞn 1 ! !n 0 þ q* k ðx; tÞ q* k x þ ;t ; 1

ð17Þ

qqkt ðxÞ ¼ q* k ðx; t  1Þ q* k ðx; tÞn þ q* k ðx; tÞ q* k ðx; t þ 1Þn ;

ð18Þ

where * denotes the complex conjugate. The phase-differences and corresponding magnitudes are computed as fkx ðxÞ ¼ argðqqkx ðxÞÞ;

ð19Þ

fky ðxÞ ¼ argðqqky ðxÞÞ;

ð20Þ

fkt ðxÞ ¼ argðqqkt ðxÞÞ;

ð21Þ

mkx ðxÞ ¼ jqqkx ðxÞj;

ð22Þ

mky ðxÞ ¼ jqqky ðxÞj;

ð23Þ

mkt ðxÞ ¼ jqqkt ðxÞj:

ð24Þ

The use of products of quadrature filters for estimation of phase-differences have earlier been

387

used by Wilson and Knutsson [32] and Jepson and Jenkin [15]. Estimating phase-differences as the argument of the complex-valued products handles phase unwrapping automatically, since the complex exponential representation by definition is modulo 2p: Consider a 1D example with the filter responses aðx1 Þ ¼ meif and aðx2 Þ ¼ meiðfDfÞ of a sinus wave at two spatial positions with phases f and f  Df: Estimating the phase-difference between the positions using the complex-valued product aðx1 Þaðx2 Þn gives meif meiðfDfÞ ¼ m2 eiDf : The phase-difference is independent of the absolute phase of the signal. Spatial phase-differences ðfx ; fy ÞT projected on the orientation n# k of the used filter f* k ; i.e. the local frequency, should further always be positive. This is due to the fact that the phase in the designed direction of a quadrature filter is always rotating in the same way. Negative local frequencies corresponds to a singular point. Such points are completely unreliable and should not be used. A singular point can occur when two frequencies interfere in the filter response. Such points can be detected and avoided in the computation [7], see Eq. (28). Temporal phase-differences on the other hand can have different sign depending on the direction of movement. This can be the case since in the temporal direction the phase of a quadrature filter designed in a spatial direction do not have and defined way of rotation. A large phase-difference in the temporal direction must thus be used carefully since it might have the wrong sign. This is considered in Eq. (30). As an additional feature using these products for estimation of phase-differences the magnitudes mkx ; mky and mkt give us a measure of how strong the signal is and how well it fits the filter. This is an important measure of reliability of phase estimates and is used in Eq. (27). Due to the interlaced data, each of the temporal neighbouring quadrature filter responses in Eq. (18) are bi-linearly interpolated to be spatially aligned with the centre quadrature filter response when the products are computed. Bi-linear interpolation is also used for the neighbouring samples in the vertical direction in the computation of Eq. (17).

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

388

Estimation of phase-differences based on three points as in Eqs. (16), (17) and (18), gives measures of the phase-differences for the centre point. The phase-differences are determined from the magnitudes of the corresponding products. The product with largest magnitude, best fit, will have most influence on the final phase-difference. For temporal phase-differences based on this symmetrical temporal support the motion estimate is centred at time t: This estimation is best as long as the motion is consistent in the spatio-temporal region. If the motion is inconsistent and our particular interest is using the estimated motion for motion compensated prediction as described in Section 4, it can be better to omit the future field-image and only rely on the motion in between time t  1 and time t: 3.2. From phase constraints to velocity To solve the ambiguity of the phase constraint equation (Eq. (4)), we determine the velocity from minimization of a least-square fit of several local phase constraints in a spatial neighbourhood as given in Eq. (25). For this purpose we use phasedifferences from a set of quadrature filter kernels O and a spatial window W ðxÞ where O and W ðxÞ are defined in Section 3.4. Based on phasedifferences from coarse filters, i.e. filters with a low centre frequency, a larger motion can be estimated than using fine filters. Using coarse filters means, however, that motions of different objects is more likely to get mixed up in the analysis resulting in a poor velocity estimate. For this reason we use orientation selective filters with a similar centre frequency to estimate a certain range of velocities. For improved readability we omit the spatial coordinates ðxÞ of many variables in this section. We use a constant motion model, v, in the minimization. X X Ev ¼ W ðxÞðcTk ðxÞ v& ðxÞÞ2 ; ð25Þ kAO

x

where W ðxÞ is a window that give more influence to constraints at the centre of the spatial neighbourhood. v& ¼ ðvT ; 1ÞT ¼ ðvx ; vy ; 1ÞT and ck are the weighted phase-differences from quadrature filter

response q* k :

0

1 fkx pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiB C ck ðxÞ ¼ ckm ckf ckv ckt @ fky A 0

fkt

1

fkx pffiffiffiffiffiB C ¼ ck @ fky A;

ð26Þ

fkt with weights, or reliability, of the phase-differences defined as pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ckm ¼ mkx mky mkt ; ð27Þ ckf ¼ 1 when

n# Tk

fkx fky

! > 0 else 0;

ckv ¼ 1 when jjvkn jjovk max else 0;   f ckt ¼ cos2 kt ; 2

ð28Þ ð29Þ ð30Þ

where ckm favourizes filters according to their magnitude due to the local structure and the temporal similarity. ckf removes points with negative spatial phase-differences, i.e. negative local frequencies, as they are completely unreliable [8,31]. ckv avoids use of phase estimates with larger normal velocity estimate than the maximum velocity vk max the filter f* k can handle. With normal velocity defined as ! fkx fkt fky vkn ¼  2 ; ð31Þ fkx þ f2ky ckt limits problems with the phase wrapping around, i.e. changing sign, for temporal phasedifferences, by promoting small phase-differences before larger phase-differences [32]. Minimization of Eq. (25) with respect to v gives vðxÞ ¼ P

k;x

W ðxÞck fkx fkx

k;x

W ðxÞck fky fkx

k;x

W ðxÞck fkx fkt

k;x

W ðxÞck fky fkt

 P

P 

P

P

k;x

W ðxÞck fkx fky

k;x

W ðxÞck fky fky

P

! :

!1

ð32Þ

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

The similarity between the concept of orientation tensors [17] and the described algorithm is shown in Appendix A. 3.3. Post-filtering of motion estimates After the velocity vðxÞ have been estimated using Eq. (32) there can still remain uncertainties in the velocity estimates. Such uncertainties can for example occur when the local structure is close to 1D. The uncertainties can be described as errors in magnitude and direction of the velocity and can be represented with the covariance matrix: X X CðxÞ ¼ ðvðxÞ  mðxÞÞðvðxÞ  mðxÞÞT þ eI; x

y

ð33Þ where mðxÞ is the local mean of vðxÞ determined with a Gaussian lowpass filter. eI is an isotropic part to limit the range of the inverse of the uncertainty, i.e. the certainty. At a point where the velocity is larger than what can be estimated, considering the used filters, the diagonal of the covariance matrix is forced to be huge. The inverse of the uncertainty matrix reflects the certainty of a motion estimate. In a local neighbourhood we then weight our velocity estimates according to this uncertainty using normalized convolution: v* ðxÞ ¼ ððg * C1 ÞðxÞÞ1 ððg * ðC1 vÞÞðxÞÞ;

ð34Þ

where g is a separable Gaussian filter with size 5  5 and s ¼ 1:2: These filter parameters are also used in the computation of the local mean and for a weighted summation of Eq. (33). With this post-filtering method we use estimates in accordance with their spatial consistency. A bad estimate can be repaired by neighbouring more consistent estimates. 3.4. Multiple hierarchical motion estimation Hierarchical motion estimation is a commonly used method to handle large velocities and to reduce noise sensitivity. It starts with motion estimation based on low spatial frequency to compute a coarse motion estimate. Motion estimates are then refined using higher spatial

389

frequencies. The smooth motion field of hierarchical motion estimation leaks motion to the background near moving objects. Motion compensation of an object according to such a motion will move the background along with the object. This gives a perceptually annoying distortion. A method to improve hierarchical motion estimation is to only allow non-zero motion vectors when the frame difference IðtÞ  Iðt  1Þ is above a certain threshold [6]. Another method is to propagate several motion vectors from each hierarchical computation level and thereby increase the probability of estimating the true motion [21]. A recent approach for hierarchical block-matching selects the initial motion of a block at a finer resolution as the motion with least motion compensated prediction error of motions from neighbouring blocks at the previous resolution [30]. Our method increases the spatial resolution of the estimated motion field with use of several parallel local hierarchical motion estimators starting at different scales [2]. The multiple hierarchical motion estimator (MHME) consists of one global and three local hierarchical motion estimators (HME), see Fig. 5. To handle camera panning, i.e. global shift of the whole image, the local motion estimators are initially compensated according to a global hierarchical motion estimate. Areas not following the global motion are corrected by the local motion estimators. The basic elements of the estimator are three motion estimation units. Each motion estimation unit determine the velocity from minimization of Eq. (25), for a constant motion model over a spatial region defined by W ðxÞ; and using a set O of phase-differences based on responses of the quadrature filters, f* k : The motion estimation units are based on coarse filters kAO ¼ f6; 7; 8; 9g; medium filters kAO ¼ f4; 5; 6; 7g and fine filters kAO ¼ f1; 2; 3g: W ðxÞ is a Gaussian lowpass filter with standard deviation s ¼ 1:2; 2.4 and 3.6, respectively described in order from high frequency to low frequency motion estimation units. Gaussians are used due to their separability which allows for efficient filtering. The motion estimation units, used for global motion estimation, use integration over the whole image.

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

390

in Eqs. (16) and (17) need not to be re-computed. It should be noted that using motion compensation of a motion estimation unit we will get a delta velocity from Eq. (25). Before post-filtering is applied, as in Section 3.3, this delta velocity is accumulated to the previously estimated velocity v* ðxÞ: The hierarchical motion estimator which uses all scales can estimate velocities up to about 10 pixel per field-image. The algorithm produces four motion estimates v* j with sub-pixel accuracy in each pixel, one global and three local estimates. Each of these motions are then used for motion compensated prediction: I#jt ðxÞ ¼ I*t1 ðx  v* j ðxÞÞ; where j ¼ ½1y4 ; ð36Þ

Integration

MC

Integration

+

MC Integration

+

vglobal

where I* t1 is a previously decoded field-image and I#jt is the predicted image. The motion estimate v* j at each pixel which gives the least local motion compensated prediction error is selected as the output from the MHME: vðxÞ ¼ arg min ððg * jI#jt  I*t j ÞðxÞÞ; ð37Þ

MC

Integration +

v* j

MC

MC

Integration

Integration

+

+

+

~ v

1

MC

MC

Integration

Integration

MC Integration

+

+

v~ 2

v~ 3

v~4

where g is a lowpass filter of size 3  3 and c is a measure of spatial consistency [11]: jjðg * vÞðxÞjj cðxÞ ¼ : ð39Þ ðg * jjvjjÞðxÞ

Selection

v

Fig. 5. Multiple hierarchical motion estimation (MHME). MC refers to a motion compensated motion estimation unit.

The hierarchical motion estimation implies that the temporal phase differences in Eq. (18) must be re-computed. This is achieved by motion compensation using the estimated motion from previous computation unit as qqkt ðxÞ ¼ q* k ðx þ v* ðxÞ; t  1Þ q* k ðx; tÞn þ q* k ðx; tÞ q* k ðx  v* ðxÞ; t þ 1Þn :

where g is a Gaussian filter of size 5  5 and s ¼ 1:2; and I*t is a previously decoded field-image. To reduce noise generated by the selection in Eq. (37) the estimates are further filtered with normalized convolution [19]: ðg cvÞðxÞ ; ð38Þ v0 ðxÞ ¼ * ðg * cÞðxÞ

ð35Þ

Bi-linear interpolation is used to achieve sub-pixel motion compensation. The spatial phase-differences

Note that when MHME is used for backward prediction as described in Section 4 the motion estimation is performed on previously decoded fieldimages. The selected motion is then temporally predicted and used for motion compensated prediction of the field-image to be coded, see Section 4.

4. Backward–forward motion compensated prediction We use backward motion compensated prediction based on the assumption that the motion is

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

~ I t+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + +

^ It

t + + + + + + + + + +

391

+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + + + + + + + + + +

+ + + + ++ + + + ++ ++ + + ++ ++ ++ + + ++ ++ + + ++ ++ + + ++ ++ + + + ++ +

+ + + + + + + + + + + + + + + + + +

Uncovered background

+ + + + + + + + + +

Moving object

Image sample

Covered background

Fig. 6. Illustration of backward prediction. The image at time t is predicted using a previously decoded image and the motion between two previously decoded images. A moving object causes irregular samples of the moving object at time t: In addition it generates covered and uncovered background. In the covered region exists multiple predictions and in the uncovered region samples are missing. The illustration is for non-interlaced sampling.

predictable in time as previously [26,24]. For this case it is important to achieve ‘true’ motion estimates as the motion, estimated from previously decoded field-images, is used for motion compensated prediction of the field-image to be coded. This is unlike block-based motion estimation, used in standard forward motion compensation, where the displacement with the least motion compensated prediction error is selected for each block. The prediction of the current field-image It using a previously estimated motion v0tDt can be solved in two different ways: *

*

Direct prediction of It using v0tDt and a previously reconstructed image I* tDt : ð40Þ I#t ðx þ Dt v0tDt ðxÞÞ ¼ I*tDt ðxÞ; where Dt denotes which previous field-image that is used, see Section 4.1. Prediction in two steps, first prediction of vt using v0tDt : v# t ðx þ Dt v0tDt ðxÞÞ ¼ v0tDt ðxÞ

ð41Þ

and then prediction of It using v# t and I*tDt : I#t ðxÞ ¼ I*tDt ðx  Dt v# t ðxÞÞ: ð42Þ Some potential problems with prediction in these cases are (1) Irregular input sampling. The positions of the predicted signals ðx þ Dt v0tDt ðxÞÞ and the sampling pattern of the desired output are not

aligned, see Fig. 6. There can also be ‘holes’ at the desired output positions, because no prediction ends up there. An example is uncovered regions due to object motion. (2) Covered regions. Many predictions can end up in the same output neighbourhood, i.e. occlusion. A simple example is a moving object on a stationary background. (3) Non-constant motion. The motion of an object at time t  Dt and t is not the same. (4) Varying image characteristics. Examples are changing lighting conditions, 3D object rotation and quantization noise. Continuous normalized convolution (CNC) [3], an extension of normalized convolution (NC) [19], is used here for motion compensation to deal with irregular sampling and the case covered and uncovered regions. NC can be viewed as locally solving a weighted least square (WLS) problem to represent a signal using a local signal model. The weights are defined by signal certainties and a spatially localizing function. When estimating regularly sampled output values, here located at the field-image to be coded, from an irregularly sampled input signal, each output sample has to be estimated from a neighbourhood unique for that sample. CNC provides a computational efficient way to compute such values from a set of fixed regularly sampled filters using a linear model (here a first order spline) and offsets between samples from the

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

392

regularly and irregularly sampled signals, see Eq. (44). This alleviates the need to work with analytic filter functions that are recalculated for each neighbourhood which greatly improves processing speed. When estimating output samples it is reasonable that signal samples spatially closer to the output samples should affect the filter output more than samples further away. This spatial localization is here achieved with a function with a centre region with values declining linearly and an outer region with values declining fast with the distance from the output sample. The approach assumes that neighbouring spatial regions of the predictions in Eqs. (40) and (41) have similar signal properties as the signals in the desired output positions. In this paper we use Eq. (40) for the prediction. Using a constant signal model the computations by CNC can be formulated for an output sample as P aðx * j Þcj I#j #I ¼ Pj ; ð43Þ * j Þcj j aðx where xj is the sample position of the jth input sample, aðx * j Þ is a continuous approximation of the spatial localization function, I#j is an irregular sample due to Eq. (40) and cj is corresponding certainty. A 1D example of how the continuous filter approximation works is given below: aðxÞ * ¼ a% þ Dx ax ; where ( X¼

ð44Þ

D ðIDx m þ 0:5Þ;

when N is even

D roundðDx Þ;

when N is odd;

Dx ¼ X  x; a% ¼

aðX  D=2Þ þ aðX þ D=2Þ ; 2

ax ¼

aðX  D=2Þ  aðX þ D=2Þ ; D

Using CNC as in this paper, output positions with ‘holes’ are spatially predicted using the neighbourhood. The value of output positions in a region of densely distributed predicted signal values are linearly interpolated from the closest signal samples. To completely solve the problem of covered regions, foreground objects that also end up visible at the next time sample should have certainty one. Certainty zero should be given to objects that end up in the background. Currently we use certainty one for all predicted image values. 4.1. Selection of motion and field-image for prediction Motion estimation based on three decoded fieldimages, I*t3 ; I*t2 and I*t1 ; as in Eq. (18), gives an estimate for the centre image I* t2 : If the motion is constant in time the motion can be accurately predicted in time. To keep the spatial accuracy in the vertical direction (of the interlaced image sequence) when the motion is constant in time, we use Dt ¼ 2 in Eqs. (40), (41) and (42) for the motion compensated prediction of It : If the motion is non-constant in time a better estimate is achieved if only the two previously decoded fieldimages, I*t2 and I*t1 ; are used in the motion estimation. For this case the current field-image is predicted using I*t1 and vt1 : The approach with least motion compensated prediction error is selected. 4.2. Motion compensated prediction refinement

ð45Þ ð46Þ ð47Þ ð48Þ

where xAR is the input sample position, N denotes the number of filter coefficients, Ixm denotes truncation of x to nearest integer towards minus infinity, roundðxÞ rounds x to the nearest integer and D is the spacing between the filter coefficients.

The initial motion compensated prediction moves the image characteristics of a previous field-image using a dense motion field. To deal with the problems of a non-constant motion field and varying image characteristics, the initially predicted field-image is refined using forward motion compensation. This new backward– forward motion compensation scheme (BFMC) is found in Fig. 7. To achieve a low rate overhead for the forward motion compensation it is based on sub-pixel block-matching with a small search window. A small search window, typically less than 1  1 pixel, also reduces problems with block artifacts as

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

those in standard forward motion compensation schemes. The forward motion compensation refinement is only used when the total reconstruction error is lower than for only using the backward motion compensation. One bit for each field-image is transmitted to the decoder to determine the use of forward refinement. The result of using refinement is shown in Fig. 8. It can be seen that errors

Previously Decoded Field-images

Reference Field-image

Motion Estimation

Backward Backward Predicted Estimated Field-image Motion Motion Compensation

Motion Compensated Refinement Motion Refinement Field

Backward-Forward Predicted Field-image + -

Fig. 7. Backward–forward (BFMC).

motion

Motion Field Coding

around edges of moving objects is much less visible after refinement.

5. Experimental results The BFMC is compared against a block-based forward motion compensation scheme (BMA) both using the SPIHT coder [29] for encoding the intra field as well as the motion compensated residuals. We also make a comparison with version 1.2 of the MPEG-2 coder from MPEG [25]. All coding is performed on luminance images only. The first field-image is intra coded (I picture), all other field-images are inter coded (P pictures). The motion compensation is performed on field-images of interlaced video. The coding performance is evaluated using peak-signal-tonoise-ratio (PSNR): MAX2 ; ð49Þ RMS2 where MAX is the maximum signal value and RMS is the root-mean-squared error.

PSNR ¼ 10 log10

Residual Coding

compensation

393

scheme

Fig. 8. Comparison of error using backward motion compensation and backward–forward motion compensation (BFMC) for a fieldimage from the Rugby sequence. Top left: Original image at t  1: Top right: Original image at t: Bottom left: Backward motion compensated prediction error (PSNR 24 dB). Bottom right: BFMC prediction error (PSNR 26 dB). Note: The field-images have been bi-linearly interpolated to full resolution.

394

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

The BFMC estimates the motion on previously decoded field-images and uses the motion for motion compensated prediction of the field-image to be coded. The predicted field-image is then refined with forward motion compensation using block-matching. The block size ðy  xÞ is 8  16 pixel and the search region is 70:75 pixel. The search is based on mean-squared-error (MSE) and uses bi-linear interpolation to achieve quarter-pel accuracy. The BMA uses a block-based motion estimator as it is the standard approach and is used in the MPEG standards. The block size is 8  16 pixel. We use integer search within (78; 716) displacement followed by half-pel refinement within 71 half-pel displacement from best integer match and finally quarter-pel refinement within 71 quarter-pel from best half-pel match, based on MSE and using bi-linear interpolation. The motion estimation is performed both using the previous decoded field-image and the field-image to be coded, and, the second previous decoded field-image and the field-image to be coded. The displacement with least MSE is selected. As for the BFMC this selection must be transmitted to the decoder using one bit per field-image. The coding rate of the velocity field of both the BFMC and the BMA is estimated using Huffman coding of each field. The residual coding rate is reduced with the rate for the velocity field coding. To have the same starting point of both the BFMC and the BMA the first three field-images are coded the same. The first one is intra field coded followed by two BMA predicted fields. The same intra and inter coding rate is used for the BFMC and the BMA.

The MPEG-2 coder is used in field-image coding mode. No bi-directional prediction (B picture) is used. The MPEG-2 coder first make an integer search within ð78; 716Þ displacements followed by a half-pixel search within half-pixel displacement from best integer match. The quantized DCT coefficients are run-length coded using an alternate scan. We use intra DC precision of 9 bits for intra coded macroblocks. The frame rate is set to 25 Hz: The group of pictures (GOP) is set equal to the total number of field-images in respective experiment, i.e. only one intra coded field-image is used. The MPEG-2 coder is designed according to test model 5 (TM5) [14]. The motion estimation in this MPEG-2 coder has no hierarchical estimation mode. 5.1. The effect of coding on motion estimation and motion compensation accuracy Since we are estimating the motion on previously decoded images it is important that the motion estimation accuracy is relatively independent of coding errors. This is especially true for regions of high contrast as motion estimation errors in such regions will cause potentially large motion compensation errors. A commonly used test sequence in evaluation of optical flow algorithms is the Yosemite sequence. The sequence was generated using a digital terrain map and includes velocities up to about 4 pixel/ frame. The progressive Yosemite sequence has been lowpass filtered and down sampled vertically to produce an interlaced sequence, see Fig. 9. The performance on this sequence is often measured in

Fig. 9. Field-image corresponding to frame 9 of the Yosemite sequence (size 126  316). Left: Image. Right: Downsampled velocity field. Note: The field-image has been bi-linearly interpolated to full resolution.

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400 33

33

32

32 31

PSNR [dB]

PSNR [dB]

31 30 29 28

30 29 28 27

27 26

395

26 0.2

0.4

0.6

0.8 1 Bit rate [bpp]

1.2

1.4

25

0.2

0.4

0.6

0.8 1 Bit rate [bpp]

1.2

1.4

Fig. 10. Motion compensation accuracy versus coding rate, for the field-image corresponding to frame 10 of the Yosemite sequence. Based on prediction of the estimated motion of frame 9 (solid). Based on prediction of the reference motion of frame 9 (dash–dot).

Fig. 11. Motion compensation accuracy versus coding rate for varying accuracy of the motion estimates, for the field corresponding to frame 10 of the Yosemite sequence. From bottom to top, integer accuracy, half-pel accuracy, quarter-pel accuracy, eight-pel accuracy (dotted) and finally full accuracy (solid).

angular error [5]

motion field but the prediction remains quite stable with decreasing coding rate. The effect of coding and limitation of the accuracy of the motion estimates on the motion compensated prediction error is shown in Fig. 11. For this sequence, and for rates at or above 0:5 bpp; motion estimates can be quantized to eight-pel accuracy without any degradation in the PSNR of the motion compensated prediction. For coding rates below 0:25 bpp it is sufficient to have half-pel accuracy of the motion estimates. The effect of coding on the accuracy of the motion estimates is shown in Fig. 12. The results for our motion estimator for the field-image corresponding to frame 9, estimated on two reference field-images, with 100% density is 10:7 angular error (RMS 0.72). Best result reported at 100% density, in Barron et al. [5], is about 10 angular error. Without the sky, which often is removed in evaluations, we get 6:3 angular error (RMS 0.5). Common to all approaches which performs good on the Yosemite sequence is that they use relatively large spatio-temporal support. Our motion estimator is designed with more complex motion fields in mind and uses only two

Dc ¼

arccosð#vTest v# true Þ;

ð50Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi T where v# ¼ 1= v2x þ v2y þ 1 ðvx ; vy ; 1Þ ; the velocity represented as a 3D vector, for the 2D velocity v ¼ ðvx ; vy ÞT : First we compare the backward motion compensated prediction based on estimates from our motion estimator with motion compensated prediction based on the reference motion field, see Fig. 10. For rates larger than 1:1 bpp our motion estimator gives a motion compensated prediction with higher average PSNR than that obtained by using the reference motion field. This may seem surprising at first, however the best motion field for prediction is not necessarily the reference motion field. This can be the case since the reference motion field is the true velocity at the instant corresponding to the previous field-image, i.e. the field-image used for prediction. The best motion field for prediction should be at some time instant in between the predicted field-image and the field-image used for prediction. Motion estimates from decoded images at coding rates between 0.1 and 1:1 bpp give a prediction with lower average PSNR than using the reference

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

396

42 40

25

20 PSNR [dB]

Angular error [deg]

38

15

36 34 32 30

10

28

5

0.2

0.4

0.6

0.8 1 Bit rate [bpp]

1.2

26

1.4

Fig. 12. Motion estimation accuracy in angular error versus coding rate, for the field corresponding to frame 9 of the Yosemite sequence. With 100% density (solid). Excluding the sky (dash–dot).

0.2

0.4

0.6

0.8 1 Bit rate [bpp]

1.2

1.4

Fig. 13. Rate distortion performance on the Yosemite sequence. Reconstruction using BFMC (solid), BMA (dotted), MPEG-2 (dash–dot), and intra coding (dashed). The average PSNRs are 35.1, 34.9, 34.1 and 31:2 dB:

or three fields of interlaced video in the motion estimation.

40

5.2. Coding performance

36

38

PSNR [dB]

34

The coding performance is evaluated on our interlaced version of the Yosemite sequence, as in Section 5.1, and two interlaced video sequences, the Rugby sequence and the Mobile Calendar sequence, at 25 frames per second (PAL). The Rugby sequence contains both fast motions of objects as well as global motion. The Mobile Calendar sequence contains relatively slow motions but includes highly detailed moving areas. In Fig. 13 rate distortion performance on the Yosemite sequence is shown. The BFMC has up to 1:6 dB better average PSNR than MPEG-2 and up to 0:4 dB better average PSNR than the BMA. The results on the central part of the Rugby sequence in terms of rate-distortion is shown in Fig. 14 and specifically for rate 0:75 bpp in Fig. 15. To focus the evaluation on the moving objects we use the central 288  360 part of the Rugby sequence. It is shown that the BFMC has 0.9– 1:1 dB better average PSNR than the MPEG-2 coder for rates between 0.25 and 1:5 bpp: The BMA is also better than the MPEG-2 coder due to the use of quarter-pel accuracy and partly due to

32 30 28 26 24 22

0.2

0.4

0.6

0.8 1 Bit rate [bpp]

1.2

1.4

Fig. 14. Rate distortion performance on central 288  360 part of the first 20 frames of the Rugby sequence. Reconstruction using BFMC (solid), BMA (dotted), MPEG-2 (dash–dot) and intra coding (dashed). The average PSNRs are 32.9, 32.6, 32.0 and 30:8 dB:

the use of the SPIHT coder. For the rate 0:125 bpp the MPEG-2 coder is better than both the BFMC and the BMA. Note that the MPEG-2 coder is a complete coder with features like, e.g. intra coding on selected blocks. The BFMC has up to 0:5 dB better average PSNR than the BMA. In Fig. 15 it is shown that the use of forward refinement, for

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400 38

40 38

36

36

34 PSNR [dB]

PSNR [dB]

34 32 30

32 30 28

28 26

26

24

24 5

10

15

20 field

25

30

35

22

40

Fig. 15. Coding at 0:75 bpp on central 288  360 part of the first 20 frames of the Rugby sequence. Top: Reconstruction using BFMC (solid), BMA (dotted) and MPEG-2 (dash–dot). The average PSNRs are 34.1, 33.7 and 33:2 dB: Bottom: Motion compensated prediction for BMA (dotted), the proposed coder without forward refinement (dashed) and with forward refinement (BFMC) at 0:035 bpp (solid). The average PSNRs are 26.0, 24.3 and 25:6 dB: The BMA uses 0:066 bpp for coding of the velocity field.

this sequence with highly non-constant motion field, improves the average PSNR with 1:3 dB compared to only using the backward motion compensation. The BFMC gives a motion compensated prediction with slightly lower PSNR than that of the BMA. However, for velocity field coding, the BFMC uses only 0:035 bpp compared to 0:066 bpp for the BMA. For this sequence, motion estimation based on two field-images, with motion compensation based on the previously decoded field-image, is often selected. A comparison of coded subjective quality on the Rugby sequence is shown in Fig. 18. The BFMC is free from blocky artifacts which leads to a more pleasant subjective performance around edges of moving objects. However, relatively flat regions should be coded more carefully than regions with high contrast since flat regions are subjectively more sensitive to coding errors. The MPEG-2 coder considers this by scaling the quantization step size with a measure of spatial activity for each 8  8 block and thus gets a better subjective performance in relatively flat regions. The results on the Mobile Calendar sequence in terms of rate-distortion is shown in Fig. 16 and

0.2

0.4

0.6

0.8 1 Bit rate [bpp]

1.2

1.4

Fig. 16. Rate distortion performance on the first 20 frames of the Mobile Calendar sequence. Reconstruction using BFMC (solid), BMA (dotted), MPEG-2 (dash–dot) and intra coding (dashed). The average PSNRs are 30.8, 30.7, 30.2 and 26:1 dB:

33 32 31

PSNR [dB]

22

397

30 29 28 27 26 25

5

10

15

20 field

25

30

35

40

Fig. 17. Coding at 0:75 bpp of the first 20 frames of the Mobile Calendar sequence. Top: Reconstruction using BFMC (solid), BMA (dotted) and MPEG-2 (dash–dot). The average PSNRs are 32.0, 31.8 and 31:1 dB: Bottom: Motion compensated prediction error BMA (dotted), the proposed coder without forward refinement (dashed) with forward refinement (BFMC) (solid). The average PSNRs are 26.5, 26.3 and 26:3 dB:

specific for rate 0:75 bpp in Fig. 17. The BFMC has 0.4–1:7 dB better PSNR than the MPEG-2 coder for rates at or above 0:5 bpp: For rates at and below 0:25 bpp the MPEG-2 coder is better than both the BFMC and the BMA. The BFMC has up to 0:25 dB better PSNR than the BMA.

398

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

Fig. 18. Comparison of coded subjective quality for coding at 0:75 bpp on central 288  360 part of the first 20 frames of the Rugby sequence. A part of field-image 34 is shown. Top left: Reference. Top right: BFMC (PSNR 32:9 dB). Bottom left: BMA (PSNR 32:0 dB). Bottom right: MPEG-2 (PSNR 30:4 dB). Note: The field-images have been bi-linearly interpolated to full resolution.

The coding rate for the velocity field is about 0:006 bpp for the BFMC and about 0:06 bpp for the BMA. Forward refinement is rarely used since the motion field is relatively constant. For this sequence, motion estimation based on three fieldimages, with motion compensation based on the second previously decoded field-image, is often selected.

6. Conclusions We have presented new motion estimating and motion compensation techniques intended for video coding. The motion estimation is based on phase and uses two or three decoded fields of interlaced images. The approach uses backward

motion compensation to achieve pixel-based motion compensated prediction. This initial prediction is then refined using forward motion compensation to compensate for non-constant motion and to consider the specific characteristics of the image to be coded. Using block-matching with a small search area this can be achieved at a lower coding cost than for standard forward motion compensating coders. Experimental results show that this new approach for motion compensation has up to 1:6 dB higher PSNR than MPEG2 and up to 0:5 dB better PSNR than block-based motion estimation and compensation using the same residual coder as in our approach, for coding of interlaced video. Perhaps most important, the subjective performance of our approach is also better.

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

Appendix A. Relation to orientation tensors In the following we show the similarity with the used algorithm and the concept of orientation tensors [17]. A local structure in spatio-temporal orientation n# has orientation tensor T ¼ A#nn# T ; where A is a scalar greater than zero representing the local signal amplitude. The motion can be extracted by an eigenvalue analysis of a 3D orientation tensor. The motion for a point is in the direction of the eigenvector with the smallest eigenvalue. Eq. (25), omitting W ðxÞ; can be rewritten to be expressed with a 3D orientation tensor T: X X Ev ¼ ðcTk v& Þ2 ¼ v& T ck cTk v& ¼ v& T T&v: ðA:1Þ kAO

kAO

Using Eq. (26), the components of the 3D orientation tensor can be written as 0 1 t1 t4 t5 B C TðxÞ ¼ @ t4 t2 t6 A t5 t6 t3 P P 0P 1 ck fkx fkx ck fkx fky ck fkx fkt P P BP C ck fky fky ck fky fkt A: ¼ @ ck fky fkx P P P ck fkx fkt ck fky fkt ck fkt fkt ðA:2Þ Minimization of Eq. (A.1) with respect to v gives !1 P P ck fkx fkx ck fkx fky vðxÞ ¼  P P ck fky fkx ck fky fky ! ! P ck fkx fkt t 5 ¼ T1 ; ðA:3Þ  P 2D ck fky fkt t6 where T2D ð2  2Þ is the 2D orientation tensor describing the spatial structure and is found as the upper left part of T. From this it is evident that the 3D orientation tensor can be determined from the 2D orientation tensor as ! T2D T2D v TðxÞ ¼ : ðA:4Þ ðT2D vÞT t3 We can determine the velocity uniquely using Eq. (A.3) if the neighbourhood consists of a structure with multiple orientations. Lowpass filtering of the estimated 3D tensor will make it

399

more likely to find the true motion, i.e. a solution to Eq. (A.3). This is equivalent to the window function W ðxÞ used in Eq. (32). The lowpass filtering is made separately for each of the 6 components of the tensor.

References [1] K. Andersson, P. Johansson, R. Forcheimer, H. Knutsson, Backward–forward motion compensated prediction, in: Advanced Concepts for Intelligent Vision Systems (ACIVS’02), Ghent, Belgium, 2002, pp. 260–267. [2] K. Andersson, H. Knutsson, Multiple hierarchical motion estimation, in: Proceedings of Signal Processing, Pattern Recognition, and Applications (SPPRA’02), Crete, Greece, 2002, pp. 80–85. [3] K. Andersson, H. Knutsson, Continuous normalized convolution, in: Proceedings of International Conference on Multimedia and Expo (ICME’02), Lausanne, Switzerland, 2002, pp. 725–728. [4] M. Andersson, J. Wiklund, H. Knutsson, Filter networks, in: Proceedings of Signal and Image Processing (SIP’99), IASTED, Nassau, Bahamas, 1999, also as Technical Report LiTH-ISY-R-2245. [5] J.L. Barron, D.J. Fleet, S.S. Beauchemin, Performance of optical flow techniques, Internat. J. Comput. Vision 12 (1) (1994) 43–77. [6] M. Bierling, R. Thoma, Motion compensated field interpolation using a hierarchically structured motion estimator, Signal Processing 11 (1986) 387–404. [7] D.J. Fleet, A.D. Jepson, Computation of component image velocity from local phase information, Internat. J. Comput. Vision 5 (1) (1990) 77–104. [8] D.J. Fleet, A.D. Jepson, Stability of phase information, in: Proceedings of the IEEE Workshop on Visual Motion, IEEE, IEEE Society Press, Princeton, USA, 1991, pp. 52–60. [9] D.J. Fleet, A.D. Jepson, Stability of phase information, IEEE Trans. Pattern Anal. Machine Intell. 15 (12) (1993) 1253–1268. [10] R. Forchheimer, T. Ericson, Differential transform coding: a new image compression scheme, in: Proceedings of the International Conference on Digital Signal Processing, Florence, Italy, 1981. [11] G.H. Granlund, H. Knutsson, Signal Processing for Computer Vision, Kluwer Academic Publishers, Dordrecht, 1995 ISBN 0-7923-9530-1. [12] M. Hemmendorff, Motion estimation and compensation . in medical imaging, Ph.D. Thesis, Linkoping University, . Sweden, SE-581 85 Linkoping, Sweden, Dissertation No. 703, ISBN 91-7373-060-2, 2001. [13] B.K.P. Horn, B.G. Schunk, Determining optical flow, Artificial Intell. 17 (1981) 185–204. [14] ISO/IEC JTC1/SC29/WG11, Test model 5 (TM5), 1993.

400

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

[15] A.D. Jepson, M. Jenkin, The fast computation of disparity from phase differences, in: Proceedings CVPR, San Diego, CA, USA, 1989, pp. 386–398. [16] H. Knutsson, Filtering and reconstruction in image . processing, Ph.D. Thesis, Linkoping University, Sweden, Discussion No. 88, 1982. [17] H. Knutsson, Representing local structure using tensors, in: Proceedings of the Sixth Scandinavian Conference on Image Analysis, Oulu, Finland, 1989, pp. 244–251, Report LiTH-ISY-I-1019, Computer Vision Laboratory, . Linkoping University, Sweden, 1989. [18] H. Knutsson, M. Andersson, J. Wiklund, Advanced filter design, in: Proceedings of the Scandinavian Conference on Image analysis, SCIA, Greenland, 1999. [19] H. Knutsson, C.-F. Westin, Normalized and differential convolution: methods for interpolation and filtering of incomplete and uncertain data, in: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1993, pp. 515–523. [20] R. Krishnamurthy, P. Moulin, J.W. Woods, Optical flow techniques applied to video coding, in: Proceedings of the IEEE International Conference on Image Processing, Vol. I, Washington, USA, 1995, pp. 570–573. [21] T. Kronander, Some Aspects of Perception Based Image . Coding, Dissertation No. 203, Linkoping University, . Linkoping, 1989. [22] C. Kuglin, D. Hines, The phase correlation image alignment method, in: Proceedings of the IEEE International Conference on Cybernet Society, IEEE, 1975, pp. 163–165. [23] H. Li, D.C. Escudero, R. Forchheimer, Motion compensated multiresolution transmission of digital video signals, in: H. Harashima (Ed.), Proceedings of the 1995 International Workshop on Very Low Bit-rate Video, 1995.

[24] K.P. Lim, M.N. Chong, A. Das, Low-bit-rate video coding using dense motion field and uncovered background prediction, IEEE Trans. Image Process. 10 (2001) 164–166. [25] MPEG Software Simulation Group, MPEG-2 Encoder/ Decoder, Version 1.2, 1996, available at ftp://ftp.mpeg. org/pub/mpeg/mssg/. [26] T. Naveen, J.W. Woods, Motion compensated multiresolution transmission of high definition video, IEEE Trans. Circuits Systems Video Technol. 4 (1994) 29–41. [27] A.N. Netravali, J.D. Robbins, Motion compensated television coding: Part 1, Bell System Technical J. 58 (1979) 631–670. [28] D.A. Pollen, S.F. Ronner, Spatial computation performed by simple and complex cells in the visual cortex of the cat, Vision Res. 22 (1982) 101–118. [29] A. Said, W. Pearlman, A new fast and efficient image coded based on set partitioning in hierarchical trees, IEEE Trans. Circuits Systems Video Technol. 6 (1996) 243–250. [30] D. Wang, D. Lauzon, Hybrid algorithm for estimating true motion-fields, Opt. Eng. 39 (11) (2000) 2876–2881. [31] C.-J. Westelius, Focus of attention and gaze control for . robot vision, Ph.D. Thesis, Linkoping University, Sweden, . SE-581 83 Linkoping, Sweden, Dissertation No. 379, ISBN 91-7871-530-X, 1995. [32] R. Wilson, H. Knutsson, A multiresolution stereopsis algorithm based on the Gabor representation, in: Proceedings of the Third International Conference on Image Processing and its Applications, IEE, Warwick, Great Britain, 1989, pp. 19–22. [33] X. Yang, K. Ramchandran, Scalable wavelet video coding using aliasing-reduced hierarchical motion compensation, IEEE Trans. Image Process. 9 (5) (2000) 778–791.