Motion compensation using backward prediction and prediction refinement

Signal Processing: Image Communication 18 (2003) 381–400 Motion compensation using backward prediction and prediction reﬁnement$ Kenneth Anderssona, ...

Download PDF

662KB Sizes 0 Downloads 70 Views

Report

PDF Reader
Full Text

Signal Processing: Image Communication 18 (2003) 381–400

Motion compensation using backward prediction and prediction reﬁnement$ Kenneth Anderssona, Mats Anderssona, Peter Johanssonb, Robert Forchheimerb, Hans Knutssona,* a

Department of Biomedical Engineering, Linkoping University, 581 85 Linkoping, Sweden . . b Image Coding Group, Linkoping University, 581 83 Linkoping, Sweden . .

Received 1 July 2002; received in revised form 18 December 2002; accepted 30 December 2002

Abstract This paper presents new methods for use of dense motion ﬁelds for motion compensation of interlaced video. The motion estimation is based on previously decoded ﬁeld-images. The motion is then temporally predicted and used for motion compensated prediction of the ﬁeld-image to be coded. The motion estimation algorithm is phase-based and uses two or three ﬁeld-images to achieve motion estimates with sub-pixel accuracy. To handle non-constant motion and the speciﬁc characteristics of the ﬁeld-image to be coded, the initially predicted image is reﬁned using forward motion compensation, based on block-matching. Tests show that this approach achieves higher PSNR than forward blockbased motion estimation, when coding the residual with the same coder. The subjective performance is also better. r 2003 Elsevier Science B.V. All rights reserved. Keywords: Backward–forward motion compensation; Phase-based motion estimation; Irregular sampling; Continuous normalized convolution

1. Introduction Since the advent of television obtaining high perceived quality using a limited bandwidth has been an important issue. From these early days the demand for visual information to be stored or transmitted has progressively increased. Critical video coding components in hybrid coding [10] are motion estimation and motion compensated pre$ Supported by the Swedish Foundation for Strategic Research (SSF) and the program Visual Information Technology (VISIT). *Corresponding author. Tel.: +46-13-227294; fax: +46-13101902. E-mail address: [email protected] (H. Knutsson).

diction which also are the main issues in the present paper. Standardized hybrid coders like MPEG-1, MPEG-2, MPEG-4, H.261 and H.263 use block-based motion estimation and motion compensated prediction. The motion parameters and the prediction residual are coded and transmitted to the decoder. The simplicity and the relatively good performance of block-based motion compensation has made it the standard approach for motion compensation. To deal with the unnatural motion compensated prediction inherent in block-based algorithms, a dense motion ﬁeld has been estimated and used in a forward motion compensated scheme [20]. To use a dense motion ﬁeld but avoid the overhead to

0923-5965/03/$ - see front matter r 2003 Elsevier Science B.V. All rights reserved. doi:10.1016/S0923-5965(03)00012-2

382

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

code it, backward video coding has been introduced using the assumption of predictability in space [27], in time [26] or in spatial frequency [23]. In backward video coding, motion estimation is performed at both the encoder and the decoder, and only transmission of the coded prediction residual is necessary. In Lim et al. [24], backward video coding using temporal prediction of a dense motion ﬁeld is selected for a block when the motion compensated prediction error is lower than for forward video coding using H.263. Yang et al. [33] estimate motion for blocks of size 4 4 pixel on previously decoded images at a low-to-high frequency scale. To improve the motion estimates for low bit rates, with increasing quantization noise, the backward motion estimates are reﬁned using forward motion estimation. In this approach, block artifacts can still be present since the motion estimation is block-based. We propose a new backward–forward motion compensation scheme [1]. Initially a dense motion ﬁeld is estimated on previously decoded ﬁeldimages using a new phase-based motion estimator. The motion ﬁeld is temporally predicted and used for motion compensated prediction of the current ﬁeld-image. To handle non-constant motion ﬁeld and the speciﬁc characteristics of the ﬁeld-image to be coded the initial backward motion compensated prediction is reﬁned using block-based forward motion compensation. The use of an initial pixelbased motion compensation reduce problems with block-artifacts as those in standard block-based forward motion compensation schemes. Image sequences consist of a large amount of data. For this reason methods for image sequences have to be computationally efﬁcient to be possible to use. This aspect is a parallel central theme in the work presented. This paper is organized as follows. In Section 2, a quadrature ﬁlter bank representation of an image is efﬁciently produced. Phase-differences from products of quadrature ﬁlter responses are in Section 3 used in a new motion estimator for generation of a dense motion vector ﬁeld. In Section 4 we present a novel approach for motion compensation using a dense motion ﬁeld to get an initial prediction and a sparse motion ﬁeld for reﬁnement. Experimental results are given in

Section 5. Finally, some conclusions are given in Section 6.

2. Quadrature ﬁlter bank representation This section describes an efﬁcient computation of a quadrature ﬁlter bank representing the image in different orientations and scales covering the Fourier domain. An fundamental property of quadrature ﬁlters is that the magnitude of the output is invariant to the phase of the underlying signal. There are also many physiological and psychophysical experiments that are consistent with the hypothesis that the visual cortex contains cells that have similar behaviour [28]. For these reasons quadrature ﬁlters has been extensively used as a basis for image representation and analysis [16]. In the present work the initial image representation has mainly two purposes. In addition to produce local phase estimates for the motion estimator in Section 3 it is essential that the representation supports a video quality estimator adapted to the human visual system. This opens up the possibility to control video coding using video quality estimation. 2.1. Quadrature filters The phase, fðxÞ; and magnitude, mðxÞ; are deﬁned as the phase angle and the absolute value of the complex-valued response, qðxÞ; of the quadrature bandpass ﬁlter, f: qðxÞ ¼ ðf * IÞðxÞ ¼ qreal ðxÞ þ iqimag ðxÞ ¼ mðxÞeifðxÞ ; fðxÞ ¼ argðqreal ðxÞ þ iqimag ðxÞÞ; mðxÞ ¼

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ qreal ðxÞ2 þ qimag ðxÞ2 ;

ð1Þ ð2Þ ð3Þ

where x denotes the spatial coordinates ðx; yÞT ; * the convolution operator, qreal ðxÞ and iqimag ðxÞ the real and imaginary part of qðxÞ and I is a real valued image. The magnitude mðxÞ of the quadrature ﬁlter response determines how well the image ﬁts to the shape of the ﬁlter and also indicates the local

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

energy in the image. The ﬁlter’s operating range depends on its angular bandwidth and radial frequency bandwidth. Attainable joint spatial and frequency resolution are fundamentally restricted by the well-known uncertainty principle. A large bandwidth gives phase estimates with good spatial localization but on the other hand increase Fourier domain interference in the phase estimation. 2.2. Motion estimation A differential technique similar to that for intensity [13] can be used for phase-based motion estimation [7,12]. Since the phase is close to linear within a ﬁlter’s operating range we can constrain the motion using phase-differences: fx vx þ fy vy þ ft ¼ 0;

ð4Þ

where v ¼ ðvx ; vy ÞT ; are the velocity components in the spatial direction x and y respectively, fs is the partial derivative of f and s denotes the direction of differentiation and can be one of the spatial directions, or the temporal direction t: Eq. (4) is however ambiguous. For this reason at least two ﬁlters are needed to solve the ambiguity of the phase constraint equation. Further, for simple neighbourhoods [11], the so-called aperture problem will be present when using phase in the same way as when using the original image for estimating local velocity. 2.3. Efficient filter bank for interlaced video In this paper, we have selected the relation between the radial and the angular frequency response of the ﬁlters to make the magnitude of the kernel as isotropic as possible. Typical values of frequency and orientation bandwidths are 1.5–2 octaves and 70 respectively. This provides compact spatial support and promotes efﬁcient ﬁltering. The large amount of image data enforce requirements on computationally efﬁcient ﬁltering. To minimize the number of coefﬁcients used in the computation of the quadrature ﬁlter responses a sequential combination of small one-dimensional

383

(1D) ﬁlter kernels are used, i.e. a filter net [4,18] is used. The ﬁlter net is furthermore designed for interlaced video, the most common format for video sequences. This is achieved by designing the ﬁlter net with a restricted Fourier domain in the vertical direction (half of the horizontal) and restrict the spatial positions of the ﬁlter coefﬁcients in the vertical direction, see Fig. 1. This means that the ﬁnal ﬁlter net can be applied directly on interlaced images without any previous interpolation and resampling to an image with equal spacing in the vertical and the horizontal direction. 2.4. Filter bank design An ideal set of frequency and orientation selective quadrature ﬁlters are designed in the Fourier domain. The frequency response of the ideal quadrature ﬁlters are generated from difference of lowpass ﬁlters multiplied with ﬁlters with angular selectivity. All ﬁlters are more or less 1D. Four directions of ﬁltering are used: n# x ¼ ð1; 0ÞT ;

ð5Þ

n# y ¼ ð0; 1ÞT ;

ð6Þ

1 n# xy ¼ pﬃﬃﬃð1; 1ÞT ; 2

ð7Þ

1 n# xy ¼ pﬃﬃﬃ ð1; 1ÞT : 2

ð8Þ

An ideal quadrature ﬁlter, Fk ðuÞ; is generated as Fk ðuÞ ¼ Dk ð#uÞBPk ðuÞ;

where k ¼ ½1y9 ;

ð9Þ

where u# ¼ u=jjujj and with the angular selectivity function given by ( ð#uT n# k Þ2 ; when u# T n# k > 0 Dk ð#uÞ ¼ ð10Þ 0; otherwise: The bandpass ﬁlters generated as BPk ðuÞ ( ¼

LP1k ðuÞ LP2k ðuÞ;

k ¼ ½2y9

ðLP1k ðuÞ LP2k ðuÞÞLP3 ðuÞ; k ¼ 1 ð11Þ

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

384

where ajk ¼ p=ð2 ln ðuhjk =uljk ÞÞ and ﬁnally

with lowpass ﬁlters given by

LP1k ðuÞ 8 Tn 1k 2 > jÞ ; minðcosða1k ln juul1k > > > > > uT n> 2 > > < cosða1k ln j ul1k1k jÞ Þ; ¼ > > > > > 1; > > > : 0;

LP3 ðuÞ 8

uT n# y 2 p > cos ln j j ; > < 2 ln ð2Þ p=8 ¼ 1; > > : 0;

ul1k o ¼ juj

juj > uh1k ;

where n> 1k is perpendicular to n1k ; and

ul2k o ¼ juj

jujop8

ð14Þ

juj > p4:

2.5. Filter optimization

o ¼ uh2k jujoul2k juj > uh2k ;

¼ jujo ¼ p4

The parameters for each quadrature ﬁlter can be described by the direction of ﬁltering nk ; n1k and n2k ; and low and high cut-off frequencies of the lowpass ﬁlters ul1k ; uh1k ; ul2k and uh2k : The parameters for the ideal quadrature ﬁlters are described below together with an indication of centre frequency oc ; see Table 1. It can be noted that all quadrature ﬁlters comes in pairs of two at each scale with the exception of the horizontal ﬁlter F1 ðuÞ: Due to interlacing F1 ðuÞ has no correspondence in the vertical direction and this ﬁlter has to be designed differently. A contour plot of the frequency responses of the ideal quadrature ﬁlters at half amplitude is shown in Fig. 1.

o ¼ uh1k jujoul1k ð12Þ

LP2k ðuÞ 8 uT n2k 2 > > > minðcosða2k ln j ul2k jÞ ; > > > uT n> 2 > > < cosða2k ln j ul2k2k jÞ Þ; ¼ > > > > > 1; > > > : 0;

p 8o

ð13Þ

First an ideal set of frequency and orientation selective quadrature ﬁlters are designed in the

Table 1 Description of the parameters for generation of the frequency response of the ideal quadrature ﬁlters. oc denotes the spatial centre frequency Filter

oc (rad)

nk

n1k

n2k

ul1k p 2 p 4 p 4

p 2 p 2

p pﬃﬃﬃ 4 2 p pﬃﬃﬃ 4 2 p 8 p 8 p pﬃﬃﬃ 8 2 p pﬃﬃﬃ 8 2

p pﬃﬃﬃ 2 2 p pﬃﬃﬃ 2 2 p 4 p 4 p pﬃﬃﬃ 4 2 p pﬃﬃﬃ 4 2

F1

0:5p

n# x

n# x

n# x

F2

0:35p

n# xy

n# x

n# xy

F3

0:35p

n# xy

n# x

n# xy

F4

0:25p

n# x

n# xy

n# x

F5

0:25p

n# y

n# xy

n# x

F6

0:18p

n# xy

n# x

n# xy

F7

0:18p

n# xy

n# x

n# xy

F8

0:125p

n# x

n# xy

n# x

F9

0:125p

n# y

n# xy

n# x

uh1k

ul2k

uh2k

p

p 8

p 4 p pﬃﬃﬃ 8 2 p pﬃﬃﬃ 8 2 p 16 p 16 p pﬃﬃﬃ 8 2 p pﬃﬃﬃ 8 2 p 16 p 16

p pﬃﬃﬃ 16 2 p pﬃﬃﬃ 16 2 p 32 p 32 p pﬃﬃﬃ 16 2 p pﬃﬃﬃ 16 2 p 32 p 32

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

π/2

Levels

F5

F3

385

F2

1

F6

uy

F7

F9 F8

-π/2 -π

F4

ux

F1

π

Fig. 1. Contour plot of the frequency responses of the ideal quadrature ﬁlters at half amplitude. Note that the Fourier domain is restricted in the y direction due to interlacing.

18

Fourier domain. The purpose of the optimization is to obtain the set of spatial kernels in the ﬁlter net that minimizes the weighted difference between the frequency response of the ideal and the optimized ﬁlters, i.e. ﬁnd the minimum of e2 ¼

X X k

jjW ðuÞðF*k ðuÞ Fk ðuÞÞjj2 ;

ð15Þ

u

where u denotes the Fourier coordinates ux and uy ; and W ðuÞ weights errors according to the expected spectrum. For most natural images the spectrum behaviour can on the average be expected to be independent of scale and decay as juja [18], where qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ juj ¼ u2x þ u2y : No restrictions on the spatial shape of the ﬁlters are considered other than the deﬁnition of the spatial support of the kernels. It is however possible to incorporate spatial criteria in the optimization, see [18] for more details on multiple domain ﬁlter optimization. With the ideal quadrature ﬁlter responses in mind a ﬁlter net is designed, see Fig. 2. The nodes denotes summation points and the ﬁlters are located on the arcs connecting two consecutive levels. Ideally it would for a given network structure be desirable to optimize the number of coefﬁcients in each arc and the spatial position for each coefﬁcient for the entire net simultaneously. This is, however an extremely complex problem and a method for ﬁnding an optimal solution has not been found. For this reason the design of the network has to be made based on experience. For

~ ~ ~ ~ ~ ~ ~ ~ ~ f 1 f 2 f 3 f 4 f 5 f6 f 7 f 8 f 9 1

14

Nodes

Fig. 2. Filter net for generation of 9 quadrature ﬁlter kernels f*k using 141 real-valued kernel coefﬁcients. The dots corresponds both to the number of non-zero coefﬁcients in the simple ﬁlter and the position for these coefﬁcients. The simple ﬁlters in the bottom of the ﬁlter net have complex-valued kernel coefﬁcients.

a given network the value of the ﬁlter coefﬁcients can be optimized using Eq. (15). The optimization of the ﬁlter net is recursive. Each level in the ﬁlter net is optimized separately with respect to the ﬁnal output nodes and the current state of the other levels in the ﬁlter net. The convergence of this approach is not guaranteed but tests show that for educated choices of ﬁlter structure and coefﬁcient distribution the network converges after a few iterations to a stable solution. For further details on this optimization see [4]. 2.6. Result of the filter network optimization The optimized ﬁlter network requires 141 realvalued multiplications per pixel which corresponds to 15–16 multiplications for each quadrature ﬁlter response. The resulting optimized quadrature ﬁlters are shown in Fig. 3. Using optimization of a single ﬁlter based on F6 ; with the same frequency weighting function as in Eq. (15), will require 361

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

386

π/2

~

~

F3

~

~

uy

F2

F6

F7

~

F9

~

F8

-π/2 -π

3. Motion estimation based on phase-differences

~

F5

ux

~

F4

~

F1

π

Fig. 3. Contour plot of the frequency responses of the generated quadrature ﬁlters f*k at half amplitude. Note that the Fourier domain is restricted in the y direction due to interlacing.

Fig. 4. Magnitude times phase of the quadrature ﬁlters for a spatial region of 37 37 pixel. Based on quadrature ﬁlter kernels from left to right starting at the top: f*1 ; f* 2 ; f*3 ; f*4 ; f*5 ; f*6 ; f*7 ; f*8 and f*9 :

ﬁlter coefﬁcients for similar weighted error as for the ﬁlter net approach. This shows that in this application the ﬁlter net solution is more than 20 times faster to generate a quadrature ﬁlter response than using a ﬁlter kernel from a single ﬁlter optimization. For comparison, using a fast fourier transform (FFT) approach will on average also be more than 10 times slower. We can visualize the characteristics of the generated quadrature ﬁlters by looking at the magnitude and phase characteristics in Fig. 4.

Phase was ﬁrst introduced in [22] using the Fourier phase for signal registration, i.e. phase correlation. Other approaches based on phasedifferences are [32,7,31]. Methods based on phase have shown to perform well in comparison with other methods, see [5]. Phase variations are nearly linear with respect to displacements a feature which makes it possible to obtain subpixel accuracy without explicit interpolation. Phase is further relatively insensitive to variation in luminance and contrast, and also to small variations in scale and rotation [9]. A draw-back with methods based on phase-differences is that they can be completely unreliable at singular points [7]. Such points can, however, be detected and avoided in the computation. Using narrowband ﬁlters reduce the risk for singular points. However, in this case more ﬁlters are needed to cover the Fourier domain and the ﬁlters will get larger in the spatial domain increasing the risk of mixing different objects in the motion estimation. In this paper, the velocity is estimated by a leastsquare ﬁt of several local phase constraints computed from responses of the quadrature ﬁlters given in Section 2. This kind of minimization has also been used by Fleet and Jepson [7] and Hemmendorff [12]. Many methods are based on a relatively large spatio-temporal support. We focus on rapidly changing motion and use a temporal support of two or three ﬁeld-images and a limited number of broad-band quadrature ﬁlters. Multiple hierarchical motion estimation is used to obtain high spatial resolution of the motion ﬁeld. First in Section 3.1 we describe how phase-differences are computed. Then in Section 3.2 a velocity is estimated based on a selected set of the available quadrature ﬁlter responses. This velocity is then post-ﬁltered in Section 3.3 to reduce uncertainty in the motion estimates. In Section 3.4 the complete motion estimation algorithm that produce the ﬁnal velocity is described. 3.1. Computation of phase-differences As was shown in Eq. (4) the velocity is related to the spatial and temporal phase-differences. Our

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

main requirement on the measurements of the phase-differences is that they should be local both spatially and temporally to separate objects of different motions and support motion estimates with high spatial accuracy. For this reason we use the 2D spatial quadrature ﬁlters, f* k ; from Section 2, in two or three ﬁeld-images. The phase-differences fkx ; fky and fkt and corresponding magnitudes mkx ; mky and mkt ; based on quadrature ﬁlter response q* k ; are computed as the argument and the magnitude of the following complex-valued functions: ! ! 1 ; t q* k ðx; tÞn qqkx ðxÞ ¼ q* k x 0 ! !n 1 þ q* k ðx; tÞ q* k x þ ;t ; ð16Þ 0 ! ! 0 qqky ðxÞ ¼ q* k x ; t q* k ðx; tÞn 1 ! !n 0 þ q* k ðx; tÞ q* k x þ ;t ; 1

ð17Þ

qqkt ðxÞ ¼ q* k ðx; t 1Þ q* k ðx; tÞn þ q* k ðx; tÞ q* k ðx; t þ 1Þn ;

ð18Þ

where * denotes the complex conjugate. The phase-differences and corresponding magnitudes are computed as fkx ðxÞ ¼ argðqqkx ðxÞÞ;

ð19Þ

fky ðxÞ ¼ argðqqky ðxÞÞ;

ð20Þ

fkt ðxÞ ¼ argðqqkt ðxÞÞ;

ð21Þ

mkx ðxÞ ¼ jqqkx ðxÞj;

ð22Þ

mky ðxÞ ¼ jqqky ðxÞj;

ð23Þ

mkt ðxÞ ¼ jqqkt ðxÞj:

ð24Þ

The use of products of quadrature ﬁlters for estimation of phase-differences have earlier been

387

used by Wilson and Knutsson [32] and Jepson and Jenkin [15]. Estimating phase-differences as the argument of the complex-valued products handles phase unwrapping automatically, since the complex exponential representation by deﬁnition is modulo 2p: Consider a 1D example with the ﬁlter responses aðx1 Þ ¼ meif and aðx2 Þ ¼ meiðfDfÞ of a sinus wave at two spatial positions with phases f and f Df: Estimating the phase-difference between the positions using the complex-valued product aðx1 Þaðx2 Þn gives meif meiðfDfÞ ¼ m2 eiDf : The phase-difference is independent of the absolute phase of the signal. Spatial phase-differences ðfx ; fy ÞT projected on the orientation n# k of the used ﬁlter f* k ; i.e. the local frequency, should further always be positive. This is due to the fact that the phase in the designed direction of a quadrature ﬁlter is always rotating in the same way. Negative local frequencies corresponds to a singular point. Such points are completely unreliable and should not be used. A singular point can occur when two frequencies interfere in the ﬁlter response. Such points can be detected and avoided in the computation [7], see Eq. (28). Temporal phase-differences on the other hand can have different sign depending on the direction of movement. This can be the case since in the temporal direction the phase of a quadrature ﬁlter designed in a spatial direction do not have and deﬁned way of rotation. A large phase-difference in the temporal direction must thus be used carefully since it might have the wrong sign. This is considered in Eq. (30). As an additional feature using these products for estimation of phase-differences the magnitudes mkx ; mky and mkt give us a measure of how strong the signal is and how well it ﬁts the ﬁlter. This is an important measure of reliability of phase estimates and is used in Eq. (27). Due to the interlaced data, each of the temporal neighbouring quadrature ﬁlter responses in Eq. (18) are bi-linearly interpolated to be spatially aligned with the centre quadrature ﬁlter response when the products are computed. Bi-linear interpolation is also used for the neighbouring samples in the vertical direction in the computation of Eq. (17).

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

388

Estimation of phase-differences based on three points as in Eqs. (16), (17) and (18), gives measures of the phase-differences for the centre point. The phase-differences are determined from the magnitudes of the corresponding products. The product with largest magnitude, best ﬁt, will have most inﬂuence on the ﬁnal phase-difference. For temporal phase-differences based on this symmetrical temporal support the motion estimate is centred at time t: This estimation is best as long as the motion is consistent in the spatio-temporal region. If the motion is inconsistent and our particular interest is using the estimated motion for motion compensated prediction as described in Section 4, it can be better to omit the future ﬁeld-image and only rely on the motion in between time t 1 and time t: 3.2. From phase constraints to velocity To solve the ambiguity of the phase constraint equation (Eq. (4)), we determine the velocity from minimization of a least-square ﬁt of several local phase constraints in a spatial neighbourhood as given in Eq. (25). For this purpose we use phasedifferences from a set of quadrature ﬁlter kernels O and a spatial window W ðxÞ where O and W ðxÞ are deﬁned in Section 3.4. Based on phasedifferences from coarse ﬁlters, i.e. ﬁlters with a low centre frequency, a larger motion can be estimated than using ﬁne ﬁlters. Using coarse ﬁlters means, however, that motions of different objects is more likely to get mixed up in the analysis resulting in a poor velocity estimate. For this reason we use orientation selective ﬁlters with a similar centre frequency to estimate a certain range of velocities. For improved readability we omit the spatial coordinates ðxÞ of many variables in this section. We use a constant motion model, v, in the minimization. X X Ev ¼ W ðxÞðcTk ðxÞ v& ðxÞÞ2 ; ð25Þ kAO

x

where W ðxÞ is a window that give more inﬂuence to constraints at the centre of the spatial neighbourhood. v& ¼ ðvT ; 1ÞT ¼ ðvx ; vy ; 1ÞT and ck are the weighted phase-differences from quadrature ﬁlter

response q* k :

0

1 fkx pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃB C ck ðxÞ ¼ ckm ckf ckv ckt @ fky A 0

fkt

1

fkx pﬃﬃﬃﬃﬃB C ¼ ck @ fky A;

ð26Þ

fkt with weights, or reliability, of the phase-differences deﬁned as pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ckm ¼ mkx mky mkt ; ð27Þ ckf ¼ 1 when

n# Tk

fkx fky

! > 0 else 0;

ckv ¼ 1 when jjvkn jjovk max else 0; f ckt ¼ cos2 kt ; 2

ð28Þ ð29Þ ð30Þ

where ckm favourizes ﬁlters according to their magnitude due to the local structure and the temporal similarity. ckf removes points with negative spatial phase-differences, i.e. negative local frequencies, as they are completely unreliable [8,31]. ckv avoids use of phase estimates with larger normal velocity estimate than the maximum velocity vk max the ﬁlter f* k can handle. With normal velocity deﬁned as ! fkx fkt fky vkn ¼ 2 ; ð31Þ fkx þ f2ky ckt limits problems with the phase wrapping around, i.e. changing sign, for temporal phasedifferences, by promoting small phase-differences before larger phase-differences [32]. Minimization of Eq. (25) with respect to v gives vðxÞ ¼ P

k;x

W ðxÞck fkx fkx

k;x

W ðxÞck fky fkx

k;x

W ðxÞck fkx fkt

k;x

W ðxÞck fky fkt

P

P

P

P

k;x

W ðxÞck fkx fky

k;x

W ðxÞck fky fky

P

! :

!1

ð32Þ

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

The similarity between the concept of orientation tensors [17] and the described algorithm is shown in Appendix A. 3.3. Post-filtering of motion estimates After the velocity vðxÞ have been estimated using Eq. (32) there can still remain uncertainties in the velocity estimates. Such uncertainties can for example occur when the local structure is close to 1D. The uncertainties can be described as errors in magnitude and direction of the velocity and can be represented with the covariance matrix: X X CðxÞ ¼ ðvðxÞ mðxÞÞðvðxÞ mðxÞÞT þ eI; x

y

ð33Þ where mðxÞ is the local mean of vðxÞ determined with a Gaussian lowpass ﬁlter. eI is an isotropic part to limit the range of the inverse of the uncertainty, i.e. the certainty. At a point where the velocity is larger than what can be estimated, considering the used ﬁlters, the diagonal of the covariance matrix is forced to be huge. The inverse of the uncertainty matrix reﬂects the certainty of a motion estimate. In a local neighbourhood we then weight our velocity estimates according to this uncertainty using normalized convolution: v* ðxÞ ¼ ððg * C1 ÞðxÞÞ1 ððg * ðC1 vÞÞðxÞÞ;

ð34Þ

where g is a separable Gaussian ﬁlter with size 5 5 and s ¼ 1:2: These ﬁlter parameters are also used in the computation of the local mean and for a weighted summation of Eq. (33). With this post-ﬁltering method we use estimates in accordance with their spatial consistency. A bad estimate can be repaired by neighbouring more consistent estimates. 3.4. Multiple hierarchical motion estimation Hierarchical motion estimation is a commonly used method to handle large velocities and to reduce noise sensitivity. It starts with motion estimation based on low spatial frequency to compute a coarse motion estimate. Motion estimates are then reﬁned using higher spatial

389

frequencies. The smooth motion ﬁeld of hierarchical motion estimation leaks motion to the background near moving objects. Motion compensation of an object according to such a motion will move the background along with the object. This gives a perceptually annoying distortion. A method to improve hierarchical motion estimation is to only allow non-zero motion vectors when the frame difference IðtÞ Iðt 1Þ is above a certain threshold [6]. Another method is to propagate several motion vectors from each hierarchical computation level and thereby increase the probability of estimating the true motion [21]. A recent approach for hierarchical block-matching selects the initial motion of a block at a ﬁner resolution as the motion with least motion compensated prediction error of motions from neighbouring blocks at the previous resolution [30]. Our method increases the spatial resolution of the estimated motion ﬁeld with use of several parallel local hierarchical motion estimators starting at different scales [2]. The multiple hierarchical motion estimator (MHME) consists of one global and three local hierarchical motion estimators (HME), see Fig. 5. To handle camera panning, i.e. global shift of the whole image, the local motion estimators are initially compensated according to a global hierarchical motion estimate. Areas not following the global motion are corrected by the local motion estimators. The basic elements of the estimator are three motion estimation units. Each motion estimation unit determine the velocity from minimization of Eq. (25), for a constant motion model over a spatial region deﬁned by W ðxÞ; and using a set O of phase-differences based on responses of the quadrature ﬁlters, f* k : The motion estimation units are based on coarse ﬁlters kAO ¼ f6; 7; 8; 9g; medium ﬁlters kAO ¼ f4; 5; 6; 7g and ﬁne ﬁlters kAO ¼ f1; 2; 3g: W ðxÞ is a Gaussian lowpass ﬁlter with standard deviation s ¼ 1:2; 2.4 and 3.6, respectively described in order from high frequency to low frequency motion estimation units. Gaussians are used due to their separability which allows for efﬁcient ﬁltering. The motion estimation units, used for global motion estimation, use integration over the whole image.

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

390

in Eqs. (16) and (17) need not to be re-computed. It should be noted that using motion compensation of a motion estimation unit we will get a delta velocity from Eq. (25). Before post-ﬁltering is applied, as in Section 3.3, this delta velocity is accumulated to the previously estimated velocity v* ðxÞ: The hierarchical motion estimator which uses all scales can estimate velocities up to about 10 pixel per ﬁeld-image. The algorithm produces four motion estimates v* j with sub-pixel accuracy in each pixel, one global and three local estimates. Each of these motions are then used for motion compensated prediction: I#jt ðxÞ ¼ I*t1 ðx v* j ðxÞÞ; where j ¼ ½1y4 ; ð36Þ

Integration

MC

Integration

+

MC Integration

+

vglobal

where I* t1 is a previously decoded ﬁeld-image and I#jt is the predicted image. The motion estimate v* j at each pixel which gives the least local motion compensated prediction error is selected as the output from the MHME: vðxÞ ¼ arg min ððg * jI#jt I*t j ÞðxÞÞ; ð37Þ

MC

Integration +

v* j

MC

MC

Integration

Integration

+

+

+

~ v

1

MC

MC

Integration

Integration

MC Integration

+

+

v~ 2

v~ 3

v~4

where g is a lowpass ﬁlter of size 3 3 and c is a measure of spatial consistency [11]: jjðg * vÞðxÞjj cðxÞ ¼ : ð39Þ ðg * jjvjjÞðxÞ

Selection

v

Fig. 5. Multiple hierarchical motion estimation (MHME). MC refers to a motion compensated motion estimation unit.

The hierarchical motion estimation implies that the temporal phase differences in Eq. (18) must be re-computed. This is achieved by motion compensation using the estimated motion from previous computation unit as qqkt ðxÞ ¼ q* k ðx þ v* ðxÞ; t 1Þ q* k ðx; tÞn þ q* k ðx; tÞ q* k ðx v* ðxÞ; t þ 1Þn :

where g is a Gaussian ﬁlter of size 5 5 and s ¼ 1:2; and I*t is a previously decoded ﬁeld-image. To reduce noise generated by the selection in Eq. (37) the estimates are further ﬁltered with normalized convolution [19]: ðg cvÞðxÞ ; ð38Þ v0 ðxÞ ¼ * ðg * cÞðxÞ

ð35Þ

Bi-linear interpolation is used to achieve sub-pixel motion compensation. The spatial phase-differences

Note that when MHME is used for backward prediction as described in Section 4 the motion estimation is performed on previously decoded ﬁeldimages. The selected motion is then temporally predicted and used for motion compensated prediction of the ﬁeld-image to be coded, see Section 4.

4. Backward–forward motion compensated prediction We use backward motion compensated prediction based on the assumption that the motion is

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

~ I t+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + +

^ It

t + + + + + + + + + +

391

+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + +

+ + + + + + + + + + + + + + + + + +

+ + + + ++ + + + ++ ++ + + ++ ++ ++ + + ++ ++ + + ++ ++ + + ++ ++ + + + ++ +

+ + + + + + + + + + + + + + + + + +

Uncovered background

+ + + + + + + + + +

Moving object

Image sample

Covered background

Fig. 6. Illustration of backward prediction. The image at time t is predicted using a previously decoded image and the motion between two previously decoded images. A moving object causes irregular samples of the moving object at time t: In addition it generates covered and uncovered background. In the covered region exists multiple predictions and in the uncovered region samples are missing. The illustration is for non-interlaced sampling.

predictable in time as previously [26,24]. For this case it is important to achieve ‘true’ motion estimates as the motion, estimated from previously decoded ﬁeld-images, is used for motion compensated prediction of the ﬁeld-image to be coded. This is unlike block-based motion estimation, used in standard forward motion compensation, where the displacement with the least motion compensated prediction error is selected for each block. The prediction of the current ﬁeld-image It using a previously estimated motion v0tDt can be solved in two different ways: *

*

Direct prediction of It using v0tDt and a previously reconstructed image I* tDt : ð40Þ I#t ðx þ Dt v0tDt ðxÞÞ ¼ I*tDt ðxÞ; where Dt denotes which previous ﬁeld-image that is used, see Section 4.1. Prediction in two steps, ﬁrst prediction of vt using v0tDt : v# t ðx þ Dt v0tDt ðxÞÞ ¼ v0tDt ðxÞ

ð41Þ

and then prediction of It using v# t and I*tDt : I#t ðxÞ ¼ I*tDt ðx Dt v# t ðxÞÞ: ð42Þ Some potential problems with prediction in these cases are (1) Irregular input sampling. The positions of the predicted signals ðx þ Dt v0tDt ðxÞÞ and the sampling pattern of the desired output are not

aligned, see Fig. 6. There can also be ‘holes’ at the desired output positions, because no prediction ends up there. An example is uncovered regions due to object motion. (2) Covered regions. Many predictions can end up in the same output neighbourhood, i.e. occlusion. A simple example is a moving object on a stationary background. (3) Non-constant motion. The motion of an object at time t Dt and t is not the same. (4) Varying image characteristics. Examples are changing lighting conditions, 3D object rotation and quantization noise. Continuous normalized convolution (CNC) [3], an extension of normalized convolution (NC) [19], is used here for motion compensation to deal with irregular sampling and the case covered and uncovered regions. NC can be viewed as locally solving a weighted least square (WLS) problem to represent a signal using a local signal model. The weights are deﬁned by signal certainties and a spatially localizing function. When estimating regularly sampled output values, here located at the ﬁeld-image to be coded, from an irregularly sampled input signal, each output sample has to be estimated from a neighbourhood unique for that sample. CNC provides a computational efﬁcient way to compute such values from a set of ﬁxed regularly sampled ﬁlters using a linear model (here a ﬁrst order spline) and offsets between samples from the

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

392

regularly and irregularly sampled signals, see Eq. (44). This alleviates the need to work with analytic ﬁlter functions that are recalculated for each neighbourhood which greatly improves processing speed. When estimating output samples it is reasonable that signal samples spatially closer to the output samples should affect the ﬁlter output more than samples further away. This spatial localization is here achieved with a function with a centre region with values declining linearly and an outer region with values declining fast with the distance from the output sample. The approach assumes that neighbouring spatial regions of the predictions in Eqs. (40) and (41) have similar signal properties as the signals in the desired output positions. In this paper we use Eq. (40) for the prediction. Using a constant signal model the computations by CNC can be formulated for an output sample as P aðx * j Þcj I#j #I ¼ Pj ; ð43Þ * j Þcj j aðx where xj is the sample position of the jth input sample, aðx * j Þ is a continuous approximation of the spatial localization function, I#j is an irregular sample due to Eq. (40) and cj is corresponding certainty. A 1D example of how the continuous ﬁlter approximation works is given below: aðxÞ * ¼ a% þ Dx ax ; where ( X¼

ð44Þ

D ðIDx m þ 0:5Þ;

when N is even

D roundðDx Þ;

when N is odd;

Dx ¼ X x; a% ¼

aðX D=2Þ þ aðX þ D=2Þ ; 2

ax ¼

aðX D=2Þ aðX þ D=2Þ ; D

Using CNC as in this paper, output positions with ‘holes’ are spatially predicted using the neighbourhood. The value of output positions in a region of densely distributed predicted signal values are linearly interpolated from the closest signal samples. To completely solve the problem of covered regions, foreground objects that also end up visible at the next time sample should have certainty one. Certainty zero should be given to objects that end up in the background. Currently we use certainty one for all predicted image values. 4.1. Selection of motion and field-image for prediction Motion estimation based on three decoded ﬁeldimages, I*t3 ; I*t2 and I*t1 ; as in Eq. (18), gives an estimate for the centre image I* t2 : If the motion is constant in time the motion can be accurately predicted in time. To keep the spatial accuracy in the vertical direction (of the interlaced image sequence) when the motion is constant in time, we use Dt ¼ 2 in Eqs. (40), (41) and (42) for the motion compensated prediction of It : If the motion is non-constant in time a better estimate is achieved if only the two previously decoded ﬁeldimages, I*t2 and I*t1 ; are used in the motion estimation. For this case the current ﬁeld-image is predicted using I*t1 and vt1 : The approach with least motion compensated prediction error is selected. 4.2. Motion compensated prediction refinement

ð45Þ ð46Þ ð47Þ ð48Þ

where xAR is the input sample position, N denotes the number of ﬁlter coefﬁcients, Ixm denotes truncation of x to nearest integer towards minus inﬁnity, roundðxÞ rounds x to the nearest integer and D is the spacing between the ﬁlter coefﬁcients.

The initial motion compensated prediction moves the image characteristics of a previous ﬁeld-image using a dense motion ﬁeld. To deal with the problems of a non-constant motion ﬁeld and varying image characteristics, the initially predicted ﬁeld-image is reﬁned using forward motion compensation. This new backward– forward motion compensation scheme (BFMC) is found in Fig. 7. To achieve a low rate overhead for the forward motion compensation it is based on sub-pixel block-matching with a small search window. A small search window, typically less than 1 1 pixel, also reduces problems with block artifacts as

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

those in standard forward motion compensation schemes. The forward motion compensation reﬁnement is only used when the total reconstruction error is lower than for only using the backward motion compensation. One bit for each ﬁeld-image is transmitted to the decoder to determine the use of forward reﬁnement. The result of using reﬁnement is shown in Fig. 8. It can be seen that errors

Previously Decoded Field-images

Reference Field-image

Motion Estimation

Backward Backward Predicted Estimated Field-image Motion Motion Compensation

Motion Compensated Refinement Motion Refinement Field

Backward-Forward Predicted Field-image + -

Fig. 7. Backward–forward (BFMC).

motion

Motion Field Coding

around edges of moving objects is much less visible after reﬁnement.

5. Experimental results The BFMC is compared against a block-based forward motion compensation scheme (BMA) both using the SPIHT coder [29] for encoding the intra ﬁeld as well as the motion compensated residuals. We also make a comparison with version 1.2 of the MPEG-2 coder from MPEG [25]. All coding is performed on luminance images only. The ﬁrst ﬁeld-image is intra coded (I picture), all other ﬁeld-images are inter coded (P pictures). The motion compensation is performed on ﬁeld-images of interlaced video. The coding performance is evaluated using peak-signal-tonoise-ratio (PSNR): MAX2 ; ð49Þ RMS2 where MAX is the maximum signal value and RMS is the root-mean-squared error.

PSNR ¼ 10 log10

Residual Coding

compensation

393

scheme

Fig. 8. Comparison of error using backward motion compensation and backward–forward motion compensation (BFMC) for a ﬁeldimage from the Rugby sequence. Top left: Original image at t 1: Top right: Original image at t: Bottom left: Backward motion compensated prediction error (PSNR 24 dB). Bottom right: BFMC prediction error (PSNR 26 dB). Note: The ﬁeld-images have been bi-linearly interpolated to full resolution.

394

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

The BFMC estimates the motion on previously decoded ﬁeld-images and uses the motion for motion compensated prediction of the ﬁeld-image to be coded. The predicted ﬁeld-image is then reﬁned with forward motion compensation using block-matching. The block size ðy xÞ is 8 16 pixel and the search region is 70:75 pixel. The search is based on mean-squared-error (MSE) and uses bi-linear interpolation to achieve quarter-pel accuracy. The BMA uses a block-based motion estimator as it is the standard approach and is used in the MPEG standards. The block size is 8 16 pixel. We use integer search within (78; 716) displacement followed by half-pel reﬁnement within 71 half-pel displacement from best integer match and ﬁnally quarter-pel reﬁnement within 71 quarter-pel from best half-pel match, based on MSE and using bi-linear interpolation. The motion estimation is performed both using the previous decoded ﬁeld-image and the ﬁeld-image to be coded, and, the second previous decoded ﬁeld-image and the ﬁeld-image to be coded. The displacement with least MSE is selected. As for the BFMC this selection must be transmitted to the decoder using one bit per ﬁeld-image. The coding rate of the velocity ﬁeld of both the BFMC and the BMA is estimated using Huffman coding of each ﬁeld. The residual coding rate is reduced with the rate for the velocity ﬁeld coding. To have the same starting point of both the BFMC and the BMA the ﬁrst three ﬁeld-images are coded the same. The ﬁrst one is intra ﬁeld coded followed by two BMA predicted ﬁelds. The same intra and inter coding rate is used for the BFMC and the BMA.

The MPEG-2 coder is used in ﬁeld-image coding mode. No bi-directional prediction (B picture) is used. The MPEG-2 coder ﬁrst make an integer search within ð78; 716Þ displacements followed by a half-pixel search within half-pixel displacement from best integer match. The quantized DCT coefﬁcients are run-length coded using an alternate scan. We use intra DC precision of 9 bits for intra coded macroblocks. The frame rate is set to 25 Hz: The group of pictures (GOP) is set equal to the total number of ﬁeld-images in respective experiment, i.e. only one intra coded ﬁeld-image is used. The MPEG-2 coder is designed according to test model 5 (TM5) [14]. The motion estimation in this MPEG-2 coder has no hierarchical estimation mode. 5.1. The effect of coding on motion estimation and motion compensation accuracy Since we are estimating the motion on previously decoded images it is important that the motion estimation accuracy is relatively independent of coding errors. This is especially true for regions of high contrast as motion estimation errors in such regions will cause potentially large motion compensation errors. A commonly used test sequence in evaluation of optical ﬂow algorithms is the Yosemite sequence. The sequence was generated using a digital terrain map and includes velocities up to about 4 pixel/ frame. The progressive Yosemite sequence has been lowpass ﬁltered and down sampled vertically to produce an interlaced sequence, see Fig. 9. The performance on this sequence is often measured in

Fig. 9. Field-image corresponding to frame 9 of the Yosemite sequence (size 126 316). Left: Image. Right: Downsampled velocity ﬁeld. Note: The ﬁeld-image has been bi-linearly interpolated to full resolution.

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400 33

33

32

32 31

PSNR [dB]

PSNR [dB]

31 30 29 28

30 29 28 27

27 26

395

26 0.2

0.4

0.6

0.8 1 Bit rate [bpp]

1.2

1.4

25

0.2

0.4

0.6

0.8 1 Bit rate [bpp]

1.2

1.4

Fig. 10. Motion compensation accuracy versus coding rate, for the ﬁeld-image corresponding to frame 10 of the Yosemite sequence. Based on prediction of the estimated motion of frame 9 (solid). Based on prediction of the reference motion of frame 9 (dash–dot).

Fig. 11. Motion compensation accuracy versus coding rate for varying accuracy of the motion estimates, for the ﬁeld corresponding to frame 10 of the Yosemite sequence. From bottom to top, integer accuracy, half-pel accuracy, quarter-pel accuracy, eight-pel accuracy (dotted) and ﬁnally full accuracy (solid).

angular error [5]

motion ﬁeld but the prediction remains quite stable with decreasing coding rate. The effect of coding and limitation of the accuracy of the motion estimates on the motion compensated prediction error is shown in Fig. 11. For this sequence, and for rates at or above 0:5 bpp; motion estimates can be quantized to eight-pel accuracy without any degradation in the PSNR of the motion compensated prediction. For coding rates below 0:25 bpp it is sufﬁcient to have half-pel accuracy of the motion estimates. The effect of coding on the accuracy of the motion estimates is shown in Fig. 12. The results for our motion estimator for the ﬁeld-image corresponding to frame 9, estimated on two reference ﬁeld-images, with 100% density is 10:7 angular error (RMS 0.72). Best result reported at 100% density, in Barron et al. [5], is about 10 angular error. Without the sky, which often is removed in evaluations, we get 6:3 angular error (RMS 0.5). Common to all approaches which performs good on the Yosemite sequence is that they use relatively large spatio-temporal support. Our motion estimator is designed with more complex motion ﬁelds in mind and uses only two

Dc ¼

arccosð#vTest v# true Þ;

ð50Þ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ T where v# ¼ 1= v2x þ v2y þ 1 ðvx ; vy ; 1Þ ; the velocity represented as a 3D vector, for the 2D velocity v ¼ ðvx ; vy ÞT : First we compare the backward motion compensated prediction based on estimates from our motion estimator with motion compensated prediction based on the reference motion ﬁeld, see Fig. 10. For rates larger than 1:1 bpp our motion estimator gives a motion compensated prediction with higher average PSNR than that obtained by using the reference motion ﬁeld. This may seem surprising at ﬁrst, however the best motion ﬁeld for prediction is not necessarily the reference motion ﬁeld. This can be the case since the reference motion ﬁeld is the true velocity at the instant corresponding to the previous ﬁeld-image, i.e. the ﬁeld-image used for prediction. The best motion ﬁeld for prediction should be at some time instant in between the predicted ﬁeld-image and the ﬁeld-image used for prediction. Motion estimates from decoded images at coding rates between 0.1 and 1:1 bpp give a prediction with lower average PSNR than using the reference

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

396

42 40

25

20 PSNR [dB]

Angular error [deg]

38

15

36 34 32 30

10

28

5

0.2

0.4

0.6

0.8 1 Bit rate [bpp]

1.2

26

1.4

Fig. 12. Motion estimation accuracy in angular error versus coding rate, for the ﬁeld corresponding to frame 9 of the Yosemite sequence. With 100% density (solid). Excluding the sky (dash–dot).

0.2

0.4

0.6

0.8 1 Bit rate [bpp]

1.2

1.4

Fig. 13. Rate distortion performance on the Yosemite sequence. Reconstruction using BFMC (solid), BMA (dotted), MPEG-2 (dash–dot), and intra coding (dashed). The average PSNRs are 35.1, 34.9, 34.1 and 31:2 dB:

or three ﬁelds of interlaced video in the motion estimation.

40

5.2. Coding performance

36

38

PSNR [dB]

34

The coding performance is evaluated on our interlaced version of the Yosemite sequence, as in Section 5.1, and two interlaced video sequences, the Rugby sequence and the Mobile Calendar sequence, at 25 frames per second (PAL). The Rugby sequence contains both fast motions of objects as well as global motion. The Mobile Calendar sequence contains relatively slow motions but includes highly detailed moving areas. In Fig. 13 rate distortion performance on the Yosemite sequence is shown. The BFMC has up to 1:6 dB better average PSNR than MPEG-2 and up to 0:4 dB better average PSNR than the BMA. The results on the central part of the Rugby sequence in terms of rate-distortion is shown in Fig. 14 and speciﬁcally for rate 0:75 bpp in Fig. 15. To focus the evaluation on the moving objects we use the central 288 360 part of the Rugby sequence. It is shown that the BFMC has 0.9– 1:1 dB better average PSNR than the MPEG-2 coder for rates between 0.25 and 1:5 bpp: The BMA is also better than the MPEG-2 coder due to the use of quarter-pel accuracy and partly due to

32 30 28 26 24 22

0.2

0.4

0.6

0.8 1 Bit rate [bpp]

1.2

1.4

Fig. 14. Rate distortion performance on central 288 360 part of the ﬁrst 20 frames of the Rugby sequence. Reconstruction using BFMC (solid), BMA (dotted), MPEG-2 (dash–dot) and intra coding (dashed). The average PSNRs are 32.9, 32.6, 32.0 and 30:8 dB:

the use of the SPIHT coder. For the rate 0:125 bpp the MPEG-2 coder is better than both the BFMC and the BMA. Note that the MPEG-2 coder is a complete coder with features like, e.g. intra coding on selected blocks. The BFMC has up to 0:5 dB better average PSNR than the BMA. In Fig. 15 it is shown that the use of forward reﬁnement, for

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400 38

40 38

36

36

34 PSNR [dB]

PSNR [dB]

34 32 30

32 30 28

28 26

26

24

24 5

10

15

20 field

25

30

35

22

40

Fig. 15. Coding at 0:75 bpp on central 288 360 part of the ﬁrst 20 frames of the Rugby sequence. Top: Reconstruction using BFMC (solid), BMA (dotted) and MPEG-2 (dash–dot). The average PSNRs are 34.1, 33.7 and 33:2 dB: Bottom: Motion compensated prediction for BMA (dotted), the proposed coder without forward reﬁnement (dashed) and with forward reﬁnement (BFMC) at 0:035 bpp (solid). The average PSNRs are 26.0, 24.3 and 25:6 dB: The BMA uses 0:066 bpp for coding of the velocity ﬁeld.

this sequence with highly non-constant motion ﬁeld, improves the average PSNR with 1:3 dB compared to only using the backward motion compensation. The BFMC gives a motion compensated prediction with slightly lower PSNR than that of the BMA. However, for velocity ﬁeld coding, the BFMC uses only 0:035 bpp compared to 0:066 bpp for the BMA. For this sequence, motion estimation based on two ﬁeld-images, with motion compensation based on the previously decoded ﬁeld-image, is often selected. A comparison of coded subjective quality on the Rugby sequence is shown in Fig. 18. The BFMC is free from blocky artifacts which leads to a more pleasant subjective performance around edges of moving objects. However, relatively ﬂat regions should be coded more carefully than regions with high contrast since ﬂat regions are subjectively more sensitive to coding errors. The MPEG-2 coder considers this by scaling the quantization step size with a measure of spatial activity for each 8 8 block and thus gets a better subjective performance in relatively ﬂat regions. The results on the Mobile Calendar sequence in terms of rate-distortion is shown in Fig. 16 and

0.2

0.4

0.6

0.8 1 Bit rate [bpp]

1.2

1.4

Fig. 16. Rate distortion performance on the ﬁrst 20 frames of the Mobile Calendar sequence. Reconstruction using BFMC (solid), BMA (dotted), MPEG-2 (dash–dot) and intra coding (dashed). The average PSNRs are 30.8, 30.7, 30.2 and 26:1 dB:

33 32 31

PSNR [dB]

22

397

30 29 28 27 26 25

5

10

15

20 field

25

30

35

40

Fig. 17. Coding at 0:75 bpp of the ﬁrst 20 frames of the Mobile Calendar sequence. Top: Reconstruction using BFMC (solid), BMA (dotted) and MPEG-2 (dash–dot). The average PSNRs are 32.0, 31.8 and 31:1 dB: Bottom: Motion compensated prediction error BMA (dotted), the proposed coder without forward reﬁnement (dashed) with forward reﬁnement (BFMC) (solid). The average PSNRs are 26.5, 26.3 and 26:3 dB:

speciﬁc for rate 0:75 bpp in Fig. 17. The BFMC has 0.4–1:7 dB better PSNR than the MPEG-2 coder for rates at or above 0:5 bpp: For rates at and below 0:25 bpp the MPEG-2 coder is better than both the BFMC and the BMA. The BFMC has up to 0:25 dB better PSNR than the BMA.

398

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

Fig. 18. Comparison of coded subjective quality for coding at 0:75 bpp on central 288 360 part of the ﬁrst 20 frames of the Rugby sequence. A part of ﬁeld-image 34 is shown. Top left: Reference. Top right: BFMC (PSNR 32:9 dB). Bottom left: BMA (PSNR 32:0 dB). Bottom right: MPEG-2 (PSNR 30:4 dB). Note: The ﬁeld-images have been bi-linearly interpolated to full resolution.

The coding rate for the velocity ﬁeld is about 0:006 bpp for the BFMC and about 0:06 bpp for the BMA. Forward reﬁnement is rarely used since the motion ﬁeld is relatively constant. For this sequence, motion estimation based on three ﬁeldimages, with motion compensation based on the second previously decoded ﬁeld-image, is often selected.

6. Conclusions We have presented new motion estimating and motion compensation techniques intended for video coding. The motion estimation is based on phase and uses two or three decoded ﬁelds of interlaced images. The approach uses backward

motion compensation to achieve pixel-based motion compensated prediction. This initial prediction is then reﬁned using forward motion compensation to compensate for non-constant motion and to consider the speciﬁc characteristics of the image to be coded. Using block-matching with a small search area this can be achieved at a lower coding cost than for standard forward motion compensating coders. Experimental results show that this new approach for motion compensation has up to 1:6 dB higher PSNR than MPEG2 and up to 0:5 dB better PSNR than block-based motion estimation and compensation using the same residual coder as in our approach, for coding of interlaced video. Perhaps most important, the subjective performance of our approach is also better.

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

Appendix A. Relation to orientation tensors In the following we show the similarity with the used algorithm and the concept of orientation tensors [17]. A local structure in spatio-temporal orientation n# has orientation tensor T ¼ A#nn# T ; where A is a scalar greater than zero representing the local signal amplitude. The motion can be extracted by an eigenvalue analysis of a 3D orientation tensor. The motion for a point is in the direction of the eigenvector with the smallest eigenvalue. Eq. (25), omitting W ðxÞ; can be rewritten to be expressed with a 3D orientation tensor T: X X Ev ¼ ðcTk v& Þ2 ¼ v& T ck cTk v& ¼ v& T T&v: ðA:1Þ kAO

kAO

Using Eq. (26), the components of the 3D orientation tensor can be written as 0 1 t1 t4 t5 B C TðxÞ ¼ @ t4 t2 t6 A t5 t6 t3 P P 0P 1 ck fkx fkx ck fkx fky ck fkx fkt P P BP C ck fky fky ck fky fkt A: ¼ @ ck fky fkx P P P ck fkx fkt ck fky fkt ck fkt fkt ðA:2Þ Minimization of Eq. (A.1) with respect to v gives !1 P P ck fkx fkx ck fkx fky vðxÞ ¼ P P ck fky fkx ck fky fky ! ! P ck fkx fkt t 5 ¼ T1 ; ðA:3Þ P 2D ck fky fkt t6 where T2D ð2 2Þ is the 2D orientation tensor describing the spatial structure and is found as the upper left part of T. From this it is evident that the 3D orientation tensor can be determined from the 2D orientation tensor as ! T2D T2D v TðxÞ ¼ : ðA:4Þ ðT2D vÞT t3 We can determine the velocity uniquely using Eq. (A.3) if the neighbourhood consists of a structure with multiple orientations. Lowpass ﬁltering of the estimated 3D tensor will make it

399

more likely to ﬁnd the true motion, i.e. a solution to Eq. (A.3). This is equivalent to the window function W ðxÞ used in Eq. (32). The lowpass ﬁltering is made separately for each of the 6 components of the tensor.

References [1] K. Andersson, P. Johansson, R. Forcheimer, H. Knutsson, Backward–forward motion compensated prediction, in: Advanced Concepts for Intelligent Vision Systems (ACIVS’02), Ghent, Belgium, 2002, pp. 260–267. [2] K. Andersson, H. Knutsson, Multiple hierarchical motion estimation, in: Proceedings of Signal Processing, Pattern Recognition, and Applications (SPPRA’02), Crete, Greece, 2002, pp. 80–85. [3] K. Andersson, H. Knutsson, Continuous normalized convolution, in: Proceedings of International Conference on Multimedia and Expo (ICME’02), Lausanne, Switzerland, 2002, pp. 725–728. [4] M. Andersson, J. Wiklund, H. Knutsson, Filter networks, in: Proceedings of Signal and Image Processing (SIP’99), IASTED, Nassau, Bahamas, 1999, also as Technical Report LiTH-ISY-R-2245. [5] J.L. Barron, D.J. Fleet, S.S. Beauchemin, Performance of optical ﬂow techniques, Internat. J. Comput. Vision 12 (1) (1994) 43–77. [6] M. Bierling, R. Thoma, Motion compensated ﬁeld interpolation using a hierarchically structured motion estimator, Signal Processing 11 (1986) 387–404. [7] D.J. Fleet, A.D. Jepson, Computation of component image velocity from local phase information, Internat. J. Comput. Vision 5 (1) (1990) 77–104. [8] D.J. Fleet, A.D. Jepson, Stability of phase information, in: Proceedings of the IEEE Workshop on Visual Motion, IEEE, IEEE Society Press, Princeton, USA, 1991, pp. 52–60. [9] D.J. Fleet, A.D. Jepson, Stability of phase information, IEEE Trans. Pattern Anal. Machine Intell. 15 (12) (1993) 1253–1268. [10] R. Forchheimer, T. Ericson, Differential transform coding: a new image compression scheme, in: Proceedings of the International Conference on Digital Signal Processing, Florence, Italy, 1981. [11] G.H. Granlund, H. Knutsson, Signal Processing for Computer Vision, Kluwer Academic Publishers, Dordrecht, 1995 ISBN 0-7923-9530-1. [12] M. Hemmendorff, Motion estimation and compensation . in medical imaging, Ph.D. Thesis, Linkoping University, . Sweden, SE-581 85 Linkoping, Sweden, Dissertation No. 703, ISBN 91-7373-060-2, 2001. [13] B.K.P. Horn, B.G. Schunk, Determining optical ﬂow, Artiﬁcial Intell. 17 (1981) 185–204. [14] ISO/IEC JTC1/SC29/WG11, Test model 5 (TM5), 1993.

400

K. Andersson et al. / Signal Processing: Image Communication 18 (2003) 381–400

[15] A.D. Jepson, M. Jenkin, The fast computation of disparity from phase differences, in: Proceedings CVPR, San Diego, CA, USA, 1989, pp. 386–398. [16] H. Knutsson, Filtering and reconstruction in image . processing, Ph.D. Thesis, Linkoping University, Sweden, Discussion No. 88, 1982. [17] H. Knutsson, Representing local structure using tensors, in: Proceedings of the Sixth Scandinavian Conference on Image Analysis, Oulu, Finland, 1989, pp. 244–251, Report LiTH-ISY-I-1019, Computer Vision Laboratory, . Linkoping University, Sweden, 1989. [18] H. Knutsson, M. Andersson, J. Wiklund, Advanced ﬁlter design, in: Proceedings of the Scandinavian Conference on Image analysis, SCIA, Greenland, 1999. [19] H. Knutsson, C.-F. Westin, Normalized and differential convolution: methods for interpolation and ﬁltering of incomplete and uncertain data, in: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1993, pp. 515–523. [20] R. Krishnamurthy, P. Moulin, J.W. Woods, Optical ﬂow techniques applied to video coding, in: Proceedings of the IEEE International Conference on Image Processing, Vol. I, Washington, USA, 1995, pp. 570–573. [21] T. Kronander, Some Aspects of Perception Based Image . Coding, Dissertation No. 203, Linkoping University, . Linkoping, 1989. [22] C. Kuglin, D. Hines, The phase correlation image alignment method, in: Proceedings of the IEEE International Conference on Cybernet Society, IEEE, 1975, pp. 163–165. [23] H. Li, D.C. Escudero, R. Forchheimer, Motion compensated multiresolution transmission of digital video signals, in: H. Harashima (Ed.), Proceedings of the 1995 International Workshop on Very Low Bit-rate Video, 1995.

[24] K.P. Lim, M.N. Chong, A. Das, Low-bit-rate video coding using dense motion ﬁeld and uncovered background prediction, IEEE Trans. Image Process. 10 (2001) 164–166. [25] MPEG Software Simulation Group, MPEG-2 Encoder/ Decoder, Version 1.2, 1996, available at ftp://ftp.mpeg. org/pub/mpeg/mssg/. [26] T. Naveen, J.W. Woods, Motion compensated multiresolution transmission of high deﬁnition video, IEEE Trans. Circuits Systems Video Technol. 4 (1994) 29–41. [27] A.N. Netravali, J.D. Robbins, Motion compensated television coding: Part 1, Bell System Technical J. 58 (1979) 631–670. [28] D.A. Pollen, S.F. Ronner, Spatial computation performed by simple and complex cells in the visual cortex of the cat, Vision Res. 22 (1982) 101–118. [29] A. Said, W. Pearlman, A new fast and efﬁcient image coded based on set partitioning in hierarchical trees, IEEE Trans. Circuits Systems Video Technol. 6 (1996) 243–250. [30] D. Wang, D. Lauzon, Hybrid algorithm for estimating true motion-ﬁelds, Opt. Eng. 39 (11) (2000) 2876–2881. [31] C.-J. Westelius, Focus of attention and gaze control for . robot vision, Ph.D. Thesis, Linkoping University, Sweden, . SE-581 83 Linkoping, Sweden, Dissertation No. 379, ISBN 91-7871-530-X, 1995. [32] R. Wilson, H. Knutsson, A multiresolution stereopsis algorithm based on the Gabor representation, in: Proceedings of the Third International Conference on Image Processing and its Applications, IEE, Warwick, Great Britain, 1989, pp. 19–22. [33] X. Yang, K. Ramchandran, Scalable wavelet video coding using aliasing-reduced hierarchical motion compensation, IEEE Trans. Image Process. 9 (5) (2000) 778–791.

Motion compensation using backward prediction and prediction refinement

Motion compensation using backward prediction and prediction refinement

Recommend Documents