SIGNAL PROCESSING:
IIWkGE ELSEVIER
COMMUNICATION
Signal Processing:
Image
Communication
11 (1998) 231-254
Stereo image analysis for multi-viewpoint applications
telepresence
M. Ebroul Izquierdo” Signal Processing
Department,
Heinrich-Hertz-Institute forCommunication 10587 Berlin, Germany Received
Technology (HHI) Einsteinufer
37,
3 June 1996
Abstract An improved method for combined motion and disparity estimation in stereo sequences to synthesize temporally and perspectively intermediate views is presented. The main problems of matching methods for motion and disparity analysis are summarised. The improved concept is based on a modified block matching algorithm in which a cost function consisting of feature- and area-based correlation together with an appropriately weighted temporal smoothness term is applied. Considerable improvements have been obtained with respect to the motion and disparity assignments by introducing a confidence measure to evaluate the reliability of estimated correspondences. In occluded image areas, enhanced results are obtained applying an edge-assisted vector interpolation strategy. Two different image synthesis concepts are presented. The first concept is suitable for processing natural stereo sequences. It comprises the detection of covered and uncovered image areas caused by motion or disparity. This information is used to switch between different interpolation and extrapolation modes during the computation of intermediate views. The proposed object-based approach is suitable for processing typical video conference scenes containing extremely large occluded image regions and keeping implementation costs low. A set of stereo sequences has been processed. The performed computer simulations show that a continuous motion parallax can be obtained with good image quality by using sequences taken with stereo cameras having large interaxial distances. 0 1998 Elsevier Science B.V. Disparity estimation; Motion estimation; Hierarchical block-matching; Image synthesis
Keywords:
1. Introduction When images of real-world scenes are seen from two different perspectives, the third dimension can be measured. This process is called stereo vision
*E-mail:
[email protected];
tel.: + 49-30-31002619;
30-3927200.
0923-5965/98;$19.00 Copyright PII SO923-5965(97)0003
0
fax: + 49-
Area and feature
correlation;
Uncovered
image
regions;
and is employed in a large variety of applications in computer vision. For example, for videocommunication with telepresence in the near future an autostereoscopic multiviewpoint system capable of offering a very realistic 3D impression to the useI with low costs is desirable. This goal can be reached through stereo analysis. The principle at the heart of this technology is to analyze the information supplied for a stereo camera with a large baseline
1998 Elsevier Science B.V. All rights reserved l-3
232
EM
Izquierdo / Signal Processing: Image Communication I1 (1998) 231-254
in order to synthesize intermediate stereo views. These views are then perceived as taken from virtual stereo cameras situated in any position between the two real cameras, so that, on the one hand, the system offers a realistic 3D impression with continuous motion parallax and, on the other hand, for video conferencing the system allows eyecontact between the conference participants. The same method can be applied by generating 3DTV with continuous motion parallax. The main problem in stereo vision analysis is to find corresponding points in images taken from different viewpoints, i.e. to detect the differences occurring between images. These differences are called disparity between the two images and can be seen as a vector field mapping one stereo image into the other. The determination of this vector field has been called the correspondence problem. The solution of the correspondence problem seems necessary for image synthesis or spatial image interpolation in stereo. Moreover, disparity estimation is the most important technique for obtaining depth information in a scene. Once correspondences are determined, the three-dimensional position of an object can be computed by triangulation of corresponding pairs of points, assuming that the camera pick-up parameters are known. Among the different techniques reported for solving the correspondence problem, two powerful classes of methods can be emphasized: feature-based and area-based algorithms.
Feature-based techniques match features in the left image to those in the right image [4, 6, 7, 221. Features are selected as the most prominent parts of the scene, such as edges, peaks, line segments, contours, etc. The parallaxes are estimated only for the extracted features, such that the number of matchings is extremely small compared with the whole number of pixels in the considered image region. These methods have a moderate computational complexity. Furthermore, the matching process is reliable and less sensitive to photometric variations because features represent properties of a scene. Nevertheless, feature-based algorithms possess an inherent error-propagation because errors on the location of the features influence the disparity estimation. These errors cannot be avoided. For instance, in [2, 14, 151 errors due to the
curvature of the zero-crossings as well as errors due to the mutual influence of two adjacent zero-crossings in a Gaussian Laplacian filtered images are discussed. A second drawback is that matching pairs are found only if corresponding features can be extracted in both images. Obviously, this assumption fails in occluded image areas, i.e., in the parts of the scene that are seen by one camera but not by the other. Area-based methods find corresponding points on the basis of the similarity of the corresponding surrounding areas in left and right images [l, 3, 5, 10, 111. Usually, the surrounding areas have a rectangular form and are called measurement windows. The luminance information within the measurement window in the first image is compared with the displaced version of the luminance information within the measurement window in the second image. To perform this process, a suitable correlation function has to be defined. The correlation function is then forced to obey appropriate constraints. The constrained correlation contributes to forming the cost function. Finally, from a set of candidate vectors, the vector minimizing the cost function is selected as disparity for the considered image point. The reliability of the estimated disparities depends in part on the measurement window size. Large window sizes supply reliable results, but on the other hand the estimation accuracy decreases [S]. In order to achieve both, accuracy and reliability, hierarchical block-matching techniques combine different window sizes in several hierarchical steps. increasing the matching accuracy, reducing the amount of false estimations, and decreasing the computational complexity of the algorithms [ 10, 111. Nevertheless, such methods tend to fail in areas of the image where the luminance variation is low, and in occluded image areas, In this paper an improved method for disparity estimation and image synthesis in stereo sequences is described. The goal of the matching method is to overcome the difficulties mentioned above by combining different strategies and exploiting the redundancy of information present in the 3D structure. The employed strategies include the following: - Global-to-local hierarchical matching procedure. - A similarity criterion combining area and feature matching.
EM
Izquierdo / Signal Processing:
Image Communication
generation. A flowchart describing the interrelation of the different modules is shown in Fig. 1. In the first module homogeneous regions are recognized using a simple variance based operator, additionally the image edges and features are extracted. The image edges are selected using a low pass filtering, which is carried out by convolution with Gaussians of increasing variance. The set of edges is then taken from the set of points where the Laplacian of the smoothed signal changes sign. The convolution of the signal with Gaussians is performed for two different scales. According to [23] the main edges are identified at a low scale, then they are followed backward through the scalespace by increasing the scale value. The feature
block matching and dense dis-
Fig. 2. System overview
Global
xl local hlatchlg ~WJs t
I
Fig. 1. System overview
233
parity$elds
- Joined motion and disparity estimation is carried out in a coherent way. - Large homogeneous image areas are detected. Points belonging to these areas are not matched. - Matched points are subject to consistency tests and reliability checks. Vectors with low reliability are removed. The resulting holes and the sampling positions, where no disparities have been estimated, are filled by interpolation. The interpolation is forced to preserve disparity discontinuities using the image edges. The starting-point of the research presented here is the method introduced in [IO]. This previously reported method is enhanced by the items mentioned above. The novelty of the improved method lies in the sensible combination of new ideas with ideas taken from known methods to produce a robust and accurate correspondence estimator. Because the main application addressed in this paper is image synthesis for 3DTV and videoconferencing, two methods for generating intermediate views are described as well. A system overview comprising both approaches for intermediate views generation is depicted in Fig. 2. The disparity estimator can be divided into three broad modules: preprocessing, hierarchical
I I (I 998) 231-254
I
of disparity
estimator
of image synthesis
module.
234
E.M. Izquierdo 1 Signal Processing: Image Communication II (1998) 231-254
extraction is performed on the Gaussian smoothed image. Hereby three gradient-based features are taken into account: edgeness, gradient direction and cornerness. The second module consists of two hierarchical matching steps. In the first step global displacements are detected. The information supplied for this step is used in the local step as a start approximation. In each step the best match is selected using a cost function, which consists of a combined area and feature similarity criterion and a temporal smoothness term involving temporal prediction vectors. The temporal smoothness term is weighted by a factor which is estimated proportional to the reliability of the temporal prediction vector. The hierarchical block-matching procedure supplies a non-dense disparity field with large holes in homogeneous and occluded image regions. The third module fills these gaps by interpolation. Herein the image edges and the previously estimated disparity fields are used in order to preserve disparity discontinuities in segment borders and get smooth fields within such segments. After the interpolation step dense fields are available. The paper is organized as follows. In Section 2 the relevant notation is introduced, as well as the basic conditions applied in order to constrain sufficiently the solution of the correspondence problem to find matches with low cost of searching and high reliability. Section 3 deals with the individual elements of the proposed matching procedure: homogeneous regions and edges extraction, feature selection, hierarchical block matching strategy, temporal prediction vector estimation, cost function, reliability measure, and interpolation. Two different approaches for image synthesis are presented in Section 4. Experimental results obtained by applying the method to several natural stereo sequences, to demonstrate the effectiveness of the proposed method are described in Section 5 and the paper closes with a summary and conclusions in Section 6.
right image planes, respectively, at time t. Moreover, let Zr’ and I!” be the real-valued bounded functions on R,“’ c ‘8’ and RF’ E %‘, respectively, representing the intensity maps of the scene projected onto R(‘) I and R rw. So the intensity functions at time t are I@) . RF) + % I .
and
I!‘) : R!f) --t ‘ill.
The image intensity at the point z = (x, y) in the image plane R,(‘) at time t is Z,“‘(z). Two points z1 = (xi, yl) and z2 = (x,, yz) in the images RI” and RF’, respectively (or R(‘) and R(‘+‘)), should be matched if and only if they are image plane projections of the same real-world surface point. In this case the vector disparity d, pointing from left to right, with respect to the sampling position z1 E RI” is given by d&i, r) = (~2 - xi,
~2
-
YI).
Analogously, we denote the motion vector pointing from R(‘) to R(‘+l) of two matched points in two successive frames t and t + 1 as mi(z,, t), respectively m,(zi, t). If notation confusion cannot happen we will omit in the following the indices 1and r indicating that a disparity vector point from left to right or from right to left, as well as the same indices for the motion notation. The disparity field from left to right at time t can be described for the mapping: Z)(f).Rj” + R!“. 1 .
The mapping D,w ideally assigns one disparity vector to each sampling position z1 E RI(‘), i.e., DF’(zl) = z2. Let Y be the set containing all mappings OF’. To each mapping Of’, let us assign a real positive value 3(0:‘). So we can define the functional 5(Dr’) : Y -+ %+ as 3(01”) =
lIZI”’- Zf) 0 DI”’I(, ss R'j
with 2. Problem description 2.1. Dejinitions and notations
For the sake of completeness we recall some basic notations. Let Rr’ and R!f’ be the left and
Zl(f)(zi)- (I!“0 Dl’))(zl) = Zlf)(zJ - Z!‘)(D,(‘)(q)) = Z,(‘)(zl) - Z3Z2).
The correspondence problem can be formulated as finding the mapping Dy’, which satisfy the relation Zr(@) < 3(0:‘) for all BF’E Y. Consequently,
E.M. Izquierdo / Signal Processing: Image Communication II (1998) 231-254
the correspondence problem is an optimization problem in which the functional 3(0,“‘) has to be minimized. One of the powerful techniques for solving this problem is to compute values of similarity at determinated candidate points within a search region on the basis of the correlation of the neighboring intensity values or/and features and to find the minimum. The above-formulated problem is a wellknown inverse ill-posed problem where available information does not constraint the solution sufficiently and therefore must be regularized. Several suitable constraints can be applied in order to regularize this optimization problem and obtain an accurate solution. In the following subsection, the constraints used in the proposed correspondence estimator are described.
2.2. Constraints Several conditions have been used to find the best match with a low searching cost. A very wellknown strategy uses the epipolar geometry as constraint [9,21]. This constraint demand that epipolar lines in RI” match the corresponding epipolar lines in RF’. Corresponding epipolar lines are defined as the intersection of a plane, containing both optic centers and a given point in the real-world scene, with the image planes [9]. The simplest case is given when the camera configuration is parallel, i.e., when the image planes are coplanar, the vertical camera coordinates are parallel and the distortion coefficient is equal to zero. In this case, the vertical components of the disparity vectors are zero, so that only a one-dimensional search along the scan lines is necessary. There are several methods of calculating the epipolar lines but most of them require a stereoscopy calibration step. In [9] an accurate method of finding the equations governing the epipolar lines without calibration is described. Using this method we can determine the epipolar lines and match along them, if the camera geometry is unknown. Smoothness-constraint-based regularization is the most often applied technique in the stereo vision literature [16,18-J. It is derived from a practical observation: disparity (motion) values vary
235
smoothly almost everywhere. It describes one of three basic rules that Marr and Poggio propose in their famous computational theory of the depth perception [17]. In [20], a rigorous mathematical discussion is given about the application of the continuity property for the determination of the displacement vector field in dynamic image sequences. In our matching algorithm we exploit the smoothness constraint using a very simple rule: Excepting the region boundaries, when d is the displacement vector of a sampling point z, then the displacement vectors of the surrounding points z + AZ can be searched in a small neighborhood of d.
Another natural constraint may be derived from the inherent stereo-motion redundancy of the 3D structure. A coherence condition between the motion and the disparities in a stereo sequence, may be expressed as linear combination of four vectors (two motion and two disparity vectors) in two succeeded frame pairs. For a sampling position z at time t a such relation is given as (see Fig. 4) d,(z, t) + m,(z + d,(z, t), t) - d,(z + m,(z, t), t + 1) - ml(z, t) = 0.
(2.1)
The relation (2.1) can be used for calculating the disparity vector at time t + 1 from the motion and disparity vectors at time t. Such an estimation yields an accumulation of the errors which results from the perturbations present in the vectors at time t. Nevertheless, (2.1) can be applied to find a start prediction vector or as condition to derive a reliability test for the estimated displacements. The uniqueness constraint relies on the assumption that a luminance value in any image may be the image plane projection of only a real world surface point because each point has a unique physical position. According to this principle the following constraint can be formulated: Each image point may be assigned at most one disparity value. This condition holds for opaque surfaces, but may be violated in the case of transparent objects, Finally, a feature compatibility constraint is applied. It may happen that corresponding features have the same physical cause, because corresponding points are projection of the same surface point in the real world and therefore they must have similar geometrical and physical properties.
236
E.M. Izquierdo J Signal Processing:
3. Coherent disparity and motion estimation The aim of this section is to describe the computational framework for combined motion and disparity analysis in a variety of stereo sequences taken from real-world scenes. The central matching strategies are applied likewise for disparity and motion estimation. Motion compensated temporal prediction vectors interact with hierarchical prediction vectors and several constraints to enhance the performance and rigor of the matching process. The three modules specified above are presented in the following subsections.
3.1. Preprocessing The success of the correspondence estimation depends essentially on the available information to be matched. Normally, the input information is processed in order to obtain additional data which can be used to constrain the set of candidates to be explored during the matching search or to enhance the similarity measures used. This task can be considered as preprocessing and it constitutes an important step in the whole correspondence analysis. 3.1.1. Homogeneous regions recognition In order to select the image points, which cannot be distinguished from their neighbors, a simple difference-based interest operator is used. The sums of absolute differences of adjacent pixels in the four principal directions (vertical, horizontal and two diagonals) are measured over square overlapping windows of small size. For each sampling position the output of the interest operator is defined as the maximum of the sums of absolute differences of pixels adjacent in each of the four directions. Sampling positions whose calculated maximum of sums of absolute differences exceeds a certain threshold are declared interesting points, the remaining points are candidate points to belong to a homogeneous region. The information obtained after this procedure is stored in a binary field containing rough information about the homogeneous regions. Hereby, the binary value 1 is assigned to the interesting points and 0 to the remaining points. Starting from this binary field large connected uni-
Image Communication
I1 (1998) 231-254
form areas are captured as follows: Firstly, small picks are removed using a window of small size, which is displaced over the image. If the whole window border consists of only one binary value, then this value is assigned to all the sampling positions inside of the window. Secondly, connected uniform areas are labeled using an efficient labeling procedure. Sequential numbers are assigned to different connected regions and they are stored as a field whose elements contain these numbers. At the same time, the number of pixels contained in each region is counted. All regions whose area do not exceed a given threshold are declared non-homogeneous i.e., the binary value 1 is assigned to these regions. Finally, the mask so obtained is smoothed using a 3 x 3 max/min median filter. 3.1.2. Edges extraction The convolution of the image with Gaussians supplies for different variance values 0, different smoothed images. The amount of smoothing is controlled exclusively for the variance of the Gaussian. The space formed by all smoothed versions of the image depending on the scale-parameter (T, is called scale-space and has been introduced by Koenderink and Witkin [12, 231. It is well known that the edges at low scales give an inexact position, whereas by a Gaussian filtering with small variance, all the edges keep their correct location [14]. In this last case the main edges are embedded in a crowd of spurious edges due to noise and texture. Following the theory introduced by Witkin, we identify the main edges at a low scale, i.e., using a large variance value, and follow them backward through the scale-space by decreasing the variance value. The edges are detected as the location where the second derivative of the smoothed image cross zero. The zero-crossings are extracted applying a predicate-based algorithm similar to such introduced in [S]. According to Koschan [13], instead of zero-crossings, interval-crossings are inspected by regarding a 3 x 3 neighborhood of each sampling position in the Laplacian of Gaussian convolved image. In order to obtain closed and accurate contours, the interval crossings are detected for different scale-values and as explained above, starting from the lowest resolution, the set of edges are completed and their location shifted to the exact
E.M. Izquierdo / Signal Processing:
Image Communication
position using the information extracted from higher resolution levels. 3.1.3. Feature extraction If in addition to the similarity of neighboring luminance values of corresponding points, correspondence between features is established as well, then it is to be expected, that the robustness of the algorithm increases. According to this observation three gradient-based features are included in the matching process: edgeness, gradient direction and cornerness. In order to reduce the effects of the ill-posed nature of differentiation, and its inherent instability, as well as the worse repercussion of perturbations due to noise in the image data, the stereo images are regularized using a Gaussian convolution. The basic idea in this regularization step is to smooth the initial data in order to perform noise reduction and at the same time to give more stability to differentiation results. That is, the three features are extracted from the smoothed images in order to obtain robust results against noise. The edgeness is defined as the magnitude of the gradient intensity, namely // VI 11.The values supplied from this norm-function are mapped to an appropriate value-range and zero is assigned to all values of 11VI 11,which do not exceed a given threshold. The gradient direction is given by the relation arctan
c1 aI/@ ix/ax’
-
where aIlax and allay are the normed partial derivatives of the intensity function. The cornerness is defined according to [22] as the changes of the direction of gradient at two nearby points z, and zb of the considered point z, weighted by II VZ(z) 11. z, and zb are located on a CirCk COntaining nine pixels and centered at z. Moreover, z, and zb are chosen such that the directional derivative along the circle reaches the minimum and the maximum respectively. Let A = VZ(z,), B = VI(&) and the angle a be the angle from A to B measured in radians counterclockwise, ranging from - x to n. To distinguish a black corner on a white background from a white corner on a black background, positive and negative cornerness are
I1 (I 998) 231-254
237
defined separately. The closer the angle to rc/2, the higher the positive cornerness measure should be and vice versa, the closer the angle to - n/2, the higher the negative cornerness measure should be. A detailed description of this measure will be referred to in [22].
3.2. Hierarchical
block matching
Using any kind of block matching strategy, the reliability that the minimum of the cost-function represents the true disparity, depend in part on the size of the measurement window. Large window sizes supply reliable but not accurate results, which can be used as start displacements to be refined using smaller window sizes. The basic idea of a hierarchical approach is to start at the highest hierarchical level with a large measurement window and a maximum search range. Only a set of candidate displacement vectors from the higher level is considered at a lower level in order to decrease the potential displacements to be taken into account. The dimension of the measurement window is decreased as well. In order to keep implementation costs as low as possible, displacement vectors are normally only estimated with respect to some selected sampling positions at the highest level. At the lowest level, a displacement vector is assigned to each sampling position of the image. 3.2.1. A hierarchical approach in two steps The proposed matching strategy is based on a hierarchical approach in two levels. Displacement vectors are only estimated if the center of the measurement window belongs to an image area that has been identified as textured (non-homogeneous). This decision is made using the mask generated by the homogeneous areas recognition procedure described above. The first step consists of a global displacement estimation. The global vectors are taken in the second step as potential displacements. Global step: In order to reduce noise sensitivity and simultaneously reach higher efficiency, both the left and right image fields are subsampled applying a Gaussian filter. Thereafter each image is split into rectangular blocks of moderate size. In
238
EM
Izquierdo / Signal Processing:
each block one sampling position is chosen as the representative point for the entire block. Furthermore, matching is performed only for those blocks containing at least one point belonging to the textured image regions. The first question to be answered is: Which point can be selected for matching in each block? Two possibilities can be considered: _ The center of each block is considered to be the point to be matched. - The point, which is best distinguished from its neighbors in the following sense: The output value of any variance-based interest operator should be used to quantify the relevance of a given sampling position. In the first case we use a large measurement window. In order to decrease the computational complexity of the algorithm, the values within the measurement window are taken shrinking a determinate number of pixels in the horizontal direction or/and in the vertical direction. In the second case smaller measurement windows can be used. To select the matching point the same interest operator used to detect homogeneous image areas is applied. Once the matching points are selected we must define the search regions. For disparity estimation we perform full search along the epipolar lines in a maximum search range. For motion estimation we realize a texture analysis of the frame difference. According to the method introduced by Seferidis and Ghanbari [19], the direction of displacement and the image activity in the scene can be measured using the absolute frame difference. Herein two statistical features derived from the temporal difference histogram are proposed: the second moment of inertia and the inverse difference moment. Using these two measures over the absolute frame difference a dynamical search range adaptation is performed. After estimation of the search region we proceed to determine the corresponding point on the basis of the cost function described below. Let z be a sampling position in the left image R,“‘,which has been chosen to be matched. For each sampling position within the search interval on RF’, the cost is calculated. The particular sampling position 2, which minimizes the cost function is designed to be the corresponding point of z, and the difference d& t) = z” - z is assigned to the disparity field at the sampling position z. Once
Image Communication
I1 (1998) 231-254
d&, t) has been estimated, the same procedure is repeated from right to left, using Z as the reference sampling position. A measurement window of the same size is placed on the left image and shifted within the estimated search interval. In addition, the correspondence search is carried out without considering the previously calculated disparity from left to right. Finally, a bi-directional consistency check is performed in order to reject outliers. If the condition IId&Y t) + d,(S>t) II < s
(3.1)
is violated for a given value E,the two vectors dl(z, t) and d&Z, t) are eliminated from both disparity fields. This verification enables a higher reliability of the estimated disparities and is derived from the uniqueness constraint. The backward/forward global motion vectors are estimated analogously for each stereo channel separately. Local step: In this step for each sampling position belonging to the textured image regions the corresponding point is estimated. The matching process is applied to the full resolution images, i.e., to the non-subsampled images, but using relatively small measurement windows. Instead, to perform full searching as in the global step, only a few of candidate vectors are tested. The candidate vectors are selected as follows: - 6 from the global level; - 4 from the surrounding sampling positions already calculated; - 1 from the temporally preceding displacement field projected in the current time level using the estimated motion. The positions of these candidate vectors are shown in Fig. 3. Each of these candidate vectors is tested within a small search range. In addition, those candidates which point into an homogeneous area of the other image or are not close to the epipolar line are regarded as invalid. The vector minimizing the cost function is selected as corresponding disparity vector. Local disparity estimation is also performed b&directionally, in order to apply the cross-consistency check on the estimation results. The estimated left to right and right to left disparities are again judged as inconsistent, if the condition (3.1) is violated. Inconsistent vector are removed from both fields. The same procedure is applied in the local step by motion estimation.
EM Izquierdo/ Signal Processing:Image CommunicationI I (1998) 231-254
239
Estimation at Time t-1 Estimation at ‘Time t
Current Sampling Position Sampling Position at time r-1 Already Processed Sampling Positions Candidates from the Local Estimation
(f
-i
Candidates from the Global Estimation
’ Motion Compensated Temporal Candidate / Fig. 3. Candidate
3.2.2.
Temporal prediction
vectors
used as start approximation
vectors
The use of temporal prediction vectors is based on the practical observation that motion and disparity vectors in real sequences change slightly within short time periods. In stereo sequences of real-world scenes it is recommended to use the stereo-motion redundancy of the 3D structure over time in order to enhance the estimations. After disparity and motion estimation at time t, predictions fields are formed by projecting the current results onto the next temporal level t + 1. The predicted left to right disparity d, with respect to t + 1 is computed according to the following relation (see Fig. 4): d,(z + AZ + ml(z, t), t -t 1) = dl(z, t).
/
E
,7
Motion Vector
in the local step.
Let? Sequence
Right Sequence
’ Mip
I I
(3.2)
In order to avoid holes (sampling positions to which no prediction vector can be assigned) each vector is also projected to adjacent sampling positions at time t + 1, i.e. in a range that is given by AZ. Analogously, we can obtain prediction vectors for the motion between frames at time t and t + 1:
E7 Fig. 4. Stereo-motion cooperation and prediction vectors. E describes the difference obtained after bi-directional consistency check according to (3.1). b describes the difference given by the relation (3.3). M,,, M,, and d,, are the temporal prediction vectors.
m,(z + AZ + m(z, t), t + 1) = m(z, t). d,(z + AZ + ml(z, t), t + 1) Using expression (3.2), a conflict is defined if a prediction vector is already assigned to the sampling position at time t + 1. Conflicts are solved by
=
d:’ if F(dF), 0,O) < F(dr), 0, 0), d’2’ else, P
E.M. Izquierdo / Signal Processing:
240
where F is the cost function. Conflicts with respect to motion vectors are solved analogously, i.e., by comparing the respective values of the cost function. Obviously, the prediction vector field so estimated cannot be dense, because in the occluded image areas arise holes. Moreover, in the image borders the vectors d(z, t) and m(z, t) can point out the image so that (3.2) cannot be applied. 3.2.3. Reliability measure A way of enforcing recursively temporal enhancement of the disparity and motion fields is to calculate the reliability of each estimated vector using an appropriate confidence measure and weighting the temporal smoothness term of the cost function according to the degree of confidence of the temporal prediction vectors. To realize this purpose the definition of a suitable reliability measure is of the utmost importance. We propose using a criterion based on two geometrical constraints together with the analysis of the curvature of the correlation surface, introduced by Anandan [l]. So, that for a given displacement vector v the used reliability function consists of three terms: f(v) = Blfi(4
+ B&*(v) +
P3f3Cv).
Hereby are pi, i = 1,2,3, suitable weight coefficients and the real-valued functionsf, are defined as follows. Stereo-motion consistency check: In two successive frame pairs, temporal and spatial corresponding image points are characterized by a close linked vector chain according to the relation (2.1). For a given image point z, let Vi, i = 1,2,3, be the four vectors (two motion and two disparity vectors), which form this vector chain. The function value of fi(vl) will be large, if these four vectors satisfy the condition (2.1); on the other hand, this value will be small if the difference
Image Communication
11 (1998) 231-254
observations we define
fib9 =
~(6,) o
for all v EE, else,
where the real-valued function g must satisfy, for given upper bounds A and B of fi and 6,, respectively, the following conditions: - g is not monotonically decreasing, i.e. g(y) < g(x) whenever x < y, - g(0) = A, - g(x) = 0 for all x > B. A simple family of functions, which fulfill these conditions is g(x) =
A-(A/B”)x” o
if O
Hereby the exponent n > 0 gives the decreasing velocity of g. If n = 1, the confidence measure depends linearly on the deviation 6,. Bi-directional consistency check: An additional criterion to measure the reliability of the vectors can be reached performing a disparity verification between left and right disparity vectors according to the relation (3.1). Compared with (3.3), the relation (3.1) describes a weaker condition. Nevertheless, it can give additional information for the reliability computation, particularly in any sampling positions where (3.3) cannot be defined. Analogous to_fi , the function values off, will be large, if the deviation E in (3.1) is small andf, will take small values for large Evalues. Let B the set of vectors, for which the relation (3.1) can be defined. fi is calculated using “G(v) =
g(sJ i 0
for all VEE, else,
where g is the same function defined above but taking smaller bound values A and B forf2 and E.
(3.3)
Reliability measure from the analysis of the correlation surface: Anandan [1] proposed to deduce
is large. Note that the relation (3.3) cannot be defined for all ZER,(‘), because some vectors can point out the image, for example in the image borders. Let us denote S G R,“’ the set of sampling positions, for which the relation (3.3) can be defined and E the set of displacement vectors assigned to all sampling positions belonging to S. Following these
the reliability of the estimated vector by analyzing the correlation surface generated by testing a search window around a given candidate vector. It is intuitively clear that a flat surface indicates a random choice in a low textured region whereas a clear peak in the correlation surface corresponds to a reliable vector. To analyze the variation of the
4, =
II Vl
+
vz
-
v3
-
v4
II
EM
Izquierdo / Signal Processing: Image Communication II (1998) 231-254
correlation surface it is necessary to determine its curvature. It is well known that the curvature of a surface at a point along any direction can be calculated if the two principal curvatures at that point are known. Chupeau and Salmon [3] simplify the analysis by using the vertical and horizontal directions, instead of the principal directions. In order to use the correlation values obtained during the matching process, a monodimensional analysis along the epipolar lines is performed for disparity vectors. In this case the values off3 are calculated regarding the curvature of the correlation line instead of the correlation surface around the considerated disparity vector. For the motion vectors, the values of f3 are estimated applying the same confidence measure introduced by Anandan [l]. 3.2.4. Cost function To select the best match from a set of displacements candidates, the cost function plays a crucial role. We propose to use a criterion based on the search for similarity of corresponding neighboring pixels and features in the two considered images together with an appropriately weighted temporal smoothness term. The weight coefficients of the smoothness term are estimated according the reliability of the temporal prediction vectors. The similarity of the luminance and feature values within the reference window placed on the first image and the luminance and feature values within the corresponding displaced search window in the second image is measured using the mean absolute difference. Let z = (x, y) be a point in the left image, z”= (2, y”) a point in the right image and &(z, t) = z”- z the considered candidate vector. The mean absolute difference of intensity values (MADI) is given by
MAW4
=;
11 1II”(z) - Z:“(z + dl(z, t)) 1. m
n
Hereby it is assumed that the points z and z”are the centers of measurement windows with sizes m x n. Analogously, the mean absolute difference of feature values MADF is defined. Note that the size of the measurement windows for the MAD1 do not
241
necessary have to coincide with the size of the measurement windows for the MADF. Actually, it is reasonable to choose larger measurement window for the MAD1 then for the MADF. The cost function F: !V x %’ x % --f % is now defined as F(d, d,, a) = MADI
+ MADF(d) + iy 11d - d, 11, (3.4)
with d as current test vector, d, as temporal prediction vector at the position z and CI an adaptive weight coefficient. x will be large, if the respective prediction vector d, is proved to be reliable, whereas x will be small if d, is unreliable. In this last case the matching selection should be carried out essentially for the two similarity criteria MAD1 and MADF. Let C be an upper bound for a, which must be estimated in dependence of the size m x n of the measurement windows and let D = max f(d,). d,EEvi‘ Using this notation, c1is defined as gf(d,)
if D > 0 and d, exists,
0
else.
!X= 1
3.3. Interpolation After local matching a not dense displacement field is available. For instance, in occluded regions or in the image borders the estimated fields can contain holes. In order to obtain dense vector maps the gaps have to be interpolated. In the interpolation process discontinuities due to depth differences between object edges are preserved using an edge controlled vector interpolation method. The interpolation is carried out in several steps. Initially, the estimated displacement vectors are used to split the image into several regions of homogeneous disparity. The region boundaries so obtained are combined with the edges extracted by the preprocessing module, to eliminate edges due to texture within the objects. The resulting edges are used to form new connected segments, in which the
E.M. Izquierdo / Signal Processing:
242
interpolation process is performed independently. In a second step each segment is approximated by a thin plate constrained by the known displacement vectors. The tightness with which the plate is constrained is controlled by springs of different forces. The force of each spring is calculated proportionally to the reliability of the estimated vectors. Springs of no force are applied to sampling positions without vectors. Finally, the finite element method is used to approximate the thin plate of minimal potential energy [24].
4. Image synthesis Once corresponding points have been estimated, the synthesis of intermediate views is a relatively simple task in which the image quality results depend exclusively on the goodness of the estimated disparities. To generate an intermediate image, the assignment of disparity with respect to the desired intermediate position is required. This process is performed by projecting the calculated left-to-right and right-to-left disparities onto the intermediate image plane to be synthesized. Furthermore, the recognition of occluded image areas plays a crucial role, because in such regions the available image information has to be extrapolated from only one of the two stereo images, whereas in regular image regions interpolation of luminance values provides from both stereo images has to be performed. According to the two addressed applications, two different approaches have been found to be suitable. 4.1. Arbitrary image synthesis The first approach comprises the computation of occluded image areas by analyzing the two disparity fields in both directions. A sampling position z in the left image is expected to be occluded in the right image if &
II 4h
~1+ 4@ + 4(z, d + AZ, ~1II > T.
Hereby z + AZ is any sampling position within the measurement window W centered in z and T is a suitable threshold. A small negative value for
Image Communication
1 I (1998) 231-254
dl(z, t) and a large positive value of d,(z + dl(z, t), t) indicates that the sampling position z is only visible in the left image plane. The occluded sampling positions with respect to the left image plane can be computed analogously. However, the average value in a window should be compared to a threshold in order to increase the reliability of the analysis. The occluded areas resulting from motion can be computed analogously. Instead of comparing disparity vectors from the left and right channels, the motion vector field has to be analyzed in the temporal domain. In a second processing step motion and disparity vectors are projected onto the intermediate image plane. For any sampling position in the intermediate image plane a scaled displacement vector is assigned according to following relations:
z*= z + Ap dl(z, t) + At m[(z, t), a,(; +
AZ*,t + At) = (1 - Ap)dl(z, t).
(4.1)
Here At (0 < At < 1) denotes the temporal position of the intermediate image plane relative to the stereo frames at time t and Ap (0 < Ap < 1) describes the spatial position of the intermediate image plane relative to left and right channels. In order to avoid holes (sampling positions to which a displacement vector cannot be assigned) each vector is also projected onto the adjacent sampling positions contained in a range given by z^+ A? Conflicts can occur if a vector is already assigned to the current sampling position. Such conflicits are solved using the formula
d(2+ AZ, t
-t A\t) =
J(l)
if J(l)
2(2)
else
> JCz)
’
(4.2)
Eq. (4.2) indicates that sampling positions representing world surface points close to the cameras are assumed to occlude sampling positions belonging to a background image area. Finally, the computation of the intermediate view is performed by using a three-dimensional time and space variant filter based on a system already described in [S]. This filter is controlled by motion vectors as well as disparity vectors. Generally, it uses the four surrounding images (R,“‘, Rj’), RI(‘+‘) and RF+‘)) to generate the current intermediate view. Areas that become occluded in the left input images are extrapolated using only the luminance information
E.M. Izquierdo 1 Signal Processing: Image Communication II (I 998) 231-254
obtained from the right input images and vice versa. In the remaining areas, interpolation is performed using both channels.
4.2. Object-based
243 Scan Une
k-
, at Right image
image synthesis
The object-based approach assumes a convex object located in the center of the scene. This assumption is fulfilled by typical videoconferencing situations, where the head and shoulder part of a person in front of a uniform background or a previously recorded textured background form the scene to be processed. The intermediate image plane is computed by a space-variant filter taking into account the disparity. The spatial filter coefficients depend on the rational sampling position of the sampling value to be interpolated between existing sampling values of the scanning grid [S]. The disparity assignment to the intermediate image plane is performed by applying the formula (4.1). The filter coefficients used to weight the spatially interpolated video signals of the left and right channel depend on the horizontal sampling position as well as on the position of the intermediate image plane. Due to the assumed object model partly occluded image areas are implicitly known. The left image side of the intermediate image is preferably reconstructed by using only the left stereoscopic channel. The right image side of the intermediate image is preferably reconstructed by using only the right stereoscopic channel. In a transition area the intermediate image is reconstructed using both stereoscopic channels. To carry out this concept, first the object position in the intermediate image plane is estimated and recorded as binary mask. After this, five different regions within the mask are defined. Let us denote those areas as I, II, III, IV and V. The part of the image left from the object is considered to belong to area I, while the area at the right side of the object is attached to V (see Fig. 5). Within area I, the luminance values to be generated are extrapolated from the left image exclusively. Within area III interpolation from both stereo images is performed using weight coefficients depending only on the position of the intermediate image plane. Within area V extrapolation from the right image is carried out. Areas II and IV are transition
RagIons
Fig. 5. Definition of five different inter-extrapolation the image synthesis along any scan line.
modes for
areas, in which interpolation is performed applying spatial weights depending on the horizontal sampling position to be interpolated and the position of the intermediate image plane. The advantages of the object-based approach are the low complexity and the robust performance for typical video conference scenes. Using a coherent disparity and motion analysis at the sender side the condition of an object location in the center of the image plane can be fulfilled by using a camera control system. However, for arbitrary scenes artefacts cannot be avoided in those occluded image areas, which are not described by the underlying model.
5. Results of computer simulations The performance of the methods presented has been tested with a set of natural stereoscopic test International known sequences resequences. corded within the framework of the European projects RACE-DISTIMA and ACTS-PANORAMA were used in all computer experiments. The sequences were recorded with stereo camera baselines from 8.75 cm up to about 50 cm, covergence distance down to about 2 m and image resolution of 720 x 576 pixels. The robustness of the methods presented is confirmed by processing the six natural stereoscopic sequences SAXO, PIANO, TRAIN, JARDIN, MANEGE and DISCUSSION, as well as the four stereoscopic sequences ANNE, MAN, CLAUDE and ROBERT, which represent typical videoconferencing situations. The baselines of the first six sequences vary between 8.75 and 40 cm,
244
Fig. 6. Detected
EM
homogeneous
Izquierdo / Signal Processing:
regions
(white displayed)
Image Communication
11 (1998) 231-254
in the first left frame of the sequences
whereas the convergence distances vary between 2.8 and 13.2 m. In these stereo sequences the largest disparity vector can reach 50 pixels. Considering these types of sequences intermediate images in any position can be synthesized by overall good image quality, using the method described in Section 4.1. Most of the videoconferencing sequences considered contain extremely large occluded image regions. These sequences have been recorded using
SAX0
(top) and PIANO
(below).
camera baselines of 50 cm and the distance between the conferencing person and the camera is approximately 1.2 m, so that the largest disparity vector can reach 115 pixels. Excellent overall image quality has been obtained in these cases, if the intermediate images are synthesized applying the method introduced in Section 4.2. The videoconferencing sequences considered were chosen to fulfill the uniform background assumption, which
EM
Izquierdo / Signal Processing: Image Communication II (1998) 231-254
simplifies the correspondence estimation and fits in with the proposed object-based image synthesis method. Although the algorithms have been successfully applied to all stereo sequences mentioned above, only selected results are reported which together seem to represent the different situations encountered in natural images. The first group of experiments demonstrate the performance of the pre-processing module for sirnilar parameter ranges and different scenes. In the procedure for homogeneous regions recognition three parameters can be adapted: the size of the window for the variance measure, the minimum area for a region, which will be considered as homogeneous, and the variance threshold for deciding if a sampling position is declared interest point. To maintain consistency and uniformity between the experiments, the window size for the variance measurement is fixed to 5 x 5 pixels and the minimum area for a homogeneous region is fixed to 800 pixels for all the experiments. The variance threshold depends on the image structure and can be calculated automatically through a histogram analysis of the image luminance. In the experiments reported here this parameter varies between 16 and 32. Fig. 6 shows the extracted homogeneous regions from the left first frame of the sequences SAX0 and PIANO. Four different resolution levels are used in the edges extraction routine. Each level is controlled for the parameter o, which gives the width of the central excitatory region of the LOGoperator and it is defined as w = 2&. Moreover, the size of the convolution mask is defined as 3~0. Using these relations the four values of w are fixed to 7,9, 11 and 13. The corresponding variances are shown in Table 1. For the sake of uniformity these four scale-parameters were maintained for all performed experiments. The extracted zero-crossings from the left first frame of the sequence SAX0 are shown in Fig. 7. Hereby the edges extracted from the lowest resolution level (a = 4.596), the highest resolution level (a = 2.475) as well as the finally reconstructed edges are shown. Features have been extracted from the smoothed images using a 5 x 5 Gaussian window. The gradient intensity values, which do not exceed the value two are reset to zero and the resulting values are mapped to the range O-255. The gradient directions are calculated using
245
Fig. 7. Detected zero-crossings in the first left frame of the sequence SAXO: using the lowest variance rr = 2.475 (at the top), using the highest variance u = 4.596 (middle) and finally reconstructed edges (below).
246
E.M. Izquierdo / Signal Processing:
the original gradient intensity values. The obtained angles are mapped to integer even values so that 180 different angles are considered. The edgeness and positive cornerness extracted from the left first frame of the sequence PIANO are shown in Fig. 8. All important parameters for the preprocessing are summarized in Table 1. To test the performance of the hierarchical block-matching procedure experiments with sev-
Image Communication
11 (1998) 231-254
eral measurement window sizes have been carried out. As expected, the larger windows provide more accurate results, although good results are obtained even with windows of moderate size. All experiments reported here have been performed using the parameters shown in Table 2. For the sake of uniformity, the point that is best distinguished from its neighbors, is selected as reference point for the matching in the global step. In order to evaluate the
Fig. 8. Extracted edgeness (top) and positive cornerness (below) from the first left frame of the sequence PIANO.
E.M. Izquierdo / Signal Processing:
Image Communication
improvements introduced when area and feature matching are combined, correspondences estimated using only the MADI as similarity criterion are compared with those correspondences estimated applying both MAD1 and MADF as described in Section 3.2.4. The reliability of the displacements is calculated using the function f introduced in Section 3.2.3. The ranges for the function values on fi ,f2and f3 are CL64,0-64 and O-64, respectively, whereas /I1 = 1, pz = 1 and p3 = 2 so that any estimated vector can take reliability values between 0 and 256. The choice of these weights has been motivated by the observation that the curvature of the correlation surface is a more robust confidence measure compared with the other two geometrical measures used to define5 Table 1 Parameter
setting for the preprocessing Homogeneous
regions
module recognition
Window size for interest operator
Minimum area of a homogeneous region
Threshold range for interest point decision
5 x 5 pixel
800 pixel
16-32
Edges extraction Variance
g1
*2
03
04
Values
2475
3182
3889
4596
Feature
extraction
Size of the Gaussian mask
Edgeness and cornerness range
Total amount discrete angles
5 x 5 pixel
O-255
180
Table 2 Sizes of measurement
windows
used by the hierarchical Global
Similarity MADI MADF
criteria
Window 13x 13 5x5
size
of
II (1998) 231-254
247
The remaining parameters involved in reliability measure and cost function have been set as follows: n = 1, B1 = $11d,,, II, & = i IIL,, II, C, = 5 and C1 = 2. Hereby are n the exponent giving the decreasing velocity of the function g, d,,, the largest displacement encountered in the vector field considered, B1 the upper bound of 6 (see (3.3)), B2 the upper bound of E (see (3.1)), C, the parameter C for the estimation of ~1in the global step and CI the parameter C for the estimation a in the local step. Most of these constants can be seen as empirical values, which have been fixed after several experiments. The computed correspondences after global step for the first frame pairs of the sequences SAX0 and JARDIN are shown in Figs. 9 and 10. The correspondences in Figs. 9(a) and 10(a) have been estimated applying only the MADI, whereas the correspondences in Figs. 9(b) and 10(b) are obtained using area and feature similarity criteria. The resulting matching can be seen to contain more errors in the first case than in the second one. Table 3 indicates the percentages of unreliable matchings in both cases. The amounts shown in Table 3 have been calculated from the average of estimated disparities over the whole sequences setting a = 0 in (3.4), i.e., without temporal smoothness term and taking into account only vectors estimated in the global step. A vector has been declared unreliable if its estimated reliability does not exceed 70% of the best reliability value in each frame pair. In Fig. 11 the reliability values corresponding to the estimated disparities for the tenth frame pair of the sequence SAX0 are shown. High rey levels represent high reliability values, whereas low grey levels represent low reliability values. Note that as expected low reliability values have been assigned to occluded sampling positions.
block-matching
step
Local step
Subsampling factor
Total pixel considered
Window
2 2
49 9
5x5 3x3
size
Subsampling factor
Total pixel considered
1 1
25 9
248
EM. Izquierdo 1 Signal Processing:
Image Communication
11 (1998) 231-254
Fig. 9. Correspondences estimated after global step for the first frame pair of the sequence SAXO. Left and right stereo images are placed side by side and corresponding points are shown as white picks in each image: (a) using only the MAD1 as similarity criterion, (b) combining MAD1 and MADF as similarity criterion.
To demonstrate the effect of the temporal smoothness term in the cost function, results obtained applying the strategy described in Section 3.2.4 have been compared with those obtained by setting a = 0 in (3.4). In Figs. 12 and 13 images representing the horizontal component of dense disparity vector fields pointing from left to right are displayed. Hereby low grey large represent large negative horizontal vector components, whereas high grey levels represent large positive horizontal vector components. According to this convention a vector with horizontal component 0 is represented by the grey value 128. The vertical component of the disparity vectors is neglected in this representation. Fig. 12(a) shows the disparity field estimated for the tenth frame of the sequence
ANNE without smoothing and with the standard parameter setting mentioned above. Fig. 12(b) displays the results obtained for the same stereo images but applying (3.4) with smoothing, i.e., adapting CI as described in Section 3.2.4. In both cases a constant disparity has been assigned to the background. Comparing these results, it is clear that the temporal smoothness term contributes considerably towards their improvement. While the improvements due to the temporal smoothing process seem significant inside of objects, the performance of this strategy seems to be irrelevant in the object borders. In this case the assignment of correct disparities is carried out essentially by the edges based interpolation strategy. Fig. 13(a) shows the disparity field estimated for the
E.M. Izquierdo / Signai Processing:
Image Communication
I I (I 9%) 231-254
249
Fig. IO. Correspondences estimated after global step for the first frame pair of the sequence JARDIN. Left and right stereo images are placed side by side and corresponding points are shown as white picks in each image: (a) using only the MAD1 as similarity criterion; (b) combining MADI and MADF as similarity criterion.
Table 3 Percentage of unreliable matchings as described in Section 3.2.4
in the global step using only the MAD1 as similarity
criterion
and applying
both MADI and MADF
Sequence
Total blocks per frame
Total blocks intersecting text. regions
Unreliable vectors combining MAD1 and MADF
Unreliable vectors Using only MAD1
SAX0 JARDIN
960 960
532 584
14 (2.6%) 41 (7.0%)
63 (1 1.8%) IO2 (17.5%)
frame 10 of the sequence SAX0 before interpolation. Hereby the luminance value 255 has been assigned to the holes remaining after the matching process. In contrast to this, Fig. 13(b) shows the same disparity field after interpolation. Now the field appears to be smooth inside the different ob-
jects of the scene and at the same time discontinuities due to depth differences between object edges are well preserved. This effect is clear by regarding the right under corner of the SAX0 scene where three objects having different depth positions are shown.
250
E.M. Izquierdo / Signal Processing: Image Communication II (1998) 231-254
Fig. 11. Reliability of estimated left to right disparity field for the tenth frame pair of sequence SAXO. High grey levels represent high reliability values, whereas low grey levels represent low reliability values.
Finally, some results illustrating the performance of the image synthesis method are given, Fig. 14 shows synthesized central viewpoints using the tenth frame pair of the sequences SAXO, PIANO, ANNE and JARDIN. The central viewpoint for the sequences SAXO, PIANO and JARDIN have been generated by applying the method described in Section 4.1, whereas the central viewpoint of ANNE has been computed using the method introduced in Section 4.2. In this representation the computed central viewpoint is displayed between the two original stereo images. In the synthesized central viewpoint of SAXO, PIANO and ANNE no artifacts are perceptible. In these cases, the dominance of homogeneous areas in occluded regions contributes to the excellent performance of the image interpolation methods. Regarding the results obtained for JARDIN, where only textured areas are occluded, some artifacts can be perceived. Specially in regions with large disparity discontinuities (see background areas near to heads of the two persons in Fig. 14(d)) image degradation is perceptible. Nevertheless, the overall image quality can be considered as good.
6. Summary and conclusions A method for joined estimation of motion and disparity has been introduced. Additionally, two different approaches for image synthesis according to several applications have been discussed. The presented research address the following known problems in motion disparity analysis and image synthesis: - Assignment of correct disparities in image areas containing low luminance variation. - Correspondence estimation in occluded regions and preservation of discontinuities in the displacement maps due to depth differences between object edges. _ Temporally smoothed displacement fields and coherent estimation of motion and disparity. - Temporally recursively enhancement of the displacement fields taken into account temporally preceding vectors according to their reliability. - Synthesis of intermediate views in any position for natural stereoscopic sequences and stereoscopic videoconferencing sequences including extremely large occluded areas.
EM
Izquierdo 1 Signal Processing: Image Communication I I (1998) 231-254
251
Fig. 12. Horizontal component of left to right disparities estimated for the tenth frame pair of sequence ANNE. Low grey levels represent large negative horizontal vector components, whereas high grey levels represent large positive horizontal vector components: (a) without temporal smoothness term: (b) applying temporal smoothness term.
Fig. 13. Horizontal component of left to right disparities estimated for the tenth frame pair of sequence SAXO. Low grey levels represent large negative horizontal vector components, whereas high grey levels represent large positive horizontal vector components: (a) before interpolation (remaining holes are displayed as white patches); (b) dense field generated using the edges based interpolation strategy.
The goal of estimation of accurate and very reliable displacement maps is reached by an improved hierarchical block-matching method. The idea at the heart of the approach presented is to combine several strategies to predict and correct each calculated displacement applying a suitable cost function and evaluating the reliability of the vector by testing several natural constraints. Additionally, two different image synthesis concepts are presented. The first approach is suitable for processing of arbitrary scenes containing occluded regions of moderate size. It comprises the detection of oc-
eluded image areas in both stereo images in order to switch between different interpolation and extrapolation modes. The second method is an objectbased approach, which assumes a convex object located in the center of the scene. This assumption is fulfilled by typical video conferencing situations, in which the scene usually consists of the head and shoulder of a person in front of a uniform background or a previously recorded textured background. The goal of this method is to generate intermediate images with a very good image quality in sequences with extremely large occluded areas
252
EM
Izquierdo / Signal Processing: Image Communication II (1998) 231-254
Fig. 14. Synthesized central viewpoint using tenth frame pair of sequences (a) SAXO, (b) PIANO. Original left view (top), synthesized central viewpoint (middle) and original right view (below).
and keeping implementation costs low. The difficulty of this task is solved by adequately exploiting the assumptions mentioned above. The performance of the presented methods was tested by several computer experiments using natural stereoscopic sequences and sequences representing typi-
cal videoconferencing situations. The joined motion-disparity analysis presented and image synthesis system is shown to be capable of offering realistic 3D-impression with continuous motion parallax to TV-audience as well as to videoconferencing participants.
E.M. Izquierdo / Signal Processing:
Image Communication
II (1998) 231-254
253
d)
Fig. 14. Continued.
Acknowledgements
The author would like to thank Manfred Ernst for valuable discussions and helpful comments, to Jens R. Ohm and Thomas Sikora for proofreading a draft of this paper and for their valuable suggestions and to the project group VLTV at the signal processing department of the Heinrich-Hertz-Institute for creating an agreeable research environment. The CCETT has provided some of the used test sequences. This is gratefully acknowledged. The research leading to this paper was supported by the
(c) ANNE,
(d) JARDIN
Ministry of Research and Technology of the Federal Republic of Germany, Grant No. OlBK304.
References [l]
P. Anandan, Measuring visual motion from image sequences, Ph.D. Thesis, University of Massachusetts, 1987. [Z] V. Berzins, Accuracy of Laplacian edge detectors, Comput. Vision Graphics Image Process 27 (2) (1984) 195-210. [3] B. Chupeau, P. Salmon, Synthesis of intermediate pictures for autostereoscopic multiview displays, Proc. Workshop on HDTV 94, Turin, Italy, October 1994.
254
EM Izquierdo / Signal Processing: Image Communication II (1998) 231-254
[4] U.R. Dhond, J.K. Aggarwal, Structure from Stereo -A review, IEEE Trans. Systems Man Cybemet. 19 (6) (1989) 1489-1510. [5] M. Ernst, Bewegunskompensierende Signalverarbeitung fiir HDTV, Final Report, -FKZ: OlBKOO3-, HeinrichHertz-Institut fur Nachrichtentechnik Berlin GmbH, 1993. [6] W.E.L. Grimson, Computational experiments with a feature based stereo algorithm, IEEE Trans. Pattern Anal. Mach. Intelligence PAMI- (1985) 17-34. [7] W. Hoff, N. Ahuja, Surfaces from stereo: Integrating feature matching, disparity estimation and contour detection, IEEE Trans. Pattern Anal. Mach. Intelligence PAMI(2) (1989) 121-136. [S] A. Huertas, G. Medioni, Detection of intensity changes with subpixel accuracy using Laplacian-Gaussian masks, IEEE Trans. Pattern Anal. Mach. Intelligence PAMI- (5) (1986) 657-664. [9] E. Izquierdo, Uber die Epipolargeometrie in Stereo, Internal Report, -FKZ: OlBK304-, Heinrich-Hertz-Institut fIir Nachrichtentechnik Berlin GmbH, Berlin, 1995. [lo] E. Izquierdo, M. Ernst, Motion/disparity analysis and image synthesis for 3DTV, in: N. Ninomiya, L. Chiariglione (Eds.), Proc. Signal Processing of HDTV VI, Part 6-B, 1995. [ 1l] E. Izquierdo, M. Ernst, Motion/disparity analysis for 3Dvideo-conference applications, in: S. Efstratiadis et al. (Eds.), Proc. Internat. Workshop on Stereoscopic and Three Dimensional Imaging IWS3DI’95, September 1995, pp. 180-186. [12] J.J. Koenderink, The structure of images, Biol. Cybemet. 50 (1984) 363-370. [13] A. Koschan, Eine Methodenbank zur Evaluierung von Stereo-Vision-Verfahren, Ph.D. Thesis, Technische Universit& Berlin, Berlin, Germany, 1991.
[14] T. Lindeberg, On the behavior in scale space of local extrema and blobs, Proc. 7th SCIA, 1991. [lS] Y. Lu, R.C. Jain, Behavior of edges in scale space, IEEE Trans. Pattern Anal. Mach. Intelligence PAMI(4) (1989) 337-356. [16] D. Marr, T. Poggio, Cooperative computation of stereo disparity, Science 194 (1976) 283-287. [17] D. Marr, T. Poggio, A computational theory of human stereo vision, Proc. Roy. Sot. B 204 (1979) 301-328. [18] H.-H. Nagel, W. Enkelmann, An investigation of smoothness constraints for the estimation of displacement vector fields from image sequences, IEEE Trans. Pattern Anal. Mach. Intelligence PAM13 (5) (1986) 565-593. [19] V. Seferidis, M. Ghanbari, Hierarchical motion estimation using texture analysis, Proc. 4th IEEE Conf. on Image Processing and its Applications, Maastricht, The Netherlands, April 1992, pp. 61-74. [20] M.A. Snyder, On the mathematical foundations of smoothness constraints for the determination of optical flow and for surface reconstruction, IEEE Trans. Pattern Anal. Mach. Intelligence PAMI(11) (1991) 1105-1114. [21] A. Tamtaoui, C. Labit, Coherent disparity and motion compensation in 3DTV image sequence coding schemes, ICASSP, 1991, pp. 2845-2848. [22] J. Weng, N. Ahuja, T.S. Huang, Matching two perspective views, IEEE Trans. Pattern Anal. Mach. Intelligence PAMI(8) (1992) 806-825. 1231 A.P. Witkin, Scale-space filtering, Proc. IJCAI, Karlsruhe, 1983, pp. 1019-1021. [24] O.C. Zienkiewicz, The Finite Element Method, 3rd ed., McGraw-Hill, London, 1977.