ARTICLE IN PRESS
Signal Processing: Image Communication 19 (2004) 771–786 www.elsevier.com/locate/image
Mosaicking images with parallax F. Dornaika, R. Chung Department of Automation and Computer-Aided Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong Received 19 November 2003
Abstract In recent years, image mosaicking has been a focus of attention of many researchers and practitioners. Existing mosaicking methods are effective only in limited cases where the camera motion is almost a pure rotation or the viewed scene is planar or very far away. In this paper, we introduce a new methodology of image mosaicking for generic 3D scenes under general camera motions. What puts the mosaicking problem to a class different from those in existing work is the parallax in the images. In this paper, we develop two heuristics for performing the geometrical image transfer. The first one uses a planar motion plus parallax together with a projective reconstruction. The second one uses projective planar patches. We provide a solution for the occlusion problem introduced by the parallax field. Examples of constructing successful mosaics from real images are shown. These experiments demonstrate the feasibility of the developed methods for registering images with parallax. r 2004 Elsevier B.V. All rights reserved. Keywords: Image mosaicking; Image registration; Parallax; 3D projective reconstruction
1. Introduction and motivation Image mosaicking has been the focus of attention of many researchers and practitioners in recent years. It is about merging a collection of images with significant overlap in their fields of view to obtain a new and larger image called the mosaic [7,9,11,23]. Image mosaicking allows one to create a picture of large field of view with the same high resolution Corresponding author. Tel.: +852-2609-8339; fax: +852-
2603-6002. E-mail address:
[email protected] (R. Chung).
obtained for small fields of view. It can be used in a wide variety of applications including remotesensed imaging, medical image analysis, virtual reality, and even video compression (as redundancy in the original collection of images is reduced). There has been some work on image mosaicking. The simplest mosaics are created from images whose mutual displacements are image-plane translations. Other mosaics are created by using a parametric 2D transformation between the image frames such as the affine transformation, bilinear transformation, and planar-projective transformation (planar homography). In [18]
0923-5965/$ - see front matter r 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.image.2004.06.008
ARTICLE IN PRESS 772
F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
authors used affine parameters that are inferred from the optical flow. This model for 2D motion is used to construct stills with a camera undergoing the pan and zoom operations. A more general model given by a planar projective transformation is used by Mann and Picard [10], Szeliski [17] in creating stills for planar scenes or for purely rotational motions of a camera. The planarprojective transformation (planar homography), in particular, correctly describes two cases: (i) an arbitrary 3D scene viewed by a camera at a fixed location free to rotate about its center of projection (a pure camera rotation); (ii) a planar (or very distant) scene viewed by a camera free to move arbitrarily (a planar or distant scene). It would be more ideal, however, if mosaic could be formed for a generic scene from a video captured by a hand-held camera undergoing free motion. In this general case, where the scene is arbitrary and the camera motion contains a translation, parallax would be present in the images. Constructing a mosaic for such a case is thus challenging as there is no global transformation between the images that allow the images to be aligned. If 2D global parametric models are used to form the mosaic, the parallax effects would be visible in the mosaic in the form of a loss of information, blurring, or duplicated details (ghosting). While parallax is necessary to compute the 3D structure [8] and the epipolar geometry [1], in image mosaicking it creates difficulties. In comparison with the feature transfer problem (from correspondences over two images to a third image), image mosaicking also requires more involvement as it has to transfer the entire image not just a small set of extracted features (on the boundary of a particular object, say). There were some attempts to reduce the effects of motion parallax induced by camera translation, through the use of interpolation techniques over optical flow [16,12]. However, optical flow can only be estimated reliably at local places where the intensity gradient is immense, and is only very sparsely available. In [13,9] authors proposed a formulation that simultaneously recovers the full correspondence map and the 3D projective structure using pyramidal representations of the original images
with a coarse-to-fine scheme. In other words, the proposed formulation [9,13] combines two problems into a more difficult one: (i) recovering the full correspondence map, and (ii) recovering the 3D projective coordinates of all overlapping pixels. This formulation involves solving a highly complicated non-linear system with a huge number of unknowns, which cannot always be a tractable problem. In this paper, we address the problem of image mosaicking for a generic 3D scene under general camera motions. We conquer the parallax problem by the use of a third image. Although it seems to be an excess requirement, one must notice that after all with only two images (with parallax) the mosaic cannot be built since there is no global 2D parametric transformation for the registration process (see next section for an explanation). We perceive what contributes to the difficulty of constructing mosaics in the most general case, i.e., for images with parallax, as the following:
Image transfer: In the general case (a generic scene and an arbitrary camera motion), there is no global 2D transformation which can be used to transfer every pixel from one image to the other image. On the contrary, in the case of planar scenes (or very distant scenes, which can be regarded as planar to a certain approximation) or a fixed viewpoint the global transformation will have a global parametric model (a 3 3 homography) which can be inferred from the overlapping areas of the images. Occlusions: Since the scenes are generic and the camera motion is arbitrary, surfaces visible in one image may be occluded in the other image (which is acquired from a different viewpoint). Image mosaicking has to take this problem into account, otherwise the resulting mosaic may contain significant blurring or inconsistent regions.
This article introduces two heuristics for image mosaicking. The first heuristic transfers the pixels of the to-be-registered image to the third image frame using an approximated planar plus parallax transformation. Then, a 3D projective reconstruction (which is far easier to recover than Euclidean
ARTICLE IN PRESS F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
structure) followed by a projection are used to register the two images. The second heuristic directly recovers the 3D projective structure of all pixels of the image to be registered, by approximating the scene as consisting of a number of small planar patches. Our overall goal is to have efficient and tractable methods that can overcome the parallax. The advantage of the developed approaches is their simplicity since they use classical computer vision techniques. Pure rotational camera motion is no longer required. The rest of the paper is organized as follows. Section 2 formulates the problem we focus on and explains why this particular problem is challenging. Section 3 gives some background about projective mappings associated with three images. Section 4 shows how to register two images with parallax using an approximated 2D mapping together with 3D projective reconstruction (the first heuristic). Section 5 shows how to perform the registration using the mapping induced by a number of small planar patches assumed for the scene (the second heuristic). Section 6 provides a solution to the occlusion problem. Section 7 presents some experimental results using real image data. Section 8 presents a performance study using synthetic data.
2. Problem formulation Given two images of a general scene taken from two different viewpoints, we aim at constructing a mosaic of both images. In the sequel, image points are represented by their homogeneous coordinates (a 3-vector). Let R and t be the rotational and translational components from the first to the second camera coordinate frames. Let K denote the internal parameters of the camera (a 3 3 upper-triangular matrix). For any pair of matches (pixel or feature), p1 and p2 , we have the following equation [14,3] (ffi stands for the equality up to a scale factor): 1 p2 ffi KRK1 p1 þ Kt; |fflfflfflfflfflffl{zfflfflfflfflfflffl} |{z} z planar parallax
ð1Þ
773
where z is the Euclidean depth of the corresponding 3D feature. Since KRK1 is the homography at infinity, and Kt represents the second epipole e2 , (1) can be written as: p2 ffi H 1 p1 þ
1 e2 : z
The 2D motion between corresponding features or pixels can be decomposed into 2 components (1): (i) planar and (ii) parallax. Note that this decomposition can be done with respect to any arbitrary plane P (real or virtual) in the environment [8]. The parallax is the image projection of the deviation of 3D features from the chosen plane: p2 ffi HP p1 þ ke2 ;
ð2Þ
where the scalar k can be considered as the projective depth of the point p1 or its parallax. In this case, the parallax is defined with respect to this plane. 2.1. Why global mapping is not available While the planar transformation can be computed by choosing a physical or virtual plane in the scene, the second component depends on both the camera translation and individual depth of the considered pixel. From (2), one would notice that the knowledge of the correspondence (p1 ; p2 ) and the knowledge of the scalar k are equivalent in the sense that the knowledge of one yields that of the other. However, in general one cannot build an accurate full correspondence map between two arbitrary images. This is due to reasons like occlusions, depth discontinuities, large disparities, and the presence of regions with uniform intensity. Therefore, the parallax component is not known for the majority of pixels even though it is known for some of them (the feature correspondences). As a result, an arbitrary pixel in the first image cannot be transferred into the second image frame even when both the planar transformation and the epipolar geometry associated with these two images are known. In other words, one cannot use (2) to register the two images since the parallax field is not known.
ARTICLE IN PRESS 774
F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
2.2. Why is it not possible to make mosaics from just two images Fig. 1 illustrates two images (from two different viewpoints) whose overlapping areas contain the projection of 3D surface. This projection is denoted by S1 in the first image and S 2 in the second image. It is obvious that pixels in the vicinity of the edges (or the textured features) in S 1 can possibly be put into correspondence with their matches in S 2 , since the intensity profile provides useful information about individual pixels. However, for pixels that are far away from those edges, one would have difficulty in putting them into correspondence since the intensity gradients there are very small or null. Moreover, in the lack of a global transformation, pixels in the non-overlapping area cannot be transferred from the first image to the second image and vice versa. This is similar to the fact that one cannot recover 3D information from one image. This demonstrates that a mosaic cannot be built from two images (with parallax) alone. In summary, image mosaicking is more involved than feature transfer (a well-known problem in computer vision).
For these reasons, most of the previous work on image mosaicking has used a planar homography assuming that the parallax field is null. As a result, when parallax is present, blurring or duplicated details will be visible in the resulting mosaic, especially over those parts of the scene which are close to the camera. In the following, we propose two heuristics to overcome the parallax problem without having to explicitly compute the disparity field between the original images. Both methods are based on the use of 3D projective reconstruction.
3. Background and methodology Since in the general case (an arbitrary scene and an arbitrary motion) it is impossible to construct a 2D parametric transformation between two images, thus a mosaic of them being out of the question (see the preceding section), we make use of a third image which we refer to as the intermediate image. We are thus left with three images (see Fig. 2). We call them as follows. The target image is the
3D surface
S2
S1
First image
Second image Overlapping areas
Fig. 1. Mosaicking two images with parallax is an ill-posed problem.
ARTICLE IN PRESS F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
image that we like to register. The reference image is the image to which the target image is registered. The intermediate image is any other image that has an overlapping area with the target image, and ideally it should feature the entire non-overlapping area associated with the target image and the reference image. It should be noticed that we have two parallax fields: (i) the one associated with the couple target image-reference image (which we are supposed to overcome), and (ii) the one associated with the couple target image-intermediate image.
3.1. Projective mappings Consider the three images in Fig. 2. It is well known that uncalibrated images can be related to some projective space [3,6]. Thus, our three images are related to an arbitrary projective space. Each image will have a 3 4 projective mapping that maps 3D projective space into the image plane. Let M, M0 , and M00 be the projective mapping associated with the target image, the intermediate image, and the reference image, respectively. These three matrices can be easily inferred from feature (point) matches. We use the following.
Target image
q
775
Let F be the fundamental matrix between the target image and the intermediate image, and e0 their epipole in the intermediate image. F and e0 can be computed from feature matches in the two images. It is well known that a solution for the mappings M and M0 is given by [21] M ffi ðI 0Þ; M0 ffi ðSðe0 ÞF þ e0 wT we0 Þ for some 3-vector w, and a non-zero scale w. Matrix I represents the 3 3 identity matrix, Sðe0 ) is the skew-symmetric matrix associated with the 3-vector e0 . Once M and M0 are determined, the 3D projective coordinates of all feature matches present in the target image and the intermediate image can be recovered by a simple triangulation. The third mapping M00 is then obtained by imposing that some reconstructed 3D points are reprojected as closely as possible to their matches in the reference image frame. This can be carried out using linear equations with at least 6 feature matches in the three images. Therefore, computing the three projective mappings requires that (i) we have at least 7 matches between the target image
Reference image
p
p"
Feature correspondence Intermediate image
q
M
M"
in three images
P
p
M Q Projective space Feature correspondence in two images
Fig. 2. Our goal is to register the target image into the reference image. The corresponding camera motion is arbitrary and may contain a significant translation. A third image—the intermediated image—is used to aid the image transfer.
ARTICLE IN PRESS 776
F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
and intermediate image (F, M and M0 ) and (ii) we have at least 6 matches in the three images ðM00 Þ. To get such matches we have used the method developed by Zhang et al. [22].
p0i ffi Hpi þ d i e0 ;
4. First method: approximated 2D mapping between the target and intermediate images
ð4Þ
where d i is given by:
Our first method relies on the following fact: in general, for any pixel of the image to be registered (target image), if we know the 2D location of its correspondence in the intermediate image we are able to transfer this pixel to the reference view using a projective reconstruction followed by a projection, i.e., using the three projective matrices M, M0 , and M00 . We can, of course, apply some developed techniques to recover the full correspondences map between the target image and intermediate image. However, for reasons mentioned in Section 2, there is no guarantee that the obtained disparity field will be correct for all pixels. Therefore, we use the following heuristic. Since the couple target image-intermediate image has its own parallax (unknown), transferring pixels from the target image to the intermediate image for the entire frame is as difficult as transferring those pixels to the reference image (our goal). Indeed, if we denote by pi an arbitrary pixel of the target image, it follows that the 2D location of its correspondence in the intermediate image is given by p0i ffi Hpi þ d i e0 ;
stress the fact that the exact value cannot be recovered for the reasons mentioned above. Therefore, transferring the target pixels to the intermediate image is performed using the following equation:
ð3Þ
where H is the planar motion associated with the reference plane (in our implementation, this plane is chosen as the average plane associated with all future correspondences in the target image and intermediate image), and e0 represents the epipole in the intermediate image. Both H and e0 are recoverable from feature matches in the intermediate and target images. Since the parallax d i is not necessarily known for all pixels, (3) cannot be used to transfer the target image pixels to the intermediate image frame. Therefore, our heuristic consists of setting the parallax d i to an approximated value d i . We
1. First case ðd i ¼ 0Þ: This is a good approximation if the distance between the centers of projection associated with the target image and intermediate image is small compared to the scene depth. One might question that this approximation ðp0i ffi Hpi Þ would mean the camera motion between the target and intermediate images is a pure rotation, and a pure camera rotation would not allow any 3D notion (Euclidean or projective) about the scene to be recovered from the images. However, the approximation, as a planar homography, also covers the case that there is considerable translation between the images but the scene is planar. Here we can always take the latter interpretation in introducing the mapping, approximating that the scene consists of a plane (the reference plane) that best fits the available feature correspondences. It should be noticed that this kind of approximation is very similar to the one used with affine cameras. After all, our goal here is only to obtain an approximation of the transformation between the target image and intermediate image. 2. Second case (d i ¼ closest parallax): This value can be set to the parallax associated with the closest pair of feature matches (remember that the parallax is known at any matched feature in an image pair). In practice, we found out that the first case provides smoother images than the second case, which can be explained by the fact that matched features are sparse. Once pi 0 is obtained, the 3D projective coordinates of the 3D point that projects onto the pixel pi follow easily from the two projective matrices M and M0 (a simple triangulation). Then, the 2D
ARTICLE IN PRESS F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
location in the reference image (mosaic frame) is computed by projecting the 3D point using the matrix M00 . One can notice that since an approximated value has been used, i.e., p0i , there will be some error introduced when the mosaic is built. However, the effect of this error is considerably reduced at the final stages (pixel warping, weighted intensity averaging). After all, we believe that an appropriate choice of the intermediate image (i.e., the baseline between the intermediate image and the target image should be a small fraction of that associated with the pair target image–reference image) makes the registration error (target–reference) much smaller than the one obtained with a registration based on a global transformation between these two images (their parallax field is not taken into consideration). We point out that the transfer of pixel matches ðpi ; p0i Þ into the reference frame can be carried out by the trifocal tensor associated with the three images instead of using the three projective mappings [5,14,15]. The debate whether one should use the three projective mappings or the trifocal tensor is beyond the scope of this paper.
777
Original pixel Target frame
Reference frame
Mosaic cells a
b
d
c b" a"
Approximated transformation c" d"
a’
b’
c’ d’
Reconstruct and project
Warped pixel
Intermediate frame
Fig. 3. Transferring original pixels from the target image to the mosaic frame (the reference frame). Each original pixel represented by the square a, b, c, d is mapped into the mosaic frame using two successive steps. First, it is mapped into the intermediate image frame using an approximated transformation (the transferred pixel is represented by the quadrilateral a0 , b0 , c0 , d0 ). Second, using the reconstruction and projection processes (3 projective mappings M, M0 , M00 , we are able to obtain the warped pixel (a00 , b00 , c00 , d00 ) in the mosaic frame (the reference frame). Then, the obtained warped pixel is digitized in the mosaic frame.
4.1. Image transfer/warping This section describes how one can obtain a warped version of the target image. This warped image is expressed in the reference image frame. Once the three projective mappings associated with the three images as well as the planar motion between the target image and the intermediate image are determined, registering the target image into the reference image becomes possible. The warping process of a pixel belonging to the target image is illustrated in Fig. 3. 4.2. Dealing with holes Since the warped version of the original image (the target image) is obtained by forward mapping, some holes may appear within it. In other words, some mosaic cells have been left with no source pixel in the target image that allow their gray values to be imposed. In our experiments, we have found out that holes are usually isolated or tend to
form curves of one pixel width in the warped image. We address the problem of holes as follows. For each hole in the mosaic frame, we examine its 8 neighboring cells. The cells are checked pair by pair along 4 directions (the horizontal, the vertical, and the two diagonals). Then the direction that holds the two cells (which are non-holes) of the most similar gray values is chosen. The gray value of the hole is then set as the interpolated value of the gray values of the two chosen cells. This minimizes the blurring effect in the final mosaic. This is equivalent to a smoothing operation along a direction that is perpendicular to the intensity gradient vector in the original image. 4.3. Mosaic construction The mosaic will be constructed by blending two images: the warped version of the target image and the reference image. For the overlapping area an
ARTICLE IN PRESS 778
F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
appropriate combination can be used to avoid brightness discontinuities. To this end, we have downweighted the blending coefficients of the warped image in the locations associated with the peripheral region of the overlapping area. Note that the original two images may have different illumination even when they feature the same 3D details. More detailed treatments about image blending can be found in [2,4,20].
Projective space
Planar patches
M’’
Target image
Reference image
M
5. Second method: approximating the scene as a piecewise planar scene In the previous section, we have shown that the transfer of the target image pixels is done using an estimation of their 2D location in the intermediate image frame. In this section, we propose another alternative. We propose to directly estimate the 3D projective coordinates of the target image pixels without having to compute their 2D location in the intermediate image frame. To this end, the scene viewed by the target image will be approximated by a set of planar patches in the projective space.
Triangle matches M’
Intermediate image
Fig. 4. The scene viewed by the target image is approximated by a set of planar patches. This set is constructed from the triangle matching between the target image and intermediate image.
5.1. Piecewise planar scene
Each plane is parameterized by a 4-vector rj such that:
We consider the feature matches in the target image and the intermediate image. Such features can be clustered into a set of triangles in the respective image according to their spatial configuration. Building the triangles can be done in two different methods: (i) using Delaunay triangulation and (ii) clustering features into triangles using nearest neighbors. Qualitatively, in our experiment, both triangulation methods have given the same mosaic. Each triangle together with its correspondence in the other image will define a 3D plane in the projective space (see Fig. 4). The equation of this plane can be inferred from the projective coordinates of the three vertices that compose the triangle (these are computed at the stage of projective reconstruction). At this point, we have a set of planar patches Pj ; j ¼ 1; . . . ; m, where m is the number of patches in the target image (the number is about the number of feature matches subtracted by two).
rTj x ¼ 0;
ð5Þ
where the 4-vector x represents the 3D projective coordinates (up to a scale factor) of any point belonging to the plane Pj . 5.2. Geometrical transfer Knowing the projective mappings M, M00 as well as the projective representation rj of all planar patches, we are able to transfer any pixel p from the target image into the reference image. We proceed as follows (see Fig. 5).
First, a planar patch is chosen: among all triangle barycenters we choose the one closest to the pixel p. The pixel p is therefore considered as belonging to the planar patch associated with this triangle. Note if Delaunay triangulation is used, then the chosen triangle is the one containing the pixel p.
ARTICLE IN PRESS F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
a large error since planar homographies can be available for the overlapping region only.
Projective space Planar patch P
M
M’’
Closest triangle
p
p"
Line of sight
Center of projection Target image
Center of projection Reference image
Fig. 5. Transferring the original pixel p. First, the closest triangle is chosen. Second, a 3D point is reconstructed in the projective space by intersecting the line of sight with the projective plane associated with the selected triangle. This yields the point P. Third, this 3D point is projected using the mapping M00 . This yields the transferred pixel p00 .
Second, the 3D projective coordinates of the scene feature that projects onto p are computed by intersecting the line of sight with the chosen planar patch. Let P be the obtained 3D point. Thus, to compute the 3D projective coordinates of P ¼ ðx; y; z; 1ÞT we have to solve these equations: p ffi MP; rTj P ¼ 0:
The above equations provide three linear equations in P ¼ ðx; y; z; 1ÞT . Since M is given by M ¼ ðI 0Þ solving for ðx; y; z; 1ÞT is very simple. Third, the 2D location of the transferred pixel, p00 , in the reference frame (mosaic frame) is recovered by projecting the 3D point P into the reference frame as p00 ffi M00 P: One can notice the difference between our heuristic (which uses three images) and the heuristic that uses homographies between the two images being registered. It is obvious that the registration obtained with the latter one has
779
Remark. One might be concerned about the mapping of the pixels that are far away (by tens of pixels) from the closest barycenters. In this case, their transfer into the mosaic frame probably suffers from more error. However, we argue that since such pixels generally belong to nearly uniform regions, their accurate registration is not crucial. It is a fact, though, that the heuristic adopted in the approximation here is best suited for images with rich texture, where more features can be extracted and matched.
6. Dealing with occlusions Since the scene is not necessarily planar and the camera motion may contain a translation, it is possible that some parts of the scene are hidden in one image but visible in the other. Such occlusions (more correctly occlusion–disocclusions) have to be dealt with when the image transfer described in the previous sections is performed. If not overcome they may cause blurring or inconsistent image information in the final mosaic. It is well known [19] that given a moving camera one can infer the 3D shape (up to a scale factor) of the observed scene even when the camera is uncalibrated. One way to do so is to compute the epipolar geometry between the two views. From the epipolar geometry, we can recover the camera motion (the rotation and the translation up to a scale factor). As a result, the scaled depth of any correspondence can be inferred. Therefore, our strategy of overcoming the occlusion problem is the following.
First step: From feature matches between the two views, we recover the camera motion (the rotation and translation up to a scale factor). Second step: We detect the mosaic pixels where occlusions occur by examining if they have more than one original pixel that are widely separated in the original image. Third step: Consider a mosaic pixel where occlusion occurs, and all its original pixels. All these candidates (the original pixels) belong to
ARTICLE IN PRESS 780
F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
the same epipolar line. We only need to retain two candidates on the epipolar line: the one closest to the epipole and the one farthest away from it, which correspond to the 3D points either closest to and farthest away from the camera in the second view (the mosaic view). Eventually, the candidate which has the minimum scaled depth will be chosen and impose its gray value. It is worthwhile noting that depth information can be bypassed if one knows the order of the camera centers.
7. Experimental results In this section, we provide some experimental evaluation of our two image mosaicking methods (see Sections 4 and 5). We have used images of indoor scenes. We have used the software imagematching to obtain matches between pairs of images [22]. The image data used in the shown experiments are all benchmark data (for vision applications) downloadable from the INRIA web site ‘‘ftp: //krakatoa.inria.fr/pub/’’. 7.1. First method In the first experiment, we used three 512 512 images taken by the same camera (see Fig. 11). The planar homography as well as the first two projective mappings were inferred from 175 matches between the target image and the intermediate image. The third projective mapping was recovered from 78 matches in the three images. Both the homography and the third mapping were estimated using linear algorithms. Using the first method described in Section 4, a mosaic was built from the target and reference images. Fig. 12 displays the resulting mosaic that is of the size 635 727. In the second experiment, the planar homography as well as the first two projective mappings were inferred from 167 matches between the target image and the intermediate image. The third mapping was recovered from 95 matches in the three images. The resulting mosaic is of the size 565 668 (see Fig. 13). Feature correspondences
between the target and reference images are shown with white crosses. Since the developed approach allows significant translation between images, we have used it for constructing mosaics from stereo images. Fig. 14 displays the result of mosaicking a stereo pair (the right one was to be registered into the left one). In this example, the warped image has some distortions due to the fact that parallax associated with the intermediate image was not small. The quality of the obtained mosaic is still acceptable. Fig. 15 displays the results of mosaicking two stereo images using two different methods. The top shows the constructed mosaic using a planar homography which was inferred from features belonging to the wall plane. The bottom shows the constructed mosaic using our method. Note that the mosaic obtained with the homography contains some misregistrations. For example, the closet as well as the vertical rectangle are misregistered. These misregistrations do not occur in the mosaic obtained from our method. 7.2. Second method The fifth experiment dealt with a difficult case in which the 3 views were widely separated. The mutual baseline was about 40 cm. Fig. 16 displays the obtained mosaic using the projective planar patches method (Section 5). The top of this figure displays the original images (two stereo images). Fig. 17(a) displays the inferred triangles based on the use of the nearest neighbors, (b) displays the inferred Delaunay triangulation.
8. Performance study and method comparison In this section, we aim at quantifying the transfer error between the images since this error determines to a large extent the quality of the final mosaic. The transfer error is simply the Euclidean distance in the image plane between the actual location of a 2D feature and its computed location using a certain transfer approach.
ARTICLE IN PRESS F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
We study the transfer error associated with two different transfer methods: (i) the global mappingbased transfer (used in most of the existing work) and (ii) the projective reconstruction-based method (the second heuristic described in Section 5). Since we are studying the geometric transfer using synthetic data, two images will be considered at a time, i.e., two camera positions will be considered at a time. The following framework has been built: 1. Nominal values for the intrinsic parameters of a perspective camera are provided. For the sake of simplicity the image plane is not bounded. The camera matrix is set to: 2 3 1500 0 256 6 7 1500 256 5: 4 0 0 0 1 2. Three different objects are synthesized: (i) a two-plane surface (the angle between the two planes is 45 ), (ii) an elliptic cylindrical surface, and (ii) a spherical surface. Each object is represented by 200 3D points uniformly distributed on its surface. For the conic surfaces (cylindrical and spherical ones) only half of the surface facing the camera is considered. The size of all three objects as well as their depth variation are roughly the same. Moreover, they are placed at the same depth (100 cm) from the camera position. The axis of symmetry of all surfaces is parallel to the vertical axis of the camera (first position). Note that the projection of the 3D points onto the first camera position provides the first image which is fixed for each object. 3. Nominal camera translations are then set. The direction of all translations is fixed but the translation norm increases allowing an increasing baseline. The maximum translation vector is set to ð30 cm; 7 cm; 0ÞT . 4. The camera is then moved along each translation. For each translation (baseline) the camera coordinate system is rotated 1000 times at random. There is no rotation about the optical
axis. Only the pan and tilt angles vary randomly in the interval ½45 ; 45 . For each baseline and each random rotation the 200 3D points are projected onto the corresponding image plane allowing the computation of the second image. We assume that all points are visible in the obtained image as well as in the first image. Uniform noise within the interval [0:5 pixels, 0.5 pixels] is then added to the nominal 2D coordinates of all points in the two images: the first one and the second one (random one). 5. For each random rotation the 200 noisy 2D points are transferred from the first image to the second image using two different techniques: (i) 2D global mapping based on the use of homography (inferred from the 200 correspondences) and (ii) the 3D projective reconstruction. Once the points are transferred, we compute the average transfer error over the 200 points (we know the ground truth for all points and for all images). Then, this average error is averaged over 1000 random rotations (1000 random images) giving an estimation of the error transfer for each baseline. We study the transfer error as a function of the baseline and of the object shape. Figs. 6–8 display the average transfer error as a function of the baseline for the three shapes
11 10
Average transfer error (pixels)
8.1. Global mapping vs. projective reconstructionbased transfer
781
9 8 7 6 5 4 3
Homography Projective reconstruction
2 1 0 10
20
30
40
50
60
70
80
90
100
Baseline (%)
Fig. 6. Average transfer error associated with the two-plane surface using two transfer methods.
ARTICLE IN PRESS F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
782
18 16
14
Average transfer error (pixels)
Average transfer error (pixels)
16
12 10 8 6 Homography
4
Projective reconstruction
2 0 10
20
30
40
50
60
70
80
90
14 12 10 8 6 Homography
4
Projective reconstruction
2 0
100
10
20
30
40
50
60
70
80
90
100
Baseline (%)
Baseline (%)
Fig. 7. Average transfer error associated with an elliptic cylindrical surface using two transfer methods.
Fig. 9. The transfer error averaged over the three shapes.
11
25
Average transfer error (pixels)
Average transfer error (pixels)
10 20
15
10 Homography
5
0 10
Projective reconstruction
9 8 7 6 5 4 3 First heuristic Second heuristic
2 1
20
30
40
50
60
70
80
90
100
Baseline (%)
Fig. 8. Average transfer error associated with a spherical surface using two transfer methods.
(two-plane surface, elliptic cylindrical surface, spherical surface), respectively. The solid curves correspond to the global mapping, the dotted ones to the projective reconstruction-based transfer. In these figures, the 100 percent value corresponds to the translation vector ð30 cm; 7 cm; 0ÞT . Fig. 9 displays the mean curve associated with the three shapes. Therefore, when the baseline is ð30 cm; 7 cm; 0ÞT a global mapping transfer will have an average transfer error of 11 pixels for the two-plane surface, 15 pixels for the elliptic
0
5
10
15
20
25
30
Baseline (%) (Target-Intermediate) Fig. 10. Average transfer error associated with an elliptic cylindrical surface using the first heuristic (solid curve) and the second heuristic (the dotted curve).
cylindrical surface and 25 pixels for the spherical surface. It can be observed that, unlike the global mapping transfer, the transfer based upon projective reconstruction does not depend on the baseline or the object shape (the associated error is roughly equal to 1 pixel). 8.2. First heuristic vs. second heuristic In another experiment adopting the same framework described above, we aim at comparing
ARTICLE IN PRESS F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
Fig. 11. The three images used in the first experiment.
the two developed heuristics (see Sections 4 and 5). For this purpose, we have used the cylindrical object, and have fixed the baseline between the target image and reference image to the 100 percent value, i.e. ð30 cm; 7 cm; 0ÞT . Fig. 10 displays the average transfer error as a function of the baseline between the target image and the intermediate one expressed as a percentage of the baseline separating the target image from the reference image. The solid curve corresponds to the first heuristic, the dotted one to the second heuristic. As can be seen, the first heuristic has acceptable accuracy only for small baselines between the target image and intermediate image. In practice, this baseline should be chosen as a small fraction of the baseline separating the two images to be registered (Figs. 11–17).
Fig. 12. The obtained mosaic in our first experiment.
783
ARTICLE IN PRESS 784
F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
Fig. 13. The obtained mosaic in our second experiment.
9. Discussion and future work In this paper, we focused on image mosaicking for a general scene under an arbitrary camera motion, a problem hardly addressed in the literature of mosaicking since it is an ill-posed problem. The camera motion is not limited to pure rotation, i.e., the translation component can be non-zero. We demonstrated that under some conditions mosaicking images with parallax is possible, and the resulting mosaics can be acceptable quality. We provided solutions to two key problems encountered in such a case: (i) the lack of a global geometrical transformation between the images
being registered and (ii) the occlusion problem. As a result, the developed approach does not suffer as much from misregistrations arisen from the presence of parallax. To solve the problem of geometrical transfer of image, we have proposed two heuristics that use 3D projective reconstruction. The advantage of these heuristics is their simplicity (conceptually and algorithmically). Instead of computing the full correspondence maps associated with the original images and the 3D projective structure (their simultaneous or sequential estimation is inefficient if not infeasible), our methods combine the use of feature matches with classical transformations associated with three different images (two of
ARTICLE IN PRESS F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
785
Fig. 14. The obtained mosaic from a stereo pair (third experiment).
them are the images to be registered). Thus, the developed approaches have the simplicity of feature transfer yet are able to transfer the entire image. The first heuristic attempts to transfer the image to be registered into the frame of an intermediate image. To build a mosaic from two images with parallax, all pixels of the original image are mapped into the third frame (the intermediate frame) using a planar homography plus an approximated parallax. Then, the original pixels are transferred into the other original image (mosaic frame) using 3D projective reconstruction followed by a projection. In the second heuristic, the entire scene viewed by the image to be registered is segmented into several planar patches in the projective space. The geometrical transfer of the original pixels will be performed using projective mappings between the projective structure and the image plane. This heuristic is ideally used with images having rich texture. The developed methods can be used to register images taken by the same camera or different
cameras. We point out that in general the second heuristic is more efficient. Examples of constructing successful mosaics from real images (monocular and stereo images) demonstrate that the developed methods are efficient and reliable for registering images with parallax. They also demonstrated the outperformance of the proposed methods in constructing mosaics when other methods fail. It is worthwhile noticing whether we need to register more than two images; then the process can be repeated with the mosaic image and the other images. It is true that the developed methods require an image additional to the two original images, but such an image is generally available if the images are from a video sequence and after all, mosaic cannot be built from just two images if they have parallax. The developed methods can easily be used to generate new views from existing ones. In our study, since the mosaic is planar and corresponds to a physical viewpoint, the camera motions are not without restrictions. For example, consider the case of a camera that rotates around a person’s face. At first, the view includes the right profile,
ARTICLE IN PRESS 786
F. Dornaika, R. Chung / Signal Processing: Image Communication 19 (2004) 771–786
then frontal face, and finally the left profile. The planar mosaic will not bring any additional information since it reduces to a synthesized image of the face associated to a certain viewpoint. In order to display all face details on a single frame, one should use other warping techniques allowing severe distortions.
[9]
[10]
[11]
Acknowledgements [12]
The work described in this paper was substantially supported by a grant from the Research Grants Council of Hong Kong Special Administrative Region, China (RGC Ref. No. CUHK4114/97/E). It was also partially supported by the CUHK Postdoctoral Fellowship Scheme (CUHK Ref. No. 98/15/ERG).
[13]
[14]
[15]
References [1] B. Boufama, R. Mohr, Epipole and fundamental matrix estimation using virtual parallax, in: IEEE International Conference on Computer Vision, 1995. [2] P.E. Debevec, C.J. Taylor, J. Malik, Modeling and rendering architectures from photographs: a hybrid geometry- and image-based approach, in: Computer Graphics (SIGGRAPH), August 1996. [3] O. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint, The MIT Press, Cambridge, MA, 1993. [4] S.J. Gortler, R. Grzeszczuk, R. Szeliski, M.F. Cohen, The lumigraph, in: Computer Graphics (SIGGRAPH), August 1996. [5] R.I. Hartley, Lines and points in three views and the trifocal tensor, Internat. J. Comput. Vision 22 (2) (1997) 125–140. [6] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, Cambridge, 2001. [7] S. Hsu, H.S. Sawhney, R. Kumar, Automated mosaics via topology inference, IEEE Comput. Graph. Appl. 22 (2) (2002) 44–54. [8] R. Kumar, P. Ananadan, K. Hanna, Direct recovery of shape from multiple view: a parallax based approach, in:
[16]
[17]
[18]
[19] [20] [21]
[22]
[23]
IEEE International Conference on Pattern Recognition, 1994. R. Kumar, P. Anandan, M. Irani, J. Bergen, K. Hanna, Representation of scenes from collections of images, in: IEEE Computer Society Workshop: Representation of Visual Scenes, June 1995. S. Mann, R. Picard, Virtual bellows: constructing high quality stills from video, in: First IEEE International Conference on Image Processing, 1994. C. Morimoto, R. Chellappa, Fast 3D stabilization and mosaic construction, in: IEEE Conference on Computer Vision and Pattern Recognition, June 1997. B. Rousso, S. Peleg, I. Finci, A. Rav-Acha, Universal mosaicing using pipe projection, in: IEEE International Conference on Computer Vision, January 1998. H.S. Sawhney, S. Ayer, M. Gorkani, Model-based 2D & 3D dominant motion estimation for mosaicing and video representation, in: Fifth IEEE International Conference on Computer Vision, 1995. A. Shashua, Algebraic functions for recognition, IEEE Trans. Pattern Anal. Mach. Intell. 17 (8) (1995) 779–789. A. Shashua, M. Werman, On the trilinear tensor of three perspective views and its underlying geometry, in: IEEE International Conference on Computer Vision, June 1995. H.S. Shum, R. Szeliski, Construction of panoramic mosaics with global and local alignment, Internat. J. Comput. Vision 36 (2) (2000) 101–130. R. Szeliski, Image mosaicking for tele-reality applications, in: Second IEEE Workshop on Applications of Computer Vision, 1994. L. Teodosio, W. Bender, Salient video stills: content and context preserved, in: ACM International Conference on Multimedia, 1993. J. Weng, T.S. Huang, N. Ahuja, Motion and Structure from Image Sequences, Springer, Berlin, 1993. G. Wolberg, Digital Image Warping, IEEE Computer Society Press, Los Alamitos, CA, 1990. Z. Zhang, Determining the epipolar geometry and its uncertainty: a review, Internat. J. Comput. Vision 27 (2) (1998) 43–76. Z. Zhang, R. Deriche, O. Faugeras, Q.-T. Luong, A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry, Artificial Intell. J. 78 (October 1995) 87–119. J.Y. Zheng, S. Tsuji, Panoramic representation for route recognition by a mobile robot, Internat. J. Comput. Vision 9 (1992) 55–76.