Future Generation Computer Systems 21 (2005) 1223–1234
A simple model generation system for computer graphics Minh Tran ∗ , Amitava Datta, Nick Lowe School of Computer Science and Software Engineering, The University of Western Australia, Perth, WA 6009, Australia Available online 18 May 2004
Abstract Most 3D objects in computer graphics are represented as polygonal mesh models. Though techniques like image-based rendering are gaining popularity, a vast majority of applications in computer graphics and animation use such polygonal meshes for representing and rendering 3D objects. High quality mesh models are usually generated through 3D laser scanning techniques. However, even the inexpensive laser scanners cost tens of thousands of dollars and it is difficult for researchers in computer graphics to buy such systems just for model acquisition. In this paper, we describe a simple model acquisition system built from web cams or digital cameras. This low-cost system gives researchers an opportunity to capture and experiment with reasonably good quality 3D models. Our system uses standard techniques from computer vision and computational geometry to build 3D models. © 2004 Elsevier B.V. All rights reserved. Keywords: 3D reconstruction; Object modeling; Delauney triangulation; Intensity-based matching
1. Introduction Polygonal mesh models are widely used in computer graphics for representing and rendering complex 3D objects. The surface of a 3D object is usually represented as a triangulated mesh in these models. While most users and researchers in computer graphics routinely use such models, quite often it is difficult for them to acquire their own models according to their specific requirements. The most popular model acquisition system is a 3D laser scanner which is usually an expen∗
Corresponding author. E-mail addresses:
[email protected] (M. Tran),
[email protected] (A. Datta),
[email protected] (N. Lowe). 0167-739X/$ – see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2004.04.009
sive device. Even an inexpensive laser scanner may cost tens of thousands of dollars. Hence, it is difficult for researchers in computer graphics to buy such systems for model acquisition. Most researchers depend on the models available from a few research labs such as the Stanford 3D scanning repository [17] and Georgia Tech large model archive [6] where high quality models have been acquired through laser scanning. However, sometime this is too restrictive since the researchers do not have the freedom to experiment with specific models according to their requirements. In many cases, it is necessary to experiment with models with specific topological features for designing efficient data structures and techniques in computer graphics. A major research area within computer vision is stereo reconstruction from images taken from monoc-
1224
M. Tran et al. / Future Generation Computer Systems 21 (2005) 1223–1234
ular (multiple images with a single camera) or polynocular views (single images with multiple cameras) [8]. Three-dimensional reconstruction from multiple images attempts to simulate human perception of our 3D world from disparate two-dimensional images. A realistic representation of objects from camera images can provide an inexpensive means of object or scene modeling compared to specialized hardware such as laser scanners. 1.1. Applications Retrieving 3D information from image sequences can be used to guide unmanned vehicle exploration through unknown or dangerous terrain. This allows humans to negotiate unsuitable environments which boast ballistic dangers or atmospheric extremes. Automatic object modeling has well-known applications including the construction of 3D polygonal models for computer graphics applications, which is the focus of this paper. These polygonal models can be used in the construction of virtual worlds to be manipulated with on screen or in interactive immersive environments [13]. Industries can also benefit from structure recovery enabling volume of mine lorries and building architectures [3] to be estimated given multiple views. 1.2. Overview In this paper, we discuss a simple model acquisition system built from web cams and digital cameras that can be used as a low-cost alternative for model generation. This system gives researchers an opportunity to capture and experiment with reasonably good quality 3D models. Our system employs standard techniques from computer vision and computational geometry to build 3D models. Once a camera captures an object, by identifying and exploiting specific scene information in the image, we can retrieve the lost dimension of depth. Analyzing areas or features, whose characteristics are preserved among disparate views, can establish point correspondences between successive images. Matching identical points between images relies on the exploitation of image information including camera parameters, intensity, edge pixels, lines and regions. Camera parameters give information regarding the transformation or projection of the object from the 3D world onto a 2D
image. Its knowledge improves the accuracy and efficiency of point matching and can be determined before or during the matching process. Next, we produce a cloud of 3D points by locating depth values from matched pairs of pixels. An appropriate visualization technique is then employed to establish connectivity between the unstructured point cloud to retrieve the object’s surface. The following sections are organized as follows. Firstly, we discuss some related work in Section 2. Our methodology is discussed in Section 3. The results obtained from our system are discussed in Section 4 along with an example. Finally, we conclude with some remarks and possible future work in Section 5. The preliminary version of this paper can be found in [18].
2. Previous work We are able to retrieve depth by identifying which points from the two views correspond to the same point in 3D space. Matching identical points between images requires the exploitation of image characteristics such as intensity, orientation and features that uniquely describe corresponding pixels. Intensitybased and feature-based matching are the two most popular approaches. In intensity-based matching methods, pixels are matched based on the similarity between areas of pixels. This similarity is measured according to some criterion such as calculating the mean square error (MSE) [10] or sum of the squared differences (SSD) [14] between corresponding pixel intensity values in each area. The matching window is found by identifying the window comparison with the minimum MSE or SSD value (that is, the window that is the closest match in relation to intensity pattern). To reduce the number of false matches, a threshold value of MSE or SSD should be introduced in case the algorithm is attempting to find a match that is not there. Koschan and Rodehorst [10] compare intensity values of colored images. Therefore, three intensity values are correlated instead of one (the gray value) resulting in a 20–25% improvement in matching precision. However, this resulted in more data to be processed so they formulated a parallel algorithm for chromatic block matching, suitable for real-time applications.
M. Tran et al. / Future Generation Computer Systems 21 (2005) 1223–1234
The Okutomi and Kanade [14] approach calculates the sum of SSD (SSSD) of the intensity values of a pre-processed (bandpass filtered) image. This method exhibits improved precision and reduced ambiguity because the SSSD is calculated over multiple baselines with respect to inverse depth rather than disparity. The improvement is particularly evident in the presence of a periodic intensity function caused by a pattern in the image (for example, a picket fence). One of the disadvantages of intensity-based matching is that it is computationally expensive compared to its feature-based counterpart since a match is attempted for potentially every pixel in the image. The first step of feature-based stereo is feature extraction. This approach exploits the homologous characteristics of multiple images. The advantage this approach has over intensity-based stereo is that the amount of computation is reduced to matching extracted features instead of every pixel. However because of this, the output of this approach is a sparse depth map and may result in incomplete surface reconstruction. The coarse-to-fine strategy matches prominent features first at a low resolution and uses these results to guide the matching process progressively up to a fine grain level [11]. Convolving the image with the Laplacian of a Gaussian filter of varying standard deviations, and finding the zero-crossings extracts the edges at various resolutions resulting in a pyramid of edge maps. Due to the utilization of the second derivative operator, this approach amplifies the presence of noise. However, the gradual coarse-to-fine movement remedies this problem as more reliable edges are matched first and this helps distinguish between false edges introduced by noise and legitimate ones at finer levels. A multistage feature-based stereo method formulated by Nakayama et al. [12] is based upon the coarseto-fine method but gives higher priority to reliable matching. Edges are extracted, similarly to the coarseto-fine method, via zero-crossings. Contrast of edge points, or the sharp change in intensity gradient, is a measure of reliability; therefore, high contrast edges have high accuracy (that is, are more likely to be correct) and are matched first. Setting the contrast threshold at the highest value isolates only a limited number of distinct edges for matching. The threshold is then successively lowered and the additional edges that are revealed are matched until the contrast threshold reaches
1225
its lowest limit. Edges are matched according to the similarity of contrast and direction and the search area is restricted to epipolar lines. This approach is an effective way to reduce ambiguity when matching but Nakayama et al. restricts their approach to cameras with parallel centers of projection. The accuracy of stereo matching methods dictates the success of the reconstruction process and in realtime applications speed may also become an important factor. From the two popular methods described above, it is apparent that there is a trade-off between speed, precision and completeness. Intensity-based matching is computationally expensive but provides a precise, dense depth map whereas feature-based matching requires less computation but produces a potentially incomplete sparse depth map. For this reason, hybrid methods have been devised incorporating both stereo matching techniques and improving speed, precision and completeness of the correspondences matched [15]. Beardsley et al. [1] and Pritchett and Zisserman [15] discuss the problem of extracting a 3D model from multiple stereo images. They use multiple cameras and focus mainly on reconstructing outdoor scenes. Outdoor scenes contain a variety of information that can lead to ambiguous matches. Therefore, a more robust method is needed to remove outlier matches from the target object matches. Both Beardsley and Pritchett and Zisserman use trifocal tensors to refine their matching process. In this paper, we concentrate on extracting 3D models for indoor objects. Hence, our techniques are considerably simpler than those since the images are limited to local objects captured in front of a plain background.
3. Our methodology In this section, we discuss the techniques we have used for implementing the model acquisition system. Most of these techniques are taken from the existing computer vision and stereo vision literature with modifications to suit our needs. For standard techniques in stereo vision, see the books by Hartley and Zisserman [8], Faugeras [5], and Horn [9]. The quality of object surface representation is dependent on the accuracy of depth retrieval. It is possible to reconstruct an object from a single image [4], but this
1226
M. Tran et al. / Future Generation Computer Systems 21 (2005) 1223–1234
may involve identifying converging parallel lines and absolute conics [8]. Using multiple images of a scene in the reconstruction process can improve the accuracy of depth retrieval. Identifying matching pixels in a stereo pair allows the pinpointing of that object point in 3D space. This is important due to the progressive nature of this approach. The output obtained from one stage is used in the input of the next. Any errors that occur during one stage, are accumulated and included in the execution of the remaining steps. When a camera captures an object, a projective transformation from 3D space to 2D takes place. A typical model of the camera is displayed in Fig. 1. In Fig. 1, an object point in 3D space, (X, Y, Z), is mapped onto the image plane according to the line joining (X, Y, Z) and the camera’s center of projection, P. The image point p, coincides with the position where this ray intersects the image plane, (x, y) in Fig. 1. The coordinates of the point p(x, y) can be found through a simple geometric calculation. x = Xf/z and Y = Yf/z, where f is the focal length of the camera. The image coordinates u, v of p can be described in terms of x and y such that u = uc + (x/pw ) and v = vc + (y/ph ), where pw is the pixel width, ph the pixel height, and uc , vc represents the principle point formed by the perpendicular projection of the camera center, P, onto the image plane. The mapping of the object from 3D space to 2D can be expressed in homogeneous coordinates, s[x, y, 1]T = K[X, Y, Z, 1]T . The intrinsic camera calibration matrix K can
be represented as f 0 0 pw f K= 0 p 0 h u c vc 1
0
0
(1)
0
and s is an arbitrary scalar value. The images acquired in our experiments are assumed to have negligible lens distortion, therefore, no rectification techniques were applied. The focal point P is located at a distance perpendicular to the image center, i.e., principle point, uc , vc . This distance is known as the focal length f. The principle point, focal length and P are collectively referred to as the intrinsic parameters of the camera along with effective pixel size and any radial distortion coefficient of the lens. Intrinsic parameters are characterized by their invariance to the camera’s geographic position and orientation. On the other hand, extrinsic parameters are dependent on the camera’s orientation relative to the world reference frame and consist of a rotation matrix and a translation vector. Object points such as (X, Y, Z) in Fig. 1 are described with respect to some world reference frame. These points must be transformed to coincide with the camera axes made possible by a rotation followed by a translation. The transformation of a 3D point to a 2D coordinate can now be represented as x = K[R|t]X], where K represents the intrinsic camera parameters, R is a 4 × 3 rotation matrix, and t a 4 × 1 translation vector. The relationship between the projections of the object from the world coordinates into image coordinates can be found in the camera parameters, which can be solved via various camera calibration techniques. Cameras can be actively calibrated using calibration targets or passively from image correspondences. 3.1. Camera calibration
Fig. 1. This is a typical camera model where a point of an object in 3D space, (X, Y, Z), is projected onto a viewing plane, p, by a viewing ray that converges at the center of projection, P.
A typical calibration method uses a calibration target. Depth is inferred by finding image points and using the known camera parameters to solve for (X, Y, Z). Camera calibration requires the known 3D positions of specified pixels on the calibration target, and their respective image coordinates to solve linearly for C, the 3 × 3 camera calibration matrix composed of
M. Tran et al. / Future Generation Computer Systems 21 (2005) 1223–1234
1227
3.2. The fundamental matrix
Fig. 2. Two views of an object possess a unique geometric relationship expressed by the epipolar constraint. The plane formed by the two center of projections and the 3D object point of interest is called the epipolar plane. Where this plane intersects with the viewing planes are the images’ respective epipolar lines.
K[R|t]. In this paper, the cameras are calibrated actively using a calibration target. This enables a metric reconstruction and also the recovery of epipolar geometry. The retrieval of camera parameters unlocks the epipolar geometry between two images, which can be exploited during the matching process. The relationship between two images and the 3D object is illustrated in Fig. 2. Image epipoles, e and e , are located at the intersection of the baseline joining the cameras’ optical centers with the image plane. An epipolar plane is formed by the cameras’ center of projection and the 3D point of interest, X in Fig. 2. The epipolar line is the intersection of this plane with the image plane. From Fig. 2, it is clear that a point in one image corresponding to a point in 3D space has its matching point contained in the epipolar line in the second image. Therefore, the epipolar constraint not only reduces the match search from the image area to a line, but also improves the robustness of the matching process. However, in 3D reconstruction, the value X is unknown and it is what we are trying to determine. Given the point x, we want to find its corresponding point x in the other image. These two values can be used to pinpoint X in 3D space. We identify the epipolar line through the fundamental matrix described by Hartley and Zisserman [8] as the algebraic representation of epipolar geometry.
The fundamental matrix encapsulates the relationship between image points in a stereo pair. This relationship is represented in the equation xT Fx = 0, where x is the correspondence of x. In accordance with projective geometry, the dot product between a point x located on a line l and l is 0. Thus, x l = xT l = xT Fx = 0 resulting in l = Fx, where l is the epipolar line. Conversely, l = FT x . Therefore, the epipolar line in the second image, l , corresponding to a point in the first image, x, can be identified through the fundamental matrix, F and vice versa. The camera projects a point in 3D space into 2D image coordinates. The fundamental matrix can be described in terms of camera matrices. In projective geometry, the cross product of two points returns a line containing the two points. Therefore, l = e × x = [e ]x x, where e is the epipole and [e ]x is a 3 × 3 skewsymmetric matrix of e . Let e = (e1 , e2 , e3 ), then [e ]x is defined as: 0 −e3 e2 0 −e1 . [e ]x = e3 (2) −e2 e1 0 The cross product of the two vectors e and x , can be expressed as e × x = [e ]x x = (eT [x ]x )T . If H represents the homography, mapping x with x such that x = Hx, we obtain the following relationship: l = [e ]x Hx = Fx, and thus F = [e ]x H. The fundamental matrix can be obtained by identifying point correspondences (x Fx = 0) or from camera parameters (F = [e ]C C+ ). See Hartley and Zisserman [8] for full proofs. Since F will be used in the matching process, that is to find x and x , F is derived from camera matrices found from image calibration. 3.3. Point matching Exploitation of epipolar geometry, through the estimation of the fundamental matrix (ideally) enables robust establishment of point correspondences whilst improving the efficiency of each search. A stereo pair of pixels enables the recovery of depth through triangulation. Referring back to Fig. 2, given a point on the image x and the camera’s center P, we want to find its corresponding match in the other image x . Casting a
1228
M. Tran et al. / Future Generation Computer Systems 21 (2005) 1223–1234
ray from P through any point in l will intersect with the line formed by P and x. Identifying the wrong pixel along l can give the wrong depth value X in 3D space so the point matching method must be chosen carefully to prevent as many potential mismatches as possible. Two popular approaches to stereo matching are the intensity-based and feature-based methods. Typically, feature-based methods are robust to significant change of viewpoints and depth disparity, but are ineffective in matching areas of smooth changing intensity values. On the other hand, intensity-based methods work well in textured areas, but large depth discontinuity along with change of viewpoint can lead to mismatches. Since depth values of objects typically do not vary drastically and in the hope that dense matches will improve structure reconstruction, we use a cross-correlation intensity-based matching method to locate point correspondences. Cross-correlating matches between image pairs improves the accuracy of matches identified. Pixels are matched according to the similarity or correlation between neighborhood intensity values. To improve accuracy of the matching process, once a match is found in the second image, a corresponding match is searched for in the first image (hence the name crosscorrelation). If the match identified in the first image is different from the initial pixel, the match is rejected. This was a common occurrence when the algorithm attempted to match featureless areas in our experiments. The user determines the neighborhood size or window size, centered at the point of interest. Region W1 in image 1 is matched with a region W2 in image 2 according to their correlation coefficient given by the following equation: 2 cov(W1 , W2 ) c(W1 , W2 ) = . var(W1 ) + var(W2 )
(3)
This equation is referred to by Sara [16] as the modified normalized cross-correlation algorithm, which has an advantage of tending to zero when similarly textured areas have different contrasts. The correlation coefficient c(W1 , W2 ) ranges between 0 and 1. The higher the coefficient the higher the correlation between the two regions W1 and W2 . To preserve the structure of the object surface, edge pixels are matched first (using the Canny edge detector [2]) and then point correspondences between image pairs are searched for at regular intervals. The
search is conducted along the pixel’s corresponding epipolar line to improve the accuracy and efficiency of each matching attempt. Accepting matches above a certain threshold and imposing the similarity constraint where a pixel and its match will have similar intensity values further improves the accuracy of the matching process. Accuracy dictates the success of these stereo matching methods, which in turn affects the outcome of 3D reconstruction. Problems such as ambiguity, occlusion, noise and discontinuity contribute to false matches between images, thus reducing the accuracy of the matching implementation. These problems can be remedied by the use of constraints to refine and improve the matching process and include similarity, uniqueness, continuity, ordering and epipolar geometry. See Faugeras’ book for more information on each of these constraints [5]. Errors not corrected in this stage will propagate and quite possibly be amplified through to successive stages resulting in an inaccurate reconstruction. 3.4. Depth inference Since a correspondence pair back-project to the same point in 3D space, its depth can be approximated through geometric triangulation. Linear inference of depth is employed since metric reconstruction is assumed. The relationship between the common (X, Y, Z) point in 3D space and the point correspondences (x1 , y1 ) and (x2 , y2 ) is expressed homogeneously as: si [xi , yi , 1]T = Ci [X, Y, Z, 1]T
for i = 1, 2
(4)
where C1 , C2 and s1 , s2 , are the camera calibration matrices and scalar factor of image 1 and image 2, respectively. The camera calibration matrix C is a 3 × 4 matrix. The (X, Y, Z) values can be obtained by substituting C1 and C2 (the two calibration matrices) in Eq. (4) and solving. The resulting structure is a cloud of points in the 3D space representing the surface of the object. 3.5. Data visualization The topology of an object can be represented by a set of polygons connected by their vertices. We use Delauney triangulation for generating a triangular mesh
M. Tran et al. / Future Generation Computer Systems 21 (2005) 1223–1234
1229
Fig. 3. (a) and (b) A stereo image pair used in one of our reconstructions. Images are reproduced with permission from Fr´ed´eric Devernay.
from this cloud of points. A Delauney triangulation satisfies the following two conditions: (i) three points are connected to form a triangle such that no neighboring points can be contained within or on the circle circumscribed by the vertex points, (ii) the outer edges of the union of all triangles form a convex hull. Delauney triangulation of a 3D data set involves firstly ‘flattening’ the data values into 2D, effectively removing the third dimension of depth; triangulation is then performed on the array of points determining connectivity between them according to the above conditions; depth is restored and a mesh topology results. However, this method is dependent on the viewpoint of the ‘flattened’ values. Rotating the axis before removing the depth information will produce a different triangulated mesh.
4. Results We now discuss the results obtained by our simple model acquisition system. Image noise is inevitably introduced through the digitization of continuous scene information during the image acquisition process. Matching images taken from a 1600 × 1200 digital camera produced denser matches compared to several 640 × 480 web cameras, given the same matching parameters (i.e. window size, edge detection thresh-
old, number of images matched and correlation coefficient threshold). However, results shown in this paper use compressed jpeg images of 512 × 512 (see Fig. 3(a) and (b)). The camera parameters associated with these two images are known a priori. 4.1. Point matching Finding point correspondences between images is by far the most significant bottleneck in the reconstruction process. Attempting to match four 640 × 480 pictures using epipolar and similarity constraints, edge detection threshold of 0.1, window size of 10, and crosscorrelation coefficient of 0.8, matching every fifth pixel can take hours to complete. For the web cameras, the object was taken with a uniform or featureless background. Since an intensitybased matching method was employed, featureless areas were not matched. However, the background was not completely featureless and non-object matches were found. These background matches can be eliminated using the cross-correlation method or by finding a ‘mask’ that isolates the object from the background. A window-based approach reflects some deficiencies when attempting to match depth discontinuities and textureless areas. Mismatches occur where the change in view has changed the neighborhood of the corresponding pixel in other images. A typical instance of this is where the pixel cannot be seen in
1230
M. Tran et al. / Future Generation Computer Systems 21 (2005) 1223–1234
other images because it is blocked by another part of the scene. Particularly, if the pixel corresponds with depth discontinuities in the image, the view disparity can change the neighborhood intensity values or occlude the pixel of interest. It is clear that area-based matching would fail if many pixels along the epipolar line have neighborhoods of the same intensities. Textureless and patterned areas cause this problem of ambiguity but are combated by the cross-correlation approach. Intensity-based matching is also sensitive to significant change in viewpoints between images. The relatively large disparity between two views results in a large enough change in neighborhood intensity values such that a pixel other than its correspondence would score a higher correlation co-efficient. The projective causality of foreshortening is not catered for in neighborhood matching. Due to the perspective nature of cameras, a line in one view appears shorter in the other if the line is at an angle to the cameras. Therefore, it affects an intensity-based method with a fixed search window size and is a common source of error. If the cameras are arranged in a straight line, images of the object appear at different relative scales. Images taken from the camera in front of the object (closer) appear larger than those taken from cameras further away from the object. This hinders the precision of window-based matching since neighborhood intensity values are compared within a fixed window size. Mismatches can be prevented during the matching phase of the reconstruction process. This can be achieved by setting the correlation coefficient to a high value in conjunction with epipolar constraints. Having a high correlation coefficient threshold prevents the occurrence of false positives but may also reject accurate matches. This is deemed appropriate, since the depth retrieval step is sensitive to errors. 4.2. Depth inference Even though the matches obtained are ‘roughly’ accurate, mismatches still occur and the estimated 3D point array contains outliers. Typically, the 3D points reflect the general shape of object, but this is much maligned by a litter of outliers. Since object points congregate in a localized cluster, outliers can have large distances between themselves and their nearest neigh-
bor and are sparsely populated relative to object point density. Therefore, points that are far away from the majority of the points (assumed to be object points) can be identified and removed from the data set. In our implementation, points with the mean distance of its 11 or so nearest neighbors that deviate from the average distance by more than the standard deviation are removed. The size of the averaging filter (that is, the number of nearest neighbors) depends on the relative density of matches found. Outliers occur due to erroneous depth estimates from inaccurate matches, or matches identified that do not correspond to the local object being reconstructed (non-object matches). Outliers resulting from matches often produce spurious results when triangulating 3D points. This can result in recovered object points being flattened into a sheet of 3D points rather than a polygonal mesh. After the removal of spurious outliers, 3D data points that lie near but not on the object surface still exist. Applying a median filter can adjust these points back to the object surface without losing the underlying object structure. The results of median-filtering the facial data points, using a neighborhood of the 20 or so nearest points, is displayed in Fig. 4(a). 4.3. Data visualization Data visualization aims to establish connectivity between an unstructured set of 3D points. Implementing 2D Delauney triangulation is simpler than the 3D approach especially if the data points are aligned with the axis. Otherwise, the data set would need to be rotated by an appropriate angle to align the depth axis with one of the axes in the world reference frame. If the object points are not rotated so that the depth values are aligned with a world reference frame axis, 2D triangulation can result in a ‘spiky’ object structure or a skewed surface representation. Triangulating the points in 2D can have a jagged effect once returned to 3D. This occurs when neighboring points in 2D that are triangulated, have large discrepancies in depth values. The resulting topology, looking along the depth axis, clearly represents the object structure. Once turned on its side, it is populated with a series of peaks and troughs. This is illustrated in Fig. 4(c). The serration can be explained by the imposition of the epipolar constraints. In reference to Fig. 2,
M. Tran et al. / Future Generation Computer Systems 21 (2005) 1223–1234
1231
Fig. 4. (a) The initial point cloud obtained after matching. After applying a median filter over the face data points, the outliers located in front of the face are adjusted to the facial surface. (b) The point cloud after applying median and averaging filters. The resulting point cloud represents a smooth surface with fewer outliers than (a). (c) The recovered structure after triangulating point cloud in (a). (d) The recovered structure after triangulating the point cloud in (b).
any mismatches along the epipolar line l will result in a depth discrepancy along the line connecting P to X. This can be resolved by applying a median filter that takes the median depth from itself and 11 or so of its nearest neighbors. The size of the median filter again depends on the density of the point set. The results of the median filtered 3D point set in Fig. 4(c) further smoothed by an averaging filter applied to all of the points with the filter the size of its connected neighbors is shown in Fig. 4(d). Further, applying both a median and an average filter improves the quality of the point cloud considerably as shown
in Fig. 4(b). Additional smoothing of points incurs lost structure detail and should only be performed if the smoothed data points form a better or more realistic representation of the object surface than its present form. The organization of the data set is dependent on the scene information recovered, mainly, the number of matches found and thus triangulated. As the density and orientation of localized point sets vary, so does the suitability of applying Delauney triangulation as a data visualization technique. The conditions specifying the Delauney triangulation method can have a ‘cob-web’ effect occurring around the object edges. Complex
1232
M. Tran et al. / Future Generation Computer Systems 21 (2005) 1223–1234
Fig. 5. (a) The recovered structure of the face from Fig. 4. The face is shown with lighting and shading. (b) Surface details can be enhanced by texture mapping the image onto the retrieved surface. We have used the original image in Fig. 4(a) as the texture.
objects with surfaces with large gradients or gaps in between object parts tend to be inter-connected and sharp arches and curves of the object shape can be concealed within the triangulation. Our approach works well with the results given in Figs. 4 and 5 because a face does not contain any holes or extended limbs. Also, the shape of the face forms a convex hull quite easily and is well suited for 2D Delauney triangulation. However, more complex objects may require a different data visualization method.
5. Conclusion and future work We have implemented a system for simple model construction using inexpensive web cams and digital cameras. We have experimented with many objects extensively and one of our reconstructions is shown in Figs. 4 and 5. This particular reconstruction used a stereo pair of images taken with calibrated cameras. Using epipolar geometry to improve search efficiency, corresponding points were located using a intensitybased, cross-correlation algorithm. This allowed us to infer depth using well-known computer vision techniques and projective geometry. After some smoothing operations on the 3D point cloud (see Section 4.3), the results were visualized using a Delauney triangulation. The triangulated mesh in Fig. 4 has several thousand triangles which is quite good for computer graphics applications. The resulting models can be used for computer graphics applications as shown in Fig. 5. The
reconstructed model with shading, lighting and texture mapping is shown in Fig. 5. Our approach is sensitive to errors therefore the image sequences are restricted to those of a small baseline between image pairs. A larger disparity between images leads to fewer corresponding parts of the image pairs. This heightens the chance of mismatches and result in fewer matches found. The reconstruction returned has a sparse set of 3D points and therefore a less detailed reconstruction. Currently, our system uses only two cameras. As a result, it is not possible to acquire a complete 3D model. At most, the areas common to both of the cameras’ views is recovered. We plan to extend the system so that we can place the object within a circular array of web cams or circumnavigate the object with a digital camera. This will enable us to build the model piecewise from pairs of cameras and then reconstruct a complete model from these partial models. One of the main obstacles here may be fusing pairs of partial reconstructions together to form a complete 3D reconstruction. Automatic camera calibration enables the camera parameters to be retrieved from the image itself [19,20]. This will remove the prior calibration restriction and enable flexible image acquisition. Implementing a standard 2D Delauney triangulation method brought about the problem of interconnecting separate object parts, losing the detail of the object structure. This is inevitable since one of the criteria stated by this method is to form a convex hull from
M. Tran et al. / Future Generation Computer Systems 21 (2005) 1223–1234
the entire set of points. A possible solution to this is triangulating isolated parts at a time but determining automatically these parts proved difficult because the structure and density of matched points varies with every object. Another possibility is to use the skeleton [7] of the cloud of points to guide the Delauney triangulation [5]. Yet another approach such as marching cubes can be used. This method requires a set of values sampled at regular intervals. The point cloud most likely will not have a regular set of values, but the missing values can be estimated by interpolation, either linearly or via curve fitting.
Acknowledgements The second author’s research is partially supported by Western Australian Interactive Virtual Environments Centre (IVEC) and Australian Partnership in Advanced Computing (APAC). The third author’s research is supported by an Australian Postgraduate Award and a top-up scholarship from IVEC.
References [1] P. Beardsley, P. Torr, A. Zisserman, 3D model acquisition from extended image sequences, in: Proceedings of the European Conference on Computer Vision, 1996, pp. 683– 695. [2] J. Canny, Finding edges and lines in images, Master’s Thesis, MIT AI Lab, TR-720, 1983. [3] R. Cipolla, D. Robertson, 3D models of architectural scenes from uncalibrated images and vanishing points, in: Proceedings of the International Conference on Image Processing, IEEE Computer Society, 1999, pp. 824–829. [4] A. Criminisi, I. Reid, A. Zisserman, Single view metrology, in: Proceedings of the International Conference on Computer Vision, vol. 40, no. 2, 1999, pp. 434–442. [5] O. Faugeras, Three-dimensional Computer Vision: A Geometric Viewpoint, MIT Press, 1993. [6] Georgia Institute of Technology, Georgia Tech Large Model Archive. http://www.cc.gatech.edu/projects/large models. [7] R. Gonzalez, R. Woods, Digital Image Processing, 2nd ed., Prentice-Hall, 2002. [8] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2000. [9] B. Horn, Robot Vision, MIT Press, 1986. [10] A. Koschan, V. Rodehorst, Towards real-time stereo employing parallel algorithms for edge-based and dense stereo matching, in: Proceedings of the Computer Architectures for Machine Perception, 1995, pp. 234–241.
1233
[11] C. Mandal, H. Zhao, B. Vemuri, J. Aggarwal, 3D shape reconstruction from multiple views, in: Handbook of Video and Image Processing, Academic Press, 2000. [12] O. Nakayama, A. Yamaguchi, Y. Shirai, M. Asada, A multistage stereo method giving priority to reliable matching, in: Proceedings of the IEEE International Conference on Robotics and Automation, 1992, pp. 1753–1758. [13] P. Narayanan, P. Rander, T. Kanade, Constructing virtual worlds with dense stereo, in: Proceedings of the International Conference on Computer Vision, 1998, pp. 3–10. [14] M. Okutomi, T. Kanade, A multiple baseline stereo, in: Proceedings IEEE Conference Computer Vision and Pattern Recognition, (1991) 63–69. [15] P. Pritchett, A. Zisserman, Matching and reconstruction from widely separate views, Lect. Notes Comput. Sci. 1506 (1998) 78–93. [16] R. Sara, Accurate natural surface reconstruction from polynocular stereo, in: Proceedings of the NATO Advanced Research Workshop Confluence of Computer Vision an Computer Graphics, vol. 84, Kluwer Academic Publishers, 2000, pp. 69–86. [17] Stanford Graphics Laboratory, The Stanford 3D Scanning Repository. http://graphics.stanford.edu/data/3Dscanrep. [18] M. Tran, A. Datta, N. Lowe, A low-cost model acquisition system for computer graphics applications, Lecture Notes in Computer Science, vol. 2659, Springer-Verlag, 2003. pp. 1054–1063. [19] Z. Zhang, R. Deriche, O. Faugeras, Q. Loung, A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry, Artif. Intell. 78 (1995) 87–119. [20] A. Zisserman, P. Beardsley, I. Reid, Metric calibration of a stereo rig, IEEE Workshop on Representations of Visual Scenes, IEEE Computer Society Press, 1995, pp. 93–100.
Minh Tran has been studying Computer Science and Electrical and Electronic Engineering in Western Australia since 1998. She completed her BSc (Honours) in Computer Science at The University of Western Australia in 2002. She began her PhD the following year in the area of computer graphics. Her interests include texture synthesis and computer vision.
Amitava Datta received his MTech (1988) and PhD (1992) degrees in Computer Science from the Indian Institute of Technology, Madras. He did his postdoctoral research at the Max Planck Institute for Computer Science, University of Freiburg and University of Hagen, all in Germany. He joined the University of New England, Australia, in 1995 and subsequently the Department of Computer Science and Software Engineering at University of
1234
M. Tran et al. / Future Generation Computer Systems 21 (2005) 1223–1234
Western Australia in 1998, where he is currently an associate professor. In 2001, he was a visiting professor at the Computer Science Institute, University of Freiburg, Germany. His research interests are in parallel processing, optical computing, computer graphics and mobile and wireless computing. He has served as a member of program committees for several international conferences in these areas. He is also a member of the editorial board for Journal of Universal Computer Science, published by Springer.
Nick Lowe completed his Bachelor of Computer and Mathematical Sciences (Honours, 1st class) at the University of Western Australia in 2001. He is currently undertaking his PhD in computer graphics. His thesis is concerned with techniques and applications for arbitrary portal rendering. Nick has a keen interest in real-time rendering and is founding member of the 60 Hz Real-time Rendering Group and a lead author of their Clean rendering libraries.