Localized Scene Interpretation from 3D Models, Range, and Optical Data

Localized Scene Interpretation from 3D Models, Range, and Optical Data

Computer Vision and Image Understanding 80, 111–129 (2000) doi:10.1006/cviu.2000.0821, available online at http://www.idealibrary.com on Localized Sc...

241KB Sizes 2 Downloads 119 Views

Computer Vision and Image Understanding 80, 111–129 (2000) doi:10.1006/cviu.2000.0821, available online at http://www.idealibrary.com on

Localized Scene Interpretation from 3D Models, Range, and Optical Data1 Mark R. Stevens Computer Science Department, Worcester Polytechnic Institute, Worcester, Massachusetts 01609

and J. Ross Beveridge Computer Science Department, Colorado State University, Fort Collins, Colorado 80523 Received May 27, 1998; accepted November 2, 1999

How an object appears in an image is determined in part by interactions with other objects in the scene. Occlusion is the most obvious from of interaction. Here we present a system which uses 3D CAD models in combination with optical and range data to recognize partially occluded objects. Recognition uses a hypothesize, perturb, render, and match cycle to arrive at a scene-optimized prediction of model appearance. This final scene-optimized prediction is based upon an iterative search algorithm converging to the optimal 3D pose of the object. During recognition, evidence of terrain occlusion in range imagery is mapped through the model into the optical imagery in order to explain the absence of model features. A similar process predicts the structure of occluding contours. Highly occluded military vehicles are successfully matched using this approach. °c 2000 Academic Press

1. INTRODUCTION The appearance of an object in a complex scene can be influenced by a number of factors. One of the factors which most complicates the automated recognition of objects is occlusion. Occlusion can greatly hinder the performance of a recognition algorithm because it is not a phenomenon which can be predicted in isolation: occlusion is a function of an object’s relationship to the scene in which it is embedded. Traditional recognition techniques either rely on static feature measurements remaining stable in the presence of occlusions [3, 15, 22] or associate a likelihood of finding each 1

This work was sponsored by the Defense Advanced Research Projects Agency (DARPA) Image Understanding Program under Grants DAAH04-93-G-422 and DAAH04-95-1-0447, monitored by the U.S. Army Research Office, and the National Science Foundation under Grants CDA-9422007 and IRI-9503366. 111 1077-3142/00 $35.00 c 2000 by Academic Press Copyright ° All rights of reproduction in any form reserved.

112

STEVENS AND BEVERIDGE

feature based on off-line appearance analysis [1, 8, 14, 29]. In most of these works, some occlusion is tolerated, but it is seldom dealt with explicitly. Instead, a match quality metric ranks potential matches, and matches with missing features are ranked lower. No provisions are made for explaining the absence of a feature in terms of interactions with other objects. It is our belief that if model-based object recognition algorithms are to improve in domains where occlusion is common, they must begin to use partial scene models to explain object interactions. Scene models may be in terms of multiple instances of modeled objects or more generic structures interacting with modeled objects or a combination of the two [27]. Here we consider how information about a scene can be inferred from heterogeneous sensor data. In this context, unmodeled objects, specifically foreground terrain, can be inferred from range data. More importantly, once inferred from range data, the interaction of the foreground with an object can be used to customize the features predicted to be visible in both range and optical imagery. Two forms of scene specific customization are important. First, occluded features are removed from the set of features predicted to be visible in both range and optical imagery. Second, new features which arise out of the interaction between the object and the foreground are predicted. These new features represent the boundary where the foreground occludes the object of interest and are typically prominent in imagery. Our occlusion reasoning process is embedded in an algorithm for recognizing military vehicles in outdoor scenes. A hypothesize, perturb, render, and match cycle adapts a set of stored geometric features to best fit the current scene context. As more information is gained about the scene during search, better predictions of object appearance are made. Results on 12 images show that occlusion reasoning greatly improves object identification in scenes containing a substantial amount of occlusion. In practice, the algorithm presented here would be used in conjunction with a preprocessing algorithm that hypothesizes which object classes are present and their approximate pose. In prior work, we have presented a preprocessor [16, 24]. However, hypothesis generation will not be treated directly in this paper.

2. PREVIOUS WORK Traditionally, two distinct sources of declarative knowledge have been incorporated into object recognition algorithms. The first source consists mainly of geometric features of the object being sought, and the second consists of stored views of object appearance. Numerous techniques have been proposed for utilizing both types of knowledge, but we will center on three distinct categories: geometric feature matching, appearance-based matching, and a pattern matching technique based on eigenvector analysis. Our intention is to briefly summarize each, paying particular attention to how occluded objects are handled. 2.1. Geometric Feature Matching Geometric object recognition centers around the search for correspondences between geometric model features, such as point, lines, and planes, and homogeneous features extracted from sensor data [3, 9, 12, 15, 22, 28]. All of these algorithms seek to construct a valid correspondence set representing a match between model and image features. A correspondence set typically contains tuples of model features matched to one or more data features. To be considered valid, these matches must remain topologically consistent under a single geometric transformation which aligns the model with the data.

LOCALIZED SCENE INTERPRETATION

113

2.1.1. Combinatorial search. The search space for these algorithms is combinatoric, and of general importance is the choice of heuristics used to search this space. Lowe and Huttenlocher [15, 22] rely heavily upon distinctive subsets of features to initialize search. Grimson [12] has developed a constraint-based tree search algorithm whose average case computational complexity for 2D problems involving rotation and translation is polynomial when problems involving symmetric models and multiple models are excluded. The complexity bound is O(m 2 d 2 ), where m is the number of model features and d the number of data features. However, Grimson also shows that if models are symmetric or more than one model instance is present, then complexity becomes exponential in m. In practice, variants upon constraint-based tree search have been shown to perform well on complex problems [28]. Cass’ [9] pose equivalence analysis and the closely related work of Breuel [7] combines search in pose and correspondence space. For 2D problems involving rotation, translation, and scale, pose equivalence analysis has a worst-case complexity bound of O(k 4 n 4 ). Here, n is the product of the number of model features times the number of data features: m × d. The k term is the number of sides on a convex polygon within which corresponding features must appear. The 4 derives from the four degrees of freedom in a 2D similarity transform. While the existence of this bound is significant, the dependence upon n 4 precludes large problems in the worst case and average case performance has not been reported. The local search matching work of Beveridge [3] has relied upon a nondeterministic search process to find optimal matches with high probability. The algorithm has been empirically tested over 2D problems of widely varying size. In these problems, models are potentially rotated, translated, and scaled. Symmetric models and multiple model instances are considered. There is empirical evidence [5] to support the hypothesis that the average case computational complexity of random-starts local search is O(m 2 d 2 ), the same as Grimson’s tree search. This bound appears to hold for single nonsymmetric models, symmetric models, and multiple model instances. The similarity is interesting, but it must be kept in mind that the algorithms are very different. While tree search deterministically finds an acceptable match, random-starts local search finds an optimal match with high probability. 2.1.2. Evaluating matches and occlusion. Of specific interest in the context of this paper is how measures are defined to ascertain the quality of each match and how these measures influence performance when objects are occluded. Most all the algorithms mentioned above use some measure of match quality, even if it is a simple rule stating that larger matches are better. In many cases, the measure is more formal, and in the case of Wells [28], the measure is formalized in terms of MAP (maximum a posteriori probability) estimation. For almost all of these measures, a pathology arises in the presence of occlusion. When objects become partially occluded, occluded model features disappear from the correspondence set. A question arises: Is it better to obtain a correspondence set which contains a subset of the model matched with high confidence or better to have a subset which contains more of the model matched but with less confidence. In other words, how should occlusion of features alter the ranking of matches. Several approaches to the problem have been taken. Beveridge has introduced a measure which trades off the importance of finding model features and their quality of match [3]. Both Lowe [22] and Huttenlocher [15] have independently used an explain-away approach: as data features are considered matched they are removed from further consideration. Wells has dealt with occlusion by forcing the measure of match quality to be penalized for missing

114

STEVENS AND BEVERIDGE

data features [28]. In most of these cases, confidence in the match measure will degrade as more model and data feature pairings are removed from the match. 2.2. Geometric-Based Appearance Matching Using just geometric knowledge during recognition may not always be the best approach. In many domains, certain features in a stored model may never actually be found in an image. Thus, continually searching for those features, when there is little likelihood that they will be found, may hinder recognition performance. For instance, consider an edge formed by the junction of two faces on a 3D model. Under certain lighting conditions and viewing angles, that feature may appear prominent when imaged. Under other conditions it may not even be detectable. Based on this belief, methods for incorporating appearance information into a model have arisen [1, 8, 14, 29]. The appearance-based approach derives additional information about stored models from a large set of training examples. In some cases, these images are artificially generated [8, 29], and in others they are real sensor images [1, 14]. In each case, the goal is to extend the model representation to make it more appropriate for matching an object in new imagery based on past experience. In some cases [1, 14] the extended model consists of geometric features and a likelihood that each feature will be observed given its presence in the training data. In others, geometric features are associated with the best algorithm to use for detecting that feature [8, 29]. Unfortunately, most of these appearance-based techniques can be highly sensitive to occlusions in a scene. Problems arise because algorithms rely upon of-line analysis to predict future object appearance. Over all views in the training set, certain features may or may not be present. Generalizing a likelihood for a feature based on a set of such observations can degenerate to an extremely weak model of occlusion. For example, while some point feature F1 may have a prior probability of being occluded equal to 0.6, and another F2 a prior probability of 0.1, in any given image the feature either is or is not occluded.2 Use of these priors implies an implicit bias toward some scene configurations over others. For example, all other things being equal, a system choosing between the situation where F1 is missing versus F2 missing would favor the case where F1 is missing. Some percent of the time this will be the wrong choice since there are presumably configurations where F2 is occluded but F1 is not. If occlusion is truly a function of a dynamic scene, as the authors believe, prior knowledge of static appearance is not guaranteed to improve performance when portions of an object are occluded. As just suggested, recognition quality may actually degrade in the cases where features with a high likelihood of being found in an image are in fact occluded. 2.3. View-Based Knowledge Other methods have completely abandoned the use of geometric models in favor of viewbased information [11, 20, 26]. Here, numerous training images are taken of an object. Off-line eigenanalysis of the training set produces a set of orthonormal basis vectors. The 2

We neglect issues arising out of image sampling and stick to a simple pin-hole model of projection of 3D objects to the continuous 2D image plane. Obviously at the level of pixels a detailed modeling of image generation can lead to local ambiguities. For features with spatial extent, of course only a portion may be occluded. However, the probability a point feature is occluded should not be confused with the fraction of a larger feature which is occluded. These are quite different things.

115

LOCALIZED SCENE INTERPRETATION

lower order vectors, as determined by their eigenvalues, are discarded. When a new image is provided, it is projected into a lower dimensional subspace using these basis vectors. Any one of a number of distance measures can be used to compute a metric of similarity between the new sample and the closest training sample. During the computation of the eigenvector basis set, each pixel in an image chip is considered. When a new image is presented for on-line matching, each pixel in the new chip is again used. As portions of an object become occluded, incorrect object pixels will be used in the projection. Such incorrect pixels will have the effect of shifting the location of the projected point in eigenspace. Hence, even modest occlusion can have detrimental effects on the algorithm. It is for this reason that Ohba et al. have used smaller regions of the image instead of its entirety [20]. Others have performed off-line analysis in attempts to limit the effects of occlusions [11]. However, it is extremely difficult to predict all possible forms of occlusion. Even under quite modest assumptions, the possible ways a single solid object might occlude an image chip grows exponentially as a function of the image chip size. The next section provides a brief look at the combinatorics of pixel occlusion. 3. COMBINATORICS OF PIXEL OCCLUSION Consider, for the sake of simplicity, that an object fills an image chip W pixels wide and H pixels high. It could be argued that there are 2WH ways to occlude the object: label every pixel as either occluded or not independently. However, this formulation neglects the structure of the occluding objects and thus grossly overestimates the plausible occlusion patterns. At a minimum, it can be expected that an occluding object creates one connected region of occluded pixels. A more conservative computational model of occlusion considers the ways a single occluding object might progressively cover the object of interest by moving in from two adjacent sides of the image. For example, consider a concave occluding object moving in from the lower left corner. In this context, concave means that if the occluding object covers k pixels on row h, then it can cover only k or fewer pixels in row h + 1. It is assumed that the image origin is the lower left corner. This model of occlusion does not consider convex objects either placed within the image or entering from one of the sides. Consequently, it may be thought of as a lower bound. This model has been chosen because when we consider patterns entering from either the left or the right it becomes a reasonable approximation for the types of occlusion dealt with later in this paper. Specifically, it is a model for how foreground terrain can occlude a vehicle. A given occlusion pattern satisfying these constraints may be written as a sequence of H integers between 0 and W : p = [l1 , l2 , . . . , l H ],

li ∈ {0, . . . , W },

li ≥ li+1 .

(1)

For example, if W = 3 and H = 4, then some of the possible patterns are: [0, 0, 0],

[4, 1, 1],

[3, 2, 1],

[2, 0, 0].

(2)

The first can be interpreted as zero occluding pixels on each of the three rows. This is the special case of no occlusion. The second sequence is interpreted as four occluding pixels for row one and one occluding pixel for rows two and three. These example occlusions are also depicted in Fig. 1. Occluding pixels always enter the image from the left border.

116

STEVENS AND BEVERIDGE

FIG. 1. Examples of structured occlusion.

For an image chip of given width W and height H , the number of distinct occlusion patterns N L is the same as the number of possible patterns satisfying Eq. (1). There are a variety of ways to compute N L . Here, let us think in terms of the possible choices. The number of possible patterns where all the rows are of equal length is W + 1 choose 1, or simply W + 1. Next consider the number of patterns where the rows are of one of two possible lengths. There are W + 1 choose 2 ways to pick the lengths and H − 1 choose 1 ways to place the break point between the two lengths. So, for example, the sequence [2, 0, 0] would be produced by first picking 2 and 0 (out of {0, . . . , 3}) and the break point 1 (out of {1, 2}). The above process can be generalized to represent the number of ways to construct a pattern with l different lengths distributed across H rows. Let this number be Sl and observe Sl may be written as: µ ¶µ ¶ W +1 H −1 . (3) Sl = l l −1 The total number of patterns is the sum for all possible choices of l is: ¶µ ¶ H µ X W +1 H −1 NL = . l l −1

(4)

l=1

The following identity from Knuth [19, p. 58] enables us to simplify this sum of products: X µr ¶µ s ¶ µr + s ¶ = . (5) k n+k n k

Equate k with l, r with W + 1, S with H − 1, and n with −1 and note: µ ¶ W+H . NL = W

(6)

Since N L is the number of patterns for an occluding object entering from the bottom and left, by like argument the number of patterns for an occluding object entering from the bottom right is the same: N R = N L . The patterns for occlusion entering from the left and right are disjoint with the exception of two cases. Each can produce zero or complete occlusion. Thus, the total number of patterns N of occlusion entering from either side and the bottom is: µ ¶ W+H − 2. (7) N = 2N L − 2 = 2 W

117

LOCALIZED SCENE INTERPRETATION

TABLE 1 Example Values for Number of Possible Occlusion Patterns H W

5

10

25

50

10 25 50

6,004 285,010 6,957,520

369,510 367,158,790 1.508 × 1011

N/A 2.528 × 1014 1.052 × 1020

N/A N/A 2.018 × 1029

What this brief look at a mathematical model for terrain occlusion shows is that the number of possible pixel occlusion patterns grows exponentially with object size. Table 1 shows values for N given some typical object widths and heights for the experiments which follow. Clearly efforts to develop training data for appearance-based methods will be profoundly hampered by these combinatorics, likewise any approach which attempts to enumerate and store, either implicitly or explicitly, all the possible occlusion patterns. One last approach that might be thought of as a solution to the combinatorics of occlusion is to break the model into small parts. Then one might expect those parts which are fully visible to be reliably detected. Unfortunately, as mentioned above, for the problems presented below our experience suggests that predicting the exact structure of the occluding contour is as important as predicting the appearance of the visible parts of the object. Thus, imposing an arbitrary decomposition of the object into independent parts cannot be expected to yield reliable recognition. 4. MOTIVATING THE RENDER, MATCH, AND REFINE APPROACH Given sufficiently precise knowledge of the sensors, lighting, and objects, one can imagine recognition proceeding as a succession of rendering and appearance matching cycles. For each cycle, the parameters of the scene model would be adjusted in such a way as to bring the prediction ever closer to the observed image. This idea was proposed by Besl in the context of open-loop object recognition [2]. In order for an object recognition algorithm to be truly robust, any relevant information acquired in later phases of recognition should be used to reanalyze the original image data. As a framework for recognition, a render, match, and refine (RMR) cycle combines some of the best aspects of model-based, appearance-based, and view-based recognition. Unlike view-based methods which rely solely upon training imagery, the potential exists to reason from the underlying 3D geometry to create accurate views of scenes not in the training data [27]. Unlike many geometric matching techniques, it is not necessary to define matching in combinatoric terms. Instead, the search is carried out in the parameter space representing the 3D pose of the object relative to the sensor. Finally, unlike the appearance-based matching techniques, each prediction reasons about model features given the context of the specific scene being interpreted. To clarify the value of scene specific reasoning, think of the difference in the following terms. Several prior systems have specified a generalized penalty function for failing to find features of a model [1]. Under some assumptions, these penalties arise out of prior probabilities that a model feature will be present generalized over the space of possible object views. Interpretation then combines evidence and downgrades matches which omit features

118

STEVENS AND BEVERIDGE

based upon these prior probabilities. In contrast, the RMR approach acquires additional information, makes specific conditioning assumptions, and then computes a new match quality estimate. If this seems vague, in [3] failure to find one out of four features dropped the confidence in the match by roughly one quarter. In the RMR approach, the same three out of four features might be treated as a near perfect match given additional evidence that fourth feature is hidden. Despite these apparent advantages to the RMR approach, there are a myriad of open issues. To name four: (1) (2) (3) (4)

What precise form should the predicted imagery take? How is the fidelity between predicted and actual imagery measured? How is the search through scene configurations controlled? How is the entire matching process initialized?

In this paper we are concerned with the first three questions. The last question, while critical, is one we will not address directly. Initialization is essentially a matter of indexing [12] and techniques exist for hypothesizing object location based on geometric hashing [21], template matching [10], or probing [6]. Grimson [13] has stated that the indexing task is very difficult. While in prior work we [16, 24] have demonstrated successful indexing on the same data as used in this paper, it is not our intent to discuss the indexing problem here. The RMR approach opens up opportunities to represent and utilize scene specific relations between sensors and objects not available to more traditional approaches. There are many examples of useful realtionships, including the ability to reason about the relative placement of different modeled objects and thereby draw inferences about partial occlusion and object interreflection [27]. However, one of the most compelling examples involves model-based fusion of range and optical imagery in order to account for partial terrain occlusion of 3D modeled objects. As our results will demonstrate below, the RMR approach is capable of finding highly occluded objects in colocated range and optical imagery. 5. RMR FOR LOCALIZED SCENE INTERPRETATION Assume for the moment that the primary source of error in our knowledge of the world is that of object location. This implies knowledge about scene lighting, camera calibration, and object appearance sufficient to render an image which can be meaningfully compared to the observed image. A measure can then be developed which indicates how well the rendered prediction and observed image match. Search in the space of possible object locations, or pose estimates, can then be guided by the quality of the match between predicted and observed images. The expectation is that when all the pieces are properly assembled, this search will refine the object pose estimate until it is a good approximation of the true pose of the object. This RMR cycle is illustrated in Fig. 2. An algorithm designed around this framework has been built for matching 3D CAD models of vehicles as observed by colocated range and color sensors. The range sensor is a LADAR range finder and the color camera a standard 35-mm SLR camera whose images have been digitized [4]. Figure 3 shows different coordinate systems used in matching. Triple arrowed lines indicate explicit transformations between coordinate systems maintained and adjusted by matching. Single arrows lines are the intrinsic sensor parameters and are assumed to be known from sensor calibration [18]. Notice that the model provides a link between the two sensors’ transformations and that the two are not independent.

LOCALIZED SCENE INTERPRETATION

119

FIG. 2. Interleaving prediction and coregistration.

The two colocated sensors give rise to an eight degrees of freedom [18] viewing model. Six encode the object pose relative to the sensor suite and two encode the relative translation of one imaging plane relative to the other. We define the eight degree of freedom parameterization as coregistration space. The term derives from the goal of aligning the object to the data while simultaneously adjusting the image registration between sensors. In this parameterization, sensor fusion takes place in the model’s native coordinate system and not at the pixel level. While this may not seem intuitive, the problem can be viewed as optimizing the object’s position independently in each sensor subject to an additional constraint upon the relative location of the two sensors. The sensor constraint allows the matching system to establish a pixel level correspondence by projecting pixels from each image onto the model. Formally, a 3D point on a model, (Mx , M y , M Z ), for a given coregistration can be projected into the color image (cx , x y ) and the LADAR image (l x , l y ). In doing so, we have forced a pixel level correspondence between (l x , l y ) and (cx , x y ). Due to differences in sensor resolution this may be a many-to-many relationship. Let F be the coregistration which positions the model in both the 3D range coordinate system and the 3D color coordinate system, F = {θ, φ, γ , L x , L y , C x , C y , Z },

(8)

where (θ, φ, γ ) are the azimuth, colatitude, and tilt for rotating an object about its origin.

FIG. 3. Coordinate systems: Triple arrowed lines are transformations used during matching whereas single arrowed lines are intrinsic sensor parameters obtained through calibration.

120

STEVENS AND BEVERIDGE

FIG. 4. Example data of a color image (left) and a LADAR image (right).

(L x , L y ) are 3D translations to move the rotated model into the LADAR coordinate system, and (C x , C y ) are the corresponding translations for color. The final parameter, Z , is the translation away from both sensors. This specific parameterization has been chosen to expedite search in coregistration space. It is these parameters which will be iteratively searched using the algorithm to be presented in Section 5.3. Figure 4 presents an example recognition problem. The left image is a color image centered on an M60 tank. The right image shows a gray-scale coloring of a LADAR image. Notice that the tank is about 60% occluded by the hillside.3 We will use this example to illustrate how our RMR algorithm is able to refine the coregistration estimate while detecting and accounting for the partial occlusion of the vehicle. The remainder of this paper is broken down into four distinct parts. First, a method for making rendered predictions of the scene based upon an assumed coregistration is described. Next, a measure is discussed for determining how well a model matches the data given the hypothesized set of eight coregistration parameters (see Section 5.2). The features used in the error measure are derived from the original CAD model and then adjusted for occlusions detected in the current scene context. Section 5.3 defines a search strategy to adjust a hypothesized coregistration state and thus improve the object position relative to both sensors. Finally, results are shown in Section 6 that compare performance with and without the component which takes account of occlusion. 5.1. Generating Model Predictions How best to predict the appearance of an object is a key question. If all objects in the world were modeled, including possible backgrounds, the prediction process would produce an image of the entire scene. In actuality, for model-based recognition we often have only partial knowledge of the possible objects present in the world. While our goal is to find the best instance or instances of those objects for which we have models, something must be done to deal with the portions of the scene for which we have no model. One approach is to simply render each object for a given coregistration and measure how well hypothesized model pixels match the data. This approach was taken in our previous work [23]. However, with additional expermentation it became clear that an error based only over the set of predicted model pixels can cause problems when matching objects of different size (i.e., small pickup trucks and larger tanks). There is a tendency to embed smaller instantiations of models within larger objects. In response, we have shifted to a different method for making predictions: that of localized scene predictions. A localized scene prediction not only examines regions on an image believed to belong to an object, but also a fixed portion of the image surrounding the object. The intuitive justification for such an approach is that the object should match a portion of the image well 3 The data is from the Fort Carson Dataset, which contains range, IR, and color imagery of military targets against natural terrain. The imagery and models are available at http://www.cs.colostate.edu/∼vision.

LOCALIZED SCENE INTERPRETATION

121

FIG. 5. Rendered predictions of the model in the scene.

and should not match well pixels which are not part of the object. Similar concepts have been used successfully in other areas of target recognition [6]. Since we are using two heterogeneous sensors, for a given model the prediction generation stage actually involves rendering two images. The first prediction is the projection of each model face at the resolution of the LADAR sensor image. This rendered image contains a value of 1 for each pixel covered by the model: the 1 labels the pixel as object. All other pixels are given label 0 (labeled as other). The second prediction is the projection of the most likely set of visible model edges at the resolution of the color sensor. These likely edges are found by examining all pairs of adjacent faces. If only one of these two faces is visible, then the adjoining edge is drawn. In general this method produces edges along the occluding contour as well as significant internal edges. The resulting image is at the resolution of the color sensor and pixels under projected edges are labeled as object edge. Since these predictions are based on the intrinsic sensor calibration parameters, the correspondence between model features and data features is known. By using pseudo-depth values obtained during rendering, it is also possible to unproject each predicted pixel in the rendered LADAR image into the model coordinate system. These unprojected features can then be reprojected into the color image. We can accomplish this unprojected feature due to the 3D nature of the LADAR sensor image. In this manner, the correspondence between range and color pixels can be determined. Figure 5 shows the renderings obtained for the ideal coregistration and the ideal object for the image of Fig. 4. The ideal parameters are not used during search; they are only used here to illustrate how a prediction is compared. The next step is to adjust these image predictions based on occlusion information obtained from the LADAR sensor. 5.1.1. Adjusting the prediction based on occlusions. Each pixel in the LADAR prediction is examined. If it is labeled as object, the predicted depth of the model at that pixel is compared to the actual pixel in the sensor image. If the corresponding data pixel is τ meters closer to the sensor than expected, we can assume that the predicted pixel is being occluded. All of the predicted LADAR pixels now have one of three labels: object, occluded, other. We then examine every pixel in the color prediction. For each pixel, its corresponding LADAR pixel is located. The correspondence is constructed by mapping the LADAR sensor pixels through the model back into the color coordinates. If the LADAR pixel was labeled as occluded, the corresponding color pixel is also labeled as occluded. One final step is necessary for adjusting the prediction of the object in the color image: adding new significant edges. If it is hypothesized that there is an occluding surface covering a portion of the object in the scene, evidence for that boundary should be found in the image. Furthermore, we cannot expect that boundary to correspond to an edge in the model. Thus, we need to introduce a new edge into our prediction. Adapting the already existing prediction is quite easy. Every color pixel is examined, and if it lies on the model and one of its eight-connected neighbors is labeled as occluded

122

STEVENS AND BEVERIDGE

FIG. 6. The adapted predictions modified due to occlusion information. The left image shows the adapted color prediction and the right image shows the adapted LADAR prediction. The labels have been given colors: black is other, white is object, gray is occluded.

the current pixel is set to object. Thus a new edge not already in the model has been added to account for the effect of the occlusion boundary. Figure 6 shows the adjusted color and LADAR predictions. The new edge added to the color can be clearly seen to account for the boundary between the object and the hillside. Our initial work did not incorporate methods for adding such edges to a prediction. In many cases, these edges are very prominent in the gradient images of the object. Figure 7 shows two images of a tank from different viewpoints. In the left image there is a prominent gradient running along a line occluding the back half of the lower left tank track. In the right image, which is shown in color in Fig. 4a, the occluding contour caused by the hillside contains a very strong gradient response. Without incorporating such features, the recognition algorithm often found false matches by trying to match existing model features to those strong gradients. 5.2. Measuring the Fidelity of the Prediction An error function is now defined to measure the fidelity between the rendered predictions and the actual sensor data. This function, E M (Fi ), determines how well model M matches the range data, r , and the color data, c, for a specific coregistration Fi , E M (Fi ) = α E Mr (Fi ) + β E Mc (Fi )

and

α + β = 1,

(9)

where α is a weighting term and was set to 0.6 for the experiments reported in Section 6. The individual error terms for each sensor, E Mr (Fi ) and E Mc (Fi ), are based on how well the individual rendered predictions match the range and color images. These are defined in Sections 5.2.2 and 5.2.1, respectively. 5.2.1. The color error. The error term for the color sensor E Mc (Fi ) is formed from the prediction of where edges should appear in the color image. This error is based on the

FIG. 7. Two different gradient images of a tank. The left image shows a strong gradient response along a line where the back half of the bottom track is occluded by foreground terrain. The right image shows a strong gradient response where an intervening hillside occludes the lower forward portion of the tank.

123

LOCALIZED SCENE INTERPRETATION

gradient magnitude. To estimate the gradient, the following 1 × 9 mask is used: A= −1

−2

−3

−4

0•

+4

+3

+2

+1

Such a mask provides gradient information for a large area about the true edge. Therefore, even when the coregistration is in error by several pixels, the search algorithm will have the directional information necessary to make improvements. The gradient magnitude is defined as p (10) Gx,y = (X · A)2 + (Y> · A)2 , where X is the column vector of pixels in the horizontal direction centered around the current pixel and Y is the analogous row vector of vertical pixels. Since we are dealing with color imagery, each element of this vector actually contains the V color component from the H SV of the pixel. The gradient values for the entire image are normalized to lie in the range [0, 1]. Next, two measures are derived. One is a function of the gradient magnitude under predicted edges. The other is a function of the gradient magnitude where our model predicts that edges should not be found. More formally, the first sums the inverse gradient magnitude for each object pixel: P x,y∈ object 1.0 − Gx,y P . (11) Eedge = x,y∈ object 1 By minimizing Eedge , we are maximizing the gradient under the expected location of the model edges. The other term captures the idea that edge strength should be low where the model predicts no edges. This measure is localized in the area of the image where the model is predicted to appear. This area is determined by fitting a bounding box to the 3D object model and then projecting this box into the image. Formally, Eedge (not edge) error is: P x,y∈ object

Eedge = P

Gx,y

x,y∈ object

1

.

(12)

The total error for the color prediction is the weighted sum of the two edge terms: E Mc (Fi ) = (0.75)(Eedge ) + (0.25)(Eedge ).

(13)

The weighting terms are required since there are on average three times as many background pixels in a match as predicted edge pixels. Without the weighting term, the error would be dominated by the contribution of the background or gradient noise. In our earlier work we used only Eedge and did not use Eedge . Our experience suggests both terms are important. 5.2.2. The LADAR error. The error term for the LADAR sensor E Mr (Fi ) is formed from the LADAR prediction. The LADAR error consists of three measurement terms which assess the goodness of each possible pixel category (label), µ ¶ 1 (Eobject + Eother + Eoccluded ), (14) E Mr (Fi ) = T

124

STEVENS AND BEVERIDGE

where T is the total number of pixels in the projected 3D bounding box of the model. The first term in this equation assesses how well the pixels predicted as object match the data: Eobject =

½

X x,y∈ object

|Px,y − Dx,y |/τ 1

if |Px,y − Dx,y | < τ , otherwise

(15)

where Px,y is the predicted feature depth and Dx,y is the actual LADAR depth. The parameter τ is designed to account for sensor noise by allowing a cap on the farthest distance a point can be from its prediction and still be called matched (set to 1.5 m). The next term accounts for how well our prediction of the background pixels match, Eother =

X ½1 x,y∈ other

0

if (|Px,y − Z | > τ ) , otherwise

(16)

where Z is the distance of the object from the sensor. The background term encourages coregistration states where the labeling of nonobject pixels is at least τ m from the object center. Similar to the color measure, the set of possible (x, y) pairs is limited to the region within the model bounding box. The last term computed is the effect of occlusion on the match. We have chosen a simple ramp function so that small amounts of occlusion are allowed, but a penalty is incurred for coregistration states where a significant portion of the model is occluded,  ¡ ¢  0 if Occ < 0.25  T ³ Occ ´ ¡ ¢ ( T )−0.25 Occ (17) Eoccluded = if Occ < 0.75 , 0.75 T   1 otherwise where Occ is the total number of pixels in the projected model bounding box labeled as occluded. 5.3. Searching Coregistration Space Our search strategy is based on the Simplex algorithm developed by Nelder and Mead [25]. A simplex is defined as a collection of n + 1 points embedded in the n dimensional search space. Conceptually, the simplex can be viewed as a geometric amoeba which flows along an error surface until all the points of the amoeba have converged to the same local optima. In this domain, a different simplex is used for each object. Each point in the simplex represents a coregistration, F, and each iteration of search moves the worst error point in the simplex in order to lower its error. Recall from Section 5 that F encodes both the object’s 3D pose and the sensor alignment. Four possible rules are invoked to improve the worst simplex point. The first is to reflect the simplex point through the simplex. If this reflected point has a lower error, then an expansion is attempted. Expansion moves the point farther along the trajectory used by the reflection. If the reflection did not improve the error, a contraction is used to pull the point toward the center of the simplex. If the contraction fails, all points in the simplex are shrunk toward the best point in the simplex. The process of adjusting the worst simplex point continues until no further adjustments can be made. Once convergence is reached, the coregistration is set to the simplex point

125

LOCALIZED SCENE INTERPRETATION

with the lowest error. Typically in this domain, once the simplex has converged the pose points are nearly equivalent. The resulting coregistration represents the best combination of 3D object pose and sensor registration based upon the comparison of the rendered object predictions with the range and color images. To initialize the simplex, random perturbations are used. Given a single set of coregistration parameters, each dimension is perturbed in succession to generate a new simplex point. The simplex algorithm is run to fruition and then randomly perturbed again and rerun. The process is repeated ten times with the hopes of achieving a better solution by avoiding local optima.

6. RESULTS A set of experiments has been run in order to gain some insight into how important the occlusion reasoning is when recognizing terrain occluded objects. By occlusion reasoning, we are referring to the step in the model prediction algorithm where pixels in the rendered range and color images are marked as occluded based upon the relative depth of pixels compared to the hypothesized distance to the object (Section 5.1.1). Without occlusion reasoning the rendered pixels belong to either the class object or background and all rendered range points and color edges derived from the model are predicted to be visible. There are two parts to the question of whether occlusion reasoning is or is not useful. First, does the occlusion reasoning improve recognition on imagery where objects are occluded. Second, and this question is sometimes overlooked, does the inclusion of occlusion reasoning do not harm for objects which are not occluded. To test both questions, the RMR algorithm was run on six images containing a single object with no occlusion and six images with a single object partially occluded. For the occluded objects, 25 to 60% of the object is occluded. For all 12 images, the object RMR algorithm was run both with and without the occlusion reasoning component. To test how well matching performs with correct and incorrect object models, the RMR algorithm is run exhaustively over the set of models. In this experiment, there are three models: a tank (M60), an armored personnel carrier (M113), and a variant of the M113 with a missile launcher on the roof (M901). TABLE 2 Confusion Matrices for Image with and without Occlusion and RMR Algorithms Using and Not Using Occlusion Reasoning No occlusion in images

Occlusion in images

M113

M060

M901

M113

M060

M901

Has Occlusion Reasoning

M113 M060 M901

1 0 1

0 2 0

0 0 2

1 0 0

0 2 0

0 0 3

No Occlusion Reasoning

M113 M060 M901

1 0 0

0 2 0

0 0 3

1 2 1

0 0 1

0 0 1

126

STEVENS AND BEVERIDGE

FIG. 8. Results for four of the Six occlusion images. In each subimage, the model orientation is shown as well as the location in each sensor. For the color image, white features are those predicted to be visible, and gray is the expected occlusion region.

For each model, the RMR algorithm is run using a series of incorrect coregistration estimates derived from the true coregistration. Specifically, the RMR algorithm is initialized independent from 10 distinct coregistration estimates formed by adding noise to the true coregistration. Each of the eight true coregistration terms is perturbed as follows: the azimuth

LOCALIZED SCENE INTERPRETATION

127

angle by ±25◦ and the five translation parameters by up to ±0.75 m.4 This approximates the magnitude of errors introduced by an indexing algorithm discussed in [16, 24]. Since there are 12 images, 3 models, and 10 initial coregistration estimates, there are a total of 360 distinct runs of the RMR algorithm. For each image, the model with the lowest match error is said to be the object present in the scene. Consequently, we will evaluate the performance of the RMR algorithm in terms of how well it identifies the correct object both with and without the occlusion reasoning. Table 2 shows the confusion matrices for identification. When there is no occlusion in the scene the occlusion reasoning variant of the RMR algorithm only misidentified one vehicle. This misidentification failed to distinguish the highly similar M113 and the M113 with the missile launcher on the roof. When there was occlusion in the scene, the same algorithm did not misidentify any of the vehicles. This is significant since the variant of the RMR algorithm without occlusion reasoning was only able to correctly identify two out of the six correct vehicles. 6.1. Specific Examples Figure 8 shows the results of the occlusion reasoning variant of the RMR algorithm on four of the six occlusion images. Within each subimage, the object orientation determined is shown as well as the match to the color image and the match to the range image. The white lines show those edges of the object predicted to be visible, and the gray area is the predicted occlusion. It is obvious in several of the scenes that there is a large amount of object occlusion which the algorithm is successfully able to predict and compensate for during matching. Note, these are final coregistrations and are hence of high quality. The search algorithm arrived at these using incorrect initial coregistration hypothesis as outlined above. 7. CONCLUSION We hope the contribution of this work will be viewed at two levels. On a practical level, we present a system which advances the state of the art for model-based object recognition in the context of multisensor ground-based target recognition. It illustrates two new concepts which combine to allow 70 to 80% correct identification of low resolution objects subject to significant amounts of terrain occlusion [16]. • Dynamic prediction of object appearance based upon 3D object geometry, multisensor data, and the surrounding terrain environment. • An iterative refinement algorithm capable of converging upon the optimal 3D scene configuration. To our knowledge, no other system has demonstrated comparable performance. For at least one alternative approach using only range data there is strong evidence to suggest the recognition problems presented here are unsolvable [17]. Beyond the specifics of the task presented, our goal has been to amplify the call by others [2] for more work on open-loop object recognition. It is our conviction that in many 4 For a sense of scale, at the average depth of the vehicles in all of the images, a 0.75 m translation amounts to 11 pixels in the color image and 5 pixels in the range image. This may not seem like that many pixels, but for this dataset the mean number of pixels on the object is 100 in color and 45 in range.

128

STEVENS AND BEVERIDGE

domains more reliable object identification depends upon the ability to render, match, and refine hypotheses to conform to scene specific constraints. We give again as our example the problem of occlusion. Treating occlusion as a random event averaged over all possible scene configurations is at best a weak method. It misses the fundamental point that in any given scene, for any given point on an object, that point either is or is not occluded. Whenever possible, recognition should seek out evidence that supports a specific scene configuration and reduce uncertainty. As we have illustrated, the presence of colocated range and optical imagery makes this process relatively straightforward. What to do when working only with optical imagery remains an excellent topic of research. Our current work is addressing these issues in the context of multiple object recognition using only color imagery [27].

ACKNOWLEDGMENTS We thank Pradip Srimani for connecting the equation for the number of occlusion patterns with the sum of products equation in Knuth. We also thank Kris Seijko for his insights into LADAR target recognition.

REFERENCES 1. A. R. Pope and D. G. Lowe, Learning object recognition models from images, in Early Visual Learning (T. Poggio and S. Nayar, Eds.), 1995. Available at http://www.cs.ubc.ca/spider/pope/home.html. 2. P. J. Besl and R. C. Jain, Three-dimensional object recognition, ACM Comput. Surveys 17, 1985, 75–145. 3. J. R. Beveridge, Local Search Algorithms for Geometric Object Recognition: Optimal Correspondence and Pose, Ph.D. thesis, University of Massachusetts at Amherst, May 1993. 4. J. R. Beveridge, D. P. Panda, and T. Yachik, November 1993 Fort Carson RSTA Data Collection Final Report, Technical Report CSS-94-118, Colorado State University, Fort Collins, CO, January 1994. 5. J. R. Beveridge, E. M. Riseman, and C. Graves, Demonstrating polynomial run-time growth for local search matching, in Proceedings: International Symposium on Computer Vision, Coral Gables, Florida, November 1995, pp. 533–538, IEEE Computer Society Press, Los Alamitos, CA. 6. J. E. Bevington, Laser Radar ATR Algorithms: Phase III Final Report, Technical report, Alliant Techsystems, Inc., May 1992. 7. T. M. Breuel, Fast recognition using adaptive subdivision of transformation space, in CVPR, pp. 445–451, June 1992. 8. O. I. Camps, L. Shapiro, and R. Haralick, Image prediction for computer vision, in Three-Dimensional Object Recognition Systems, Elsevier Science Publishers, Amsterdam, 1993. 9. T. A. Cass, Polynomial-time object recognition in the presence of clutter, occlusion, and uncertainty, ECCV 92, 1992, 834–842. 10. R. O. Duda and P. Hart, Pattern Recognition and Scene Analysis, Wiley, New York, 1973. 11. J. Edwards and H. Murase, Appearance matching of occluded objects using coarse-to-fine adative masks, in IEEE Conference on Computer Vision and Pattern Recongition, 1997. 12. W. E. L. Grimson, Object Recognition by Computer: The Role of Geometric Constraints, MIT Press, Cambridge, MA, 1990. 13. W. E. L. Grimson, The effect of indexing on the complexity of object recognition, in Third International Conference on Computer Vision, pp. 644–651, IEEE Computer Society Press, LOS Alamitos, CA, December 1990. 14. A. Hoogs and R. Bajcsy, Model-based learning of segmentations, in International Conference on Pattern Recognition, Vienna, August 1996, Vol. 4, pp. 494–499, AIPR, IEEE. 15. D. P. Huttenlocher and S. Ullman, Recognizing solid objects by alignment, in Proc. of the DARPA Image Understanding Workshop, Cambridge, MA, April 1988, pp. 1114–1124, Morgan Kaufmann, San Mateo, CA.

LOCALIZED SCENE INTERPRETATION

129

16. J. R. Beveridge, B. Draper, M. R. Stevens, K. Siejko, and A. Hanson, A coregistration approach to multisensor target recognition with extensions to exploit digital elevation map data, in Reconnaisance, Surveilance, and Target Acquisition for the Unmanned Ground Vehicle (O. Firschein, Ed.), pp. 231–265, Morgan Kaufmann, San Mateo, MA, 1997. 17. J. G. Verly and R. T. Lacoss, Automatic target recognition for LADAR imagery using functional templates derived from 3-D CAD models, in Reconnaissance, Surveillance, and Target Acquisition (RSTA) for the Unmanned Ground Vehicle (O. Firschein, Ed.), Morgan Kaufmann, San Mateo, CA, 1997. 18. Z. Zhang, J. R. Beveridge, M. R. Stevens, and M. E. Goss, “Approximate Image Mappings between Nearly Boresight Aligned Optical and Range Sensors,” Technical Report CS-96-112, Computer Science, Colorado State University, Fort Collins, CO, April 1996. 19. D. E. Knuth, The Art of Computer Programming, 2nd ed. Computer Science and Information Processing, Vol. 1, Addison-Wesley, Reading, MA, 1973. 20. K. Ohba and K. Ikeuchi, “Recognition of the Multi-Specularity Objects Using the Eigen-Window,” Technical Report CMU-CS-96-105, School of Computer Science, Carnegie Mellon University, February 1996. 21. Y. Lamdan and H. J. Wolfson, Geometric hashing: A general and efficient model-based recognition scheme, in Proc. IEEE Second Int. Conf. on Computer Vision, Tampa, December 1988, pp. 238–249. 22. D. G. Lowe, The viewpoint consistency constraint, Int. J. Comput. Vision 1, 1987, 58–72. 23. M. R. Stevens and J. R. Beveridge, Precise matching of 3-D target models to multisensor data, IEEE Trans. Image Process. 6, 1997, 126–142. 24. M. R. Stevens, C. W. Anderson, and J. R. Beveridge, Efficient indexing for object recognition using large networks, in Proc. 1997 IEEE International Conference on Neural Networks, pp. 1454–1458, June 1997. 25. J. A. Nelder and R. Mead, A simplex method for function minimization, Comput. J. 1965. 26. S. K. Nayar, S. A. Nene, and H. Murase, Real-Time 100 object recognition system, in Proceedings of ARPA Image Understanding Workshop, Morgan Kaufmann, San Mateo, 1996. Available at http://www.cs.columbia. edu/CAVE/rt-sensors-systems.html. 27. M. R. Stevens, Reasoning about Object Appearance in the Context of a Scene, Ph.D. thesis, Colorado State University, 1999. 28. W. M. Wells, Statistical Object Recognition, Ph.D. thesis, Massachusetts Institute of Technology, 1993. 29. M. D. Wheeler and K. Ikeuchi, Sensor modeling, markov random fields, and robust localization for recognizing partially occluded objects, IUW 93, 1993, 811–818.