Signal Processing 32 (1993) 217-255 Elsevier
217
Distributed spatial reasoning for multisensory image interpretation G .L . Foresti, V . Murino . C .S. Regazzoni and G . Vernazza Department of Biophysical and Electronic Engineering (DIRE), University of Genova, via all Opera Pia 11 A, I-16145 Genova, Italy
Received 15 June 1991 Revised 31 January 1992
Abstract. A hierarchical distributed system is presented, which interprets 3D scenes through fusion of multisensory images . The recognition problem is partitioned into a set of less complex subproblems by associating with each representation level expert processing units that filter out unreliable solutions and focus attention on promising ones . In this way, the search space for possible solutions is limited in a distributed way, as a priori knowledge about observations and constraints is used at multiple levels . Different instances of the same inference mechanism are applied at each level . As a consequence, each processing unit is able to search autonomously for a local solution in order to contribute to obtaining a globally consistent solution . An important characteristic of the system is to be easy to maintain and extend . The results reported have been obtained by using multisensory images of real scenes considered in the context of an autonomous-driving application . Two examples of interpretation of 3D road scenes are given, and the distribution of computational load is discussed . Zusammenfassung. Es wird ein hierarchisch verteiltes System vorgestellt, in dem 3D-Szenen dumb Fusion von MultisensorBildern interpretiert werden . Das Erkennungsproblem wird in einen Satz von weniger komplexen Unterproblemen aufgeteilt, indent jeder Reprasentationsstufe Experten-Prozessor-Einheiten zugeordnet werden, die unzuverlassige Losungen ausfiltern and vielversprechende Losungen verstarken . Auf these Weise wird der Suchbereich fur mogliche Losungen begrenzt, da apriori-Kenntnisse uber die Beobachtungen and Randbedingungen in den verschiedenen Staten genutzt werden . Verschiedene Falle von gleichen Interferenz-Mechanismen werden in jeder Smile angewendet . Als Konsequenz ergibt sich, daBjede ProzessorEinheit in der Lage ist, nach einer lokalen Losung zu suchen and so zu einer global konsistenten Losung beizutragen . Eine wichtige Eigenschaft des Systems ist die einfache Wartung and Erweiterbarkeit, Die wiedergegebenen Ergebnisse werden anhand von Multisensor-Bildern realer Szenen in Zusammenhang mit einer 'autonomousdriving'-Anwendung gewonnen . Zwei Beispiele der Interpretation von 3D-StralBen-Bildern werden angefuhrt ; die Verteilung der Rechenaufwendungen wird diskutiert . Resume . Nous presentons un systeme distrubue hiearchique pour l'interpretation de scenes 3D par fusion d'images multicapteurs. Le probleme de la reconnaissance est partitionne en on ensemble de sous-problemes moms complexes par l'association avec chaque niveau de representation d'unites de traitement expertes qui filtrent lea solutions douteuses et concentrent ('attention sur celles qui sent prometteuses . De cette fagon, l'espace de recherche des solutions possibles est limite de maniere distribuee, puisque la connaissance a priori sur les observations et les contraintes est utilisee a des niveaux multiples . Des instantiations differences du mime mecanisme d'inference sent appliquees a chaque niveau . En consequence, chaque unite de traitement est capable de rechercher de maniere autonome one solution locale de facon a contribuer a l'obtention d'une solution globalement consistante. Une caracteristique importante du systeme est qu'il est facile a maintenir est a etendre . Les resultats rapportes ont ete obtenus avec des images multi-capteurs de scenes reelles considerees dans le contexte d'une application de circulation autonome. Deux exemples d'interpretation de scenes routiere 3D sont donnes, et la distribution de la charge de eaten] est discutee . Keywords. Knowledge-based systems ; interpretation ; geometric reasoning ; data fusion .
Correspondence to : Professor G . Vernazza, DIBE, Facolta d'Ingegneria, University di Genova, via all'Opera Pia 11A, 1-16145 Genova, Italy .
0165-1684/93/106 . 00 (l 1993 Elsevier Science Publishers B .V . All rights reserved
218
G .L. Foresti et al. / Distributed spatial reasoning
1 . Introduction Automatic interpretation of 3D images is a dataabstraction process performed by matching features extracted from an image with predefined models . The overall recognition process is often distributed at different abstraction levels (hierarchical structure), and the final result is a more or less detailed description of the scene considered. Many researchers have focused their investigations on developing sound theoretical methods for deriving information from images, privileging bottom-up procedures . According to these methods, the vision process is based on sequential modules ; each module maps features at a new level at which information is more synthetic and easier to capture [3, 231 . This approach, named `computational' approach, has led to interesting numerical mathematical models that, even though they allow the generation of powerful tools for explanation of the early human vision process, in real environments are currently applicable separately to different steps of an automatic recognition system . This is because the computational approach uses appropriate modules to encompass many realistic situations, and each module is well suited to a specific task ; however, the information-flow sequence that can be modelled mathematically is too deterministic, as it is not flexible enough to represent different strategies necessary to consider complex real scenes . Other researchers have used tools, like those derived from Artificial Intelligence (Al) [5, 14, 24j for the reasoning process about image features and object characteristics, on the basis of both general and domain specific knowledge (Knowledge-Based (KB) approach) . Multilevel abstraction provides a natural subdivision of the interpretation process ; specialized processes, called knowledge sources (KSs), allow the mapping of information into new information at and between the various levels . This approach, applied to the vision process, combines image-processing techniques (low-level vision) with knowledge-based manipulation of symbolic structures and models (high-level vision) to measure the matching between data and models Signal Processing
(e .g ., fuzziness or certainty factors)) . The different knowledge sources provided by the symbolic approach for the interpretation process (e .g ., heuristic-procedural rules, general knowledge about the visual world, specific knowledge related to the application domain, etc .) can be utilized to generate new hypotheses and to manipulate and compare data and models, thus achieving a more flexible architecture. However, the KB approach, also named `symbolic' approach, even when integrated with more conventional ones (i .e ., statistical and syntactic methods), shows some drawbacks in terms of processing time : too-frequent backtracking steps are usually required to understand complex images . Moreover, `symbolic' input data on which reasoning is based must be directly related to features of the images to be processed . A KB approach developed without explicitly defining the characteristics of the computational modules involved in feature extraction may often result in inefficient cooperation strategies between system components . The purpose of this paper is to present a system under development at DIBE which efficiently integrates computational with symbolic methods in the presence of different sensors ; the architecture of the system is organized into multilevel data and multilevel models . Such an architecture is mainly justified by the necessity for providing, in the near future, suitable techniques to develop systems to be used in real applications . According to the combined approach, each `computational module' is controlled and activated by a KB inference control structure . The employment of explicit control knowledge, rather than knowledge contained in the code of a conventional sequential program, allows the realization of a more flexible structure, easy to manage and modify . The application chosen to test the proposed architecture is interpretation of 3D outdoor road scenes . In spite of the relative simplicity of the images considered, the paper shows that it is feasible to perform distributed spatial reasoning oriented to scene recognition inside the architecture, thus providing the effectiveness of the cooperation strategies chosen for the
219
G .L. Foresti et al . / Distributed spatial reasoning
management of the computational modules . Moreover, this application has provided interesting results on matching of CAD models to image data . To capture different KSs, we employ a distributed blackboard [5], organized into a hierarchical structure. Each KS applies declarative and procedural domain knowledge in order to manage the different computational modules at a given level, to activate the matching process, and to control the information flow towards a higher level of abstraction . Many and well-known are the advantages of employing a distributed blackboard (e .g ., modifications to each KS are easily feasible and do not affect the overall system, thanks to modularity (see also [24])) . Blackboard structures have been widely used to solve many problems involving signal-into-symbol transformations [14, 23, 241 . VISION [14] represents a well-known example of an image-understanding system with a blackboard architecture_ An improved version (the SCHEMA system [8]) is based on the idea that a generalpurpose system should consist of a set of specialpurpose systems. In this paper, we do not use this criterion, as we prefer to apply the distributed blackboard model to provide a general scheme able to improve different computational steps by supporting them with higher degrees of adaptiveness and controllability . This means that the goal of our work is somehow complementary to the VISION system one, as it aims at exploiting general (parameterized) computational strategies for using the same system for recognition of different objects . A more similar approach to the one proposed in this paper was presented by Pau [21] . Pau proposed a three-level KB decision structure for multisensory data interpretation : it was based on measurements in the vector space, on structural graph matching, and on rule-based representation of a priori knowledge about the application domain . In comparison with Pau's approach, we have tried to maintain a more uniform control structure at all levels of the system architecture by defining a reasoning mechanism which can be applied to the low-level, middlelevel, and high-level processing steps .
Other points are addressed in the paper, which refer to the knowledge required by specific KSs in the hierarchical structure : - Geometric reasoning [16] : the problem of regularizing the 2D-into-3D transformation when passing from the viewer-centered image level to the object-centered reference system (where object models are described) is addressed in more detail . A hypothesis and test mechanism for determining object poses is described as a particular case of the more general problem (faced by any KS) of transforming lower-level observations into local observation ; - Multisensory-data fusion : two simple fusion techniques are proposed to integrate edges and regions extracted from an image acquired by a single sensor, and to fuse surfaces, acquired by different physical sensors, through geometric reasoning . Such techniques are not so accurate as the ones based on a more statistical basis (e .g ., [I]) but can be regarded as examples of the general role of the fusion process in the proposed architecture . In this sense, the most correct technique is based on the use of networks of logical sensors, as proposed by Henderson 115] . Despite the simplicity of the fusion algorithms used, the proposed combined approach is very appropriate for interpretation of multisensory data when dealing with : (a) not easily overlapping' sensors (e .g ., IR and TV cameras) ; (b) 3D environments acquired from multiple views; (c) low-correlated sensors (each channel is considered autonomous up to the highest level) . In Section 2, the multilevel system is presented, and in Section 3 a description of its application to spatial reasoning for road scenes interpretation is provided . Recognition and interpretation performances results are reported in Section 4 ; finally, general conclusions are drawn in Section 5 .
2. Representation levels of data and models In a system where models are represented at a single level of abstraction, the problem of interpreting scenes by means of a sensor can be defined as Vol . 32 . No, . 1 -2. May 1993
220
G.L . Foresti et al . / Distributed spatial reasoning
the problem of assigning a label (chosen from among a set of predefined ones) to an observation provided by a sensor . The labelling process implies an image transformation (like the ones provided by computational modules) which allows the system to extract features and to represent them symbolically . Symbolic descriptions obtained after feature extraction are then matched with a priori knowledge represented by models . Describing object properties by using propositional lists of symbols makes easier the task of filling the knowledge-base of the recognition system, and, consequently, increases the system's portability . Observations to be matched with models may be provided either directly by a physical sensor or by an intermediate computational module . The labelling problem requires a solution space S to be defined. A label Ij can be associated with a specific object Oj by taking into account a priori knowledge about the application domain ; we denote by Oj an object model containing a symbolic description of the intrinsic and relational properties of the object in the application domain . The set of object models 0={Oj , j=1, . . . , J} to be recognized for a certain application corresponds (one to one) to the set of labels L= {lj , j=1, . . . , J} . The symbol zk stands for a generic element of the set of observations provided by a computational module, e.g ., a pixel, an edge, etc . The set of observations is indicated as Z= {z k , k=1, . . . , K }. An association pair AP =(lj , zk) is the link between an object and the observation that supports it. In the most general case, assigning a label l; to an observation zk by using only intrinsic object properties means to compare all elements of 0 with all elements of Z (i .e., complexity is expressed as O(J x K)) and to select the best pair AP* by using the metrics provided by a matching measure . A solution S to a single-level recognition problem can be defined as a set of association pairs, that is, S= {AP,, s= 1, . . . , S} ; in such a solution, also relational constraints dependent on the application domain must be satisfied . Accordingly, the number of configurations to be exploited in order to find the optimal solution, according to any matching measure, rises Signal Processing
up to O(JK ) . The problem of finding the best solution in a direct way is clearly intractable when 0 and Z become too large . Usually, recognition systems try to find a sub-optimal solution by means of heuristic strategies in order to avoid exploiting all possible configurations in the solution space . These strategies can be embedded in the procedural code of a computational algorithm [17] . Otherwise, it is possible to use explicitly represented knowledge about the application domain as in knowledge-based systems [7, 241 . 2 .1 . The multilevel data multilevel model approach
The multiple data multiple model (MDMM) approach proposed in this paper allows one to represent both models and signal data at different levels of abstraction and introduces a local decisional capability at each level . To this end, a distributed semantic description of the outdoor world must be provided at all the possible representation levels of the system . A basic strategy to reach this goal is to define a uniform representation method and a uniform inference mechanism to be applied at each system level . They simplify communications between levels and facilitate the solution of conflicts generated by autonomous local decisions . 2.1 .1 . General description
In the MDMM approach, the labelling goal is pursued by dividing the solution space into multiple representation levels . The solution found at each level must be locally consistent to be propagated to higher levels . If we denote by p, p = 1, . . . , P, each level of the solution space, a local solution LSP can be defined by analogy to the solutions found in a single-level systems . A local solution LS'= {APF, s= 1, . . . , sP, p=1, . . . , P} consists of a set of association pairs API =(1Ja, z~,,) . A model O° can be regarded as a set of relational and intrinsic semantic constraints which must be
221
G .L . Foreti et al. / Distributed spatial reasoning
satisfied by a set of observations Z°={zk,p= l, . . , P, k=1, . . . , K°} . A processing unit (module) is associated with each representation level ; each module at level p takes as input the observations Z° -' provided as output by units at level p-1, and transforms them into the simple local observations SZ° . Then, these observations Z° are matched with intrinsic constraints associated with elements of the local set 05 . Such object models were activated during a previous top-down prediction phase . Observations satisfying intrinsic constraints are then submitted to a grouping procedure, which produces output observations Z° . Before transmitting such observations to the higher level, a new matching phase is performed, which verifies whether relational constraints among simple observations are satisfied or not . Only the Z° that match with object models are proposed to the higher level as a possible solution . The system is characterized by two information flows : a topdown symbolic information flow contains predictions about the object models that must be activated at each level, and a bottom-up numerical information flow is associated with the observations generated by each module, viewed as a virtual sensor. A distributed blackboard approach [5], using a set of expert subsystems which cooperate by exchanging top-down and bottom-up messages, has been adopted to implement the multilevel system. Knowledge sources (KSs) with a uniform type of knowledge representation and uniform reasoning scheme are associated with solution subspaces . Such KSs utilize local expert knowledge (1) to transform lower level observations into local observations, and (2) to group observations according to local criteria . Each KS is viewed as an Intelligent Sensor which can be influenced by constraints imposed by higher modules and by observations received from lower modules, and which can in turn influence decisions inside the network by performing predictions, groupings, matching and transformations locally . Local techniques that can be employed at each system level have been outlined in some previous works [4, 9], also giving descriptions of specific modules . More precisely, in
low-level visual sensors (i .e ., an edge-extractor and a region extractor) are described, and attention is focused on the interpretation level and on the various strategies that can be adopted to recognize single objects . In [9], the role of the geometric reasoning module is pointed out : this module uses a priori knowledge and hypothesis-and-test mechanisms to solve the 2D-to-3D conversion from the sensor-centered reference system (i.e ., an image) to the object-centered reference system used at the interpretation level . In [6, 19], implementation details are provided about the object-oriented system (called DOORS) used as a shell for the present application . [4],
2.1 .2 . Local knowledge base In this section, prototypical desnptions of elements of the model and observation spaces at each system level p system are provided . Figure 1 shows the organization of basic elements of models and observations . Attributes. A set of intrinsic and relational constraints is associated with each model Ol . Each I
z
it.
FS
Z : Po:arvabonattehnalbusl F :Camdeakmne®o®kdStheoWarvefcn H : funds, hb1 asoodata at 5,e boalnona H8 : ®nple hire 5,8 :tirykbaWa Feel lwel has aeamat hilt n d baler, . Ran, HSy Moefe name omnokaslmrldWINlhegctu s
Fig . L Layout of declarative knowledge representation as a multilevel frame tree of complex and simple hints and attributes . Val . 42, Nos. 1- 2 , Mav 1993
222
G.L . Foresti et a! . / Distributed spatial reasoning
constraint is represented inside the system as a fuzzy membership function [19, 25] to be applied to the values assumed by a set of attributes F . The set Fe= { J'} represents the intrinsic and relational properties of each observation zk provided by a computational module at level p . Examples of intrinsic attributes are the area of a region, the length of an edge, the shape of a surface, etc . Relational attributes can be represented as links between two observations (e .g ., a shared perimeter and the contrast between two regions) . Attributes can be either simple, FS, or complex, F, depending on their dependence on the computation of other attributes. Intrinsic and relational hints . Hints, h,,,f, are semantic descriptions of intrinsic and relational attributes f. A hint is a symbol associated with a membership function mapping the value assumed by an attribute f into a fuzzy membership value included in the range [0, 1] . The set of intrinsic hints related to attribute f, say Hf , can be defined as : Hf= {h,,,j : u=1, . . ., Uf} . In this way, the distance of the value of an attribute ffrom its semantic description h,,,, can be measured by means of the membership function stored as a priori knowledge . For example, the attribute `area' of a region can be classified as small, medium or large, i.e ., Uj= 3, that corresponds to possible object descriptions at the region level . Hints can be either simple or complex . Complex hints operate on the output of fuzzy membership functions applied by simple ones. In Fig. 1, one can see that hints and attributes form multilevel trees where a priori knowledge of the system can be stored . Interlevel projections in the model space . The possible interlevel relationships among basic elements of the description of object models must be given as a priori knowledge to the system. Basic elements must be provided for each possible virtual sensors inside the distributed system . Starting from these relations it is possible to dynamically build up a multilevel description constituted by a set of virtual objects, that is, Oj ={O°, p=1, . . . , P} . Interlevel relations between a hint, hy, f, and one or more intrinsic hint(s) at level p - I are referred Signal Prowssing
to as Hint Projection HPd,j relations . A HintProjection relation is a one-into-many mapping operation of a hint huj into a set of hints of X,, 1= Vg hints at the lower level, i .e ., Hu,f' _ .f= TP - '(g), x=1, . . . , Vg} . Hl f' is a sub: {hl,,-' set of Hj - ' ; Vg is the number of possible mapping of the hint, and g is the feature which is used at level p in order to compute feature f. A real-valued weight w, in the range [0, 1], is associated with each possible mapping x of a hint to represent the mapping probabilty . Given these basic interlevel relationships, a virtual object model O°"= {hu,;', ue[1, Uf], i=1, . . . , Ij , feF°"} can be mapped into a set of possible descriptions at level +l p, say O°r={W,' :hx,geHu;,' hu j"s0° } . Despite its complexity, each object model description exhibits the interesting property that the problem of describing attributes and hints at each level is separated from the problem of deciding the hint projections to be used between levels. This means that the same hint can be associated with more than one virtual object model, thus considerably reducing the amount of local knowledge to be represented, simultaneously maintaining a high descriptive power . Moreover, suitable strategies can be designed that allow one to dynamically change weights assigned to multiple hint mapping between levels. They can be used, for example, to backtrack from errors when the hypothesis about a viewpoint for the searched object does not lead to the object recognition . 2.1 .3 . The reasoning mechanism Each processing unit inside the hierarchical structure of the recognizer is regarded as a virtual sensor, both when the unit is associated with a physical device and when it uses an intermediate level algorithm . Three types of observations are required by the reasoning mechanism, at each level p to symbolically represent input/output data provided by computational modules : lower-level observations, ZP - ', simple local observations, SZP, and complex observations ZP. A virtual sensor d at level p takes as input a set of observations Z", ", d=1, . . . , DP ', from all lower-level
G.L . Foresti et al.
223
Distributed spatial reasoning
virtual sensors d at level p - 1 . The virtual sensor d obtains as output new local observations Z", after performing some processing steps . Two main computational operations are performed to obtain local observations : transformation and grouping. Each input observation belonging to Z" - ' is transformed into a local set of simple observations SZ" . Such observations are then grouped together . After each operation, a matching phase is activated, which can be viewed as a filter that eliminates uninteresting observations . Intrinsic hints are used to filter results of local transformations, and relational hints are used to eliminate uninteresting groups of observations . Output observations Z" consist of groups of locally transformed observations which have passed both matching phases .
HINT PROPAGATION SELECTION
LOCAL ACTIVATION OF RU REASONING
WATT)
2.1 .3 .1 . The multilevel recognition strategy
Among the possible matching schemes and strategies for integrating multilevel solutions LS", a model-driven strategy has been selected for the current application . This means that each module at level p tries to drive lower-level modules at level p -I that are linked to it. To this end, messages are used containing symbols related to the hints to be locally activated . These hints are utilized to activate local transformations and to filter out uninteresting observations . The reasoning cycle is performed in two steps : a top-down activation and a bottom-up reaction . Top-down phase . A top-down mechanism controls the hierarchical activation of object models . A message from level p to level p - I addresses a set of hints which correspond to a possible description of the searched object in terms of features extracted by a computational module at level p - 1 . The global top-down inference mechanism is referred to as rule DO . The following steps are performed at each level p (see Fig . 2) . (1) Activation of a set of local hints, according to the received message, and generation of bottomup information flow if observations to be transformed are available from lower levels . This step lies in activating a virtual object model 0,P,, in reply
TD MESSAGE PROPAGATION
1 [WAIT ) Fig. 2 . Flow diagram for the top-down reasoning mechanism .
to the top-down message which corresponds to the set of locally activated hints. (2) When input observations are not directly available (i .e., it is not yet possible to activate the local computational module), the top-down message must be propagated to lower levels . HintProjections attached to locally activated hints are used to create the virtual object model 0,',Y. ' to be fired, in the following way : 0 P,
hP .-l'
It P.
'E
HP
x* = arg(max w,) , h n,,,f E 0; } . This process is iterated until observations are found available at a level p* . Bottom-up phase . As a reaction to the top-down mechanism, computational modules can produce a bottom-up information flow . This flow is controlled by another inference mechanism (see Fig . 3) at each level ; this mechanism involves five main steps : (1) Acquisition of observations Z" -I from the lower level, activation of the local computational Vol . 72, Nos. .1199 1-2,M ;
224
G .L . Foresti et at . / Distributed spatial reasoning
module which transforms output observations into local symbols SZ° . This step is modelled by rule DI :Dl :Z° ' -. SZ°. (2) Generation of a set of simple observations SZJ'= {szf, k=1, . . . , SK,?}, which match intrinsic contraints associated with the locally fired virtual object model O,P, . . This is equivalent to generating a set of local association pairs (1,, szk ) candidates for being part of LSP . This step is modelled by rule D2 :D2 :SZ°.SZ? . (3) Observations coming from different sensors d, Z° - '', d=1, . . . , DP - ', must be suitably fused at level p . In the described system, the observations SZp ,d, d=1, . . . , DP` that passed the first matching phase are fused. Fusion of observations after matching improves the robustness of the system by creating alternative paths to reach the solution . Rule D3 transforms simple observations into fused simple observations : D3 :
SZp. dxSZ,,,
SZP,
d,e=i, . . .,D°,
d#e.
Fused observations are matched again with intrinsic models to verify fusion results . (4) Grouping of SZp observations to obtain Z={zk,k=1, . . .,Kp} ; the grouping step is modelled by rule D4 : D4 : SZp -. Zp . (5) Matching of grouped observations Zp with the relational constraints of O,P,,- to find a restricted set Z7' of grouped observation . The association pair (1;, zj,k) is added to the local solution LS°= {(1„ z,, k -) s , k'=k(s), s=1, . . . , S} . Relational hints are evaluated by considering those observations, z;,k and z, ,k, for which constraints are provided inside either object model Op or Op . The association pair that best matches with such constraints is selected as part of the local solution, and is passed to the higher level, i.e ., they are sent to level p + 1 . This step can be modelled by rule D5 : D5 : ZIP -. Z°. The transformation, grouping and fusion phases are performed by specific computational algorithms at different system levels ; control of these algorithms is implemented by a set of production rules in LISP, i.e ., the language used inside the Signal Processing
DOORS [13,19] system . Computational algorithms at lower levels are implemented as C functions, and are activated by the system . 2.1 .3 .2. Matching
Matching is performed at all system levels according to fuzzy set [25] theory . The following general maxmin rule, FM (Fuzzy Match), is used by both rules D2 and D5, to obtain matching measures : FM :
zk.=arg I max S(zk)1, l k
S(z°k) = min(h u,r
where O;={h"~1 .ae[1, . . ., UAL Z°= (4, k= 1, K) and zk={fl~„m=1, . . , M} . To build up an association pair (1;, zk .), the observation z k. is selected that obtains the highest score when matched with the hints of a virtual object model . The score S(zk .) to be assigned to each observation is computed according to a conservative, worst-case criterion : the degree of matching of the hint hv,f of the virtual object model Op that has obtained the minimum fuzzy membership value as compared with the attributes f5, of the observation 4 is assigned to each observation zk before the maximization step . 3. Spatial reasoning as a multilevel transformation In this section, we show how the problem of spatial reasoning in the presence of multiple sensors can be modelled according to the formulation given in the previous section . This explanation highlights the aspects involved in the projection and transformation operations performed between the 3D level of object descriptions and the 2D level of image features. A description of the dynamic processing performed by the system through all the representation levels can be given by pointing out the links existing between the computational algorithms managed by the KSs and the symbolic features provided by such algorithms to the recognition process . Attention is focused on a specific
G.L . Foresti et al . / Distributed spatial reasoning
BU message
225
WAIT
uswm4
DATA TRANSFORMATION
N
v
OBJECT NOT FOUND
MATCHING
DATA FUSION
Mth . odxrdd . coning mom dilbr.M wsas?
Dons don perf«mod?
BU NEGATIVE ANSWER
SOLUTION STATUS UPDATE
N
MnJd ng nocc
dsdi
BacM,.cWing
I
pvkrmod?
GROUPING
LOCAL BACKTRACKING
TD HINTS PROPAGATION
OBSERVATIONS
BU NEGATIVE ANSWER MATCHING
SOLUTION STATUS UPDATE BU DATA PROPAGATION Fig . 3. Flow diagram for the bottom-up reasoning mechanism .
Vol . 32, Non . 1-2, May 1993
226
G .L. Foresti et at / Distributed spatial reasoning Table I Specific transformation, fusion, and grouping actions performed at each system level Level Transformation Fusion Grouping 1 Analogic-into-digital conversion None Feature extraction by grouping pixels 2 Feature characterization None Feature grouping 3 Closure extraction & Overlapping in the 2D image None characterization space 4 2D-into-3D geometric Overlapping in the 3D reference Dynamic hypothesis tree transformation formation & maintenance system 5 None Object subpart semantic labelling Dynamic hypothesis tree formation & maintenance 6 Object semantic labelling None Dynamic hypothesis tree formation & maintenance
level of the system : the Geometric Reasoner KS is described, which is devoted to predicting possible aspects of a 3D surface and to performing the 2D into 3D transformations of the observations obtained from lower-level modules . Two physical visual sensors (e .g., infrared and b/w cameras) provide input data . 3.7 . Representation levels Six representation levels (i .e . P=6) (see Fig . 4) are considered in the current application ; input observations, simple observations, and complex observations used at each level are presented in Table 1 and computational algorithms related to transformation, fusion and grouping are given in Table 2. (1) Measurable physical quantities related to an observed scene are acquired and digitized . Physical devices, which are responsible for transforming external analogical information into digital images, are represented inside the Image Analyzer KS ; this
KS also contains computational algorithms necessary to group pixels into descriptive primitives (i .e ., edges and regions) . Two sensors, a b/ 1w camera and one between an infrared sensor and a synthetic CAD image generator are the physical devices used for the current experiments . (2) Edges and regions provided by the Image Analyzer are characterized and transformed separately by the Edge and Region Analyzer into simple local observations (i .e ., frames) ; symbolically represented descriptive primitives are grouped by using production rules . (3) Groups of edges are transformed into closures by the Virtual Descriptive Primitive Analyzer associated with each physical device . A closure is a symbol that represents a closed polygonal patch supported by different image features ; no grouping occurs at this level, that is, closures are directly identified with a surface of an object, and grouping is performed at higher levels . (4) Closures are transformed into 3D surfaces by the Geometric Reasoner KS ; at this level, fusion
Table 2 Specific simple observations, complex observations, and hints prototypes used at each system level Simple O BS . SZ° Complex OBS . Z" Level p Module 1 Image Analyzer Pixel Features (edges & regions) Characterized features Compound features 2 Descriptive Primitive Virtual Descriptive Primitive Closure Fused closure 3 Group of surfaces 4 Geometric Reasoner Surface Object subpart Object description 5 Object Detector 6 Hypothesis Manager Object Scene description Signal Processing
are described Hint type No hint 2D edge/region hint 2D hint 3D hint Object subpart Object
G.L . Foresti et al . / Distributed spatial reasoning
is performed when multiple surfaces are provided by different physical sensors . The computational algorithms, at this level, produce simple observations, called 3D-closures, which are then grouped by means of an A* simplified algorithm [7, 20] operating on a search space represented by sets of 3D-closures which must satisfy relational constraints on the searched object [9] . The fusion process consists of two steps : first, a transformation of surface characteristics from the 3D reference system associated with a sensor into a global reference system is performed ; then, a simple overlapping of
Scheduler 1
I
Mgduio l Activation gy
{arv7Al DATARASP
J31AP"PABA.
CONSTRAINTS Situation Judqa
Recognition 'Strategy /
OBSERVATIONS request
ans
C SCENE MODEL
Hypothesis Manager req e 1
Lay
answe Object Detector
request SUBPART , SURFACE 3D HINT MODELS
~~
Detected Object an ar
Geometric Reasonln
request CLOSURE MODEL
Hypothesized Scene
I
3D surfaces
answer
®-- VittualDescriptive
2D Closure
2D HINT
Low & Midde Laval Algorithms Pppllcetlon Strategy
Low Le el Subsyste 1
. ..
SBbsL6yslam subsists . N
WAGE ANALYZER
P
- - LEV, I
ORIGINAL WAGE
Fig. 4 . Multilevel organization of representation levels in the current application : Knowledge-Sources (modules), observations and hints are detailed,
227
resulting surfaces is sufficient to obtain the fused surface required . (5) The Object Detector module transforms groups of surfaces into object subparts by attaching them a context-dependent semantic label ; object subparts are grouped together, and relational constraints on the searched object have to be satisfied ; an A* algorithm is used at this level, too. (6) Groups of object subparts are labelled as objects at the Hypothesis Manager level ; objects are grouped by the local A* algorithm to form scenes, which correspond to a list of association pairs between objects and sets of multilevel observations satisfying relational constraints among objects . A scene is a global solution to the recognition problem . In addition to the above modules, the Situation Judge KS is shown in Fig . 4 ; it is devoted to selecting different modules activation strategies . There are P=6 representation levels for which the topdown and bottom-up inference mechanisms outlined in the previous sections have to be devised . However, the Image Analyzer module is not provided with hints to drive and judge acquisition and feature extraction results (i .e ., pixel grouping) . A possible way to do so is to employ intrinsic models of observed images where relational constraints between pixels are explicitly used as parts of energy functions to be minimized. Gibbs-Markov's Random Fields theory [ 12] is one of the techniques we are studying for this purpose . Each of the modules can be activated by either a hint or a datum, i .e ., each module works according to the type of message received (Top-Down, TD, or Bottom-Up, BU) . When a generic module receives a message containing a list of hints corresponding to a virtual object and no local features are present at the modules' level, a new list of lower-level hints is generated according to the topdown reasoning mechanism . Whereas, when lowerlevel observations arrive at a module, the described bottom-up inference mechanism is activated . In the following sections, the mapping of this general functioning at different system levels is detailed . VcL 32, No,- i-2. May 1993
228
G .L. Parent et at. / Distributed spatial reasoning
In Appendix A, an example of virtual objects and observations progressively instantiated in the system is given, which can be used by the reader in order to better understand the information flow .
3.2. The recognition process The recognition process starts with a bottom-up phase . The Situation Judge selects the `initialization' goal when no observations are present at any level of the blackboard, and no messages have been sent to any module . Consequently, a start message is sent to the Image Analyzer to perform the initialization phase . After this phase, the actual searching process starts, i .e ., a message is sent to the Hypothesis Manager module . The object to be searched for during the current searching cycle is selected at this level . A list of 3D hints associated with the object are sent to the Object Detector level ; the top-down reasoning mechanism iterates by selecting new hints to be propagated towards lower modules . When the Edge or Region Analyzer level is reached, the BU information flow starts, as observations instantiated during the initialization phase are present at this level .
3.2.1 . Initialization phase The Image Analyzer is a purely bottom-up module which transforms images provided by the physical sensors into edges and regions, without using any task-dependent hint . This is due to the fact that the transformation process from external physical quantities into measures associated with pixels is performed through the analogical acquisition process inherent in the physical device . Moreover, no hint-driven pixel-grouping algorithm (i .e . region and edge extraction) is currently available . The control rules fired after a message has been received by the Image Analyzer Activate Region Growing (RG) algorithms ; i .e ., Gibbs MRF region-growing [12] and Edge-Detection (ED) algorithms (i .e ., LoG edge detector) on images to be processed . Pixels represent the elementary observations SZ, and the algorithms utilized represent computational grouping processes applied by type-D4 rules Signal Pr cnsemg
to images provided by different sensors . Two input sensors are used in the current application, i .e ., D'=2 . The same grouping operators are applied to both sensor channels. Two intrinsic images of edges and regions are the output complex observations Z' to be sent to the Edge Analyzer and Region Analyzer modules . Z' also contain a pair of label images (i .e ., `label pictures'), where the assignment of each pixel to a given edge or region group is specified by means of an integer label . The Edge and Region Analyzers process BU messages (received during the initialization phase) by following the same reasoning ; first, they transform input observations into local simple observations ; then, they send an `observations ready' message to the Situation Judge. The first step is performed by type-D1 2 rules, which fire symbolic characterization algorithms . Such rules first activate local algorithms that, taking label pictures as input, characterize each region or edge with a set of attributes. A frame, linked by an is-a relation to a feature prototype present in the local longterm memory, is instantiated for each feature ; it contains all information about the values of the intrinsic and relational attributes of the feature . Each frame is identified by a name indicating the type of descriptive primitive, the sensor by which the primitive has been obtained, and a progressive number (e .g ., ir-edge-25) . For each edge, the attribute characterization algorithm [7] determines : initial and final points, length, direction, etc . ; for each region : barycenter, average grey level, minimum bounding rectangle, etc . The initialization phase ends when the Situation Judge has received a `ready' message from all descriptive Primitive Analyzers. All information about edges and regions is now available to the system in the form of frames, and a start message can be sent to the Hypothesis Manager module in order to start the object searching cycle .
3.2.2. Dynamic object search cycle The search cycle consists of two phases : first, objects to be searched for are chosen, and a TD hint propagation occurs through the modules .
G .L . Foresti et al . / Distributed spatial reasoning
229
Table 3 Virtual Object model frame at the Hypothesis Manager level : it is composed by the two `hints' related to road and house description HYP-ROAD-SCENE-1 SLOT
VALUE
TYPE
ISA LEVEL SON-OF PARENT-OF HYP-VISUAL-VIEWPOINT HYP-CAD-VIEWPOINT COMPOSED-BY OBI-STATUS HYP-STATUS
(ROAD-SCENE HYPOTHESIZED-SCENE) 6 ROOT nil (HYP-VIEWPOINT-1 HYP-VIEWPOINT-2) (HYP-VIEWPOINT-3 HYP-VIEWPOINT-4) (ROAD HOUSE) (FOUND SEARCH) OPEN
link attribute link link link link link attribute attribute
Then, a BU answering phase sends to higher levels replies about the presence of observations that satisfy certain constraints . TD propagation implies that different hypotheses about virtual object models are instantiated at the various levels, depending on the hint-projection mapping between levels . 3.2 .2 .1 . Top-down phase The Hypothesis Manager (HM) module starts by choosing the object to be searched for among the ones available for the application considered, i.e., autonomous land driving . The choice criterion is based on the a priori probability assigned to each object within the related frame . After selecting the new object to be searched for, HM instantiates a Hypothesized Scene frame (see Table 3) in the local blackboard ; such a frame represents a node of a hypothesis tree where track is kept of all the objects already recognized during the interpretation process . A node is said to be active if it contains an object labelled `searched' in the status slot . A node is said to be open whether it is active or contains not-yet-searched objects, and its fuzzy recognition value is above a given threshold . Backtracking [7] is allowed by activating open tree nodes within the tree when the belief value of the currently active node is below the recognition threshold . An A* algorithm is used [20] to manage open and closed nodes of the hypothesis tree . The Object Detector (OD) module is informed about the selected object through a TD message
containing the name of the object to be searched for . A Hypothesized Object frame is instantiated (Table 4) from the prototype stored in the local knowledge base, which represents a virtual object model at this level : this frame contains symbolic names of object subparts, each made up of one or more surfaces . A hypothesis tree for object subparts is also maintained at the OD level . The order in which subparts are searched for depends on the related degree of necessity . A message containing a list of surfaces to be found is sent to the Geometric Reasoner (GR) module in order to activate the recognition of each subpart (Table 5) . The GR progressively stores in the blackboard instantiated frames of other local virtual objects, called Hypothesized-Surfaces (HS) frames, as shown in Table 6 . These frames inherit from the prototype a list of 3D hints constraining the attribute values of the surface in the reference system of the currently searched object . If an object subpart consists of more surfaces, a status frame is maintained in the context of a hypothesis tree similar to the one for objects and object subparts at higher levels . A Hypothesized-Viewpoint (HV) frame, as shown in Table 18, is instantiated by selecting the most probable among a set of prototypical object aspects stored in the GR Knowledge Base . An HV frame is a relation between frames of the reference systems of the object and of the sensor, characterized by means of the corresponding rototranslation matrix . The list of 2D hints with the related VA 12 . No,. I-2 . May 1993
230
G.L . Foresti et a!. / Distributed spatial reasoning Table 4 Virtual Object related to the hypothesized object `house' at the Object Detector level : object subparts represent hints at this level HYP-HOUSE-1 SLOT
VALUE
TYPE
ISA LEVEL SUBPARTS? SUBPARTS STATUS
(HOUSE HYPOTHESIZED-OBJECT) 5 T ((frontal-wall 0 .7)(lateral-wall0 .4)(roof0 .1)) ((frontal-wall found 0 .7) (lateral-wall not-yet-searched nil) (roof not-yet-searched nil)) (HYP-VIEWPOINT-1 HYP-VIEWPOINT-2) (HYP-VIEWPOINT-3 HYP-VIEWPOINT-4) ((house-frontal-surface)(house-lateral-surface) (roof-right-surface roof-left-surface)) house-reference-system
link attribute attribute link attribute
HYP-VISUAL-VIEWPOINT HYP-CAD-VIEWPOINT SURFACES REFERENCE-SYSTEM
link link attribute link
Table 5 Virtual Object received by the Geometric reasoner : hints are described in terms of surfaces of the subpart HYP-FRONTAL-WALL-1 SLOT
VALUE
TYPE
ISA HYP-VISUAL-VIEWPOINT HYP-CAD-VIEWPOINT SURFACES
HYP-SET-OF-SURFACES (HYP-VIEWPOINT-1 HYP-VIEWPOINT-2) (HYP-VIEWPOINT-3 HYP-VIEWPOINT-4) 3D-FRONTAL-SURFACE
link link link attribute
weights is automatically stored in the 2d-hints slot of the current HS . To this end, three steps are necessary : first, the 3D patch, provided by the selected model, is projected on the 2D image plane of the sensor by using the rototranslation matrix of the current HV, and is stored in the (sensor>-2D-projected-closure slot of the HS frame, where
indicates the name of the current sensor . Then, 3D hints of the HS are considered : for each 3D hint (see Table 7), the list of 2D hints contained in the slot has-lower-
(sensor>-Hint-Projection slot by taking the hint related to the same attribute that is characterized by the maximum fuzzy value . If a single surface or group of found surfaces does not satisfy the constraints, either a new hypothesized viewpoint can be instantiated from the list of available ones, or the object is considered not present . Accordingly, either new hints are propagated to all lower-level sensors or a negative answer can be sent to the OD module . During the TD cycle, the GR also selects (once for each object) the integration strategy, i .e ., it ranks lower-level sensors according to their possible contributions to the recognition of the current surface . The choice criterion for selecting the guide-sensor is the highest average fuzzy value of the 2D hints stored in the -Hint-Projection slots for the current HS. Thanks to this mechanism, a surface is mapped
231
G .L. Foresti et al. / Distributed spatial reasoning Table 6 Internal virtual object instantiated at the Geometric reasoner level : hints describing 3D surface properties are used as basic elements at this level HYP-3D-FRONTAL-SURFACE-1 SLOT
VALUE
TYPE
ISA HYP-VISUAL-VIEWPOINT HYP-CAD-VIEWPOINT LEVEL COEFF-OBJ COEFF-VISUAL-SENSOR COEFF-CAD-SENSOR 3D-HINTS
3D-FRONTAL-SURFACE (HYP-VIEWPOINT-l HYP-VIEWPOINT-2) (HYP-VIEWPOINT-3 HYP-VIEWPOINT-4) 4 ((001 -7)) ((0 0 .254 0.9755 -27254)) ((0 0 .25 0.96 -27 .25)) ((fused-3D-rectangular-surface) (fused-extensive-surface-area)) (((2D-vertical-straight-parallel-closure 0 .8) (2D-vertical-straight-convergent 0 .2)) ((2D-fused-medium-area 0 .7) (2D-fused-little-area 0 .3))) (((2D-straight-horizontal-convergent 0 .6) (2D-horizontal-straight-parallel 0 .4)) ((2D-fused-medium-area 0.8) (2D-fused-little-area 0 .2))) ((90 130) (90 230) (140 130) (140230)) ((75 123) (98216) (153 107) (141 236)) (2D-vertical-straight-parallel-closure 0 .80 2D-fused-medium-area 0 .70) (2D-straight-horizontal-convergent 0.60 2D-fused-medium-area 0 .70) SEARCHED (VISUAL-SENSOR 0 .75 CAD-SENSOR 0 .70)
link link link attribute attribute attribute attribute link
VISUAL-2D-HINTS
CAD-2D-HINTS
CAD-2D-PROJECTED-CLOSURE VISUAL-2D-PROJECTED-CLOSURE VISUAL-HINT-PROJECTION CAD-HINT-PROJECTION STATUS ON-SENSOR
attribute
attribute
attribute attribute link link attribute attribute
Table 7 Example of a 3D hint at the Geometric reasoner level : the frontal wall surface is described as a rectangular planar surface FUSED-3D-RECTANGULAR-SURFACE SLOT
VALUE
TYPE
ISA LEVEL JUDGEMENT-ON FZ-OPERATOR FZ-THRESHOLD HAS-LOWER-VISUAL-HINTS
3D-HINT 4 shape-factor fz-3D-rectangular-function 0.75 (2-vertical-straight-parallel-closure 2D-vertical-straight-convergent) (2D-straight-horizontal-convergent 2D-horizontal-straight-parallel)
link attribute link link attribute link
HAS-LOWER-CAD-HINTS
into a set of 2D hints related to each sensor in a dynamic and adaptive way, and a guide sensor is selected, too ; then, a virtual object model is sent to lower-level (Table 8) .
link
Modules at the Virtual Descriptive Primitive Analyzer (VDPA) level of all sensor channels consider the received set of hints as virtual object models related to closures . The VDPA maps this Vol . 32 . Nos . 1-2, May 1993
232
C.L . Foresti et at. / Distributed spatial reasoning Table 8 Virtual object at the Virtual Descriptive Primary Analyzer level of the visual sensor HYP-VISUAL-2D-FRONTAL-CLOSURE-1 SLOT
VALUE
TYPE
ISA LEVEL 2D-EDGE-HINT 2D-REGION-HINT STATUS 2D-HINT
HYP-CLOSURE 3 (edge-straight-parallel 0 .90) (region-straight-parallel 0 .85) SEARCHED (2D-vertical-straight-parallel-closure 2D-fused-medium-area) (edge-straight-parallel 0 .90) (region-straight-parallel 0 .85) VISUAL-SENSOR
link attribute link link attribute link
EDGE-HINT-PROJECTION REGION-HINT-PROJECTION ON-SENSOR
link link
Table 9 Example of 2D hint used to describe a characteristic of the virtual object in Table 8 2D-VERTICAL-STRAIGHT-PARALLEL-CLOSURE SLOT
VALUE
TYPE
ISA LEVEL HAS-LOWER-EDGE-HINT HAS-LOWER-REGION-HINT FZ-OPERATOR FZ-THRESHOLD JUDGEMENT-ON SENSOR
(2D-HINT SIMPLE-HINT) 3 (edge-straight-parallel) (region-straight-parallel) fz-parallel-function 0 .75 shape-factor VISUAL-SENSOR
link attribute link link attribute attribute link attribute
description into different hints, depending on the type of Descriptive Primitive Analyzer (DPA) . A priori expectations about the dependability of the computation of a given hint by the Edge Analyzer (EA) and by the Region Analyzer (RA) are described inside the has--lower-hint slot of prototypical 2D hint-frames at the VDPA level (Table 9) ; they make it possible to decide which Analyzer is to be activated . The system can choose to activate the most appropriate lower-level Analyzer for a given object according to the globally highest dependability ; the other Analyzer module is asked for a confirmation only in a subsequent phase. After receiving a virtual object (Table 10), the selected Descriptive Primitive Analyzer can perform the transformation and grouping operations necessary to satisfy the hints indicated in the received message, i.e ., the BU distributed search process can start thanks to the presence of local Signal Processing
Table 10 Virtual object related to the Edge Analyzer module . This object is composed of a complex hint HYP-VISUAL-EDGES-1 SLOT
VALUE
TYPE
ISA LEVEL STATUS 2D-EDGE-HINTS SENSOR
HYP-EDGE-CLOSURE 2 SEARCHED (edge-straight-parallel) VISUAL-SENSOR
link attribute attribute attribute
simple observations created during the initialization phase, and the matching procedure can be activated . The bottom-up phase of the cycle is described in the next subsection . 3.2 .2 .2. Bottom-up phase Descriptive primitive analyzer level-
At the EA level, edges that satisfy the hints contained in a received message are searched for in the local
233
C .L . Foresti et al . / Distributed spatial reasoning Table 11 Complex hint contained in the virtual object of Table 10 EDGE-STRAIGHT-PARALLEL SLOT
VALUE
TYPE
ISA
(2D-EDGE-HINT COMPLEX-HINT) 2 fz-AND 0 .75 (rectilinear parallel)
link
LEVEL FZ-OPERATOR FZ-THRESHOLD HINTS
attribute attribute attribute link
Table 12 Example of a simple relational hint (`parallel'), used to describe the complex hint of Table 11 PARALLEL SLOT
VALUE
TYPE
ISA
(2D-EDGE-HINT SIMPLE-HINT) 2 (parallel-rules-task) fa-parallel-function 0 .75 direction
link
LEVEL GROUPING-METHOD FZ-OPERATOR FZ-THRESHOLD JUDGEMENT-ON
attribute attribute attribute attribute link
blackboard . The received hints at this level (Table 11) are usually composed by other simple hints (Table 12) . This is due to the complexity of descriptions required to individuate edge and region groups . The order in which simple hints are stored in the has-simple-hints slot of the complex one corresponds to the order used to activate simple hints . First, hints related to intrinsic edge properties (e .g ., rectilinearity, length, etc .) are checked . Then, grouping algorithms are activated on the subset of simple observations satisfying such intrinsic hints, which are represented inside grouping-method slots (Table 13) of the simple relational 2D hints (e .g . collinear, convergent, etc .) contained in the local virtual object ; the resulting complex observations (i .e ., edge groups) are matched with 2D relational hints by means of their fuzzy-membership functions . A fuzzy value is associated with each complex observation ; it represents the degree of matching between the observation and the virtual object model . Concerning edges, typically, hints refer to curvatures, whereas regions are examined
Table 13 Example of edge instantiated during the initialization phase within the Edge Analyzer module VISUAL-EDGE-45 SLOT
VALUE
TYPE
ISA LEVEL SENSOR SUPPORTING-EDGE ADJACENT-EDGE INITIAL-FINALPOINTS MEAN-DIRECTION ELONGATION LENGTH NUMBER-OF-POINTS CURVATURE IS-RECTILINEAR
simple-observation 2 visual-edge-analyzer edge-image- I ((62 I) (56 2) (32 5)) ((115 54) (149 66))
link attribute attribute link attribute attribute
285 0 .12 38 .56 35 0 .5 true
attribute attribute attribute attribute attribute attribute
in terms of luminance and shape . For example, in the case of the road search of Fig . 10, the composed hint 'straight-convergent-symmetric' is instantiated at this level ; first, rectilinear edges are extracted ; next, rectilinear edges convergent to the upper central portion of the image are searched for, consistently with the chosen viewpoint ; then, the convergent angle of the edge pairs is checked by means of the relational hint . The resulting edge groups are filtered by symmetry checking . In this way, edges that form an angle with the central image axis and that are consistent with the hypothesized viewpoint are selected . Collinear edges are added by means of another grouping method . Each grouping method is applied by means of a different set of production rules [9]. Rules related to different relational hints (e.g ., convergent-edges) are activated in sequence after a sorting operation . Each set of rules can be viewed as a virtual sensor which follows a different grouping strategy (e .g ., symmetry in respect to an axis, convergence to an image point or to an image area, parallelism, collinearity, circularity, low contrast, regular shape, etc .) . After each grouping phase, the related hint is checked, and edge groups are discarded . Similar procedures are also used to search for other objects . As shown by the example, the grouping Vol . 32. N ., 12. Mac 1993
234
G.L . Foresti et al . / Distributed spatial reasoning
phase can be split into several phases, which are performed in series as to obtain a single description of the complex observation matching better with the virtual object model . Virtual descriptive primitive level . The set of grouped edges is sent to the VDPA of the currently investigated sensor channel, which transform it into a closure ; to this end, a method stored in the virtual object model prototype is used which find a closed polygonal . patch starting from an incomplete, piecewise continuous description of its contours . The method has to compute the set of vertexes of the closure by considering grouped edges coming from the lower module . First, the equations of the straight lines corresponding to the edges are found ; then, different strategies are applied to obtain vertex points through the intersection of such lines . For example, in the case of road search, the coordinates of the closure vertexes are computed by using the intersections between convergent straight lines and between each line and the image borders (see Fig . 5) . The resulting closure is assumed to represent the projection of a plane surface on the image plane and it is matched with a local virtual object model . Then, the process continues, and the Region Analyzer (RA) module is used to verify if the obtained closure is adequately supported by the regions contained in it ; to this end, a list of hints is sent to the Region
Analyzer module, containing a description of the 2D view of the road in terms of regions . This description is driven by the closure obtained by the EA, in the sense that an enlarged mask based on vertexes of the obtained closure is also sent to the RA, in order to focus its attention on a precise area of the image. This module activates a local regiongrouping algorithm [2, 9] based on a Bayesian labelling of regions, and provides the VDPA of a new set of observations, separately transformed into another closure . A fusion rule is activated, and new parameters of the fused closure (Table 14) are computed [9] . A weighted average is used to compare the judgements of the sensors and to decide if the observation is sufficiently supported by both sensors . The result is the instantiation of a frame containing all the information about the closure, i .e ., the point coordinates, the area, the edges and regions supporting it, the barycenter, and so on . A new matching phase is activated on the fused closure in order to decide if the obtained observation satisfy local hints . If the match is successful, the list of closure points are propagated to the GR module .
Table 14 Simple observation instantiated at the VDPA level : regions and edges supporting the obtained closure on the visual channel are shown DETECTED-VISUAL-CLOSURE-1 SLOT
VALUE
TYPE
ISA LEVEL SENSOR SUPPORTINGEDGE-CLOSURE SUPPORTINGREGION-CLOSURE
complex-observation 3 visual-sensor (visual-edge-45 visual-edge-62) (visual-region-25 visual-region-52 visual-region-63 visual-region-75 visual-region-77) ((81 153) (92 234) (121 122) (115 200)) nil
link attribute attribute link
376 (98 172)
attribute attribute
CLOSURE
-n Fig. 5 . Description of the transformation at the VDPA level (p = 3) in order to obtain a closure from edges individuated at the Edge Analyzer level . Signal Processing
CONVERGENCEANGLE AREA BARYCENTER
link
attribute attribute
G .L. Foresti et al . / Distributed spatial reasoning
Remaining levels . No grouping occurs at the VDPA level, as closures are mapped in a one-toone way with searched surfaces . Hence, the obtained fused closure, i .e ., the list of closure points, is directly sent to the Geometric Reasoner . The presence of a BU message containing a new observation fires the local transformation rule . A priori knowledge about the hypothesized viewpoint as it has been instantiated during the TD cycle is used to solve the 2D-to-3D transformation of 2D closure vertexes into surface vertexes . The coordinates of the vertexes of the obtained surface are expressed into two different reference systems : the object reference system is used to allow a direct match with surfaces of the object model, while the global one is used as a common reference system where to perform fusion operations . The simple observation produced by the transformation at the GR level is shown in Table 15 : it contains the edges, the regions, and the coordinates of the vertices of the polygonal closure region . The computational processing steps performed at the GR level are detailed in the next section. The GR successively applies its processes for all available sensors (Visual, IR or CAD camera) and produces 3D surfaces . A fusion step is necessary to integrate the surfaces extracted from the two sensors into a single 3D surface representation . To this end, local Table 15 Simple observation representing the detected frontal wall surface the GR level DETECTED-3D-SURFACE-I SLOT
VALUE
TYPE
ISA
COMPLEXOBSERVATION 4 HYP-3D-FRONTALSURFACE-1 ((24169) (4170 44) (112 35 11) (9940 51)) 5.15 5 .90 HOUSEREFERENCESYSTEM
link
LEVEL ASSOCIATED-WITH 3D-COORDINATES
WIDTH HEIGHT REFERENCE-SYSTEM
attribute link attribute
attribute attribute link
235
Table 16 Complex observation representing an house subpart sent from the Geometric reasoner module up to the Object Detector DETECTED-SUBPART-I SLOT
VALUE
TYPE
ISA
COMPLEXOBSERVATION 4 HYP-FRONTALWALL-I ((24169) (4170 44) (11235 11) (994051)) 5 .15 5 .90 HOUSEREFERENCESYSTEM
link
LEVEL ASSOCIATED-WITH 3D-COORDINATES
WIDTH HEIGHT REFERENCE-SYSTEM
attribute link attribute
attribute attribute link
simple observations (i .e ., surfaces) are projected into the global reference system from the one associated with each sensor . Knowledge of the set-up of the two cameras allows this transformation ; the rototranslation matrix between the interested reference systems is stored in the slot rototranslationmatrix of the related ref-system-relation frame, called RT from-to-global frame (Table 17) . Another step which is required after fusion is conversion of the fused surface into the 3D reference system related to the object model where it is possible to directly check local hints ; to this end, it is necessary to return to the sensor reference system by means of the inverse transformatin, RTfrom-global-to-, and of the Hypothesized Viewpoint . Complex observations related to groups of different surfaces are progressively built in this reference system (Table 16) . The transformation at the Object Detector level attaches to each surface group a semantic label that identifies an object subpart . Moreover, a grouping method based on an A* algorithm [20] manages the hypothesis tree built by the recognized subparts by verifying the consistency of spatial relations among these ones . Finally, the transformation at the Hypothesis Manager level (p=6) labels groups of objects as recognized objects and a grouping method builds up the recognized scene . Three types Vol 32, Nos. 1-2, May 1993
236
G.L . Foresti et al. / Distributed spatial reasoning
Table 17 Example of ref-system-relation frame containing the rototranslation matrix from the visual sensor to the global reference system RT-FROM-VISUAL-SENSOR-TO-GLOBAL SLOT
VALUE
TYPE
ISA
REF-SYSTEMRELATION GLOBALREFERENCESYSTEM VISUAL-SENSORREFERENCESYSTEM ((1000) (0 0 .96 -0 .251) (0 -0 .25 0.96 0) (0001)) VISUAL-SENSOR RT-FROMGLOBAL-TOVISUAL-SENSOR
link
SOURCE-FRAME
TARGET-FRAME
ROTOTRANSLATIONMATRIX
SENSOR INVERSE
attribute
3 .3 . Geometric reasoner
At this level (p=4), the system transforms a closure provided by a VDPA into a surface . The attributes of the obtained surface are matches with the 3D hints of the virtual object that is searched for . As one can see in Fig . 4 (see also Table 2), the module devoted to these tasks is the Geometric Reasoner. In the following, the role of this module is described in more detail .
attribute
3.3 .1 . Reference systems The concept of reference system is explicitly attribute
attribute attribute
of frame-tree (for the object subparts, the objects, and the scene) are considered in order to maintain a consistent recognition status . In the next section, a detailed description of the procedure to transform a closure into a 3D surface is provided, while the reader is referred to [19] for further details on upper levels .
represented by a frame prototype within the recognizer . There are two main categories of references systems, which correspond to different instances of the prototype : 3D (sensor-centered, objectcentered and scene-centered) and 2D (image level) reference systems (see Figs . 6 and 7) . The direct relation between reference systems and its inverse are represented as a pair of frames (ref-systemsrelation) (Table 17) : a rototranslation matrix is stored in these frames together with the indication of source and target reference systems. the reference systems used are - Global reference system : (X, Y, Z, scene centered) ; in this case it has been chosen coincident with the road reference system .
x=xo Fig. 6. Sensor, object (road) and global reference systems for the central viewpoint hypothesis . In this case, global and object reference systems are chosen to be the same . Si-enal Processing
G .L . Foresti et al. / Distributed
spatial
237
reasoning
no inverse, indicating that it requires regularizing knowledge to solve it . - Object-model reference system (X01, Yo;, Zo;) object centered . This frame has no a priori known relation with other reference systems . The Hypothesized Viewpoint frames instantiated during the TD phase give such relations . 3 .3 .2. Geometric transformation Z
Fig . 7 . Relation between the 2D reference system (i.e ., the focal plane) and the XY plane, expressed in terms of the inclination angle 8, .
3D sensor reference systems : (Xd , Yd , Zd , sensor centered), with origin at a point (Xo, Yo, Zg,) of the global reference system (see Table 17) . Looking at the reference systems adopted in this paper (Figs . 6 and 7), the rototranslation matrix between the global and the sensor reference systems depends only on the angle O, between the X Y plane and the Xd Yd one (Fig. 7), and on the height of the sensor from earth, h 5 . The complexity of the related rototranslation matrix has been reduced from the one in Fig . 8(a) to the one in Fig . 8(b) . Each VDPA related to a different sensor has several slots representing intrinsic parameters (e.g ., focal length, aperture, zoom, etc .) . 2D sensor reference systems (i .e ., sensor d image planes) (xd, y, image level) . xd and y d axes have been chosen anti-parallel to X,, and Yd axes, respectively . The origin of the system (xd , yd) is at (0, 0, -Fd ) in the 3D sensor reference system . Fd is the focal length of the sensor d. This knowledge is stored in the ref-systems-relation frame which puts into relation 3D (source) with 2D (target) reference systems, together with perspective transformation equations . This relation has
The conversion of the 2D closures into 3D data is performed by a 2D-into-3D transformation function that receives as input a point (x, y) from the image plane, and provides as output the corresponding point in the 3D sensor reference systems . This problem is ill-posed, as infinite points in 3D space may correspond to a single point on the image plane . A priori knowledge must be used to regularize the solution . Some concepts play a very important role in regularizing such transformations ; they are detailed below : Viewpoint consistency . The concept of viewpoint consistency has been introduced by Herman et al . [16], and states that all the points of the surface S„ of a 3D object are projected onto the image plane by means of the same transformation function . Viewpoint assumption . The selection of the viewpoint is very critical for the functioning of the recognizer . It not only affects the computation of rll
r12 r22 r32
r13 r23 r33
tx
r21 r31
0
0
0
1
ty tz
(a) It 1 0 0 0
0 cos OS -sin O s 0
0 sin O s cos O s 0
0 hs
0 1
(b)
Fig.
8.
Special case
of
the rototranslation matrix for the considered set-up . Vu'. 32, N ., . 1-2, Mey 1993
238
G .L . Foresti et al. / Distributed spatial reasoning
the inverse transformation (once the viewpoint has been hypothesized) but also requires an explicit representation of different probabilities w,,,,c(OI) of object poses during the hint propagation phase . In the current system version, we have used the environmental information [19] provided by a ground map together with an associative datadriven indexing mechanism [10] are used to fix probabilities of objects and of their poses . Obviously, unexpected objects are not taken into account by this information . Object model representation . The representation of a priori knowledge about an object model can be given in a structured way, as a list of object subparts and surfaces of each subpart (see Fig . 29) and their relations . Here we assume that surfaces associated with each subpart are given as planar patches in the related object model reference system, that are characterized by a fuzzy description of their minimum and maximum dimensions along all directions . This implies that recognized objects can be described as a set of planar patches . Many interesting objects in a real road scene have this property (e .g ., road, houses, cars, etc .) . Other choices are possible for objects which cannot be assimilated to surfaces, but are not considered in detail in this paper ; in such cases the number of system levels should remain the same, but the internal structure of the Geometric Reasoner KS should be changed in order to be adapted to the chosen object model representation (see Fig . 9) . For example, if a precompiled view graph of either an
object or a subpart is available, a set of descriptions in terms of 2D hints could be directly obtained, without considering an explicit symbolic representation of 3D properties . A surface-based objectmodel description has been preferred here in order to stress the multilevel descriptive nature of the system architecture . This motivation has also driven presentation of results : consequently, we tried to stress application domains which can easily provide structured descriptions of object models (e.g ., road maps, in the case of the second experimentation) . TD hypothesis . During the TD phase, at GR level, a status frame is associated with the selected surface. The hypothesized-viewpoints slot contains the name of one of the possible ref-system-relation frames betwen the sensor and the object reference systems associated with the object . Consequently, a rototranslation matrix is available and the point of view by which the surface may be seen (see Tables 17-19) can be hypothesized . The unknown relation between the object and the global reference system can be computed as a composition with other relations provided as a priori knowledge by the system . As plane surfaces have been assumed, the four parameters of the plane equation in respect
Table 18 Example of Hypothesized viewpoint used in the case of the frontal wall reconstruction HYP-VIEWPOINT-1 SLOT
VALUE
TYPE
ISA
REF-SYSTEMRELATION HOUSEREFERENCESYSTEM VISUAL-SENSOR REFERENCESYSTEM ((100 -15) (0 0 .96 0 .25 6 .96) (0 -0 .025 0.96 18 .95) (0001)) VISUAL-SENSOR HYP-VIEWPOINT-1INV
link
SOURCE-FRAME
TARGET-FRAME
VIEWPOINT
ROTOTRANSLATIONMATRIX
-CIAEL112E
Fig . 9. Hint mapping in the case of complex object models described by means of precompiled viewgraph . Signal Processing
SENSOR INVERSE
attribute
attribute
attribute
attribute link
239
GL . Foresti et al. / Distributed spatial reasoning
to the above defined reference systems are of main interest . Such parameters are considered again when the related BU message is received by the Geometric Reasoner module .
adopting a perspective transformation : d= P`Yd Zd
- F*Xd Z,,
X,/ -
(1)
aXd +bYd +eZ,,+d=0 . 3.3 .3 . Regularization using a hypothesized viewpoint Thanks to hypotheses made during the TD phase, closure obtained by the VDPAs can be transformed into consistent plane surfaces in the related object model reference system for the matching phase. Then, the 3D descriptions of the surface, obtained separately by each sensor, are converted in the 3D global reference system (X, Y, Z), where the fusion phase can be performed . The 2D-into-3D transformation from the image plane to a generic 3D system requires the solution of an ill-posed problem . The knowledge of the viewpoint makes possible to compute the hypothesized plane equation in the sensors' reference system . This step is performed by using the rototranslation matrix RT,„ associated with the transformation between the sensor and the object reference system, i .e ., the conversion between the equation a'X0;+b'Y0,+c'Z0 ;+d'=0 and aXd+ bY,,+cZd +d=0 is obtained as follows : where h'= (a'
b' i
d')
and Vo,=(Xo; Y03 Zoj 1), RTs Ve=Vo;, where V d- (Xd Yd Zd 1) =h' • RT, V,=h Va=O, where h = (a b c d) . In such a way, starting from the description (a' b' c' d') stored in the slot coeff obj of the frame regarding a surface, it is possible to compute, by means of a simple change of coordinates, the other coefficients (a b e d) and to store them in the slots related to considered sensor (coeff-sensor-1 and 2), so obtaining the geometrical representation of the plane considered . Once we have obtained a constraint equation of the plane in the sensor reference system, the following system can be solved by
The solution of this system is
xd =
dxd -
cF+axd +by d
Yt = -
- dyd
eF+ax d +by d '
-dF cF+ axd + by d
(2)
Zd=
Singularities may arise for points lying on the straight lines in the image plane expressed by the equation cF+axd +hy d =0, but they are not considered in this paper . By using (2) it is possible to obtain the 3D coordinates of closure vertices in the 3D reference system (Xd , Yd , Zd) . In this system of coordinates it is possible to compute the several attributes of the surface (see Table 16) and to verify, by applying the related fuzzy functions to the selected attribute, whether the detected surface fits well with the hypothesized one of the model . The matching phase checks the intrinsic hints defined on such attributes and the overall consistency of the transformation . The system operations to determine and characterize each surface are performed for each physical channel . Then, both the surfaces are converted in the global reference system in order to fuse them and to obtain a single
Table 19 Example of rej-system frame related to the visual sensor VISUAL-SENSOR-REFERENCE-SYSTEM SLOT
VALUE
ISA
REFERENCElink SYSTEM VISUAL-SENSOR attribute (RT-FROM-VISUAL- attribute SENSOR-TOGLOBAL HYPVIEWPOINT-I-IN V )
SENSOR KNOWN-RELATIONS
TYPE
Vol. 32, No, . 12, May 1993
240
G.L . Foresti et a! . / Distributed spatial reasoning
3D map. This is done by looking at the information contained in the related frames of the type ref-system-relation (Table 17), where the (fixed) rototranslation matrix between the sensors (Table 21) and the global reference systems is stored . The fusion algorithm aims to obtain a description of the scene in terms of the points on the found surface that can be associated with the object. Even a small error concerning the relative positions of the different sensors, or an error related to image acquisition, is sufficient to obtain an improperly registered 3D image. However, the overall result is that these errors can be tolerated if the surface has sufficient dimensions . The fusion algorithm performs a simple overlap of the surfaces found by each sensor, assigning different weights to each 3D point according to the number of surfaces that support it . The final result is a single surface made up of points with different degrees of membership : the larger the membership value obtained for a single point, the higher the probability that this point may belong to the object examined . Once fusion has been performed, the found surface is described inside the blackboard by means of characterizing parameters . Multiple surfaces are grouped by means of an A* algorithm which checks the consistency of the open node considered by performing relational matching . If the matching result is positive, the group of surfaces is propagated to the Object Detector level .
4. Results Results are reported on the interpretation of 3D real scenes at fixed time instants . The scenes examined were acquired under different environmental and lighting conditions . Two types of situations are considered, where images acquired by using different sensors are evaluated through the recognition process at the same time . In the first case, a simple road sense is examined by considering data provided by a thermocamera and a b/w camera . In the second case, data provided by a real sensor (b/w telecamera) and by a synthetic image Signal Processing
obtained by a 3D model of the environment (CAD image) are recognized after being integrated . The CAD image makes available to the system the knowledge contained in a cartographical map [ 19] . The scene considered includes two objects : a road and a house . The house recognition is used as an example to describe the recognition of an object composed of multiple surfaces . In Appendix A, the multilevel object model description of a generalized road-house scene used for both recognition processes is provided . This description is represented inside the system by means of frame networks . Virtual objects progressively instantiated at each system level during the top-down phase of the dynamic search cycles are also shown in Appendix A for the house recognition case, together with corresponding simple and complex observations sent back during the bottom-up phase . The proposed examples include aspects related to fusion between data coming from inhomogeneous sensors . In such cases, it is not possible to find measures that relate data at the lowest levels of a system, as occurs for homogeneous sensors. Instead, it appears more natural to perform fusion at the hypothesis level [41, that is, after recognition has been performed over separate channels . This approach has the additional advantage of making the system more robust to sensors' failures : an interpretation is always feasible, even when a sensor provides observations affected by large errors . Two types of results are reported : (1) recognition results, in order to demonstrate that the system is able to recognize interesting objects and to reconstruct a 3D scene ; (2) performance results, obtained by the proposed organization of the system : they make it possible to evaluate the effectiveness of multilevel representation of models and the degree of processing distribution among different levels . Recognition results are presented by showing local solutions progressively obtained inside the system at each level. Performance results are presented by giving a graphical representation of the amount of processing versus time and the various system levels .
C .L. Foresti et al. / Distributed spatial reasoning
241
Fig . 10 . Road scene acquired by an infrared (left) and a b/w visual (right) sensor .
4 .1 . Recognition results 4.1 .1 . Infrared and h/w camera The scene acquired by means of an IR and a b/w sensor (Fig . 10) contains a rectilinear, nonasphalt road, bounded by thick vegetation . Sensors are positioned as shown in Fig . 6 . The images obtained as output from the two cameras are digitized and undersampled at a 256 x 256 pixel resolution to reduce the total amount of data to be interpreted . Filtering, segmentation, small-region merging, and edge detection are performed by the Image Analyzer during the initialization phase . From the IR image, 23 regions and 70 edges are obtained, and, from the b/w image, 29 regions and 148 edges . Complex observations obtained by the Image Analyzer KS are sent to the Descriptive Primitive Analyzers (Fig . 11, Fig . 12 and Table 13) of both sensor channels, which transform them into
local simple observations, by characterizing their attributes and representing them as frames in the local blackboard. At this point, the Situation Judge module activates a dynamic object search cycle, by sending a message to the Hypothesis Manager . A road-house scene is considered to be searched-for . Depending on this choice, specific virtual objects are propagated through the KS network by instantiating local frames at each level similar to the ones described in Appendix A . In the example considered here, the road is described as an object composed of only one subpart and one surface, and the hypothesis about a central view (i .e ., the driver viewpoint) is considered more probable than that about a lateral view . Moreover, the close position used for the two sensors allowed us to consider a single specific mapping from the 3D hints to 2D ones . A general symbolic description of virtual objects instantiated at each level is given in Fig . 29, Vol 32 . Nos. 1-2, May 1993
242
G .L . Foresti et al. / Distributed spatial reasoning
Fig . 11 . Region-based segmentation related to the infrared (left) and visual (right) sensors .
and Table 22, even though this description is more oriented to the second case considered here . Results related to the complex observation which are produced as a response to the top-down virtual object instantiation phase are presented together with a graphical representation of the multilevel architecture, where the module that has produced them is highlighted . The results of closure extraction, as provided by both the Region and Edge Analyzers, are shown in Fig . 12 . The closure obtained at the VDPA level is shown in Fig . 14. The results of 2D-into-3D transformation are shown in Fig . 15, from which one can see that the edge-continuity in the 2D representation is not maintained in the 3D representation, due to the different spatial resolutions of the two representations . Consequently, the obtained 3D points are extended (in a continuous way) to obtain the stretches connecting all the sequence points in Signal Processing
order to create a homogeneous patch which can be identified with the 3D surface to be associated with the object `road' on both sensor channels (Fig . 16) . Figure 17 shows an aerial view of the surface considered. The image in Fig . 17 exhibits a 25/255 meter/pixel horizontal resolution and a 70/255 meter/pixel vertical resolution ; moreover, the size of the area of light colour is equal to the focal length, and corresponds to the road stretch for which the sensor is blind and hence does not provide any information . In particular, the system knows that the b/w camera is located at a height of around 1 .70 meters, and that the aperture angle is equal to 40° ; accordingly, it estimates an average road width of about 3 .80 meters . The result of the fusion operation at the 3D level is shown in Fig . 17, where the area of white colour represents a spatial point that belongs to the obtained surface, and that is confirmed by both sensors ; the area of
G .L . Foresti et a! . / Distributed spatial reasoning
243
aticn dge ueu Hype Manager request
an
ct Detecto request Ge a easoninq es
I
sv
REGION ANALYZER
LV.AGE ANALYZER
Fig. 12. Edge extraction (LoG operator) related to the infrared (left) and visual (right) sensors .
green colour represents a point confirmed by only one sensor . In this case, the white area corresponds to an agreement between the two sensors, while the green area indicates a conflicting situation . The fusion method utilized at this level may appear quite trivial but can be effective thanks to the geometric-reasoning processing, which allows each channel to provide 3D observations related to a common reference system . The obtained surface is directly sent to the Object Detector, without any grouping phase, due to the very simple road description which has been adopted . The global solution consists of a frame at a Hypothesis Manager level which collects all the symbolic names of local observations and virtual objects progressively created during processing . 4.1 .2. b/w and CAD images
A second application consists in the integration between the information provided by a real sensor
(b/w camera) and that coming from a virtual sensor which provides 2D synthetic views (CAD image) ; both types of information concern an outdoor environment through which an autonomous vehicle is moving. The 2D synthetic views are made available to the system by using the knowledge contained in a cartographical map of the environment [19] . The scene taken with the b/w camera was acquired along a road delimited by slight slopes (Fig. 18(a)) ; on the left side of the road, there was a house, represented in the cartographical map . Therefore, also the CAD model (Fig. 18(b)) of this scene is available : it was obtained by utilizing a camera emulator [19] . The matching process between the two images is justified by the problem of identifying the vehicle's position on the map . In particular, here we suppose that a Mapper [19] system is able to hypothesize the possible position of the vehicle and to provide an expected synthetic view of the scene, obtained on the basis of the V0. 33, Nw . 12. May 1993
244
C.1.. Foresri et al . ; Distributed spatial reasoning
Situa dg eque
an Hypothess Manager an
rtqu ec request
Geometric Reasoning I reque Vutval Oescrpwe we
REGION ANALYZER
D4AGE ANALYZER
Fig . 13. Infrared (left) and visual (right) edge-groups separately obtained on each channel .
Fig. 14. Closures obtained during the road search cycle (infrared (left) and visual (right) sensors) .
G.L. Foresti et al. / Distributed spatial reasoning
request
245
r
r1 \
Virtual Descrlptive~ imNve Meyze
! La Leve Subsystem 1
Ilow-Level Subsystem N
N DAAGE ANALYZER
Fig . [5 . Edge projection of closure borders on the 3D plane associated with the road surface .
Situation Judge
I s-
req e
Hypothesis Manager request
answer Object Detector
receest
so
,-qua
er
en er
Virtual Descriptive Pit iliveArayze
el $ tern
.Y ANAALYT !ZER
Fig . 16 . 2D top views of the 313 surfaces obtained for the road' object on the infrared sensor (left) and the visual sensor (right) channels .
246
G.L. Foresli el af. / Distributed spatial reasoning
Fig . 17 . Fused 3D interpretation of the `mad' object (top view) ; the blue zone indicates the area where the camera is blind, while green areas indicate areas where the two sensors disagree . The white zone indicates areas recognized as a road b both channels .
cartographical information . In this case, all information related to rototranslational matrixes between the global reference system and the CAD virtual sensor can be assumed known . Viewpoint hypotheses are necessary only for the b/w camera . Initialization phases of the recognition process are first shown . In the present case, only the results of the edge-detection step are reported (Fig . 19) (28 edges for the CAD image ; 176 edges for the b/ w image) . Edge detection on the synthetic image is performed only for obtaining information in a homogeneous way with the real sensor, as exact parameters of object discontinuities could be immediately available from the CAD model . The system searches for objects in a sequential way ; once the `road' object has been recognized and characterized, the system tries to search for the `house' object. The house is composed by multiple subparts, each one associated with one or more Signal Processing
surfaces . In Appendix A (Fig . 29, Tables 22 and 23), frames instantiated at each level during the top-down (i .e., virtual objects) and bottom-up (i .e ., observations) phases of the house dynamic search cycle is given . First, the system searches for the main subpart and the main surface among those making up the object (in the present case, the frontal wall, and the related surface), according to the position of the viewpoint ; then, it searches for the remaining subparts . If the search falls, the viewpoint can be changed by selecting another one among the stable views of the object represented inside the system . When virtual objects have been instantiated at all levels down to the Descriptive Primitive Analyzers, the bottom-up answering phase begins from the Edge Analyzer KS . At this level the virtual object is formed by an edge-based description of the 2D appearance of the frontal wall as a patch as bounded by two straight, parallel
247
G .L. Foresti el al. / Distributed spatial reasoning
pairs of edges . Then, the system extracts a set of edges parallel along a vertical direction . The polygonal closure area at the VDPA level is then extracted, based on the local transformation method. Regions that are contained inside the obtained closure are searched for by the Region Analyzer . After the CAD channel communicates to the Geometric Reasoner the obtained closure, the b/w camera channel is investigated in order to search for the frontal wall of a real house (Fig . 22). Similar operations are performed on the b/w channel by limiting the search process to an image area obtained by using the surface of the CAD channel and the hypothesized viewpoint as focus of attention [4] . On the basis of the parameters extracted during the processing and using inverse perspective-transformation, the Geometric Reasoning obtains the 3D reconstruction of the recognized surfaces obtained by the CAD surface (Fig. 21) and the b/w camera one (Fig . 22). Figure
Table 20 Dimensions of the detected road and the detected house found by the Geometric Reasoner module for the second scene Object
Width
Height
Road House (frontal surface) House (side surface)
4 .10m 5 .15 m 7 .10 m
5 .90 m 5 .80 m
23 presents the result of the fused 3D interpretation process : it was obtained by projecting the closure edges on two planes orthogonal to the XZ plane in the house reference system . The sizes estimated by the system for the main objects examined are shown in Table 20 . At the Object Detector (OD) level and at the Hypothesis Manager (HM) level two local solutions are instantiated, which correspond respectively to associations pairs between road and house object models and their subparts (OD level), and between the road-house scene and two component objects (HM level) . The global
A& 1W `6 .tG` w r. a :a Fig, 18 . Road scene, acquired by a b/w image (a) and synthetic hyothesis provided by using a cartographical map (CAD model (b)) . Vol . 32 . Nos . 1-2, May 1993
248
G.L . Foresli et al . / Distributed spatial reasoning
Fig . 19 . Edge extraction (CAD (left) and b/w (right) images) .
Fig. 20 . Extraction of edge groups from the CAD channel by using local models obtained starting from the descriptions of the frontal and the lateral walls subparts of the `house' object .
G .L . Foresti et at. / Distributed spatial reasoning
249
Fig . 21 . Surfaces obtained from Fig . 19 at the GR level (CAD channel), representing the frontal (left) and lateral walls (right) of a house .
ow Level Subsystem
t
Low. Level Subsystem
I IMAGE ANALYZER
Fig. 22 . Corresponding surface of Fig . 19 on the b/w channel .
250
G .L . Foresti el al . / Distributed spatial reasoning
Fig . 23 . Orthogonal view of 3D surfaces of the frontal (left) and lateral (right) walls at the GR level .
solution obtained is made available to the Mapper module [19], which can use it either to confirm or to discard its previous hypothesis.
Table 21 Production rules are grouped at each level according to the task to which they are devoted . In this table, the number of rule groups and rules representated at each system level are shown LEVELp
Module
Rule groups
Rules
1 2 3
Image Analyzer Descriptive Primitive Virtual Descriptive Primitive Geometric Reasoner Object Detector Hypothesis Manager
8 9 4
22 31 14
5 4 2
19 12 7
4.2 . Performances and discussion
The system has been implemented in Common Lisp on an HP9000/350 computer . Rules D0-D5 are represented by means of production rules, hierarchically organized at each level . Thirty-two groups of production rules (tasks), up to a total of 105 production rules, are activated during the whole recognition process, distributed as shown in Table 21 . Figures 24-26 show time percentages spent at each system level for scene I during the initialization phase (Fig . 24) and for both scenes during the following dynamic search cycle (Figs . 25 and 26) . In Fig . 27 a diagram is presented showing the computational load, in terms of time spent by each Signal Proceuing
4 5 6
rule DO-D5 at the higher levels of the system (p= 3, . . ., 6) . The diagram refers to the road search in the second considered scene . In Fig. 28, the computational load at the Descriptive Primitive level for each searched object is shown . In the current serial implementation, to complete the whole cycle of operations (from the low-level operations to the
251
G.L . Foresti et al . / Distributed spatial reasoning
final interpretation ones), the system requires 23,483 CPU units in the first case (IR and b/w sensors), and 37,160 in the second one (b/w and CAD images) ; 3423 (road) and 12,465 (road+house) CPU units are spent for the dynamic search cycle phase in the two cases, respectively . This means that the search for a surface belonging to an object requires, respectively, 16% and 18% of the overall time . The greater percentage in the second case is due to the higher complexity of the house description (i .e., multisurface description) . An amount of overhead is also spent to activate each unit, to pass messages, etc . ; it is equal to about 120 CPU units for each module activation . Globally, the time required for rule activations is about 24% of the time required for the overall computation . This load can be eliminated in an effective parallel implementation . As one can see, the main bottleneck of the system occurs in the initial bottom-up phase at the lowest level . One dynamic search cycle takes only a few percent of the initialization process (17% and 26% in the considered cases) and is distributed among levels. A progressive narrowing of the searched space is indicated by the highest peak of the computational load of the Descriptive Primitive Analyzer level during the BU phase of the search cycle (see Figs . 28 and 29) . A distribution of load among rules inside each level (see Figs . 27 and 28) indicates that there is an inversion of load between higher and
Fig. 25 . Time percentages spent by the system at each level during the road search cycle for scene 1 .
Fig. 26 . Time percentages spent by the system at each level during the house search cycle for scene 2 .
Computational Load Distribution scene 2) CPU Time= 2eo su
1". .l Pevoe.Prim.M.
10 8 It ~~
Geometric Reeeoner
Fig . 24 . Time percentages spent by the system at lower levels during the initialization phase for the first scene .
Fig. 27 . Computational load, in terms of time spent by each rule DO- DS at the higher levels of the system (p=3 . . . , 6) . VoL 32, Nos . 1 2. May 1993
G.L. Foresti et al. / Distributed spatial reasoning
252
5 . Conclusions
Fig. 28 . Computational load at the Descriptive Primitive level for each searched object in both scenes .
lower levels : higher levels spend more time in prediction actions (rule DO) than in transformation, grouping and matching actions (rules D1-D5) . However, the time spent for prediction is useful to limit the growth of load at lower levels .
GEOMETRIC RRANONPR LEVEL
VIEWPOINT AND SENS DEPENDENT
RR/GHTNESS AND SENSOR `DEPENDENT
Fig. 29 . Hint mapping between system levels related to the house subpart (frontal-wall) detailed in Appendix A . signal Processing
A distributed recognition system for the interpretation of 3D scenes by using multisensory data has been presented . The system consists of a multilevel KB structure for controlling `computational' modules hierarchically organized . Two main issues have been addressed : (1) multilevel hierarchical representation of models oriented toward a modeldriven reduction of the solution search space, and (2) integration of solutions, provided by hierarchical processing units, by using a distributed reasoning mechanism. Multilevel object model representation has been obtained by means of frame networks . Intrinsic and relational constraints have been described as frames, and associated with fuzzy membership functions. Interlevel relations between constraints at adjacent levels have been defined to allow for dynamic projection of hints . A multilevel representation has also been developed for observations . The proposed reasoning mechanism is based on two sets of inference mechanisms to be activated locally, depending on the direction (i .e ., top-down or bottom-up) of the information to be processed . The top-down mechanism projects onto lower-level constraints associated with searched objects . The bottom-up mechanism involves two main steps : (1) transformation of lower-level observations into local ones, by means of computational algorithms, and (2) grouping of local observations into more complex aggregates. Both intrinsic and relational hints associated with the virtual object models and fired during the top-down phase are matched with observations during the following bottom-up phase and allows the system to identify local solutions . Thanks to both inference mechanisms, an efficient integration of multilevel solutions is achieved, and a global solution is reached at the highest level, which is formed by local ones . In addition, specialized a priori knowledge can be embedded at each level to make more efficient the transformation and grouping steps . The computational complexity of the overall problem can be further reduced by using local bottom-up heuristics . Other
G .L . Foresti ci al . / Distributed spatial reasoning
advantages of the proposed approach are related to the characteristics of KB systems, and lie in the high flexibilty and ease of maintenance . Spatial reasoning has been chosen as a case study to show how the proposed model can be applied to real cases . Recognition and interpretation of regular objects (i.e ., characterized by combinations of plane surfaces) in complex environments, starting from multisensory data, are described . Up to now, only a limited set of objects at higher levels is taken into account (i .e., roads, houses and obstacles), but the set of primitives which build up lower-level descriptions is quite extended to be easily used for other applications . Real outdoor scenes acquired by means of multiple (physical and virtual) sensors are considered as test data to be provided at the recognition process . Diagrams presented in Section 4.2 can be used as a basis to evaluate the performance of different computational algorithms within the system . This can be a further development to be pursued in the future, as it provides also a basis to evaluate effective distribution of computational load inside the system. The obtained results seem to indicate that computational algorithms to be used by a symbolic reasoning mechanism oriented to modelling the recognition process can be categorized into a few different classes (i .e ., transformation, fusion, grouping), at least in the considered case of objects which can be described in terms of planar surfaces . In our opinion, however, this approach can be generalized to other cases, such as objects described in terms of their aspect views, or compound or higher-order surfaces . The main drawback of the present system is the necessity for manually compiling multilevel object-model descriptions . This aspect can only partially be reduced by the possibility of using in various situations the constraints associated with objects . A second drawback lies in the fact that the distributed reasoning mechanism is not based on any optimality criterion, hence it does not ensure that the obtained integrated solution will be globally optimal . To overcome the first drawback, we are investigating learning techniques and automatic compilation of multilevel object descriptions based
253
on the use of information provided by higher-level knowledge sources (e .g ., a cartographical map of an outdoor environment) . Distributed belief networks, as proposed by Pearl [22], can be useful in defining optimality criteria to eliminate the second drawback . The locally bounded influence between modules seems to us to indicate that global optimization might be attained, for example, by using local energy functions (as in the GMRF [12-2] case) to manage module communications . Two other problems will be addressed . First, the necessity for speeding up the recognition process by combining associative and symbolic methods [10] will be considered . In the context of the present framework, this means that data-driven criteria to rank hypothesized viewpoints should be explored . Secondly, an open point is the extension of the proposed approach to temporal analysis, i.e ., multisensor motion understanding .
Acknowledgments This paper has been partially supported by the Pro-Art Italian section of the Prometheus Eurekaproject ; authors wish to thank Tecnopolis Csata for providing images used in the first experiment, and Oto Melara S .p .A . for the b/w image used in the second one . We wish to also thank the anonymous reviewers for their valuable suggestions for making the paper more readable .
Appendix A In this section, an example of frames progressively instantiated during the top-down and bottomup phases of the dynamic search cycle are shown . Frames are implemented in Common Lisp, while all computational methods have been written in C . The example refers to the search process of the house frontal surface described in the second paragraph of the results . In Fig . 29 the multilevel model is shown, while in Tables 22 and 23 a scheme is given for virtual objects, and observations proVoi . 32. No, . 1-2, May 1993
254
G.L. Foresti et al. / Distributed spatial reasoning
Table 22 Multilevel virtual object models and their components hints instantiated during the top-down search cycle related to the house frontal wall (see also Tables 3-12) Level
Virtual object model
Hints
6
hyp-road-scene-1
5
hyp-house-1
4b 4a
hyp-frontal-wall-1 hyp-3D-frontalsurface-I hyper-visual-2Dfrontalclosure-I
road house frontal-wall lateral-wall roof 3D-frontal-surface fused-3D-rectangular-surface fused-extensive-surface-area 2D-vertical-straightparallel-closure 2D-fused-medium-area
3
2
hyp-cad-2D-frontalclosure-1 hyp-visual-edges-1 hyp-visual-regions-1 hyp-cad-edges-1 hyp-cad-regionsedges-I
edge-straight-parallel . . .
Table 23 Multilevel simple, and complex observations produced during the bottom-up phase of the house frontal wall search cycle (see also Tables 13-16) Level
Simple observation
Complex observation
2
visual-edge-45
detected-edge-visualclosure-I
3
4a 4b 5 6
cad-edge-23 visual-region-21 cad-region- 11 visual-edge-closure-I visual-region-closure-I cad-edge-closure-1 cad-region-closure-I visual-3D-surface-1 cad-3D-surface-I detected-3D-surface detected-object-subpart object
detected-visualclosure-I detected-cad-closure-I detected-3D-surface-I detected-subpart-I detected-object detected-scene
gressively instantiated . Frames contained in such schemes are shown, which have been referred to in the various paper sections . It is worth noting that in the system implementation, for the sake of efficiency and generality, we have included in the Geometric reasoner module the management of virtual objects related both to object subparts and to surfaces of object subparts : efficiency is improved by signal Processing
decreasing the communication load between modules, while generality is saved as changing of 3D model representation should not affect the general structure of system's levels (see also Fig. 9) .
References [1] J . Aloimonos and D . Shulman, Integration of Visual Modules ; An Extension of the Marr Paradigm, Academic Press, San Diego, CA, 1989 . [2] G. Armano, C.S . Regazzoni, S .B . Serpico and G. Vernazza, "Region growing and merging techniques for accurate image segmentation", Proc. LASTED Internal. Conf., Grindelwald, Switzerland, 7-10 February 1989 . [3] M . Brady, "Computational approach to image understanding", ACM Computing Surveys, Vol. 3, No. 71, 1982. [4] G. Capocaccia, A . Damasio, C . Regazzoni and G. Vemazza, "Data-fusion approach to obstacle detection and identification", Proc. SPIE, Vol. 1003 Sensor Fusion : Spatial Reasoning and Scene Interpretation, Cambridge, MA, 1988, pp . 409-419 . [5] J . Corkill, "Advanced architectures : Concurrency and parallelism", in : V. Jagannathan, R. Dodhiwala and L .S. Baum, eds ., Blackboard Architecture and Applications, Academic Press, Boston, 1989 . [6] P . Coach, D .D . Giusto, C .S. Regazzoni and G. Vernazza, "On the use of multisensor data-fusion in the vision system of an autonomous vehicle", Proc . 1st Prometheus Workshop, Wolfsbourg (FRG), 22-23 May 1989, pp . 136-145 . [7] S. Dellepiane, C . Regazzoni, S.B . Serpico and G. Vernazza, "Extension of IBIS for 3D recognition in NMR multislices", Pattern Recognition Letters, November 1988, pp.65-72 . [81 B . Draper, R .T . Collins, J . Brolio, A . Hanson and E . Rieseman, "The schema system", Internat. J. Computer Vision, Vol. 2, 1989, pp . 209-250 . [9] R . Feri, G . Foresti, V . Motion, C .S. Regazzoni and G. Vernazza, "Spatial reasoning by knowledge-based integration of visual and it fuzzy cues", Proc. 5th EUSIPCO, Barcelona, Spain, September 1990, pp . 1719-1722 . 110] G.L . Foresti, V. Morino, C .S. Regazzoni and R . Zunino, "Map-driven image interpretation by associative model indexing", Proc . IAPR Workshop on Machine Vision and Application, Tokyo, Japan, 28-30 November 1990, pp.385-388 . [II] G. Foresti, V. Motion, C. Regazzoni and G. Vernazza, "A numerical and symbolic fusion method for image sequences", Proc . Internat. Conf. Accuse . Speech Signal Process ., 14-17 May 1991, Toronto, Ontario, Canada . [12] S. Geman and D . Geman, "Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images", IEEE Trans. Pattern Anal . Machine Intell., Vol . 6, No . 6, November 1984, pp. 721--741 . [13] D.D . Giusto, C .S . Regazzoni and G . Vernazza, "Multilevel data-fusion for detection of moving objects", Proc.
G .L . Foresti et al. / Distributed spatial reasoning IEEE Internal . Conf. System Man Cybernet ., MC, Boston,
MA, November 1989, pp. 931-933 . 1141 A . Hanson and E . Riseman, Vision: A Computer System for Interpreting Scenes in Computer Vision System, Academic Press, New York, 1978, pp . 303-333 . [15] T . Henderson, C . Hansen and B. Bhanu, "The specification of distributed sensing and control", J. Robotics Systems, Vol . 2, 1985, pp . 387-396 . [16] D . Kapur and J . Mundy, eds ., Geometric Reasoning, MIT Press, Cambridge, MA, 1989. [17] D . Marr, VISION, Freeman, San Francisco, 1982. [18] D . Marr and E . Hildreth, "Theory of edge detection", Proc . R . Soc . London, Vol . B.207, 1980, pp . 187-217 . [19] P. Merialdo, C .S. Regazzoni, P .C . Pecollo, G . Vernazza and R. Zunino, "Integration of territorial system in the vision system of autonomous land vehicle", Proc . 2nd Internet. Conf. Intelligent Autonomous System, Amsterdam, December 1989, pp . 694-704 . [20] N. Nilsson, Principles of Art jflcia! Inteiligence, Tioga Press, 1980 .
255
[21] L .F . Pau, "Behavioral knowledge in sensor/data fusion system", Special Issue on Multisensor Integration and Fusion for Intelligent Robots, J. Robotics System, Vol . 7, June 1990 . [22] J . Pearl, "Bayesian decision methods", Encyclopedia of Al, Vol . 1, 1987, pp . 45-48 . [231 I. Pitas and A.N . Venetsanopoulos, "Towards a knowledge-based system for automated geophysical interpretation of seismic data (AGIS)", Signal Processing, Vol. 13, No . 3, October 1987, pp. 229-253 . [24] T .E . Weymouth and A .A. Amini, "Visual perception using a blackboard architecture", in : R . Kasturi and M.M . Trivedi, eds ., Image Analysis Application, Marcel Dekker, New York, 1990, pp . 235-280 . [25] L .A . Zadeh, Fuzzy Sets and Systems, North-Holland, Amsterdam, 1983 .
Vol . 32, Nos. 1 -2, May 1993