A cognitive architecture for artificial vision

A cognitive architecture for artificial vision

Artificial Intelligence Artificial Intelligence 89 ( 1997) 73-l 11 A cognitive architecture for artificial vision* A. Chella a,*, M. Frixione b,l, S...

3MB Sizes 14 Downloads 134 Views

Artificial Intelligence Artificial Intelligence 89 ( 1997) 73-l 11

A cognitive architecture for artificial vision* A. Chella a,*, M. Frixione b,l, S. Gaglioa*2 ’ Dipartimento di Ingegneria Elettrica, Universitci di Palermo, Viale delle Scienze, 90128 Palermo, Italy h lstituto Internazionale

per gli Alti Studi Scientijci,

Via G. Pellegrino 19, 84019 Vietri S.M. (Salerno),

Italy

Received December 1994; revised June 1996

Abstract A new cognitive architecture for artificial vision is proposed. The architecture, aimed at an autonomous intelligent system, is cognitive in the sense that several cognitive hypotheses have been postulated as guidelines for its design. The first one is the existence of a conceptual representation level between the subsymbolic level, that processes sensory data, and the linguistic level, that describes scenes by means of a high level language. The conceptual level plays the role of the interpretation domain for the symbols at the linguistic levels. A second cognitive hypothesis concerns the active role of a focus of attention mechanism in the link between the conceptual and the linguistic level: the exploration process of the perceived scene is driven by linguistic and associative expectations. This link is modeled as a time delay attractor neural network. Results are reported obtained by an experimental implementation of the architecture. Keywords:

Perception; Active vision; Robotics; Conceptual spaces; Spatial reasoning; Geometric reasoning; Representation levels; Hybrid processing

1. Introduction An artificial vision system for an autonomous agent must be able to build a rich internal representation of the external environment. Such internal representation should allow the system to effectively draw inferences, make decisions, and, in general, perform reasoning processes concerning its own tasks [ 4,3 11. *Work partially supported by Progetto Finalizzato Robotica and Progetto Coordinato SARI of the CNR (Consiglio Nazionale delle Ricerche), and by MURST 40% of the Minister0 per l’Universit8 e la Ricerca Scientifica e Tecnologica. * Corresponding author. E-mail: [email protected]. ’ E-mail: [email protected]. * E-mail: [email protected]. 0004-3702/97/$17.00 Copyright @ 1997 Elsevier Science B.V. All rights PII SOOO4-3702(

96)00039-2

reserved.

74

A. Chellu et ul./Art~cial

Intelligence 89 (1997) 73-111

In classical reasoning systems oriented to logic, the meaning of symbols is given by relating them to abstract entities according to model-theoretic semantics. This turns out to be incomplete for an autonomous agent, since it needs to find the meaning for its symbols within its internal representation and in its interaction with the external world, thus overcoming the well-known symbol grounding problem, as discussed in Harnad [ 361. We present a cognitive architecture for an artificial vision system, in which an effective internal representation of the environment is built by means of processes defined over a suitable intermediate level, that acts as an intermediary between the sensory data and the symbolic level. This architecture is not to be considered as a model of human vision: no hypotheses are made concerning its empirical adequacy from a psychological point of view. However, various cognitive results have been used as sources of inspiration. According to Marr’s model [ 471, visual perception is modeled as a process in which information and knowledge are represented and processed at different levels of abstraction, from the lowest level, directly related to features of proximal stimuli, to the highest one, where knowledge about the perceived objects is of a symbolic nature. Following Marr’s seminal work, research in computer vision has exploded, becoming itself a discipline which has provided many working paradigms for object reconstruction and recognition from sensory data (see Besl and Jain [ 111, Chin and Dyer [ 241, Bindford [ 141 for a review). A general implicit assumption of research in computer vision has been that the vision process ends with the 3D reconstruction of shapes by means of some suitable geometric primitives, as for instance, Marr’s generalized cylinders [ 471. Models of reasoning about the structure of the reconstructed objects have been proposed only for very special purpose recognition systems (see, e.g., the ACRONYM system [ 19,201 or the ALVEN system [ 631). In the artificial intelligence community, on the other hand, there has been growing interest in spatial reasoning for planning activities of situated agents in a physical environment [ 11, and for man-machine interaction [44]. But research in this field has failed to result in an effective interaction with the real world environment by means of a working vision system. The architecture proposed here aims at providing a general vision model for an autonomous agent, that fills a gap between these lines of research by means of a paradigm according to which a reconstructed geometrical scene can be described in symbolic linguistic terms. It also provides a context which can be useful for active vision tasks, as described by Bajcsy [ 51 and Ballard [ 71. This linguistic description should, however, be considered as a first level that is nevertheless sufficient to ground successively higher symbolic reasoning activities. The three cognitive representation levels proposed by Gardenfors [ 331 are the basis of our architectural design: the subsymbolic level, in which the information is strictly related to sensory data; the linguistic level, in which information is expressed by a symbolic language; and an intermediate, prelinguistic conceptual level, where the information is characterized in terms of a metric space defined by a number of cognitive dimensions, independent of any specific language. This level aims at generating the essential representation of the agent’s external environment and at providing a precise interpretation of the linguistic level.

A. Chellu et al./Artificial Intelligence 89 (1997) 73-/I 1

7s

The interpretation of the conceptual categories at the linguistic level involves some well-known problems. For instance, perceptual common sense concepts hardly correspond to clear cut, classic categories which can be described in terms of necessary and sufficient conditions. Membership in perceptive categories is not an all-ornothing affair: it is usually necessary, for example, to consider a prototype of the category. Moreover, the available information depends strictly on the data acquired through measurement processes. As a consequence, knowledge at the conceptual level is affected by measurement errors. A way of facing these problems is to model the mapping between the conceptual and the linguistic levels in terms of a connectionist device. Neural networks make it possible to avoid an exhaustive description of conceptual categories at the symbolic level: in some sense, prototypes “emerge” from the activity of an associative mechanism during a training phase based on examples. In addition, the measure of similarity between a prototype and a given object is implicit in the behavior of the network and is determined during the learning phase. A further cognitive aspect is the role of attention processes in the link between the linguistic and the conceptual level. A finite agent with bounded resources cannot carry out a one shot, exhaustive, and uniform analysis of a perceived scene within reasonable time constraints. Furthermore, some aspects of a scene are more relevant than others, and it would be irrational to waste time and computational resources to detect true but useless details. We face these problems by a sequential attention mechanism, which suitably scans the internal representation of the scene. Also the order in which the objects in the scene are analyzed can be relevant (and, obviously, it becomes crucial in the case of the perception of dynamic scenes). Our model drives the focusofuattentian by the knowledge, the hypotheses, the purposes and the expectations of the system, in order to detect the relevant aspects in the perceived scene. Hence, it is a task of the higher level components to use the information acquired through the perceptual system to create expectations or to form contexts in which hypotheses can be verified and, if necessary, adjusted. The link between the linguistic and the conceptual level is therefore bidirectional: the conceptual level defines the interpretation domain for the symbols at the linguistic level, and the linguistic level generates expectations in order to explore the conceptual level suitably. Three focus of attention modes are the basis of the proposed architecture: a reactive mode, in which attention is driven only by the characteristics of scene, a linguistic mode in which attention is driven by simple inferences at the linguistic level, and an associative mode in which attention is driven by free associations among concepts. In summary, we extend further and complete the representation levels proposed by Marr, by adding a conceptual and a linguistic level, where understanding takes place. Moreover, the introduced focus of attention provides a systematic and general interaction mechanism among levels, and extends also the active vision paradigm to higher cognitive levels. As a consequence, the limitations of special purpose goal oriented vision systems like ACRONYM and ALVEN, or the more recent systems like TEA- 1 [ 561 and BUSTER [ 151, are overcome by means of a framework in which general understanding of visual information is modeled in a well-founded manner, and specific goals can be easily expressed.

16

A. Chella et al./Artijiciul

Intelligence 89 (1997) 73-1 II

We are aware of the typical, “hard” and not yet solved problems encountered in real scenes at low level vision, as shadows, poor contrast, occluding objects, segmentation criteria; in Section 9 we will discuss how our architectural design is a contribution towards possible and unexplored solutions. In the following sections we present the architecture in a detailed manner, also providing simple experimental results aimed at illustrating the functioning of the various components. It should be noted that also with the adopted reduced experimental setup at the low level, that provides only essential information about scenes, the architecture is able to draw many inferences and to build a rich interpretation context. Specifically, Section 2 delineates the design of the architecture based on the previously exposed principles; Section 3 describes the three levels of representation, while Sections 4 and 5 respectively specify the linguistic level and its interpretation function. Section 6 examines in greater detail the focus of attention mechanism, and Section 7 characterizes the link between the conceptual and the linguistic level in terms of time delay attractor neural networks. Finally, Section 8 describes the employed experimental setup and the obtained results, and Section 9 presents some concluding remarks and hints on future work.

2. The cognitive architecture The cognitive assumptions introduced in the previous section provide the guidelines for the design and implementation of the proposed architecture for artificial vision, The current implementation concerns the analysis of static scenes. Fig. 1 shows the overall architecture in which the previously described three levels of representation are pointed out. Block A is the starting block of the subsymbolic level: it receives one or more input pictorial digitized images acquired by a camera and it gives as output the Mat-r’s 2iD description [47] of the input image. This contains information similar to the intrinsic images proposed by Tenenbaum, Fischler and Barrow [59] and by Barrow and Tenenbaum [ 91, such as relative depth, local orientation and segmentation maps. Several algorithms and methodologies have been proposed in the computer vision literature to extract this information from the pictorial images (see Bertero, Poggio and Torre [ lo], Lee [45], Aloimonos [2] for a review). The maps extracted by block A are sent as input to block B, which builds, at the conceptual level, a scene description in terms of a combination of 3D geometric primitives. Several types of 3D primitives have been proposed to generate an object centered description of the scene, like generalized cylinders and cones [ 20,5 11, geons [ 12,261, superquadrics [ 8,52,57] and deformed superquadrics [ 53,60,61]. Such primitives can be recovered by many proposed reconstruction methods, which are based mainly on the iterative minimization of suitable nonlinear error functions (see Bolle and Vemuri [ 171) . Block C implements the mapping between the conceptual level and the symbolic level; this block aims at recognizing the objects and the situations. The input to block C is a structure at the conceptual level, its output is sent to the linguistic level to produce a sentential description of the scene. The symbolic knowledge base is the kernel of the

i 7

A. Chella et al./Artijicial Intelligence 89 (1997) 73-111

eo

Camera

2 l/2 D Approximation

.

Viewer Centered Mental Image

Fig. I. The proposed architecture the input from a camera and gives a scene description in terms of a between the conceptual level and attention mechanism, while block

in which the three levels of representation are pointed out. Block A receives as output the 2tD map images. The maps are sent to block B, which builds combination of 3D geometric primitives. Block C implements the mapping the symbolic level. Block D implements the linguistic mode of the focus of E implements the associative mode of the focus of attention.

78

A. Chella et al. /Art@cial

Intelligence 89 (1997) 73-1 I I

linguistic level. The aim of this block is twofold: it describes in a high level language the perceived scene by interpreting the input coming from block C, and it generates, by means of its inference capabilities, the expectations that drive the focus of attention mechanism. Block D is responsible for the linguistic mode of the focus of attention mechanism. It receives as input the instances of concepts from the knowledge base and it suitably drives the focus of attention, in order to seek the corresponding objects and situations in the acquired scene. Block E is responsible for the associative mode of the focus of attention. Its operation is similar to block D, but it drives the focus of attention by looking for the objects in the scene which can be freely associated with the input instances. The reactive mode of the focus of attention is implemented as an internal mechanism of block D: when the block does not receive any expectations as input, it generates some generic expectations in order to “bootstrap” the operation of the system.

3. The levels of representation According to Marr’s model [47], visual perception is described as an information processing activity at different levels of abstraction. At higher levels visual information is object centered and is related to the 3D characteristics of the scene. In Marr’s theory a superior symbolic level is limited to a hierarchically organized catalogue of 3D prototypes. It has been introduced in the mental imagery literature (see Block [ 161) the distinction between mental pictures and propositional mental representations: e.g., Kosslyn [43] distinguishes between a short term memory based on mental images, and a propositional long term memory: mental images can be generated and processed starting from the propositional long term information. It has been discussed whether mental images are viewer centered or object centered representations, that is, whether a mental image depends on the specific observation point or not. Cognitive evidence exists according to which both these kinds of representation coexist and are integrated in human memory, as described by Tarr and Pinker [58] and by Farah, Hammond, Levine and Calvanio [ 291. In the Johnson-Laird theory [ 391, three levels of representation are hypothesized, that, in some sense, summarize the various points of view sketched above. The “highest” level is a propositional representation, i.e., a symbolic representation similar, for example, to a semantic network. The intermediate representation is a mental model, in some respects analogous to an object centered (or spatial) mental image. The “lowest” level is a visual, viewer centered, mental image. From a slightly different point of view, Gardenfors [ 331 proposes three levels of information representation: a linguistic level, a conceptual level and a subsymbolic level. At the linguistic level the information is described in terms of a symbolic language, e.g., a first order language; at the subsymbolic level information is characterized directly in terms of the perceptual inputs of the system. Between these two levels, a third level is hypothesized: the conceptual level, in which information is described in terms of a

A. Chella et al./Art@cial

Intelligence 89 (1997) 73-1 I I

79

conceptual space. Our model is inspired by the three representation levels proposed by Gardenfors. The theory of conceptual spaces provides a robust cognitive background for the definition of the internal representations of the agent’s external environment. Furthermore, this framework may easily be generalized to incorporate well-founded attentional mechanisms, as we will show in Section 6. Further analogies can be found between the model proposed here and the one proposed by Man; as well as the models that have emerged from the mental imagery debate. In Fig. 1, the three gray blocks correspond to Gardenfors’ levels of representation. The first level can also be seen as a visual, viewer centered, mental image (or, in Marr’s terminology, a 2iD sketch). The central level embeds within itself an object centered mental image (in Marr’s terminology, a 3D model representation). The upper level consists of a propositional, linguistic, knowledge representation. Such a level can be assimilated to Kosslyn’s long term memory and to Marr’s hierarchical catalogue of models. According to Gardenfors, a conceptual space is a metric space consisting of a number of quality dimensions. From a formal point of view, a conceptual space is an n-dimensional space CS where Xi is the set of values of the ith quality dimension (for 1 < i 6 n with n E IV). Examples of such dimensions would be color, pitch, mass, spatial coordinates, and so on. The dimensions should be considered “cognitive” in that they correspond to qualities of the represented environment, without reference to any linguistic descriptions. In this sense, a conceptual space is prior to any symbolic characterization of cognitive phenomena. Some dimensions in a conceptual space are closely linked to the sensorial input of the system, other dimensions can be related to more abstract concepts. We call knoxel 3 a generic point in a conceptual space (the term is suggested by the analogy with the term pixel in digital image processing); knoxels therefore represent epistemological primitives at the considered level of analysis. Formally, a knoxel is a vector k = (XI, x2,. . . , x, ) where Xi E Xi corresponds to a parameter associated with a quality dimension of the domain of interest. In our architecture, the dimensions of the conceptual space are the parameters of the 3D geometric primitives which compose the scene. In this perspective, the knoxels correspond to simple geometric building blocks, while complex objects or situations are represented as suitable sets of knoxels. Accordingly, each knoxel is related to measurements, obtained via suitable sensors, of the geometric parameters of simple, basic objects in the external environment. A metric function d is defined in CS, which may be considered as a measure of similarity among knoxels in the conceptual space (see GLdenfors [ 321) . In general terms, a precise characterization of the conceptual space poses some problems. This is the case, in particular, when one has to take into account the qualitative difference in the information being represented in each dimension. It is, for instance, a complex task to find a metric that allows for a suitable quantization of the interesting features. Gardenfors [34] notes that: 3 The term knoxel was first introduced meaning.

by Gaglio, Puliafito, Paolucci and Perotto I 30 1 with a slightly different

80

A. Chellu et al./Artifcial

Intelligence

89 (1997) 73-1 I I

The main factor preventing a rapid development of a cognitive semantics based on conceptual spaces is the lack of knowledge about the relevant quality dimensions. It is almost only for perceptual dimensions that psychophysical research has succeeded in identifying the underlying topological structures (and, in rare cases, the psychological metric). For example, we only have a very sketchy understanding of how we perceive and conceptualize things according to their shapes. The models developed by Marr and Nishihara [ 481, Pentland [ 521, Biederman [ 131, and Tversky and Hemenway [65] among others, seem to point in the right direction, but there still remains a lot to learn about the “shape space”. Nevertheless, we claim that our architecture overcomes these problems since we have adopted a very simple (but nonetheless useful) conceptual space in which the dimensions correspond to the parameters of suitable 3D geometric primitives. Their boolean composition, according to schemas of constructive solid geometry (CSG) 4 as described by Requicha [55], permits the representation of a great variety of familiar shapes, particularly those corresponding to human artifacts. We have found convenient to adopt the superquadn’cs as the geometric primitives of the CSG schema. They are widely used both in computer graphics [ 81 and computer vision [ 52,57,66] as they offer an acceptable compromise between the compression of information in the scene and the necessary computational costs [ 46,571. Furthermore, superquadrics provide good expressive power and representational adequacy [ 521. Solina and Bajcsy [ 571, Gupta and Bajcsy [ 351, Leonardis, Solina and Macerl [46], among others, have proposed working techniques for recovering superquadrics from real scenes, even when the objects are difficult to segment. Techniques aimed at the recovery of superquadrics, also in the presence of occlusions, have been proposed by Whaite and Ferrie [ 661 and by Maver and Bajcsy [49]. Superquadrics are geometric shapes derived from the quadrics parametric equation with the trigonometric functions raised to two real exponents. The inside/outside function of the superquadric in implicit form is:

F(x,y,z)

=

[(f-ye1 +(-yc2]e*‘e’ +(-y”‘,

(1)

where the parameters a,, a,, and a, are the lengths of the superquadric axes and the exponents ~1 and 82, called form factors, are responsible for the shape’s form: EI acts in terms of the longitude, and ~2 in terms of the latitude of the object’s surface. Pq. (1) returns a value equal to 1 when the point (x, y, z ) is a superquadric boundary point, a value less than 1 when it is an inside point, and a value greater than 1 when it is an outside point. Fig. 2 shows the forms assumed by a superquadric by varying only its form factors (~1, ~2). Form factors less than 1 let the superquadric take on a squared form, as in Fig. 2(a) where the values (O.Ol,O.Ol) result in a box shaped superquadric; values approaching 1 render the shape rounded, as in Fig. 2(b), where the form factors 4 According to the CSG schema, the geometric primitives can be considered closed compact sets in Euclidean space, and they can be composed through regularized boolean operators (R-AND, R-OR, R-DIFF) to form general 3D structures.

A. Chella et al./Art$icial Intelligence 89 (1997) 73-111

81

Fig. 2. Aspects assumed by a superquadric by varying its form factors. ( 1, 1) make the superquadric an ellipsoid. When the form factors are (0.01, l), the superquadric assumes a cylindrical shape (see Fig. 2(c) ). Finally, values greater than 1, e.g., (5,5), tend to generate a cuspidate aspect, as in Fig. 2(d). The previous equation is the parametric equation in canonical form of a superellipsoid: the three center coordinates pX, pr , pz and the three orientation parameters ~,a, $ completely describe a generically displaced superquadric. The expression of the knoxel, describing a generic superquadric is therefore:

k= (a,,a~,a,,el,E2,p,,p~,p7,~,~,ICI).

(2)

As an example, let us consider the sample scene in Fig. 3 representing a hammer, a computer mouse and a tennis ball. The knoxels are obtained by approximating each part of the scene by means of the best fitting superquadric (see Fig. 4) ; details on this operation performed by our experimental setup will be given in Section 8. Each superquadric has been indicated by a tag; the acquired scene is therefore described by the knoxels kl, kz, k3, and k4. As previously stated, a knoxel individuates a single superquadric; complex objects and situations are represented by suitable sets of superquadrics according to the CSG schema. It should be noted that the superquadric parameters also code the position and orientation of the superquadric in space: therefore the relative orientation and mutual position of the superquadrics describing a composite object, e.g., the hammer, are implicitly

A. Chella et al./Art@ial

82

Fig. 3. A sample scene representing

Intelligence 89 (1997) 73-111

a hammer, a computer

mouse and a tennis ball.

defined. There is no need of mechanisms such as the adjunct relations proposed by Mat-r [48]. We define a perception cluster pc = {kl , k2, . . . , kl} as a finite set of knoxels corresponding to an object or a situation in CS. Referring to Fig. 4, the perception cluster pet = {kl , kz} describes a hammer, while the perception cluster PC:! = {ks} describes a tennis ball. The set PC of all the perception clusters in CS is defined as: PC=

{{k,,kz,.

.., kl} 11 E N, ki E CS for 1 < i < 1).

(3)

The conceptual level is independent of any linguistic characterization. Indeed, the symbols at the linguistic level are interpreted on configurations at the conceptual level. A suitable interpretation function maps linguistic expressions onto conceptual structures of the appropriate type. In Section 5, we describe how this interpretation function may be “computed”.

4. The linguistic level The role of the linguistic level is to provide a concise description of the perceived scene in terms of a high level logical language, in itself suitable for symbolic knowledgebased reasoning. In order to describe the symbolic knowledge base, we adopt a hybrid

A. Chella et al./Artijkial Intelligence 89 (1997) 73-111

83

Fig. 4. Results obtained by the superquadric approximation of the scene in Fig. 3.

formalism, in the sense of Nebel [ 501. Accordingly, a hybrid formalism by two different modules: a terminological component and an assertional component. In our model, the terminological component contains the descriptions of the concepts relevant for the represented domain (e.g., types of objects and of situations to be perceived). The assertional component stores the assertions describing the specific perceived scenes. The distinction between terminological and assertional components is useful for maintaining the distinction between the conceptual knowledge, which is largely independent of the specific perceived scene, and the assertions concerning the scene itself. Moreover, terminological formalisms are well suited to our purposes, in that they are centered on conceptual descriptions. This allows for a compact description of concepts, whose instances are to be recognized in the perceived scene. The adopted formalism is completely monotonic (it is well known that in classic terminological system concept description in terms of default attributes is not allowed). Nonmonotonic extensions of the conceptual knowledge base would probably demonstrate themselves to be helpful in further developments of the system. Up to now, however, we have chosen to keep the symbolic knowledge base completely monotonic, in order that the prototypical characterization of concepts might emerge entirely from the properties of the conceptual level and from the associative mechanisms linking it to the linguistic level, as proposed by Gardenfors [ 321. representation

is constituted

A. Chella et al./Artifciul

Intelligence 89 (1997) 73-I I1

Fig. 5. Graphic description of a fragment of the terminological knowledge base. A generic Object is described as composed of at least one knoxel. A Simple-object is described as an object composed of exactly one knoxel; a Complex-object is an object composedof at least two knoxels. Hummer is an example of a complex object. The role has-part has been differentiated into more distinct roles. The concept Hammer has two roles: has-bundle and has-head.

As an example, consider in Fig. 5 a fragment of the terminological knowledge base concerning the description of objects. In the figure, the graphic notation developed by Brachman [ 181 for the KL-ONE system has been adopted. A generic Object is described as composed of at least one knoxel. A Simple-object is described as an object composed of exactly one knoxel; a Complex-object is an object composed of at least two knoxels. Hammer is an example of a complex object. The role has-part has been differentiated into more distinct roles. For example, the concept Hammer has two roles: a role hashandle with exactly one filler, which must be a knoxel with a cylindrical shape, and a role has-head with exactly one box shaped filler. The assertional component is based on a first order predicate language, in which the concepts of the terminological component correspond to one argument predicates, and the roles (e.g., has-head or has-handle) correspond to two argument relations. So, for example, in order to assert the existence of an instance Hammer#l of the concept Hammer, the formula Hammer( Hammer#l) is asserted. To express that the filler of the role has-handle knoxel Cylinder-shaped#l, the formula

for Hammer#l

is a specific

has-handle(Hammer#l,Cylinder-shaped#l) is asserted. As far as situations are concerned, we choose to represent them as concepts in the terminological formalism. In other words, we assume that situations are reified, i.e., that

A. Clzella et al. /Artificial Intelligence 89 (I 997) 73-1 I I

Fig. 6. Graphic description of the Situation concept. Every situation has at least one object as participant. Next-to and Above are described as particular types of situations, with exactly two participants.

to every specific situation there corresponds an individual in the domain. This solution is analogous to Davidson’s proposal for event representation [25]. Since we have no philosophical worries of ontological parsimony, this choice turns out to be simpler and advantageous in many respects. It is well suited for terminological formalisms, and provides a great flexibility and expressive power. For example, quantification on situations is allowed. Fig. 6 shows the network description of the Situation concept, and of two particular types of situation, Above and Next-to. As shown in Fig. 6, every Situation has at least one object as participant. Next-to is described as a particular type of situation, with exactly two participants. To assert that an object O#l is by the side of a second object 0#2, an instance S#l of Next-to is generated, whose participants are O#l and 0#2. In other words, the following assertions are added: Next-to(

%?I),

participant(S#l,

O#i),

participant(S#l,0#2). The situation Above is described by means of two roles, is-above and is-below, both with exactly one filler. These roles are defined as particular differentiations of the role participant.

5. Interpreting

symbols on the conceptual space

As pointed out in the previous section, the linguistic level provides a concise symbolic description of the perceived scene. Obviously, in the perception process this stage comes at the end, since it pertains to the most abstract representational level of visual information. What we need at this point is a denomination function which maps structures within the conceptual space onto linguistic constructs. A possible solution will be presented in the next section, and it will be related to the focus of attention mechanisms.

A. Chellu et al. /Artificial

86

Intelligence

89 (I 997) 73-1 I I

In order to define a denomination function correctly, we cannot avoid proceeding in the opposite direction; i.e., we need to introduce an internal, cognitively oriented semantic interpretation for the symbols at the linguistic level. In particular, we define a suitable interpretutionfunctian that maps the symbolic structures at the linguistic level onto entities in the conceptual space. This is a general methodological issue in artificial intelligence where it is normally assumed that there is a language that needs a semantics. By contrast, in the perspective of the vision context, the main problem is that there is a perceptual representation that needs a language. The proposed interpretation function @ associates any individual constant representing an object or a situation at the linguistic level with a perception cluster in CS, any concept (one-place predicate) with a set of perception clusters, any role (two-place predicate) with a set of pairs of perception clusters, and so on. Therefore, if C is the set of assertional individual constants and @’ is the interpretation function @ restricted to C, then Qc has the following type: &:C+PC where ample others, where handle:

(4)

PC represents the set of all perception clusters as defined in (3). As an exreferring to the scene in Fig. 4, the interpretation function @ associates, among an instance of the concept Hummer with the perception cluster pet = {kt , kz}, kt and k2 are the superquadrics representing the hammer head and the hammer

@(Hammer#l)

= {kl, kz}.

(5)

The compositional aspects of the interpretation of symbolic structures at the linguistic level can be defined according to the usual model-theoretic semantics of terminological languages, as described by Nebel [50]. The main difference between the proposed semantics and the usual model-theoretic approach is that in our approach individual constants are not interpreted on unstructured set-theoretical entities (the elements of the domain). On the contrary, perception clusters are objects endowed with a rich internal structure. This fact involves relevant consequences. In the traditional model-theoretic approach the extension of primitive atomic predicates can be only assumed as given, and is, in a certain sense, completely arbitrary. In our approach, the extension of many primitive predicates can be determined on the basis of the structure of the entities in the semantic model itself. As a simple example, consider the has-part role of the Object concept. Given the assertion has-part(Hammer#l, Cylinder-shaped#l) in a purely extensional modeltheoretic semantics, its truth is justified exclusively by the fact that the pair of the extensions of Harnmer#l and of Cylinder-shaped#l belongs to the extension of has-part: (@( Hammer#l),@( Cylinder-shaped#l))

E @(has-part).

(6)

In the internal semantics, the truth of the previous assertion can be determined by examining the entities on which Harnmer#l and Cylinder-shaped#l are interpreted in the conceptual space: the assertion is true if the set of knoxels on which

A. Chella et al. /Artijicial

Cylinder-shaped#l is interpreted:

Intelligence

89 (1997)

73-1 I I

87

is interpreted is a subset of the set of knoxels on which Hammer%1

@( Cylinder-shaped#l)

C @( Hammer#l)

.

(7)

The assumption according to which the individual constants representing objects are interpreted onto perception clusters is a simplification made possible by the fact that we are dealing with static scenes. To characterize objects independently of position and orientation, the perception clusters would be properly parametrized with respect to some of its constituents, i.e., they must be projected onto suitable subspaces of the whole conceptual space. Similarly, in a dynamic context, the internal structure of the semantic entities can be articulated further, in order to justify, at the semantic level, the truth of other kinds of atomic sentences. Consider, for example, object categorization. A given object can be recognized as an instance of a concept Flexible-object if, in the set of the perception acts concerning it at different instants, the object itself underwent some kinds of deformation, i.e., if the shape factors or the length of axes varied within certain ranges.

6. The focus of attention As mentioned in the introduction, a finite agent with bounded resources cannot carry out a one shot, exhaustive, and uniform analysis of a perceived scene within reasonable time constraints. Some aspects of a scene are more relevant than others, and it would be irrational to waste time and computational resources to detect true but useless details. This is a typical problem of traditional symbolic models: Doyle [ 271 and Cherniak [ 221 stress the fact that, in order to avoid the proliferation of insignificant true conclusions, the aims and the purposes of an agent must be taken into account in the modeling of inferential activities. In modeling perception, these problems can be faced by taking into account the fundamental role of attentive phenomena in vision, as described in the work of Yarbus [67]. In the psychological literature, the focus of attention has sometimes been described as a spotlight which scans the visual field, individuating relevant aspects (see Posner [54] ). This mechanism is analogous to the scanning of a mental image, as described by Kosslyn [43]. Several models of focus of attention mechanisms have been proposed in the artificial vision literature. An interest in some form of active processes during the recognition process is present in Marr’s work [47] as well. The early focus of attention models aimed at searching for a particular object in the scene, given a static model of the object. The basic purpose of an attentional mechanism is computational efficiency (see Ballard [7]). The subject has become a key point in the field of the active vision research; the interest in this argument has been summarized by Bajcsy and Campos [6] who propose the “active and exploratory” framework for perception. According to this framework, the perception process of a living or artificial organism is based on four characteristics: it is an active and flexible task, it must have exploratory capabilities, it is a selective process, and it must be able to learn from the environment.

88

A. Chellu et al./Art@cial

Intelligence

89 (1997)

73-111

A strategy adopted by the active vision researcher in order to model the focus of attention mechanism aims mainly at choosing an optimal viewing position for the sensors, in order to improve the interpretation of the image and to minimize uncertainty. According to this strategy, Whaite and Ferrie [66] propose a probabilistic measure of the uncertainty of the superquadrics parameters, with respect to a general view position. The observation point is therefore changed in order to minimize this uncertainty. Maver and Bajcsy [49] propose a similar strategy for reasoning about occlusions, that takes into account the knowledge of the sensor geometry. They plan the next positions of the sensor in order to extract information from regions of missing data. Another well-used strategy to model the focus of attention (see Burt [ 211, TsotSOS, Culhane, Wai, Lai, Davis and Nuflo [ 641) is based on the pyramidal approach. Accordingly, the image is represented by a hierarchical data structure; “fine-to-coarse” algorithms generate the image measures, while “coarse-to-fine” search strategies are able to locate objects or situations in the scene. A high level control system drives the gathering mechanism. Other adopted strategies are based on Bayesian and causal models of the focus of attention. Rimey and Brown [ 561 propose TEA-l, a task oriented system that expends the minimum effort necessary for solving a specific task. The knowledge of the system is structured by Bayesian networks, while the control of action is carried out by a benefit-cost analysis. The system is able to answer to questions about table settings, such as “Is this a fancy or an informal meal?‘; the system activates the suitable visual actions controlling the focus of attention movements and the image processing tasks in order to answer the question. Birnbaum, Brand and Cooper [ 151 propose the BUSTER system, which is aimed at developing a causal explanation of the scene. The attention is driven by causal semantics in order to find the causal role of elements in the scene and the causal relationships among the elements. BUSTER codes in terms of rules a simple physical knowledge about static scenes made up of structure block stacks incorporating architraves, cantilevers and balanced structures. It is well known that at the lower, preattentive levels of visual perception there is a global parallel processing of visual information. The data received in input are concurrently processed, in order to produce a global reconstruction of the perceived scene. All data at this level have the same relevance, and no distinction is made between important and irrelevant information. According to Duncan and Humphreys [28], the goal of preattentive processing is a segmentation of the visual field into regions relevant from a purely perceptual point of view. At the attentive level, on the other hand, there is a sequential processing of visual information. From this point of view there is, in general, no “one shot” recognition of an object or of a scene; objects and scenes are, instead, recognized through a sequential exploration of the perceived image. In our architecture, the conceptual level described in the previous section acts as a “buffer interface” between subsymbolic and linguistic processing. The information coming up from the subsymbolic level has the effect of contemporarily activating an (eventually very large) set of knoxels in the conceptual space. It is the focus of attention mechanism that imposes a sequential order in the conceptual space according to which the linguistic expressions can be given their interpretation.

A. Chella et al./Art$cial

Intelligence 89 (1997) 73-l I I

Fig. 7. A perception act related to the scene in Fig. 3; the perception of the hammer handle knoxel and the hammer head knoxel.

89

act describes the hammer as a sequence

In order to describe the focus of attention mechanism, we denote as CS* the set of all the possible sequences of elements belonging to CS, i.e., the set of all the possible sequences of knoxels: cs*

= {k,,klk2,klk3,.

. .}.

We define a perception act p as a generic space:

sequence

of knoxels

in the conceptual

p E cs. Considering the scene in Fig. 3, a possible perception act may be: p1 = klk2. This perception act describes a way of perceiving the hammer as a sequence of its handle and its head (see Fig. 7). With reference to a perception cluster pc, we say that a perception act p is associated with the perception cluster pc if p E PC*, where pc” is the set of all sequences of knoxels belonging to pc. As an example, the previously introduced perception act p1 = kl k2 is associated with the perception cluster pcl = {kl, kg} describing the hammer. The perception acts associated to a perception cluster therefore correspond to specific ways of perceiving an object or situation described by the perception cluster. It should be noted that the sequence of knoxels that makes up a perception act may not include

90

A. Chella et al. /ArtQicial

Intelligence

89 (I 997) 73-1 I I

all the knoxels of the corresponding perception cluster and/or it may include the same knoxels several times. In fact, the perception acts pq = ktk&t, or pcl = k2kl may also be considered as associated to the perception cluster pc. They correspond to other ways of perceiving the hammer present in the scene. We introduce a denomination function 0 associating perception acts with assertions at the linguistic level: 0 : CS* -+ Assertion

(10)

where Assertion is the set of grounded well-formed assertional formulas. Given a perception act p, O(p) is a grounded assertional formula in which a new assertional constant occurs, that denominates p. This new assertional constant is the “name” that the system associates with the perception act p. Block C in Fig. I implements the denomination function by means of suitable attractor neural networks, as described in the next section. According to the previous example, the denomination function maps the perception act pt = kl k2 related to the hammer, to the instance Hammer#l of the concept Hummer: O( kl k2) = Hammer(Hammer#l).

(11)

Analogous considerations can be made for concepts describing spatial situations, as Above and Next-to. Our proposal for the description of complex concepts in terms of sequences of knoxels is a way of dealing with attentional mechanisms in a well-founded manner, which naturally extends Gardenfors’ notion of conceptual space. It should be noted that the perception act assumption avoids the needs of augmenting the dimensions of the space in order to describe complex objects or situations made up by several blocks: complex objects can be described by perception acts of arbitrary length. In order to individuate the grouping paths among knoxels and to generate the most significant perception acts, it is necessary to orient the focus of attention in a suitable manner. In human beings, the focus of attention can be oriented either voluntarily, under the guidance of high level cognitive information and processes, or automatically, in dependence on particular stimuli present in the perceptive field, as described by Posner [ 541 and Jonides [ 401. We assume that the focus of attention is determined by three concurrent modes: the reactive, the linguistic and the associative mode. The reactive mode is the simplest one: the grouping paths among knoxels are determined only by the characteristics of the visual stimulus, e.g., the volumetric extension of the forms, or the aggregation density of the perceived objects. As an example related to the previous scene, when the architecture is in reactive mode the focus of attention is directed to the hammer handle and to the hammer head because of their volumetric extension, thus generating the perception act pt = kl k2 (see Fig. 7). The knoxels related to this perception act are sent to the denomination block to find the corresponding linguistic constants at the linguistic level. The denomination block correctly denominates the input perception act as an instance of the Hammer concept. The assertions generated at the linguistic level describing the operation of the architecture in reactive mode are reported in Fig. 8. In the linguistic mode, the focus of attention is driven by the symbolic information explicitly represented at the linguistic level. Consider again the hammer example

91

A. Clzella et al. /Artificial Intelligence 89 (1997) 73-11 I

Knoxel (#kl) Knoxel (#k2) Cylinder-shaped(#kl) Box-shaped(#k2) Hammer(Hammer#l) has-part(Hmmer#l,#kl) has_part(Hammer#l,#k2) Fig. 8. The assertions

generated

at the linguistic

level related to the perception

act represented

in Fig. 7

Knoxel (#kl) Knoxel (#k2) Cylinder-shaped(#kl) Box-shaped(#k2) Hammer(Hammer#l) has-handle(Hammer#l,#kl) has-head(Hammer#l,#k2) Fig. 9. The assertions generated act represented in Fig. 7.

at the linguistic

level related to the linguistic

expectations

for the perception

(Fig. 7). At the linguistic level a hammer is described as composed of a handle of cylindrical shape and a head of boxed shape. Let us suppose that the denomination block has recognized the knoxel corresponding to the cylinder shape. The description of a hammer at the linguistic level reports that it is made by a cylinder shaped head and a box shaped handle. The linguistic level therefore hypothesizes that the cylinder shaped knoxel may be, among other things, a filler for the role has-handle of the concept Hammer. The linguistic mode of the focus of attention now attempts the identification of the parts of the hammer, in particular its handle and its head, in order to verify the presence of such a hammer in the scene. This corresponds to finding the suitable fillers for the role parts of the object, i.e., a filler for the has-head role and a filler for the has-handle role. Therefore, whenever a possible hammer handle is recognized, the focus of attention tries to identify a hammer by identifying the possible fillers for its head and its handle. When some of the expectations concerning these knoxels are satisfied by some corresponding knoxels in the scene, the perception act made up by the knoxels that satisfy these conditions is sent to the denomination function in order to recognize the object or the situation, as in the reactive mode. It should be noted, however, that if the cylinder has been recognized, and there is no recognizable hammer head, i.e., the linguistic expectations are not satisfied, the architecture cannot recognize the hammer. Block D in Fig. 1 implements the linguistic expectation function by means of suitable attractor neural networks, as described in the next section. The assertions generated at the linguistic level related to the described example are reported in Fig. 9.

92

A. Clzellu et ul./Artijicial Intelligence 89 (I 997) 73-11 I

Fig. 10. The resulting perception act related to the previous the associative expectation to find a ball and a mouse.

scene when the focus of attention

is driven by

In the associative mode of the focus of attention, the grouping paths are determined by an associative, purely Hebbian mechanism determining the attention on the basis of free associations between concepts. Whenever two objects in the same scene are perceived, the weight of the associative connection between the corresponding concepts is increased. So if hammers and balls have been always present in the same scene, the weight of the association between the concepts Hammer and Ball is strong, and they mutually activate each other. As a consequence, whenever a hammer is recognized, the focus of attention tries to identify some balls in the perceived scene, and vice versa. Let us suppose to have recognized the hammer by the linguistic mode (see the previous example). At the linguistic level, the concept Hummer is associated by a Hebbian mechanism to the concepts of Bull and Mouse, due to a previous learning phase. The linguistic level therefore hypothesizes the presence of these objects in the scene and the associative expectations block generates the corresponding hypotheses. As in the linguistic mode, when some of these expectations are satisfied by some corresponding knoxels in the scene, the perception act made up by the knoxels found to be so is sent to the denomination block. Fig. 10 shows the resulting perception act, while Fig. 11 shows the corresponding assertions generated at the linguistic level. Block E (Fig. 1) implements the associative expectations by means of attractor neural networks, just as for the linguistic expectations. The implementation will be described in the next section.

93

A. Chella et al. /Artijicial Intelligence 89 (I 997) 73-1 I1

Knoxel Knoxel Knoxel Knoxel

(#kl) (#k2) (#k3) (#k4)

Cylinder-shaped(#kl) Box-shaped(#kZ) Hammer(Hammer#l) has-handle(Hammer#l,#kl) has-head(Hammer#l,#k2) Ball-shaped(#k3) Ball(Ball#l) has-part(Ball#l,#k3) Ellipsoid-shaped(#k4) Mouse(Mouse#l) has-part(Mouse#l,#k4) Fig. 11. The assertions generated act represented inFig. 10.

at the linguistic

level related to the associative

expectations

for the perception

Next-to(Next-to#l) participant(Next-to#l,Hammer#l) participant(Next-to#l,Ball#l) Fig. 12. The assertions

generated

at the linguistic

level related to the spatial situations

in the previous scene.

The task of recognizing the perception acts in the case of spatial situations is similar to the task of recognizing the objects; Fig. 12 shows the assertions generated for the spatial concept Next-to (described in Section 5)) after the recognition steps of the objects present in the previous scene. In particular the assertions state that the hammer and the ball are side by side: the perception act obtained as the sequence of the knoxels of the hammer and the knoxel of the ball (kt kzks) has been recognized from the denomination block as a Next-to situation. It should be clarified that the distinction between the associative mode and the linguistic mode is a soft one: even the linguistic mode in some sense “associates” the perceived object with some expected objects. As it will be explained in the next section, both modes are implemented by attractor neural networks with suitable associative capabilities trained by a careful learning phase. The main difference between the two modes is that the associative mode captures all the free associations not described by the semantic network at the linguistic level, while the linguistic mode associations are driven by the conceptual description at the linguistic level. In the previous scene, for example, the neural networks responsible for the associative mode of the focus of atten-

94

A. Chellu et al./Art$cial

Intelligence 89 (1997) 73-111

Fig. 13. Another perception act related to the previous scene; the focus of attention is directed to the hammer head, the ball, and the hammer handle.

tion have learned to associate a hammer to a ball, but they have not learned to associate the cylinder related to the hammer handle, to the box related to the hammer head, also if the two objects are always present in the same scene. This kind of associations is in fact managed by the linguistic mode. The main goal of the expectation generation process is to obtain the most exhaustive possible interpretation of the acquired scene by avoiding the generation of true but useless assertions. When the associative and linguistic expectations are not activated, the architecture describes the scene only by means of the simple reactive mode. In this case the architecture has no other choices than to build and denominate all the possible perception acts obtained by combining all the knoxels present in the scene. The reactive mode alone therefore generates a combinatorial exploding number of assertions, the most of which are true but uninformative. It should be noted, in fact, that the denomination of the objects strictly depends on the particular found knoxel sequence: when the input perception act contains the hammer head, the ball, and the hammer handle (Fig. 13)) the denomination block does not recognize the hammer, but it recognizes the three knoxels as three distinct objects: a cylinder, a ball and a box. The generated assertions (Fig. 14) are true but they do not describe the scene exhaustively. Furthermore, as the reactive mode has no access to the descriptions of objects of the terminological component, the architecture is not able to fill the roles for the parts of an object: e.g., the reactive mode is able to recognize the hammer (see the assertions in Fig. 8)) but it is not able to

A. Chella et al./Artijcial

Knoxel Knoxel Knoxel

Intelligence 89 (I 997) 73-l 1 I

95

(#kl) (#k2) (#k3)

Cylinder-shaped(#kl) Cylinder(Cylinder#l) has-part(Cylinder#l,#kl) Ball-shaped(#k3) Ball (Ball#l) has-part(Ball#l,#k3) Box-shaped(#k2) Box (Box#l) has-part (Box#l , #k2) Fig. 14. The assertions

generated

at the linguistic

level related to the perception

act represented

in Fig. 13.

recognize the cylinder shaped knoxel as the hammer handle and the box shaped knoxel as the hammer head. The denomination and attention mechanisms have been described up to now in isolation. As a matter of fact they operate concurrently by a simple recognition process cycle. The process is bootstrapped by the reactive mode of the focus of attention, which enables the denomination block to recognize objects “evident” in the scene: e.g., referring to Fig. 4, the reactive mode identified the hammer which is recognized by the denomination block. This allows a balancing between the associative and the linguistic modes of the focus of attention to satisfy their own generated expectations. As a default, the architecture first tries sequentially to recognize the objects anticipated by the linguistic expectations, and then the objects anticipated by the associative expectations. However, it is possible to ignore one or both of them. When no expectations are satisfied, the recognition process restarts through the reactive mode of the focus of attention in search of a new and as yet unrecognized object. The focus of attention mechanism may be modeled as an expectation function P linking the linguistic to the conceptual level; the function has its domain in the set Assertion of assertional grounded well-formed formulas, and its range in the set of perception acts. Therefore the function ?P is of the following type: Pi : Assertion

4

CS*,

where i = 1,2,

( 12)

where CS*, as previously discussed, is the set of all perception acts; the index i indicates the attentive mode: 1 stands for the linguistic mode and 2 stands for the associative mode. The function P generates expectations on the perception acts to be found in the conceptual space, on the basis of the available assertional information. In other words, the focus of attention looks for specific perception acts belonging to the perception clusters corresponding to the “expected” assertional constant in the perceived scene. We

96

A. Chellu et al./Art@cial

Intelligence

89 (I 997) 73-l 1 I

have chosen to model the attentive modes by the same function in order to reinforce the fact that the distinction between the associative mode and the linguistic mode is a soft one, and that they should be considered to be two faces of the same global attentive process. In the linguistic mode, the assertional constants are generated when the system makes linguistic inferences; as a consequence the focus of attention generates the perception acts made up by knoxel samples for these assertional constants. The system globally computes the expectation function by taking into account the information described in the terminological component. In terms of the previous example, when a cylinder shaped knoxel is found, the focus of attention searches for the fillers of the roles has-part of the Hummer concept, e.g., a filler for has-head and a filler for has-handle. Therefore the function P1 (Cylinder-shaped( k#l) ) generates perception acts made up of sample fillers of these roles. The associative mode is similar to the linguistic mode, except that the focus of attention searches for objects freely associated to the constants introduced at the linguistic level. In the hammer example, when a hammer is found, the function P*(Hammer(Hammer#l)) generates perception acts made up by samples of balls and mice. In the reactive mode, the focus of attention searches for generic objects in the scene. For the purpose of uniformity, this mode is considered a special case of the linguistic mode where the expected object is an instantiation of the most generic class, e.g., an Object.

7. The connectionist levels

implementation

of the link between conceptual and linguistic

A perception cluster, as described in Section 3, is a set of knoxels associated to an object or a situation: pc = {ki, k2,. . . , kl}. Each knoxel ki may be viewed as a point attractor of a suitable energy function associated to the whole perception cluster. In this way, a set of fixed point attractors models and generates the perception cluster: starting from an initial state representing a knoxel imposed, for instance, from the external input, the system state trajectory is attracted in turn to the nearest stored knoxel of the perception cluster. The implementation of a perception cluster by means of an attractor neural network (see Hopfield [38], Amit [ 3]), characterized by the corresponding energy function, appears to be a natural choice: each knoxel of the cluster is an activation pattern learned by the network. The implementation of the perception acts associated with a perception cluster is built by means of time delayed connections that learn the corresponding temporal sequences of knoxels, as proposed by Kleinfeld [ 411 and by Kleinfeld and Sompolinsky [42]. This modification allows the attractor neural network both to recognize and to generate all the perception acts corresponding to a concept. Therefore, to implement the denomination and expectation functions mapping the conceptual onto the linguistic level (the blocks C, D and E of Fig. 1), each concept at the linguistic level is associated to a suitable attractor neural network.

A. Chella et d./Artijcial

Intelligence

89 (1997)

73-1 II

97

The choice of time delay attractor neural networks offers several advantages. It is based on the well-studied energetic approach; the learning phase is fast, since it is performed at “one shot”. Furthermore, as it allows for a uniform treatment of both the recognition and the generation of perception acts, the denomination functions and the expectation functions introduced in the previous section may be implemented by a uniform neural network architecture design. For the sake of simplicity, we have adopted the binav unit version of the attractor neural network; the coding of the knoxels in terms of the binary activation pattern of the network has been computed by the coarse coding algorithm proposed by Hinton, McClelland and Rumelhart [ 371. The general expression of the energy function of an attractor neural network for a perception cluster is: nr

Ill

E~(r)=-~~~j~i(t)k;(t) i=l ,;=I

withj

# i

(13)

where m is the number of binary units of the network, T is the connection matrix storing the attractors representing the knoxels of the perception cluster, and k(t) is the knoxel representing the current activation pattern of the network. The number m of units depends on the number 1 of knoxels in the perception cluster according to the low memory load condition discussed in Amit [ 31: 1-c Ly,m

(14)

where LY,z 0.3. The connection Cj = A & k,k, Cl

matrix T is given by:

(15)

with j # i

where k, is the vth knoxel of the perception cluster. In order to describe a perception act associated to the perception cluster, a sequential operation in the corresponding attractor neural network is implemented by introducing time delayed connections among units, These connections store the time sequence of knoxels in the perception act; the resulting energy term is: E2(t) =-~~~D$k;(t)kj(t-d7) d=l i=l j=l

with j f i

(16)

where r is the time delay among two subsequent knoxels in the perception act p, s is the amplitude of the time window of interest, Dd is the delayed synapses connection matrix related to the time delay dr, k(t) and k( t - d7) are respectively the current and the past (d7) th knoxel of the perception act. The connection matrix Dd is given by:

(17)

98

A. Chella et aL/Art@cial

Intelligence 89 (1997) 73-I II

where kg and k(l+d) are respectively the 5th and the (5 + d) th knoxel of the current perception act; h is the length of the considered perception act. The global external input to the network is modeled by the energy term: E3(t)

=-CCKjki(t)lj(t) ;=I

withj

z

i

where F is the external input connection matrix, Z(t)is the actual activation input of the network coming from the conceptual space. The connection matrix F is given by: Fi,j =

(18)

j=1

k 5 k, L,

with j # i

pattern

(19)

Fl

where L, is the input corresponding to the knoxel k,. The global energy function is the sum of ( 13), ( 16) and ( 18): E(t)

= EI (t) + AE2(t) + cE3(t)

(20)

where A and E are the weighting parameters of the time delayed synapses and the external input synapses, respectively. The expectation functions W’, corresponding to blocks D and E, are implemented by setting the parameters of the energy function E(t) to h > I and E = 0. In fact, the task of these blocks is to generate suitable knoxel sequences representing the expected perception acts for the input assertion. This choice of parameters allows the transitions among knoxels to occur “spontaneously” with no external input. Referring to (20), it can be shown that an attractor is stable for a significant long time period due to the El (t) term, so that the output knoxel is easily observed. As A > 1, the term A&(t) after some dr is able to destabilize the attractor and to carry the activation pattern of the network toward the following attractor of the sequence representing the next knoxel of the stored perception act. The neural network therefore visits in sequence all the knoxels of the stored perception act related to the input assertion. The denomination function 0, corresponding to the block C of Fig. 1, is implemented by setting the parameters of the energy function E(t) to h < 1 and E > 0. The task of this block is the recognition of input knoxel sequences representing the input perception acts. To accomplish this task it is necessary to consider the input term Es(t) in order to make the transitions among knoxels happen, as driven from the external input. When A < 1, the term /\E2 (t) is not able itself to drive the activation pattern transition among the knoxels of the perception act, but when the term eE3 ( t) is added, the contribution of both terms will make the transition happen. The neural network therefore recognizes the input perception act as it “resonates” with one of the perception acts previously stored and generates the corresponding assertion. To examine the operations of the neural networks employed, we adopt the %overlupk measure of performance (see Amit [ 3]), during network epochs, where an epoch is an activation cycle of the neural network. This measure of performance is defined with respect to a previously learned knoxel k as the time evolution of the overlap, in terms

A. Chella et al./Art$cial

99

Intelligence 89 (1997) 73-l I1

1.2

-0.4

-

_-

-s#2 - s#3 -

I

-0.8 0 1.2

a

s#4

2

,I*

I 4

6 epochs

I

<,

I,

8

12

10

1

1

0.8

0.4

-_________-__------_____ --------

0

y

-c - --

-__-_-

-s#l --

-0.4

/

-s#2 - s#3

b

__...s#4 0

2

4

6 epochs

8

Fig. 15. The diagrams of the %overlap versus epoch of the neural networks Hammer and (b) Ball when the input is the perception act of Fig. 7.

of the normalized and the previously

dot product, between learned knoxel k:

10

associated

12

to the concepts

(a)

the current knoxel output of the network k(t)

(21)

100

A. Chellu et al./Art@cial Intelligence 89 (1997) 73-1 II

20

epochs

30

Fig. 16. The diagram of the %overlap versus epochs measure of the neural network generating expectations for the has-handle filler of the hammer.

the linguistic

Let us consider the operation of the attractor neural network in the situation depicted in Fig. 7: the focus of attention is directed, by the reactive mode, to the hammer handle and the hammer head, and the knoxels related to this perception act are sent to the denomination block to generate the corresponding assertion at the linguistic level. Fig. 15 shows the %overlup versus epochs measures of the neural networks associated with the concepts Hammer (Fig. 15 (a) ) and Bull (Fig. 15(b) ) . Each line of the diagrams shows the %overlup of the output activation pattern of the networks with respect to a previously learned knoxel, when the input is the perception act pl describing the hammer (Fig. 7). It should be noted that the sequence of input knoxels, representing the hammer handle and the hammer head, “resonates” with the previously learned sequence of knoxels s#l and s#3 of the network associated with the Hammer concept. On the other hand, the overlap of this sequence of knoxels with the sequences stored in the other network is low. Therefore the denomination block correctly denominates the input perception act as an instance of the Hammer, as described in the generated assertions reported in Fig. 8. Let us consider at this point the operation of the linguistic expectations block (block D of Fig. 1) during the example described in the previous section: a cylinder has been found and the linguistic level hypothesizes the presence of a hammer in the scene. The linguistic expectations block generates the hypothesized instances of the hammer head and of the hammer handle. Fig. 16 shows the %overlup versus epochs measure of the neural network generating the possible expected knoxel instances of the has-handle filler of the hammer. As in the previous diagrams, each line shows the

A. Chella et al. /Artijicial Intelligence 89 (1997) 73-111

101

%overlap of the output activation pattern of the network with respect to a previously learned knoxel. It should be noted that the network generates the knoxel hypotheses s#l, s#2 and s#6 as possible hammer handles. The knoxel s#4 does not belong to the hypotheses. After this step, the network generates the possible expected knoxel instances of the has-head fillers of the hammer. When some of these knoxels are satisfied by some knoxels in the scene, the resulting perception act is sent to the denomination block to recognize an instance of the Hammer, thus generating the assertions in Fig. 9. The operation of the associative expectations block (block E of Fig. 1) in the example considered follows the same guidelines as the linguistic expectations block: at the linguistic level, the Hammer is associated, by a Hebbian mechanism, to the Bull and the Mouse, due to the previous learning phase. The attractor neural network generates the possible expected knoxel instances of the Ball and of the Mouse; at the end of the operation, the assertions of Fig. 11 are generated.

8. Experimental

setup

This section describes the setup adopted to obtain the examples presented throughout the theoretical discussion, along with other more complex examples of the operation

Fig. 17. The result of the segmentation Fig. 3 are set into relief.

phase. The regions found after the segmentation

phase starting from

102

A. Chella et al. /Art@icial Intelligence 89 (I 997) 73-l 11

Fig. 18. The depth map of the acquired

scene obtained by the shape from the shading algorithm.

Fig. 19. The voxel representation

of the acquired

scene

A. Chella et al./Artijcial

Intelligence 89 (1997) 73-111

Fig. 20. A complex scene made up of a hammer, a cordless telephone,

103

a wood block and a mouse.

of the architecture. We have chosen an experimental framework that avoids some typical complex problems encountered in 3D vision. Also in this essential framework, our architecture is able to draw interesting inferences and to build an interpretation context. The framework consists of static scenes made up of objects like hammers, tennis balls, computer mice and telephones; all the objects rest on a uniform visually contrasting planar backdrop. The objects are easy to segment and they are arranged in order to avoid occlusions. Sensory data are 2D images acquired by a video camera (twodimensional arrays of pixels) representing an orthogonal view of the observed scene, as in Fig. 3. Starting from the acquired pictorial image, the subsymbolic level (see Fig. 1) computes the segmentation map by means of a region growing algorithm (see Zucker [ 681) : the image is initially partitioned into elementary regions of uniform brightness and the adjacent regions, for which the contrast difference is low, are merged. Fig. 17 shows the segmentation map found after the region growing phase starting from the scene in Fig. 3. The relative depth map is then computed by the Tsai and Shah shape from the shading algorithm [62] (see Fig. 18). We do not calculate the local orientation map. Both the depth map and the information about the segmented regions are fed as input to block B of Fig. 1. The first operation of this block is the volumetric representation of the input depth map by a spatial array. The result is a discrete representation of the spatial bulk of the objects present in the scene by VOX&, i.e., in terms of primitive volume elements (see Fig. 19).

104

A. Chella et al./Artijcial

Fig. 2 I. The superquadric reconstruction during the exploration of the scene.

Intelligence 89 (1997) 73-1 I I

of the scene in Fig. 20 along with the focus of attention movements

In order to describe the scene in terms of superquadric parameters, and therefore in terms of knoxels, each part of the scene that results from the region growing algorithm is approximated by means of the best fitting superquadric. The superquadric approximation operation is carried out by applying a simple two-step algorithm [ 41. First, the center pxrpY, pZ and the orientations of the principal axes 4, 8, and I) of the part under consideration are calculated, by determining the point and the unit vectors with respect to which all the products of inertia are zero, by following the algorithm proposed by Chien and Aggarwal [23]. Once the center and the principal axes are known, the computation of the lengths a,, a?, and a, of the axes of the superquadric approximating the considered part is trivial. In the second step, the form parameters (et, ~2) that best correspond to the squareness features of the object are obtained by minimizing the error function proposed by Solina and Bajcsy [57]. Since the center, orientation and axes are known quantities, the error function depends solely on the form parameters and has a minimum value that corresponds to those values defining the superquadric that best fits the given part. The approximation of each part therefore requires an optimization procedure in the two-dimensional space of the form parameters. Fig. 4 shows the results of the recovery of the superquadrics of the acquired scene of Fig. 3: each region of Fig. 17 has been approximated by a superquadric. Fig. 20 shows a more complex scene made up of a hammer, a cordless telephone, a wood block and a mouse. Fig. 21 shows the superquadric reconstruction of the same

A. Chella et al./Art@cial Intelligence 89 (1997) 73-111

Knoxel Knoxel Knoxel Knoxel Knoxel Knoxel

105

(#kl) (#k2) (#k3) (#k4) (#k5) (#k6)

Cylinder_shaped(#k2) Box-shaped(#kl) Hammer(Hammer#l) has-handle(Hammer#l,#k2) has-head(Hammer#l,#kl) Box-shaped(#k3) Block(Block#l) has-part(Block#l,#k3) Next-to(Next-to#l) participant(Next-to#l,Hammer#l) participant(Next-to#l,Block#l) Ellipsoid-shaped(#k4) Mouse(Mouse#l) has_part(Mouse#l,#k4) Above(Above#l) is-above(Above#l,Mouse#l) is-below(Above#l,Block#l) Parallelepiped_shaped(#k5) Thin-cylinder-shaped(#k6) Telephone(Telephone#l) has-body(Telephone#l,k#5) has_antenna(Telephone#l,#k6) Fig. 22. The assertions

generated

at the linguistic

level related to the perception

acts represented

in Fig. 21.

scene along with the focus of attention movements during the exploration of the scene. Fig. 22 shows the assertions generated at the linguistic level. By analyzing the focus of attention, it is possible to see that it follows two sequences: a sequence in which the attention is focused on the hammer, the block and the mouse, and another sequence in which the attention is focused on the body and the antenna of the telephone. The assertions generated at the linguistic level and the dynamics of the time delay neural networks may therefore be analyzed as a concatenation of these two sequences. It should be noted that basing the focus of attention mechanism on the expectations generation

A. Chella et al./Art@cial Intelligence 89 (1997) 73-11 I

106

Fig. 23. A complex

scene made up of a screw, a cylinder, a square block and a rectangular

block

allows the creation of “attentional contexts” within which an object is analyzed. In fact, during the analysis of the first sequence, the telephone is ignored, because the object does not belong to the current attentional context. The same thing occurs during the second sequence: the block, the hammer and the mouse are ignored because they do not belong to the same attentional context of the telephone. This allows to avoid the “cognitive overload” problem. The architecture is able to discover the relevant paths, and aggregate the information in order to generate only those linguistic descriptions that are “useful” and “interesting” in the current attentional context. The scene represented in Fig. 23 demonstrates the same process: the screw and the cylinder make up the first context and the two blocks constitute the second. Fig. 24 shows the superquadric reconstruction of the scene along with the focus of attention movements, and Fig. 25 shows the assertions generated at the linguistic level.

9. Discussion and conclusions The main goal of this work is to link together in a principled way two different research traditions: that of computer vision on one hand, and that of symbolic models of knowledge representation and reasoning on the other hand. We maintain that this goal can be achieved by taking into account the results obtained in different subfields

A. Chellu et al./Arti&ial

Fig. 24. The superquadric reconstruction duri ng the exploration of the scene.

Intelligence 89 (1997) 73-111

107

of the scene in Fig. 23 along with the focus of attention 1 mow

science. The architecture we have described is a first step in this direr :tion. of tzognitive particular, two main assumptions are critical for our proposal: In I The existence of a conceptual level, intermediate between the lower vision level (i> and the high level, symbolic representation. The conceptual level has a nonlinguistic nature (it is independent of any linguistic formulation), and it is modeled in terms of a conceptual space. It is generated starting from the outputs of the vision module, and has the role of providing an interpretation for the symbols of the linguistic level. (ii) The link between the conceptual level and the linguistic representation is achieved through a focus of attention mechanism, that has the effect of scanning in a sequential way the information processed at the lower levels. This hypothesis stands on the widely shared psychological assumption, according to which lower level vision is based on massive parallel information processing, while high level attentive phenomena are of a sequential nature. Given these basic assumptions, the specific choices that have been made in working out the architecture make it very general. Obviously it can be adjusted easily enough to accomodate more specific choices. The architecture extends previous work on scene understanding [ 19,20,63] by providing a cognitive framework in which to embed 3D reconstruction performed by current artificial vision architectures [ 11,241. It provides a well-founded interpretation mech-

108

A. Chella et al./Art@cial

Knoxel Knoxel Knoxel Knoxel Knoxel

Intelligence

89 (1997) 73-1 I1

(#kl) (#k2) (#k3) (#k4) (#k5)

Thin-cylinder_shaped(#k3) Flat-cylinder_shaped(#kk4) Screw(Screw#l) has-peg(Screw#l,#k3) has-head(Screw#l,#k4) Cylinder-shaped(#k2) Cylinder(Cylinder#l) has_part(Cylinder#l,#k2) Next-to(Next-to#l) participant(Next-to#l,Screw#l) participant(Next-to#l,Cylinder#l) Parallelepiped-shaped(#kl) Block(Block#l) has-part(Block#l,#kl > Box-shaped(#k5) Block(Block#2) has_part(Block#2,#k5 > Next-to(Next-to#2) participant(Next-to#2,Block#l) participant(Next-to#2,Block#2) Fig. 25. The assertions generated at the linguistic level related to the perception acts represented in Fig. 24. anism thatbuilds a rich linguistic description of the perceived scene. This linguistic description may be considered as the ground level for complex symbolic spatial reasoning activities, which up to now have been modeled without any reference to actual interaction with the external environment [44]. The proposed focus of attention mechanism complements at the cognitive level the current work on active vision which is mainly modeled in reactive terms [ $61. It is clear that, at the present stage of development, our architecture does not directly address many of the presently unresolved problems of computer vision, although it may well provide a contribution in some of these areas. Typical problems encountered in real vision systems are: non-optimal image acquisition conditions, poor contrast, shadows, occlusions between objects, segmentation criteria. Nevertheless, our framework

A. Chella et al. /Artijicial Intelligence 89 (1997) 73-I I1

109

offers interesting hints to face them. For example, the hypothesis generation process at the basis of the focus of attention mechanism can be usefully employed to solve the occlusion problem: the linguistic information and the associative mechanism can provide interpretation contexts and high level hypotheses that help in interpreting uncomplete structures. As a matter of fact, occlusions, non-optimal image acquisition and segmentation problems can be addressed in a framework in which active vision processes are coupled with our focus of attention mechanism. Symbolic reasoning and attentive processes driven by high level expectations can be essential in orienting low level active processes, in order to acquire new information from the sensors. The fusion of our model in an active vision framework is one of the topics of our future research. We are presently extending the architecture to the analysis of dynamic scenes. In this case, the subsymbolic level must be able to estimate the motion parameters of the objects in the scene (velocity, acceleration, and so on); the mapping between the conceptual level and the linguistic level must take the dynamic evolution into account; non-rigid objects must be recognized in spite of the modifications to their shape. We maintain that the assumptions at the basis of our model can be easily extended to the dynamic scenes. The focus of attention mechanism, and the concept of perception act are by nature dynamic: they introduce a dynamic aspect even into the perception of static scenes. Even more so, they are expected to work in dynamic contexts.

Acknowledgements We would like to thank Luigia Carlucci Aiello and Peter Gardenfors for the interesting discussions about the topics of the paper. Marco Gori, Pino Spinelli and Carmen Usai carefully read and commented on previous versions of this paper. We would also like to thank the anonymous referees for their suggestions which helped improve both the presentation and the contents of the paper.

References [ I 1 P.E. Agre, Computational research on interaction and agency, Artif: Intell. 72 ( 1995) l-52. [ 21 .I. Aloimonos, Visual shape computation, Proc. IEEE 76 ( 1988) 899-916. [ 3 1 D. Amit, Modeling Brain Function. The World of Attractor Neural Networks (Cambridge

University Press, Cambridge, 1988). [ 4 1 E. Ardizzone, S. Gaglio and E Sorbello, Geometric and conceptual knowledge representation within a generative model of visual perception, .I. Infell. Robotic Syst. 2 ( 1989) 381-409. 15 ] R. Bajcsy, Active perception, Proc. IEEE 76 (1988) 996-1005. 161 R. Bajcsy and M. Campos, Active and exploratory perception, Comput. Vision Graph. hnage Process. 56 (1992) 31-40. 17 I D.H. Ballard, Animate vision, Art$ Intell. 48 ( 199 I ) 57-86. I8 I A.H. Barr, Superquadrics and angle-preserving transformations, IEEE Conzput. Graph. Appl. 1 ( 198 1) 1 l-23. [ 9 1 H.G. Barrow and J.M. Tenenbaum, Computational vision, Proc. IEEE 69 ( 198 1) 572-595. I 101 M. Bettero, T. Poggio and V. Torre, Ill-posed problems in early vision, Proc. IEEE 76 ( 1988) 869-889. [ 11 I P.Besl and R. Jain, Three-dimensional object recognition, ACM Compur. Surv. 17 ( 1985) 75-145.

I10

A. Chella et ai./Arti$cial Intelligence 89 (1997) 73-111

recent research and a theory, Comput. Vision Graph. Image Process. 32 (1985) 29-73. [ 131I. Biederman, Recognition-by-components: a theory of human image understanding, Psych. Rev. 94 (1987) 115-147. I141 T. Bindford, Survey of model-based image analysis systems, ht. J. Rob. Rex 1 ( 1982) 18-64. I 151 L. Bimbaum, M. Brand and P. Cooper, Looking for trouble: using causal semantics to direct focus of attention, in: Proceedings ICCV-93, Berlin ( 1993) 49-56. I 161 N. Block, Imagery (MIT Press, Cambridge, MA, 198 1). I171 R.M. Bolle and B.C. Vemuri, On three-dimensional surface reconstruction methods, IEEE Trans. Pattern Anal. Mach. Intell. 13 (1991) 1-13. I181 R.J. Brachman and J.C. Schmoltze, An overview of the KL-ONE knowledge representation system, Cognit. Sci. 9 (1985) 171-216. L191 R. Brooks, Symbolic reasoning among 3D models and 2D images, Arti$ Intell. 17 ( 198 1) 285-348. 1201 R. Brooks, Model-based 3-D interpretation of 2-D images, IEEE Trans. Pattern Anal. Mach. Intell. 5 (1983) 140-150. 121 I PJ. Burt, Smart sensing within a pyramid vision machine, Proc. IEEE 76 (1988) 1006-1015. [ 22 I C. Chemiak, Minimal Rationality (MIT Press, Cambridge, MA, 1986). [231 C.H. Chien and J.K. Aggarwal, Identification of 3D objects from multiple silhouettes using quadtrees/octrees, Comput. Vision Graph. Image Process. 36 ( 1986) 256-273. [24] R. Chin and C. Dyer, Model-based recognition in robot vision, ACM Comput. Surv. 18 (1986) 67-108. [ 25 I D. Davidson, The logical form of action sentences, in: N. Rescher, ed., The Logic of Decision and Action (University of Pittsburgh Press, Pittsburgh, PA, 1967) 8 l-9.5. 1261 S.J. Dickinson, A.P. Pentland and A. Rosenfeld, From volumes to views: an approach to 3-D object recognition, Comput. Vision Graph. Image Process. 55 ( 1992) 130-154. [ 27 ] J. Doyle, Rationality and its roles in reasoning, in: Proceedings AAAI-90, Boston, MA ( 1990) 10931100. [ 28 1J. Duncan and G. Humphreys, Visual search and stimulus similarity, Psych. Rev. 96 ( 1989) 433-458. I29 1 M.J.K. Farah, D. Hammond, R. Levine and R. Calvanio, Visual and spatial mental imagery: dissociable systems of representation, Cognit. Psych. 20 ( 1988) 439-462. [ 301 S. Gaglio, PP. Pulialito, M. Paolucci and P.l? Perotto, Some problems on uncertain knowledge acquisition for rule based systems, Deck Support Syst. 4 ( 1988) 307-3 12. [ 3 1 ] S. Gaglio, G. Spinelli and V. Tagliasco, Visual perception: an outline of a generative theory of information flow organization, Theoret. Linguist. 11 ( 1984) 21-43. [ 321 P. Gardenfors, A geometric model of concept formation, in: S. Ohsuga et al., eds., Information Modelling and Knowledge Bases Ill (10s Press, Amsterdam, 1992). [ 331 P. Gardenfors, Three levels of inductive inference, in: D. Prawitz, B. Skyrms and D. Westerstahl, eds., Logic, Methodology, and Philosophy ofScience IX (Elsevier Science, Amsterdam, 1994). 1341 P Gardenfors, Meaning as conceptual structures, Tech. Rept. 40, Lund University Cognitive Studies, Lund (1995). I35 1 A. Gupta and R. Bajcsy, Volumetric segmentation of range images of 3D objects using superquadric models, Comput. fission Graph. Image Process. 58 (1993) 302-326. 1361 S. Hamad, The symbol grounding problem, Physica D 42 ( 1990) 335-346. [ 371 G.E. Hinton, J.L. McClelland and D.E. Rumelhart, Distributed representations, in: D.E. Rumelhart and J.L. McClelland and the PDP Research Group, eds., Parallel Distributed Processing: Explorations in the Microstructure of Cognition 1 (MIT Press, Cambridge, MA, 1986) 282-3 17. 1381 J.J. Hopfield, Neural networks and physical systems with emergent collective computational abilities,

I 1211.Biederman, Human image understanding:

Proc. Nat. Acad. Sci. USA 79 ( 1982) 2554-2558. 1391 P.N. Johnson-Laird, Mental Models (Harvard University Press, Cambridge, MA, 1983). [40] J. Jonides, Voluntary versus automatic control over the mind’s eye’s movement, in: J.B. Long and A.D. Baddeley, eds., Attention and Pelfbrmance IX (Erlbaum, Hillsdale, NJ, 1981) 187-203. [41] D. Kleinfeld, Sequential state generation by model neural networks, Proc. Nat. Acad. Sci. USA 83 (1986) 9469-9473. [ 421 D. Kleinfeld and H. Sompolinsky, Associative network models for central pattern generators, in: C. Koch and 1. Segev, eds., Methods in Neuronal Modeling (MIT Press/Bradford Books, Cambridge, MA, 1989) 195-246.

A. Chellu et al./Artijicial Intelligence 89 (1997) 73-l 11

III

L431 SM. Kosslyn, Image and Mind (Harvard University Press, Cambridge, MA, 1980). 1441 E. Lang, K.U. Carstensen and G. Simmons, Mode&g Spatial Knowledge on a Linguistic Basis, Lecture Notes in Artificial Intelligence 481 (Springer, Berlin, 199I ). 1451 D. Lee, Some computational aspects of low-level computer vision, Proc. IEEE 76 ( 1988) 890-898. 1461 A. Leonardis, E Solina and A. Macerl, A direct recovery of superquadric models in range images using Recover-and-Select paradigm, in: J.O. Eklundh, ed., Proceedings ECCV-94, Lecture Notes in Computer Science 800 (Springer, Berlin, 1994). 1471 D. Man; Vision (Freeman, New York, 1982). and recognition of the spatial organization of three148 I D. Marr and H.K. Nishihara, Representation dimensional shapes, Proc. Roy. Sot. London Ser. B. 200 ( 1978) 269-294. 1491 J. Maver and R. Bajcsy, Occlusions as a guide for planning the next view, IEEE Trans. Parfern Anal. Mach. Intell. 15 (1993) 417-433. 1501 B. Nebel, Reasoning and Revision in Hybrid Representation Sysrems, Lecture Notes in Artificial Intelligence 422 (Springer, Berlin, 1990). objects, ArfiJ Infell. 8 1511 R. Nevatia and T.O. Binford, Description and recognition of complex-curved (1977) 77-98. 1521 A.P. Pentland, Perceptual organization and the representation of natural form, Artif: Infell. 28 ( 1986) 293-331. I531

A.P. Pentland and S. Sclaroff, Closed-form solutions for physically-based modeling and reconstruction, IEEE Trans. Parrern Anal. Mach. Intell. 13 ( 1991) 715-729. 1541 ML Posner, Orienting of attention, Quar. J. Exper. Psych. 32 (1980) 2-25. 1551 A.A. Requicha and H.B. Voelcker, Solid modeling: a historical summary and contemporary assessment, IEEE Cornput. Graph. Appl. 2 (2) ( 1982) 9-24. 1561 R.D. Rimey and C.M. Brown, Control of selective perception using Bayes nets and decision theory, Int. J. Compuf. Vismn 12 ( 1994) 173-207. I.571 E Solina and R. Bajcsy, Recovery of parametric models from range images: the case for superquadrics with global deformations, IEEE Trans. Partern Anal. Mach. Infell. 12 ( 1990) 131-146. in shape recognition, Co&t. Psych. [58l M.J. Tarr and S. Pinker, Mental rotation and orientation-dependence 21 (1989) 233-282. 159 J.M. Tenenbaum, M.A. Fischler and H.G. Barrow, Scene modeling: a structural basis for image description, Compuf. Graph. Imqe Process. 12 ( 1980) 407-425. 160 D. Terzopoulos and D. Metaxas, Dynamic 30 models with local and global deformations: deformable superquadrics, IEEE Trans. Pattern Anal. Mdch. Infell. 13 ( 1991) 703-714. 161 D. Terzopoulos, A. Watkin and M. Kass, Constraints on deformable models: recovering 30 shape and nonrigid motion, Artif: Infell. 36 (1988) 91-123. 1621P-S. Tsai and M. Shah, Shape from shading using linear approximation, Tech. Rept. CS-TR-92.24, Department of Computer Science, University of Central Florida, Orlando, FL ( 1992). 1631 J.K. Tsotsos, Knowledge organisation and its role in representation and interpretation for time-varying data: the ALVEN system, Conz~uf. Infell. 1 ( 1985) 498-514. 1641 J.K. Tsotsos, S.M. Culhane, W.Y.K. Wai, Y. Lai, N. Davis and F. Nuflo, Modeling visual attention via selective tuning, Arr$ Infell. 78 ( 1995) 507-545. 1651 B. Tversky and K. Hemenway, Objects, parts, and categories, J. Exper. Psych. 113 (1984) 169-191. [66 I I? Whaite and F. Ferrie, From uncertainty to visual exploration, IEEE Trans. Pattern Anal. Mdch. Intell. 13 (1991) 1038-1049. 167 ] D.L. Yarbus, Eye Motion and Vision (Plenum Press, New York, 1967). [681 SW. Zucker, 382-399.

Region growing:

childhood

and adolescence,

Compur. Graph. Image Process. 5 (1976)