DIGITAL SIGNAL PROCESSING ARTICLE NO.
7, 13–27 (1997)
SP970274
On the Evolution of Parallel Computers Dedicated to Image Processing through Examples of Some French Computers Edwige E. Pissaloux1 and Patrick Bonnin2 Universite´ de Rouen, PSI/La3I, Faculte´ des Sciences et des Techniques, 76 821 Mont Saint Aignan, France information depends upon the image representation in the computer and the operations performed. In other words, the quality of the obtained results is determined by the adequation of the computer’s abstract object and the real object represented, i.e. by the resources provided by the computer (hardware and software). This paper proposes an analysis of the computer concept in terms of image processing/vision requirements. It is organized as follows: Section 2 addresses the concept of a computer and a dedicated computer, and some problems inherent to its physical implementation. Section 3 briefly analyzes the requirements of computer vision tasks at different theoretical levels. Sections 4 presents examples of some French dedicated computers for low and intermediate levels of image processing through the new concepts integrated in their architecture. Some concluding remarks are given in the Section 5.
Pissaloux, E. E., and Bonnin, P., On the Evolution of Parallel Computers Dedicated to Image Processing through Examples of Some French Computers. Digital Signal Processing 7 (1997), 13–27. This paper proposes an overview of the evolution of the architecture of some parallel computers dedicated to image processing. A formal definition of a dedicated computer is proposed and some requirements of different levels of image processing/vision in terms of their algorithmic structures are pointed out. Examples of few French computers show hardware structures useful for efficient (real-time and on-board) implementations of image processing/vision tasks. r 1997 Academic Press
1. INTRODUCTION Image processing is well known to be very time consuming. Therefore, since the advent of parallel computers, several architectures have been proposed [5,9,28]. However, the universal computer for image processing has not been (yet?) designed, despite many interesting existing architectures. This is probably the consequence of the very big complexity of image processing tasks, and big variety of image types (grey-level, color, infrared, radar, aerial, satellite). Moreover, an image encompasses information at different syntactic (abstract) and semantic (interpretation) levels. Therefore, the search of data pertinent to the final purposes is a complex task. However, the quality and speed of the extracted 1
E-mail:
[email protected].
2
E-mail:
[email protected].
2. CONCEPT OF A COMPUTER AND ITS IMPLEMENTATION 2.1. Formal Definition A simple observation of the computer’s external behavior justifies the following definition of a computer. DEFINITION. A computer is a set of different abstract algebras. An abstract algebra (a structure) is an ordered pair (E, F), where E is a nonempty set, and F is a set of composition rules (internal and external laws) defined over E [2]. Some of the composition rules from F are inherent to the algebra; we say that they confer 1051-2004/97 $25.00 Copyright r 1997 by Academic Press All rights of reproduction in any form reserved.
13
the given algebraic structure to E (thus E becomes effectively an algebra); they are named structure operations. Other composition rules from F are added because they are frequently used.
for the first, and calculation elements, ALU for example, for the second). Sequential computers work usually on elementary data structures, i.e., on subalgebras whose elements are (or can be, without important additional cost) directly (but in finite precision) implemented in the hardware. This is the case of rings, fields, free finitely generated monoids [19,20]. However, additional information about the validity of the obtained result has to be provided through different flags (carry, overflow). Parallel computers implement more complex data structures, mainly sequences (or series) and graphs. Partial or global orders inherent to these data structures lead to their spatial organization on a support (a computer memory, in particular), and to a new practical definition of a parallel computer could be:
EXAMPLE. Between three ring structures (algebras) fundamental in computer science (Boolean ring, ring of the residue classes modulo N in : and its image in Z), only the Boolean ring is 9naturally: represented in the hardware. It is defined as a set B 5 50, 16 with two binary operations: — addition modulo 2, with 0 as its neutral element; — multiplication, with its neutral element 1. Usually we define two more operations on B: Boolean addition (with 0 as its neutral element), and negation (said 1 complement). The set B with these four operations are called a Boolean algebra B. Other operations, such as NAND, NOR, . . . are added to the Boolean algebra.
COMPUTER 5 data structure 1 communications 1 operations.
Only well identified operations of F will be performed on elements of E, and all structure properties can be used when designing computers/programs. The most complex (compound) operations (programs) allowed to be performed should correspond to the operations involving similar algebras. Two algebras (E, F) and (E8, F8) are similar (or of the same type) if F and F8 have the same number of operations and if ; k [ 51, . . . , n6 fk [ F and f 9k [ F8 have the same arity. The above definition guarantees that it will not be the side effects when performing computer operations; however, the reality is different, and the carry flag set to 1 after «5 2 5», for example, is the most classic example of such (computer science) incoherencies. If the set F encompasses operations which are useful for implementation of a given class of problems, such as those related to signal processing, image processing, matrix calculations, . . . , the corresponding computer is named a dedicated computer.
Data structures are represented in finite precision as well; they introduce the same problems as elementary data and add new problems related to the global and partial order concept management at the hardware level. Communications are a generalization of memory accesses of sequential computers. In parallel computers, their importance and volume have gained independence and significance. In the case of computers dedicated to image processing/vision, basic data structures are 2D arrays, named images (formally supported by series), linked lists and graphs (both formally supported by graphs), and sets. Spatial communications correspond to the abstract operation of a structure traversal (global or partial). They are: local (involving a short sequence of consecutive2 elements—a substructure), semiglobal (involving a significant number, but not all elements), and global (involving all structure elements). Section 4 discusses different implementations of the above abstract concepts in a few examples of parallel computers dedicated to image processing.
2.2 Practical Definition
2.3. Some Problems Inherent to the Physical Implementation of the Concept of a Parallel Computer
Technological constrains bring physical limits to the abstract definition of a computer; therefore, a practical one could be [22]:
In this section we give some of the most important parameters which directly influence architectural trade-offs when designing a parallel machine. They will be useful for a machine presentation addressed in the Section 4.
COMPUTER 5 data structures 1 operations. Indeed, elements of the set E and operations of F have to be represented at the hardware (by memorizing elements, such as memory or registers, on N bits
2
14
In the sense of some order.
exist also hybrid parallel computers with locally distributed and globally shared memory (Cray T3D, for example). Many of the image processing/vision computers have a distributed memory (DAP/ICL, MPP/Nasa— Goddar Space Flight Center, CLIP/University College of London, NCR’s GAPP, MP-1 de MasPar, Sympati/CEA–IRIT).
A realization of a real parallel computer should find the best compromise between several, sometime antagonistic, criteria. They can be split in two classes: structural and performance.
2.3.1. Structural Criteria According to the abstract definition of a computer, three elements should be investigated when designing and realizing a computer; they are:
(b) Granularity of the elementary data structure. The granularity signifies the size of the operands involved in an operation, especially in the communication process. It can be practically quantified by a relative cost (number of machine cycles) of a local operation (usually binary addition) compared to a communication operation for a data of a fixed size (frequently 32 bits). Three classes can be distinguished:
— spatial representation of complex data structures; — topology of the communication (interconnection) network; — processing capabilities. 2.3.1.1. Spatial representation of the complex data structure. Beside the problem of the complex data structures representation mentioned above, two others, namely,
— fine-grained computers for which this cost is about a few units (for the CM-200 it is between 2 to 10, depending on the communication mode); — medium-grained computers for which this cost is about some tens of units (for Inmos Transputer, for example, it is between 10 and 100); — coarse-grained computers, where this cost is of several hundreds (iPSC, PSI).
— their spatial representation, — the granularity of its elementary data; have to be considered. (a) Spatial representation of the complex data structure in parallel computers. There are two possible spatial representations of complex data structures:
It is worthwhile to stress that the granularity has a direct impact on the PE internal complexity, and machine instruction usage. The PE architecture complexity is almost the same problem as that for a sequential computer. Regarding communications, in fine-grained computers, the use of a local (PE executed) operation or a local communication is quite equivalent; therefore, a programmer will use them without restriction, contrary to this use in a coarsegrained computer. The granularity grows with the processing power of the PE. 2.3.1.2. Topologies of the communication (interconnection) networks. The communication network should provide the efficient means for fast and reliable execution of different complex data structure traversal operations. A topology of the communication network is a graph associated with it. Physical implementation of an interconnection network can be:
— one, where the order of elementary elements is implicit, and corresponds to the order defined on memory cells; — second, where the order of elementary elements is explicit. The first representation leads to a shared memory parallel computer model, while the second leads to the distributed memory computers; in this latter, the order on elementary elements is that of the processing elements whose memory encompasses the elementary data. These two spatial organizations have led to different parallel machine organizations according to the place of the interconnection network [12]. Shared memory (or PE-to-memory configured3) parallel computers are characterized by an interconnection network which binds PEs of a parallel computer to the memory (usually physically realized as several banks, one per PE). In distributed memory (or PE-to-PE configured) parallel computers, each processing element has its own (local) memory, and all PEs are interconnected through a network. There
— static (or fixed); — dynamic (or switched). Static networks, of usually better temporal performance, but with very limited reconfiguration flexibility, than dynamic, are most frequently used in image processing/vision computers. Topological characteristics of the interconnection
3 PE (processing element) designs the CPU of a parallel computer; however, in order to underly the usually small processing power compared to this of sequential CPU, the term PE is used.
15
machine realization, but are very strongly influenced by the interconnection network. The interconnection graph degree and graph diameter influence the complexity of the routing algorithm; hence the execution time of the whole data exchange is also influenced. The fault tolerance is another important parameter—a parallel fault-tolerant computer will work (with decreased temporal performance) despite some faulty PEs. This implies the absence of the master processor in the architecture and the capability of the interconnection network to locally and temporally modify its physical link. Other criteria, such as scalability, PE modularity, physical feasibility, etc. directly influence the potential physical adaptation of a parallel computer to application needs. Each PE performs a predetermined set of operations which defines its processing capabilities. In the case of image processing/vision parallel computers these should include operations useful for low, intermediate, and high levels processings; some of them are pointed out in Section 3.
network directly influence temporal performance of a parallel computer; some of them are: — graph degree, a maximal number of links of a PE (for the CM—200, this degree is 12); — graph diameter, a maximal distance between any couple of its vertices (for the graph diameter of the CM—200 is 12). The graph degree determines the number of potential direct communication links between two PEs of a parallel computer, while graph diameter influences directly the algorithmic complexity of a routing algorithm (for global communications). 2.3.1.3. Processing Capabilities. By a processing power, in this context, we design the type of processing a PE (computer) is able to perform in a unit of time4 (which kind of the operation). This is different from the processing speed of a computer, usually expressed in MIPS, MFLOPS, MLIPS. Consequently, the processing power of a parallel machine is influenced by: — PE autonomies, — PE operations.
3. ANALYSIS OF REQUIREMENTS FOR IMAGE PROCESSING OF LOW AND INTERMEDIATE LEVELS
The concept of PE autonomy is inherent to SIMD computers, because MIMD computers, sometime considered as a set of independent computers, are largely autonomous. The autonomy of a PE is its capability to execute an instruction different from that performed by other PEs. Four types of autonomies are:
This section stresses some important points for parallel image processing by which its performance directly depends on computer architecture; complete surveys of that topic can be find in [1,29,30]. A grey-level image is usually represented by a two-dimensional array of pixels. A numerical value of the (i, j)th pixel codes (by a natural number) its luminosity, derived from a raster device. Different processings are applied in order to extract information encompassed in the image. All of them are frequently subdivided in three levels, determined by the complexity of processing: low, intermediate, and high. However, boundaries between these classes are not very sharp, and a number of applications do not involve all three levels.
— autonomy of activity; i.e., PE can decide whether it performs or ignores instruction; this facilitates the implementation of the conditional statement (IF– THEN–ELSE–) (CM-1, MP-1, Sympati); — autonomy of operation; i.e. PE can choose the next instruction to be executed (CLIP7 or NAP computers); — autonomy of addressing; i.e. a local data pointer allows accesses to different memory cells in different PEs; very convenient facility for supporting linkedlists, trees, look-up tables (Blitzen, Sphinx); — autonomy of connection; i.e. a PE can locally modify its temporal physical connection with its neighbors; this property is frequently named reconfigurability (Polymorphic Torus, SILT, Sphinx).
3.1. Low Level Processings Low level processings (or preprocessings) try to compensate, remove, or minimize the image registration errors and to enhance the visual perception or the behavior of image data for subsequent analysis. In the case of known image degradation, it is necessary to apply inverse degradation, frequently named image restoration. It usually corresponds to geometric enhancements which try to compensate for spatial distortion problems, incorrect sensor alignment
2.3.2. Performance Criteria Performance criteria of a parallel computer are dependent not only on technology used for physical
4
Independently of the physical duration of the unit.
16
(parallaxis effect), perspective effect, photometric distortions (optical defocalization and diffraction, sensor nonlinearities, etc). In the case of estimated degradation, it has to be compensated for by subjective or objective image quality enhancements such as noise removal. Local functions are mainly linear filterings (average, average with threshold, gradient, laplacien), nonlinear filterings (median, min/max Nagao), and morphological operations (erosion, dilation, opening, closing, skeletonizing, thinning). Global processings are performed on the whole image. They can be subdivided into the following three classes: orthogonal transforms which compute the repartition of signal frequencies with a transfer function (such as Fourier or Haar), spatial filterings (such as noise rejectors (threshold)), and statistical analysis functions (a histogram, for example). Most of these processings use a pixel image (a two-dimensional array) as their input and output data structure; sometimes, they construct the image representation of basic image characteristics such as interest or edge points, for example. Local filtering can be reduced to the convolution of the initial image with a function extracting the considered syntax properties of information represented by the image content (with or without thresholding). The image-result has to be calculated by the same algorithm in any pixel. The value of a pixelresult can be obtained by an approximation of the exact calculus through an equivalent discrete formula of a continuous function which usually involves neighboring pixels. Figure 1 shows the case when a 3 3 3 centered mask chose the neighborhood in-
volved in the approximate calculus of a considered function in its central points. The mask can vary in size (from 2 3 2 (for the simplest edge-point detectors) to the whole image size mask and in form; the square mask (used by Sobel’s, Kirsh’s, and Prewitt’s operators) is the most popular one. Consequently, such a class of processings requires the data programming model which is efficiently supported by the fine-grained SIMD computers with distributed memory. However, the use of a mask for neighbor selection supposes the convenient degree of the PEs interconnection network, with or without autonomy of connection. The operations involving some global information (as thresholding, or unpredictable convergence test (implemented with algorithmic DO . . . UNTIL . . . and WHILE . . . control structures) require usually global communications. Therefore, in distributed memory computers the graph diameter of the interconnection network has to be low; otherwise a globally shared memory should be used for data exchanges. The SYMPATI2/SYMPHONIE parallel computers (cf. Section 4.1.) offer very attractive hardware structures for convolution-based low-level real-time processings.
3.2. Intermediate Level Processings These processings aim to elaborate new complex data structures that are less cumbersome and useful for image understanding/interpretation functions. They construct geometric primitives from elementary ones detected by the low-level processings. Usually, they transform an image representation of low-level information into new complex data struc-
FIG. 1. Image convolution with a 3 3 3 mask.
17
with adjacent nodes. The evaluation is performed iteratively until the convergence is reached. Sometimes, the evaluation can lead to a new graph. Different recognition procedures based upon graph and/or symbolic reasoning are used. These algorithms, implemented at the hardware level, can satisfy very hard real-time constrains. In Section 4.3 we present an SIMD asynchronous computer implementing the graph data structure at the hardware level, while Section 4.4 gives an overview of the µPD circuit dedicated to object/image matching through the modified dynamic programming algorithm.
tures such as (simple or double) linked lists, graphs, sets, and their composition and multiple embeddings, in order to represent lines, circles, ellipses, cubes, and parallelepipeds. Edge and region constructions and their global characterizations are examples of processings from this class. The implementation of the corresponding algorithms requires global data exchanges between all elements of a complex variable (such as the region label, element ordered insertion/deletion, element comparisons with a global value, etc.). Several complex data structures may represent different information encompassed in the same image. However, it can be suitable to apply different processings to each data structure separately, or even, in the case of data dependencies, iteratively to apply the same processing to each of their connected components. Consequently, the implementation of spatial, partially or globally ordered, data requires careful design of the interconnection network and communication primitives for data exchanges. Operations on connected components have to be efficiently implemented with semi-global communication operations. The global processings, on sets of complex structures, require efficient global and semi-global communications; thus the graph diameter of the interconnection network has to be low. Processings on data set structures and connected components of the same structures have to be supported by data—parallel and control—parallel programming models, inherent to multi-SIMD or MIMD controlled fine- and medium-grained computers. The SPHINX pyramidal multi-SIMD computer (cf. Section 4.2.) offers very attractive topology of the PE interconnection network, while the MAO associative network and SYRAR system offer efficient supports for some graphs and trees-based processings (cf. Sections 4.3, 4.4).
4. SOME EXAMPLES OF FRENCH PARALLEL COMPUTERS DEDICATED TO IMGAE/VISION PROCESSING This section briefly presents some French parallel computers which have been at least realized as prototypes; they implement some of the hardware implications of image processings analyzed in the previous sections. They consider two aspects of a computer representation: complex data structures for images representation, and new operations. SYMPATI2 and SYMPHONIE computers increase the processing speed of image processing applications through helicoidal memory data implementation and an enhanced interconnection network. Sphinx and MAO propose a hardware support for pyramids and graphs. SYRAR offers new architectural mechanisms for the image matching operation.
4.1. SYMPATI2/SYMPHONIE—Computers Dedicated to Low and Intermediate Levels Tasks SYMPATI2 ([13]; Fig. 2) is parallel computer realized by the Commissariat a` l’Energie Atomique
3.3. High Level Processings High level processings vary largely with final application. In the case of object recognition/scene interpretation, they construct descriptors of forms obtained from geometric primitives and perform a recognition procedure (data base model matching, image matching, neural network object sortings). These processings require very complex algorithms which search the global extremum of a (multivariable) discreet function. The graph—a global image descriptor—is the most frequently used data structure. Graphs are dynamically managed. Many graph operations evaluate a new value of some property (associated with a given graph node) as a function of value associated
FIG. 2.
18
SYMPATI2 SIMD computer.
(CEA/DEIN CEN), Saclay, and the CERFIA Laboratory, Universite´ de Toulouse. It is merchandized by the Centralp Co. SYMPATI2 is a 32 to 128 PE SIMD fine-grained (16-bit) computer with a very simple internal architecture of its PE (Fig. 3). The masking module (Work Flags) offers the autonomy of execution to the PE, thus an efficient execution of the conditional IF . . . THAN . . . ELSE . . . statement. Indeed, the result of the evaluated condition partitions all PEs in two subsets: one, for which the evaluated condition is false, and the other, for which the same condition is true. A conditional statement is executed in parallel on all PEs of the same subset. The Index Register is a generic name for the address unit which calculates the memory addresses of helicoidally implemented data. The helicoidal implementation of data in memory (Fig. 4) allows scanning an image in row or column order without board effects and conflict-free memory accesses to different memory banks in 1 machine cycle (Fig. 5). Such image implementation allows for larger, than in classic computers, accesses to the memory by a PE and for efficient implementation of the convolution, mask-based, operations. Indeed, image, subimage, and mask concepts are hardware supported data structure in Sympati. SYMPATI2’s interconnection network is an enhanced 1D ring (fixed-topology). Memory accesses use the enhanced interconnection network. Indeed, the interconnection graph degree is not 2 (as in a usual ring) but 4 (what about memory accesses) and 6, what about PE internal register accesses. Such topology has been realized through additional links between PEs (Fig. 6). The SYMPATI2 interconnection graph diameter is N/4, where N is the number of PEs. SYMPHONIE [8], developed since 1993 for the Rafal fighter aircraft, is an enhanced version of
SYMPATI2 (Fig. 7). This distributed memory parallel computer has been designed by the CEA/DEIN, Saclay, and SAT (Socie´te´ Anonyme de Te´le´communication). A SYMPHONIE’s chip encompasses sixteen 32—bit PEs and their memory banks. The whole computer can be considered as two independent, but synchronous, machines: one performing Sympati’s operations (PE cells), and the other dedicated to communications (local and global) (Com cells) (Fig. 8). The arithmetic and logic unit allows parallel execution of some operations which should be explicitly specified by a programmer (such as addition and multiplication, or parallel load of several registers). There are also some hardware structures for an efficient implementation of floating point operations. The dedicated unit for memory accesses performs in-parallel memory address calculations. The topology of the Symphonie interconnection network is a pure 1D ring. It provides numerous data exchange schemes between PEs: regular (i.e., the receiver’s relative address is common to all PEs), global broadcast, and irregular (Fig. 9). All communication models use the message passing paradigm. All other SYMPHONIE’s characteristics are similar to those of SYMPATI2. The SYMPHONIE Programming Language (SPL) is a C-like high-level language; therefore all data structures of C are supported. Consequently, SYMPHONIE is well-suited for low and intermediate level vision tasks programmed in a high level language.
4.2. SPHINX—A Pyramidal Machine for Intermediate Level Image Processing Tasks The binary pyramidal computer SPHINX (1993) has been developed at the Institut d’Electronique Fondamentale, University of Paris XI, in conjunction with the ETCA, a Defence Research Laboratory, and the Sodima Co. [15].
FIG. 3. Internal architecture of SYMPATI2 PE.
19
FIG. 4. Helicoidal implementation of data in memory of SYMPATI.
This binary pyramid aims to optimize most frequent data movements when processing images and is very useful for global processings. Indeed, the diameter of the interconnection graph used when performing such operations is that of a binary tree; there is log2 N, where N 3 N is the number of PEs in the pyramid base (instead of N in a 2D mesh). Moreover, the binary pyramid avoids data serialization on different nodes of a tree inherent to the quaternary pyramid, and allows us to pipeline the execution of different instructions [5]. Sphinx being a set of decreasing size meshes, the distributed multi-SIMD control of the overall structure has been designed to be SIMD in a layer and
MIMD between layers (cf. Fig. 10). This control strategy supports efficiently data-parallel and control-parallel programming models [21]. The divideand-conquer technique is one of the best suited strategies for implementation of image processing algorithms on Sphinx [4]. A data structure—a C-graph (a concentration graph)—is hardware/software implemented in the Sphinx [7]. A C-graph is a subgraph of the graph of physical connections of the pyramidal network associated with a plan graph implemented in a pyramidal layer. Figure 11 gives an example of a plan graph and several C-graphs possible to associate with it. A node p of a given C-graph is said to be a centralizing
FIG. 5. Physical implementation of image lines/columns in helicoidal memory.
20
logic instructions of usual computers. In-layer and inter-layers physical communications, C-graph data exchanges, and global reduction (Global OR Sum) have been added to the Sphinx instruction set. Data exchange NEWS operations are performed in the mesh using a 1-bit reconfigurable register (autonomy of interconnection). The mesh data exchanges are synchronous, while the tree (C-graph) data exchanges are asynchronous in order to get maximal benefit from the parallelism of the architecture [3]. The PE autonomy of addressing is realized through the local PE’s pointer; thus it is possible to indirectly access data implemented in the PE internal memory (highly convenient when implementing linked lists).
FIG. 6. SYMPATI2 PE data communication capability.
4.3. Associative Nets The associative net architecture a parallel computer under development at the University Paris 11, is based upon the graph concept. An implementation of the graph at the hardware level has been first investigated in the Calculateur Fonctionnel Project at ETCA, a Defence Research Laboratory [25]. This project has stressed the duality between graph operations and their expression in a functional programming language. The asynchronous execution model and necessary stability detection mechanism (derived from the self-timed concept [31]) in the 2D mesh have led to the associative net definition. Associative nets [16] try to implement at the hardware level several graph theory elementary concepts which can be directly mapped onto a 2D mesh of P processors (Pi, 0 # i # (P 2 1)). They are graph, connect component, successor, and ancestor functions [32]. An oriented graph G 5 (P, E) of degree D is represented by P processors (vertices of G), and by its interconnection network (a subset of the physical interconnection network), which defines the set of graph’s edges E. The data values of P processors are considered as a parallel variable of P elements. G being oriented, there exist sets of ancestors and successors of Pi in G. A set of processors of G is an equivalence class, if it is possible to define over G an equivalence relation Rg such that Pi and Pj are equivalent if and only if ' k, 0 # k # (P 2 1), such that Pi and Pj belong to the k’s ancestor set. Equivalence classes define connected components of a given graph. As a graph can be considered as a set of elements, it is possible to perform the usual set operations on elements of G (arithmetic and logic element-wise operations). The scan f (prefixed- or association-) operation [33] involves a graph G considered as a parallel variable p of pi elements (0 # i # (P 2 1)), and
node, if there is no node in a C-graph ancestor of p. On Fig. 11, the centralizing nodes are represented by black points. The set of all centralizing nodes pyramidally projected on a plan (a pyramidal layer, for example) define the centralizing graph (Fig. 12). Cgraphs are usually associated with image regions, while centralizing graphs represent the connectivity of graph components. C-graph operations are: c a parallel application of the same operation to all graph nodes, c a parallel data exchange between two C-graphs (embedded in the same plan graph) through frontier pixels, c communication primitives — (mesh) plan communications, — send-down (a message), — send-up (a message), with possible local reduction through an associative and commutative operator. The Sphinx PE (Fig. 13) is fine-grained. Its elementary instructions are similar to the arithmetic and
FIG. 7. SYMPHONIE system.
21
FIG. 8. SYMPHONIE’s computer global organization.
c, one of its subgraphs (or c-graph—the communication graph) which defines authorised data movements in G. The scan operation calculates the result ai (implemented on Pi) using the values pj of all successors of Pi. Figure 13 gives examples of implementations of some image processing objects (namely region, edge, oriented trees) on an associative net. The c-graph is an elementary parallel data struc-
ture. Its basic operations are the usual graph operations such as c-graph construction and deletion, c-graph topology redefinition, union/intersection/ cartesian product of two graphs, or insertion/ deletion of one element. An associative net supports tree operations efficiently; therefore, before performing a graph operation it is necessary to embed the tree (which considerably decreases temporal performance).
FIG. 9. SYMPHONIE’s data exchange models.
22
FIG. 10.
Binary pyramid SPHINX.
The associative computation model was designed for a mesh computer. The topology of the interconnection network is 2D. Local mesh communication can be performed due to the autonomy of interconnection. Global communication involves routing through intermediate PEs. The internal architecture of a PE (Fig. 14) encompasses the modules which perform OR-association,1-association, and tree embedding in the physical 2D mesh. For temporal purposes, the PE includes the arbitra-
tion logic which detects the stability of the calculating results; once reached, the execution of the next instruction begins.
4.4. SYRAR Computer—A Parallel Computing Structure for Image Matching The SYRAR parallel computer is under development at the University de Rouen in conjunction with the CEA–DAM (Comissariat a` l’Energie Atomique) and University Paris 11. It is dedicated to matching of 2D signals (thus in particular, for image comparisons) through the orthogonal dynamic programming algorithm, a 2D extension of the basic dynamic programming algorithm [24]. Its heart is the µPD circuit, a massively parallel circuit implementing a parametrized dynamic programming, through the basic graph operation—search (in a graph) for the path of minimal cost. In the case of image processing, graph vertices are pixels of an image, and graph arcs are virtual links
FIG. 11. Example of a plan graph and some C-graphs associated with it.
FIG. 12.
23
C-graph and centralizing graph associated with it.
frequently named vectors (of N elements). The dynamic programming algorithm compares these vectors by c calculation of the distance d[i][ j] between any two elements, U[i] and V[ j]; c searching of the path smin such as: score 5 minpaths (Si,j d[i][ j] 3 C(s))
FIG. 13.
(the path s 5 c (i,j), where c is the concatenation operation, and i, j [ 51, . . . , N 6). The C(s) is a local cost of path s construction, i.e., the cost associated with the distance between two consecutive elements of the path. According to the fact that two vectors are equal if ; i [ 51, . . . , N6, U[i] 5 V[i], the cost of local diagonal path s will be less than of a nondiagonal one; d[i][ j] is an elastic distance between compared elements. The µPD circuit principle is given by Fig. 16. It is a n-ary hypercube of dimension 2 (i.e., a 2D mesh) with n processing elements per dimension). The whole PD machine, a scalable massively parallel computer, is MISD (systolic) data driven execution model controlled computer. Each PE is 3—connected to its three neighbors in the grid (east, southeast, south), with autonomy of connection. Each PE has its own local memory. Each PE has the autonomy of execution; i.e., it can decide to enable or disable the execution of the received instruction in function of its internal status. Each PE updates locally the score formula. A PE internal architecture is very simple:
Sphinx PE internal architecture.
c a counter, for simulation of the cost function, one per three possible path s development directions; c two programmable registers: one for the cost function, and the other for path s development code (necessary for the global path s construction).
between two convenient pixels. The orthogonal dynamic programming algorithm constructs an oriented 1-graph, a path, which encompasses the ordered list of couples of corresponding pixels, one pixel per matched image. The classic dynamic programming algorithm is applied to two 1D digital signals (thus two lines or two columns of two grey-level images, one per image). They can be represented by oriented 1-graphs, U and V,
A parallel dynamic programming algorithm searches, in parallel, all possible paths s in lopt nibbles, where lopt is the length of the optimal path. On the kth nibble, 0 # k # lopt, all active PEs have their spatial (machine) index of the form ( , k) or (k, ). Figure 17 gives
FIG. 14. Embeddings of region (a), edge (b), and oriented tree (c) on an associative net.
24
FIG. 15. Associative net PE internal architecture.
concept of the universal and a specific algebra. It has stressed the importance of adequacy between abstract mathematical concepts and their physical representation. These differences should be carefully investigated when designing new computers in order to avoid side-effects, undesirable effects which can occur during the machine work. The paper has pointed out the specificities of the image processing and discussed complex data structures (list, graphs, sets) used by it and operations performed on them. Examples of some parallel French computers providing different implementa-
an example of all paths s development in parallel; the compared 1D signals are represented on X and Y axis. The shadowed area represents all parallel developed paths; the bold path is the optimal path. Results of the comparison of real signals are included in the part inside the square (others are used for µPD simulator purposes).
5. CONCLUDING REMARKS The paper has addressed the concept of a computer and a dedicated computer as a materialization of the
FIG. 16.
FIG. 17. Parallel development of paths in the SYRAR computer.
µPD circuit organization.
25
tion of graphs and their operations and some original memory data implementation techniques are shown. The hardware structures designed for the above purposes increase the speed of parallel processing of images. The self-timed logic [31] can bring further improvement of temporal performance as well. Further research should investigate new hardware implementations of the graph concept. Indeed, the only operations which are at present effectively executed at the hardware level are those on trees; therefore many image processing graph-based algorithms still have to be implemented through complex, and error-prone, software. Moreover, other image matching techniques should be hardware implemented. It should be stressed that the machine definition elements presented in Section 2 can be used as criteria, not only for dedicated parallel computers, but for any parallel computer class comparison and for parallel machine benchmarking.
9. Dew, P. M., Aernshaw, R. A., and Heywood, T. R. (Eds.) Parallel Processing for Computer Vision and Display.Addison– Wesley, Reading, MA, 1989. 10. Duff, M. L. B. CLIP4, a large scale integrated circuit array parallel processor. In Proc. 3rd Int. Conf. on Pattern Recognition, 1976, pp. 728–733. 11. Haralick, R. M. Some neighbourhood operators. In Real Time Parallel Computing (E. Onoe, K. Preston, and A. Rozenfeld, Eds.). Plenum, New York, 1978, pp. 12–35. 12. Hwang, K. Advanced Computer Architectures. McGraw–Hill, New York, 1993. 13. Juvin, D., Basille, J-L., Essafi, H., and Latil, J. Y. Sympati2, a 1.5 D Processor array for image application. In Proc. of the EURASIP’88, Signal Processing IV: Theories and Applications (J. L. Lacoume, A. Chehian, N. Martin, and J. Malbos, Eds.). Elsevier Science, Amsterdam, pp. 311–314. 14. Lee, D-L. Design of an array processor for image processing, J. Parallel Distrib. Comput. 11 (1991), 163–169. 15. Me´rigot, A., Bouaziz, S., Cle´rmont, Ph., Devos, F., Eccher, M., Me´hat, J., and Ni, Y. Sphinx, un processeur pyramidal massivement paralle`le pour la vision artificielle. In Proc. of the AFCET 7th RFIA Conf., Paris, 1989, pp. 185–196. 16. Me´rigot, A. Associative Nets: A New Parallel Computing Model. Technical Report, Universite´ Paris XI, IEF, 1992. 17. Me´rigot, A., and Zavidovique, B. Image analysis on massively parallel computers: An architectural point of view, Int. J. Pattern Recognit. Artif. Intell. 6, No. 2 & 3 (1992), 387–393. ´ tat, Universite´ ae Paris 7, 18. Pissaloux, E. The`se Doctorat d’E France, 1987. 19. Pissaloux, E., and Nolin, L. NL1 machine—A concept of a high performance data type architecture. J. Microprocessing Microprogramming 27 (1989), 307–314. 20. Pissaloux, E. A rational methodology for design a new computer structures, Euromicro J. 3 (1990), 555–560. 21. Pissaloux, E., Bouaziz, S., Me´rigot, A., and Devos, F. Coprogramming: A tool for the development of software for massively parallel computers, Euromicro J. 30, No. 1–5 (1990), 569–676. 22. Pissaloux, E., and Bonnin, P. On the adequacy of image processing algorithms and massively parallel computers. In Proc. of the IEEE/Euromicro MPCS’94, May 2–6, Ischia, Italy, 1994, pp. 487–499. 23. Pissaloux, E. E., Bonnin, P., and You, J. On models of parallel computers and models of parallel computation. In Proc. of the Int. Conf. on Parallel Processing and Applied Mathematics, Czestochowa, Poland, Sept. 9–12, 1994, pp. 1–6. 24. Pissaloux, E., Le Coat, F., Bonnin, P., Bezencenet, G., and Durbin, F. A parallel method for matching of aerial images. In SPIE Int. Symp. on Intelligent Systems & Advanced Manufacturing, Boston, November 18–22, 1996, vol. 2904, pp. 75–81. 25. Que´not, G. The «Orthogonal Algorithm» for optical flow detection using dynamic programming. In Proc. of IEEE ICASSP892, San Francisco, CA, March 1992. 26. Rosenfeld, A. Computer visison: Basic principles. Proc. IEEE 76, No. 8 (1988), 863–868. 27. Sorel, Y. Massively parallel computing systems with real-time constrains—The «Algorithm Architectue Adequation» methodology. In Proc. of the IEEE/Euromicro MPCS Conference, Ischia, Italy, May 2–7, 1994. 28. Uhr, L. (Ed.). Parallel Computer Vision. Academic Press, San Diego, 1987.
ACKNOWLEDGMENTS We thank the different research centers of the CEA (Commisariat a` Energie Atomique), LETI—DEIN, DAM—Bruye`re-leChaˆtel, and Limeuil-Valenton, and the Institut d’Electronique Fondamentale, Universite´ Paris 11, for their support and discussions related to this paper.
REFERENCES 1. Arbib, M. A., Hanson, A. R. (Eds.) Vision, Brain, and Cooperative Computation. MIT Press, Cambridge, MA, 1987. 2. Birhoff, G., Bartee, T. C. Modern Applied Algebra. McGraw– Hill, New York, 1970. 3. Bouaziz, S., Pissaloux, E., Me´rigot, A., and Devos, F. A communication mechanism and its implementation in the Multi-SIMD massively parallel computer SPHINX. Euromicro J. 32, No. 1–5 (1991), 39–46. 4. Cartier, S. Vers une implantation automatique de programmes de traitement d’image sur les machines paralle`les he´te´roge`nes. Ph.D. the`se de l’Universite´ Paris 11, 1996. 5. Cantoni, V., and Ferretti, M. Pyramidal Architectures for Computer Vision. Plenum, New York, 1993. 6. Cantoni, V., and Levialdi, S. Multiprocessor computing for images. Proc. IEEE 76, No. 8 (1988), 959–969. 7. Cle´rmond, Ph. Me´thodes de programmation de machine cellulaire pyramidale: Applications en segmentation d’images. The`se de Doctorat, Universite´ Paris VII, 1993. 8. Collette, Th., Gramat, Ch., Juvin, D., Larue, J-F., Letellier, L., Schmit, R., and Viala, M., SYMPHONIE—Calculateur Massivement Paralle`le: Mode´lisation et Re´alisation. In Actes 3-ie`me Jorune´e Ade´quation Algorithme Architecture en traitement du signal et images, CNES, Toulouse, France, 17–19 janvier 1996, pp. 279–286.
26
31. Varshavsky, V. (Ed.) Self-time Control of Concurrent Processes, Kluwer Academic, Amsterdam, 1990. 32. Cormen, Th. H., Leiserson, Ch. E., and Rivest, R. L., Introduction to Algorithms, MIT Press, 1990. 33. Blelloch, G. E., Vector Models for Data-Parallel Computing, MIT Press, 1990.
29. Weems, Ch. C. Architectural requirements of image understanding with respect to parallel processing. Proc. IEEE 79, No. 4 (1991), 537–547. 30. Weems, C. C., Riseman, E. M., Hanson, A. R., and Rosenfeld, A. The Darpa image understanding benchmark for parallel computers. J. Parallel Distrib. Comput. 11 (1991), 1–24.
27