Real-time image processing with dynamically reconfigurable architecture

Real-time image processing with dynamically reconfigurable architecture

ARTICLE IN PRESS Real-Time Imaging 9 (2003) 297–313 Real-time image processing with dynamically reconfigurable architecture L. Kessal*, N. Abel, D. D...

654KB Sizes 4 Downloads 216 Views

ARTICLE IN PRESS

Real-Time Imaging 9 (2003) 297–313

Real-time image processing with dynamically reconfigurable architecture L. Kessal*, N. Abel, D. Demigny ETIS, UMR 8051 CNRS/ENSEA, Cergy Pontoise University, France

Abstract During the last few years, many architectures using processors and/or field programmable gate arrays (FPGA) were built to accelerate computationally complex problems. The processors allow a high degree of flexibility, whilst the FPGA implementation might be considerably faster. In spite of the possibility of reconfiguring the conventional FPGA an unlimited number of time, many of these architectures were built to compute a single application. If the FPGA is reconfigured several times to execute various algorithms, the configuration time increases and degrades global performances. In this paper, an architecture dedicated to real-time image processing using the AT40K reconfigurable FPGA family is presented (ARDOISE project1). We discuss Dynamic Reconfiguration (or Run-Time Reconfiguration), a technique based on the reuse of the same device (an FPGA configured on the fly) by scheduling the execution of different algorithms building an application. The techniques and the tools developed to test and use the system are described. r 2003 Elsevier Ltd. All rights reserved.

1. Introduction The multimedia growth in every domain needs more and more complex algorithms for coding/decoding and processing huge data flows. Nowadays, two technologies are used to face these requirements: parallel processing and dedicated circuits. The first solution gives a high flexibility, but increases the size, the power consuming. The second solution results in very fast specialized systems that cannot be upgraded or adapted to other purposes. For more than 10 years, configurable computing systems demonstrate their efficiency to execute complex algorithms for several applications: convolution, morphology, image filtering, edge extraction, and object recognition. The most common devices used for configurable computing are field programmable gate arrays (FPGA). Over the last 10 years, FPGAs have demonstrated the potential for achieving configurable computing and prototype systems for a wide domain of applications. Most of the conventional FPGA technology has a highly fine-grained architecture. FPGAs are *Corresponding author. E-mail address: [email protected] (L. Kessal). 1 This work involves 10 French research labs and is supported by the French agency for education, research and technology. 1077-2014/$ - see front matter r 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.rti.2003.07.001

composed of programmable logic array which is used as custom hardware that can be configured to implement the treatment required. These hardware resources can be used to design custom computing systems highly adapted to specific applications. This approach offers the possibility of exploiting significant data-level parallelism to increase performances. To execute a new treatment in SRAM-based FPGA, it is necessary to reconfigure logic functionality and set connectivity at the bit level. Because the bitstream files have important size and the configuration interface is slow, the configuration time can be a significant overhead in configurable systems, which limits performance when an FPGA needs to be reconfigured ‘‘on the fly’’ (run-time reconfiguration—RTR). If the logic resources are targeted to execute one treatment, the chip is configured only at the startup and remains unchanged until the application is finished. For example, multiple-FPGA machines, like Splash-2 [1] (each module contains 16 Xilinx XC4010) or Programmable Active Memory DEC PeRle-1 [2] (25 Xilinx XC3020), were used in order to achieve highly parallel processing rates on multiple data. The entire chips are configured once for the target application. One way to reduce the hardware complexity and increase the functional density of the reconfigurable resources is to share the same physical chip to perform

ARTICLE IN PRESS 298

L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

all application’s stages. For most embedded systems, the ability to reuse hardware resources for several treatments by reconfiguring the hardware during runtime execution of the application is a promising alternative. This approach, termed Dynamically Reconfigurable Architecture (DRA), can reduce the system’s hardware complexity and increases the flexibility with the ability to choose at any time an algorithm rather than another because it is more adapted. Since some years, the DRAs represent a rich area for industry and research laboratories. For the design of DRAs, two levels of the reconfigurable element granularity have been used in order to take benefit of the data parallelism: *

*

Coarse-grained configurable logic. In this case, the reconfigurable resources are organized in a twodimensional array allowing significant interconnect flexibility. The hardware structure is based on both linear data and control paths (data-path structure). The coarse-grained architecture includes multipliers, adders, DSP operators, and pipelined registers. Because of their coarse granularity, the hardware systems built according to this model require only few bits to be configured (like DART [3] and Systolic Ring [4]). Fine-grained configurable logic. Unlike the coarse granularity, the fine granularity of logic elements offers better flexibility; their reconfigurable resources can be used to implement operations at the bit level, data-paths and basic arithmetic functions necessary to build custom embedded systems need to be highly adapted to specific data for the target applications. Most of the conventional SRAM-based FPGA are classified in this category. The configuration bits in the device have to tell for each gate and interconnection element how to behave. Consequently, configuration stream sizes and times increase considerably; this major drawback reduces their ability to be used in embedded systems that are dynamically reconfigurable. Some years ago, FPGA families have been developed, such as, the Atmel AT40K FPGA family [5] and Xilinx XC6200 FPGA family [6], in order to reduce configuration time and to allow partial reconfiguration. Moreover, their fine-grained reconfigurable logic enables custom computing according to the demand of the applications. By accepting a weak decrease of performances, it is possible to design basic elements (data-paths, DSP functions, etc.) to emulate advantageously coarse-grained architectures. In this study the Atmel AT40K FPGA have been used to build a platform named ARDOISE whose the main goal is to experiment and demonstrate the potential and performances of the dynamically reconfigurable systems.

In the rest of this paper, we describe ARDOISE architecture and we show how it can be used to provide

real-time design swapping capability (hardware sharing). The organization of the paper is as follows. First, a brief reflection about the configurable computing technology which has been introduced to improve performances on a wide range of applications is given (Section 2). Then the dynamically reconfigurable system which allows the ability to be configured entirely or partially (Section 3) is presented. An evaluation and a comparison of the performances of various architectures are discussed in the Section 4. To illustrate the approach of the use of ARDOISE, for real-time image processing, several examples of algorithms implementation are described (Section 5). Today, ARDOISE platform is operational and is available in all participating laboratories. The Section 6 is dedicated to the description of the tools developed to better use the prototype and help designers to develop and debug applications. Finally, we summarize and discuss about future research.

2. Dynamically reconfigurable hardware The reconfigurable embedded systems using FPGA represent an attractive alternative for applications with a strong real-time constraint. The basic steps of typical image segmentation algorithms (see Fig. 1) can be classified in: smoothing filter, edge extractor, region labeling, etc. According to the complexity of the target application, one or several FPGA devices are used. In general, the system is statically reconfigured: the entire chips are configured once for the target specific application. An important advantage of reconfigurable logic is the ability to reconfigure functionality in response to changing application data sets. However, most of algorithms need external resources (such as FIFO and memory buffer). So it becomes difficult to adapt hardware to be able to choose a filter rather than the other one. FPGAs have limited resources and offer lower performances than Application Specific Integrated Circuit (ASIC) solutions. With configuration clock rate and processing clock rate, in permanent progress, the FPGAs can supply both advantages with a limited inconvenience. The commercially FPGA such as the AT40K series are particularly suited for RTR because they can be rapidly reconfigured. These devices can be totally or partially configured in less than 1 ms and their processing speed is approximately 1/3 of ASIC speed. This FPGA family has a bus which allows the device to be easily connected to microprocessors. So, it is possible to configure rapidly an arbitrary subset of the device by using a fast parallel programming interface. The ability to reconfigure a part of the chip during run-time presents an immediate advantage: the size of the configuration data is reduced. Dynamic reconfiguration (DR) offers the possibility of implementing on a same

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

Pixels In

Frame Grabber

FPGA-1

FPGA-2

FPGA-3

Smoothing Filter

Edge Extractor

Boundary closing, etc.

FIFO

FIFO + Buffer

LIFO

299

Frame Grabber

Pixels Out

Fig. 1. Basic steps in image segmentation.

Input of pixels

Frame Grabber Smoothing filter Edge detection Contour extraction.

Reconfiguration Data

.........................

Dynamically Reconfigurable Area

Memory resources LIFO, FIFO, ...

Frame Grabber FPGA Output of results Fig. 2. FPGA structure using dynamic and partial reconfiguration.

FPGA in a time-multiplexed fashion, a set of algorithmic steps (Fig. 2).

3. ARDOISE architecture As in the virtual memory case, the principle of ARDOISE consists in implementing a virtually large design on a small FPGA as shown in Fig. 2. The realization of such a system requires commercially available FPGA with: (1) sufficiently fast configuration times, (2) possibility to reconfigure subsections of the chip while it is still running, and (3) an important number of I/O pins to obtain the required data bandwidth. Nowadays, there are many FPGA devices that allow an important number of I/O pins. By using fast and few reconfiguration logic (by column as in Virtex [7] series) and a good partitioning strategy, it is possible to design embedded systems that take benefit of partial reconfiguration mechanism. Because there is not yet a base of commercial FPGA with all the previous characteristics to satisfy these requirements, prototype development became necessary to investigate the hardware issues of DR. Ten French research teams launched ARDOISE project to build a

dynamically reconfigurable platform dedicated to realtime image processing; the main goal is to help in the investigation of DR paradigm and the hard wired systems. The basic idea of ARDOISE is to swap algorithms, used in image segmentation, on the same hardware structure, by reconfiguring few devices several times during the processing of each image. This is equivalent to assign to the same hardware resource, an FPGA device, to execute a sequence of algorithms according to a defined schedule (virtual hardware). Using this concept and exploiting temporal parallelism, the final system allows to making the FPGA run consistently at its peak rate. Updates in the case of an ASIC are longer and more expensive. It would be possible to replace an algorithm implemented in an ASIC, by an implementation on DRA, such as ARDOISE, with the same computation time if the data parallelism is well used and if operators are pipelined. Furthermore, it allows high flexibility level, formerly reserved to microprocessors. This architecture uses fast and DR provided in new FPGA chips. The AT40K family from Atmel which is particularly suited for RTR is used, because it can be rapidly reconfigured. These devices which include 45K equivalent gates capacity can be totally or partially reconfigured in less than 1 ms. It is possible to

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

300

Host processor (ADSP 21061)

address bus

256k, 32b memory

Host (PC) Serial Port

data bus

FPGA AT40K

Scheduler, global I/O module FPGA + RAM + Flash + clock gen. configuration bus global I/O bus Data IN

module (1)

module (2)

module (3)

256k, 32b memory Data OUT

(a)

(b)

Fig. 3. Organization view of ARDOISE system (a) and details of one module (b).

reconfigure rapidly an arbitrary subset of the device by using a fast parallel programming interface. This is possible because the configuration bus allows easy connection to microprocessors. The ARDOISE architecture [8–10] is based on three (or more) identical modules (Fig. 3a). Every module (Fig. 3b) includes one FPGA connected to two local memories to save temporary intermediate results. The modules are interconnected using a 48 bits bus which can carry either addresses or data. Free buses on both sides are used for data input and output (for example camera and screen). An other module, called mother module, includes all the hardware for configuration storage in a compressed form and scheduling of the configuration and clock generation. Mother module’s FPGA is used to uncompress the configuration files during the reconfiguration process and to manage the exchanges of the high-level results with the host processor. ARDOISE can be used according to 2 modes: stand-alone mode and coprocessor mode. The second mode is used to accelerate algorithm computation. In stand-alone mode, the configuration data of every treatment to be interlaced, are initially stored in the mother board’s FLASH memory. After power up, the FPGA of the mother board is itself configured. In the second mode, ARDOISE system is endowed with an external host processor (DSP SHARC 21061 by Analog Devices). In this configuration the host processor can fill memories (static memory or flash memory) with the data of various configurations. The mother board tasks are summarized in the following points: *

reconfigures modules by loading the configuration data. This operation, equivalent to write in a memory by using the FPGA fast interface, is made by using the configuration bus;

*

*

*

*

makes the management of the algorithm data by mapping operators like DMA in various modules; reduces the configuration delay. Indeed, every time the data of configurations are stored in the memory in a compressed form. The mother board performs decompression on the fly and sends configuration data to desired modules; manages dynamically the configuration data by loading them selectively in modules; distributes clocks in modules. To make benefit of the whole frame duration (40 ms), algorithms swapped in computing module work at various frequencies, greater than the frame data in/out rate. The mother module integrates a block, which generates four clocks where frequencies are programmable. Some of these points are discussed in the tools section.

3.1. ARDOISE utilization and swapping algorithms in FPGA Figs. 4a and b show one of the multiple uses of ARDOISE. The two GTI modules are used as frame grabber. During computing of frame n, the frame n þ 1 is an input and stored in memory A and the frame n  1 is read from memory C and outputted. The frame n is stored initially in memory B. The first configuration is loaded on the center FPGA (in gray), the first algorithm is then applied to the data previously stored in memory B (Fig. 4a). Then, the FPGA is configured again for computing the second algorithm (Fig. 4b). D and B are used respectively as input and output memories for this algorithm. For successive algorithms, the scheme is repeated. At the end of the frame duration, several algorithms have been computed on the central FPGA and the use of memories A (respectively C) and B (respectively D) are exchanged for the next frame

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

A

Memory

Memory

256kx32

256kx32

FPGA

FPGA

FPGA

GTI1

ALGO 1

GTI2

Memory 256kx32

IN

Memory

B

256kx32

Memory

Memory

256kx32

256kx32

C

A

Memory

Memory

256kx32

256kx32

FPGA

FPGA

FPGA

GTI1

ALGO 2

GTI2

Memory 256kx32

OUT IN

D

GTI2

BC

GTI1

GTI2

BC

GTI1

301

Memory

B

256kx32

(a)

C

Memory

Memory

256kx32

256kx32

OUT

D

(b)

Fig. 4. Scheme of algorithms execution with dynamical reconfiguration. Table 1 Performances measures of algorithm implementations Algorithms

Number of config.

Pixels computed in one cycle

Computation time (ms)

Configuration time (ms)

Deriche’s filter Sobel’s gradient Contour closing Region labelling (1st stage) Region labelling (2nd stage) Total

2 1 2 1 1 6

2 2 6 1 2

7.9 3.9 2 7.9 4 25.7

1.08 0.54 1.08 0.54 1.08 4.32

computation. The GTI1 and GTI2 modules allows desynchronizing between the central computing module, BC, and the video acquisition system (which occurs often at lower frequency than the maximum FPGA abilities). The different treatments are swapped in the central module and computed at a high clock speed. Intermediate results will be stored in the GTI local memories with different memory models, during computing or reconfiguring. The computing module memories are used to store local data inherent to every algorithm: temporary data storage, realization of data structures such as FIFO and various treatment parameters. In addition, because most of the treatments must deal with non-interlaced images, the GTI module can be used to interlace video data input, sampled from acquisition system. 3.2. Example of algorithm implementation The potentiality of ARDOISE on image segmentation application have been tested. Similar to some digital signal processing, the image segmentation implementations include basic stages: noise filtering, edge detection, contour closing and region labeling. The algorithms studied for noise smoothing are Deriche’s filter, Sobel’s convolution masks and Nagao’s algorithm (refer to Section 5). For edge detection the Sobel’s gradient masks is used. Finally, a solution using cellular automata architecture [11] for contour closing and Rosenfeld’s algorithm for region labeling is chosen. All

of these algorithms have been previously studied on static architectures. At each stage, ARDOISE’s FPGA is reconfigured to perform a specific algorithm. The following measures are given for throughput frequency of 35 MHz and image resolution of 512  512 pixels. The complete application can be executed in less than 31 ms. Table 1 shows details. 3.2.1. Deriche’s noise filter Through this illustrative example, the efficiency of the ARDOISE architecture and the necessity of having successful tools to perform rapidly simulation and synthesis of algorithm stages is highlighted. The chosen algorithm is FGL’s smoothing filter [12], a version of Deriche’s filter. This operator on various technologies (ASIC, DSP and FPGA) have been already studied, optimized and implemented. This filter possesses an unique parameter (g is 3 bits in length) which allows to adapt the impulse response to the input image to be computed. For example, this possibility is very attractive to choose the resolution of the filter to be employed for smoother transition of intensity. Fig. 5 defines the computation model of the filter. The elementary cell is a low-pass; the z-transform and the recursion equations are given below: HðzÞ ¼

1g ) Yn ¼ Xn þ gðYn1  Xn Þ: 1  gz1

ð1Þ

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

302

γ xn

X

yn

yn-1

R

Fig. 5. Deriche’s smoothing filter, F. Garcia Lorca’s version.

Input Image

Horizontal Filtered Image

2 x FGL filter

2 x FGL filter

Horizontal Filtered Image 2 x FGL filter

2 x FGL filter

Line Buffer (LIFO)

Line Buffer (LIFO)

2D Filtred Image (a) Horizontal stage

(b) Vertical stage

Fig. 6. 2D implementation of FGL’s recursive filter.

3.2.2. 2D image implementation Below the Deriche’s noise filter implementation developed for real-time image smoothing is presented. FGL’s algorithm which is a single pole recursive lowpass filter, as basic stage of the realization was preferred. Causal and anti-causal parts of the 1D structure include 2 stages connected in a cascade (Fig. 6a). The data samples of each line are filtered with the 2 FGL’s filter, passing from left-to-right. Then, the pixels are processed in the same manner, but with the filter moving right-toleft. Finally, results are stored horizontally. In software emulation, these stages can be made separately by using intermediate buffer. The recommended solution to extend the filtering to two dimensions is to incorporate an intermediate image memory between row and column. During the vertical step (Fig. 6b), the results previously calculated are performed in the same manner as horizontal step; the input data are taken vertically from the image memory. 3.2.3. Implementation of Deriche’s filter using ARDOISE One of the main advantages of using ARDOISE structure in a real-time image processing is the ability of performing treatments independently of the data stream (desynchronizing between the computing rate and the data input sampling). This means that it is possible to process several data samples in parallel. The 2D filtering have been decomposed into 2 configurations. During the first configuration the horizontal stage (Fig. 7a) is applied, then the computing unit is reconfigured dynamically to implement the vertical stage (Fig. 7b). The computing scheduling is made in 2 clock cycles. The left frame grabber GTI1 is in input mode while the right

one GTI2 is in output mode. Several models of data organization in the memories can be implemented in GTI’s reconfigurable logic in order to optimize the execution of each treatment. During data input video, pixels are stored in the memory controlled by GTI1 in such a way that 2 pixels of the same column and 2 consecutive lines are stored at the same address. First cycle begins with the sampling of 2 pixels from GTI1’s memory. Then, the left-to-right filtering step is made on the 2 pixels in parallel that will end with a storing of the results in the intermediate buffer. During the same clock cycles, intermediate results of 4 pixels of the previous lines are extracted from the buffer to complete horizontal filtering with the right-to-left step. So, two lines are performed in parallel. The 1D processing results are stored in the GTI2’s memory after a data reorganization procedure. Similarly, the vertical treatment will make during the following configuration. By including the configuration time with a configuration clock rate of 30 MHz, the filtering of one image (512  512 pixels) is made in less than 10 ms.

4. Performances and limits of DRA ARDOISE structure was defined to be very flexible in order to implement easily most of the algorithms encountered in real-time image processing. The purpose of ARDOISE project is to elaborate development methodologies for the DRA. Partitioning strategy, study of the DRA performances and the management of the configurations are the important points that have been studied.

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

GTI1

GTI2

GTI1

2x8

32

32

Vertical Data storage

Horizontal Data storage

8

8

2 x FGL filter 2 x FGL filter

16

2 x FGL filter 2 x FGL filter

16

16

303

GTI2 2x16 16

2 x FGL filter 2 x FGL filter

16

16

16

16

2 x FGL filter 2 x FGL filter

16

16

R1 (32 bits)

R3 (32 bits)

R1 32 bits

R3 (32 bits)

R2 (32 bits)

R4 (32 bits)

R2 32 bits

R4 (32 bits)

64

(a)

64

64

Intermediate buffer (LIFO structure)

(b)

64

Intermediate buffer (LIFO structure)

Fig. 7. Implementation of the Deriche’s noise filter.

The immediate benefit of the dynamic reconfigurable computing is the silicon reduction. Certainly, today the integration density of the silicon is not the critical point. However, in the case of a system on chip (SoC) in which will be resulting processors, memories, DSP functions and why not a hardware reconfigurable area, it is important to reduce the size of this last one. Nevertheless, the flexibility in choosing the algorithms to be used in a simple manner is more important. One can make, in real time with a hardware process, image analysis where the result will help to choose the appropriate algorithm according to incoming image. This functionality allows to exploit the mechanisms of sequence breaking and decision-taking. Because it is possible to adequately pipeline the design and exploit more parallelism per cycle, FPGAs can provide computational power per unit of area highly than conventional processors and can so complete more work per unit of time. In a target application dedicated to real-image processing, because of horizontal and vertical blank-time the valid samples occupy a proportion of total image duration. If the image size is N pixels, the data input frequency Fi is Fi ¼

N ¼ aFs ; T

ð2Þ

where T is the frame duration and Fs the video sampling clock.

4.1. Hardware usability The main objective of the DR is to allow a system to react during run-time to choose the most suited algorithm according to the data of the target application. Compared to a static solution, DR does not improve the execution speed of an application. However, it reduces and optimizes the usability of the reconfigurable logic area in the FPGA. For static architectures, Bertin suggested [13] the expression of the power Pus needed by an application, as the product of the number of gates Gs to be used and the computing frequency (Eq. (3a)). In a real-time implementation without hardware buffering, the computing frequency is necessarily dependent on the video sampling clock Fs : By analogy of the power application, one proposes the definition of an architecture power as the product of the number of equivalent gates and the maximum frequency Ft that the system can supply (Eq. (3b)) Pus ¼ Gs Fi ;

ð3aÞ

Pms ¼ Gs Ft :

ð3bÞ

The quotient of Pus and Pms gives the usability of the hardware resources Zs : Pus Fi ¼ : ð4Þ Zs ¼ Pms Ft

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

304

For example, image processing requires more and more computational power and data throughput; the real-time implementations (without hardware buffering) offer disappointing results: *

Image size N ¼ 512  512 pixels (a ¼ 0:655; Fs ¼ 10 MHz, Ft ¼ 35 MHz) ) Zs ¼0.19.

4.2. Performances and silicon reduction A static implementation which uses Gs gates is partitioned into C stages (or partitions) to be mapped on a DRA. In a unit of time T, frame duration, C treatments are computed by performing time-multiplexing of execution/reconfiguration steps. The system reconfigures the same configurable logic to implement different partitions at successive times. Functional flexibility and data parallelism b are the main goals which drove the development of FPGA designs. In the rest of the article, let us note by Gd the number of reconfigurable gates. To simplify the demonstration of the approach, one supposes that partitions are equivalent size and use the same data parallelism b: The silicon area reduction is Rsxd ¼

Gs CGd C ¼ ¼ : b bGd bGd

ð5Þ

Maximizing the data parallelism means trying to find the best implementation of a given task on the DRA based on FPGA architecture. The computational time of one image is the product of the number of pixels and the delay required to evaluate one pixel (with Ti ¼ 1=Fi and Tt ¼ 1=Ft ): Dcomput ¼

T Tt Fi T ¼ : Ti b bFt

ð6Þ

In addition, if Vc indicates the configuration speed of a given FPGA family, expressed in number of gates configured per second, the reconfiguration time will last Dreconfig ¼

Gd : Vc

ð7Þ

The duration Dstep to execute one configuration at frequency Ft ; including configuration overhead is Dstep ¼

Fi Gd : þ bFt Vc T

ð8Þ

Temporal order constraint imposes that the sum of the computation time of all the partitions of the algorithm and the additional time required for reconfiguration, should be less or equal than the block acquisition duration T: These overhead times can be minimized separately to improve performances. One deduces in these conditions, the maximum number of configurations C for a given FPGA family, according to the size of the partitions Gd and the data parallelism rate

b used for each implementation   Fi T Gd þ pT CDstep pT ) C bFt Vc   Fi T Gd 1 )C¼ þ : bFt Vc T

ð9Þ

The previous expression contains two terms: the first term is the overall processing time and the second one is the total time required to reconfigure the hardware. To study the impact which they can have on the number of configurations (C), let us consider the example of the real-time image processing at the resolution of N ¼ 512  512 pixels. So, for the frame rate of 25 frames/s (frame duration T ¼ 40 ms) the sampling frequency of the pixels is Fs ¼ 10 MHz and Fi ¼ N=T ¼ 6:55 MHz. The rate of the valid pixels is aE0:655: For illustrative purposes, the working frequency Ft and configuration speed Vc which depend heavily on the structure and the technology of the FPGA used are fixed. For example, AT40K family offers faster reconfiguration times of Vc ¼ 50  106 gates/s. Because FPGA implementations for many algorithms used in video image processing are studied, our experience suggests that maximum frequency Ft ¼ 35 MHz is a good choice. With these technology and application constants, Eq. (9) becomes 2  106  b ðAtmel 40K familyÞ; 0:4  106 þ bGd 2  104  b CE ðXilinx 4000E familyÞ: 0:4  104 þ bGd

CE

ð10Þ

Fig. 8 plots the relationship shown in Eq. (10). For a dynamically reconfigurable FPGA, the study of C according to Gd and b shows that the size and the configuration time of the physical device have an impact on the number of successive reconfigurations. For example, the configuration of Atmel AT40K devices is 100 times faster than Xilinx 4000E family’s device. Consequently, the complexity of the application which it is possible to implement is 100 times greater. In this project, Atmel AT40K family which allows faster reconfiguration is used. In the first curve, one distinguishes two zones: in the first one ðGd {105 Þ the number of configurations is sufficient, so the use of the DR is completely justified. On the contrary, in the second zone (when Gd c105 ) the number of configurations decreases quickly. For a given FPGA technology, these curves can help to choose the size of the device Gd to be used for building DRAs. The second complementary curve, drawn for Atmel AT40K technology, shows the DR efficiency at varying parallelism data rate. For a fixed size Gd of DR, one notices that the parallelism data is not systematically an advantage. Beyond a certain value of b; the total configuration time becomes dominating compared with the computational time.

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

12 β=4

Number of configurations C

Number of configurations C

25 20 15 10 5

305

β=2

β=1

8 100000 gates

6 4 2

AT40K

XC4000XL

40000 gates

10

750000 gates

0 10 2 (a)

10 3 10 4 10 5 Partitions granularity Gd

10 6

10 7

2 (b)

4

6 8 10 Data parallelism β

12

14

Fig. 8. Impact of the FPGA technology on the efficiency of the RD, comparison of Atmel AT40K and Xilinx XC4000E family (a) for fixed values of b and the number of configurations according to the data parallelism rate (b) for fixed values of Gd :

4.3. Architecture power and efficiency

gain increases proportionally with Ft : The silicon reduction can be approached by

By considering Eq. (9a) one can summarize the limit of the architecture power as follows:   Gd Pu ¼ Gd Ft 1  C : ð11Þ Vc T As it underlined above, the expression of the limit depends essentially on technological performances of the used FPGA device: the configuration speed Vc and the maximal computing frequency Ft : For a fixed value of the number of configurations, when Gd increases, the architecture power takes a maximum: Pu max ¼

Ft Gd 2

when Gd ¼

Vc T : 2C

ð12Þ

The maximal complexity of the application which can be implemented for DRA is as follows: Gs max ¼

C Vc T X Vc T : bp p 2C p¼1 2

ð13Þ

bp is the data parallelism rate which the configuration p is processed. The precedent expression is a function only of Vc and T: This result is very important; algorithms used for images segmentation, described in Section 3.2, add up with less than 4  105 gates. What consolidates the choice of the AT40K technology to build ARDOISE. 4.4. How to choose the FPGA device For a given application (Fi ; T; Gs ), the useful power is Pstatic ¼ Gs Fi pPdynamic :

ð14Þ

DR allows better silicon reduction when the data parallelism rate b ¼ 1: With this condition, the silicon

Gs Ft E : Gd min Fi

ð15Þ

The performances obtained are better when the choice of G is less than 20% of the complexity limit. In this project Atmel’s AT40K40 device, which offers E50K gates and presents good rate E6%, is used. 4.5. Which reconfigurable device to build DR systems? The DR is a concept which is not reserved for only FPGA that offers a high configuration speed such as Xilinx XC6200 and Atmel AT40K series. These two FPGA families allow dynamic partial or complete reconfiguration during run-time by using fast configuration interface. The global performances can be degraded by the reconfiguration time of the reconfigurable logic. This does not mean that classical FPGA cannot be used to design dynamic systems. These last years, a great number of systems were built by teams of research integrating mechanisms aiming to reduce the impact of the configuration time: configuration data caching, bitstream compression techniques, masking the configuration, etc. The masking of configuration strategy was studied by Guermoud in his thesis [14]. In the following, the conclusions of a comparative study of performances are presented. Three typical DR realizations, using one or two configurable devices, are evaluated: 1. Without masking the reconfiguration time. 2. Masking the reconfiguration time: at the same time one device is reconfigured while the other one is used to compute.

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

306

3. Doubling the reconfiguration time: the two devices are reconfigured at the same time, and then compute together. The three solutions use the same quantity of hardware resources: total number of gates, memory size and memory bandwidth. To simplify expressions and figures interpretation, one assumes that the same data parallelism rate is used for each algorithm. 4.5.1. Solution without masking the reconfiguration time In this case, a single FPGA with a capacity of Gd equivalent gates is used. In order to execute several algorithms, a series of reconfiguration/computation are performed alternatively. 4.5.2. Solution with masking the reconfiguration time In this architecture (Fig. 9), multiple steps of reconfiguration/computation are applied in alternation to each of the two FPGA (each provided the equivalent of Gd =2 gates).

4.5.3. DR with double configuration speed Here, the two FPGA devices are reconfigured simultaneously with different configuration data. After the reconfiguration step, each FPGA starts to execute its own corresponding treatment. Fig. 10 shows the functionality of this architecture. This mechanism allows the doubling of the configuration speed. By calculating the silicon reduction, the three architectures performances [15,16] can be estimated to compare them. Table 2 summarizes the expressions of the silicon area gain for various solutions: Fig. 11 represents the three curves. The silicon reduction of the three solutions discussed above are drawn according to the size normalized for a ratio Ft =Fe ¼ 10: Two zones can be clearly distinguished, according to the importance of the time of configuration. A first zone, in which the masking of the configuration time is disadvised, because the silicon reduction is lower than the classical solution. In the second zone, the masking solution is better, because the configuration time is the dominant term. The DR is interesting only if the configuration time of the FPGA is around of 10% of the block duration T: This feature

Data reconfiguration

FPGA-1 Data flow

FPGA-1

results

Execution

Reconfiguration

FPGA-2 FPGA-2 Reconfiguration

Data flow

results

Execution

Data reconfiguration

step (a)

step (b)

Fig. 9. Technique of masking the reconfiguration time.

FPGA-2

FPGA-1 Reconfiguration

Reconfiguration

FPGA-1 Data flow

FPGA-2 Execution

Execution

results

Data reconfiguration

b- Computation

a- Reconfiguration

Fig. 10. Technique of doubling the reconfiguration speed.

Table 2 Silicon area gain of three dynamic architecture examples Architecture without masking configuration delays   Ft Gs 1 R1 ¼ Fe Vc T

Architecture with masking configuration delays R2 ¼

Ft 2Fe

Architecture with doubling configuration speed   Ft Gs R3 ¼ 1 Fe 2Vc T

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

307

Silicon Reduction

10 8 R3

6

R2

4

R1 static architecture

2

Gs VcT

0 0.2

0.4

0.6

0.8

1

Fig. 11. Silicon reduction of the three various solutions.

depends on the technology of the FPGA used. It is the case of the AT40K device used. So, the technique of masking the configuration becomes interesting if the FPGA has a slow reconfiguration speed. The third technique, doubling the configuration speed, presents the best result. Furthermore, it is easier to configure two FPGAs in parallel that in alternation. This proves that the solution to increase performances is technological. FPGA devices should be endowing with mechanisms allowing reconfiguration time reduction: better reconfiguration interface, using configuration data caching, partial reconfiguration, etc. In conclusion, for a given technology parallel reconfiguration solution offers better performances, because it reduces the configuration time. Consequently, the solution based on masking configuration is consequently without great interest. Due to hardware management in a ping-pong manner of the FPGA devices, the system implementation will present some difficulties (PCB realization more complex).

5. Application example: edge detector implementations The first step after hardware architecture development was obviously the test (Fig. 12). Some process, already studied, have been selected and mapped onto ARDOISE. The study of those algorithms may underline the advantages of DR use and may give also the answer to question: why ARDOISE is suitable for those applications? This section will deal with edge detection algorithms. 5.1. Management of configurations using cache logict As it has been already explained, one way of using ARDOISE is to construct some IP libraries. The IP may be considered as macro-instructions that are executed (the IP is loaded) the one after the others on ARDOISE. The first IPs developed were the smoothing filters of Sobel, Deriche and Nagao. These algorithms will not be presented, because they were studied and were used by

Fig. 12. ARDOISE prototyping platform.

the scientific community as algorithm for the same aim: image smoothing. But, they do not have the same characteristics. For example, Sobel is very fast computed but does not smooth correctly very noisy pictures. Nagao gives better results for low noisy images but it takes more time. Finally, Deriche’s filter gives the best results because it can be parameterized according to image’s noise level. Deriche’s filter can be used unconditionally when the computation time is not critical. However, it takes at least twice more time than Nagao. The choice of the algorithm has to be made to obtain a good compromise between the global processing time and the needed performance. The use of ARDOISE allows to map those three algorithms onto a small portion of programmable logic. While the current computed algorithm is mapped onto the FPGA, the two other algorithms are stored in cheaper media (SRAM, FLASH memory or hard disk drive). Similarly to data, logic is stored in cache memory. In case where logic is active and stored in SRAM, it is ready to be loaded onto the device. However, when logic data is stored in a hard disk drive, it needs a long time to transfer the logic data from the hard disk drive to the SRAM and then configure the FPGA (Fig. 13). Today, the configuration data of the algorithms are stored in the FLASH memory. At start-up, one can decide which of these algorithms will take place into the SRAM (there is enough place for the three IPs and more) and then one configures the computing device with the desired IP. All those operations are described with a high-level language: memory transfers and configurations are managed with a C library. Of course, it is possible to store a configuration bitstream in the SRAM while the active configuration is computed in the FPGA. For example, it is possible to transfer the configuration data of Nagao’s filter from the Flash

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

308

SRAM

Config. A

Hard disk drive

Config. D Config. F

Config. A Config. B

PC Host

Config. C

FPGA

Config. Manag. Config. D

Config. F

Config. E Logic

Config. F Fig. 13. Cache logic technique.

memory to the SRAM, and, when this transfer is completed, the Nagao task becomes active by configuring the FPGA by loading these data directly from the SRAM. The design which takes place into the scheduler makes possible this transfer without interrupting the configuration task. For this example, the transfer time is less than an image processing time. The system is really fast to react to the order of changing the configuration. The transfer time can become critical if the configuration data is not in the Flash memory but in a hard-drive. The ARDOISE prototype is not well fitted for this case, because transfers between the hard-drive and the prototype are made using a serial port. But the current process is not interrupted by the transfer and the new configuration becomes active a few seconds after the order is given. 5.2. Dynamic reconfiguration The first tests on ARDOISE make possible to use this architecture to compute the three smoothing algorithms of Deriche, Nagao and Sobel. The user chooses which algorithm is the well adapted for the processed picture. Algorithms are swapped just by button push. This may underline one of the benefits of dynamical configuration: this three algorithms are exclusives because no one wants to use at the same time Nagao and Deriche algorithms. But, the system may need to use one of these algorithms at different moments. Of course, using an FPGA with 10 Mio gates makes possible to map the three algorithms without using dynamic configuration. In this case, area where are mapped the two inactive algorithms is wasted. The proportion of wasted logic grows with the number of stages and the number of the algorithms to be mapped by stage.

But this choice can be managed automatically. In this case, one of the configurations has to measure some chosen parameters which are sent to the host processor. The processor is high-level programmed to choose which algorithm suits best the incoming data. The FPGA is then configured with the corresponding IP. In the way, tasks are well shared: hardware computes low-level algorithms and those algorithms are software managed. One can talk about dynamic configuration because the FPGA is configured twice during a unique data treatment. The first configuration is the parameters estimation and the second is the smoothing algorithm. This high-level management has not been tested yet, but the configuration of the device twice sequentially for each incoming image has been really tested. The first configuration computes a smoothing filter and second one is the gradient’s norm computation of the smoothed image. The time used by those two configurations and computations is less than 25% of the image duration time. Therefore it seems possible to map many different IPs into the hardware. Next algorithms we will develop come from the edge detection: threshold, segmentation, etc. Also, some pictures estimation algorithms will allow to test the data dependent scheduling of the IPs. 5.3. Partial configuration This is the last point which was experimented with ARDOISE. The FPGA used for this architecture is compatible with fine grain partial configuration. It means that it is possible to configure a single cell while the rest of the FPGA is working. The first experimentation consists in developing a dynamically reconfigured application on a single ARDOISE board. This application is the Sobel’s smoothing filter (1st configuration:

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

309

Fig. 14. Partial configuration.

dynamic1) followed by the gradient’s norm computation (2nd configuration: dynamic2). These two treatments were computed successively using dynamical configuration whereas a static treatment (image sampling) was active (Fig. 14). This approach is not classic, and a new methodology was developed to place and to and route the designs [17]. The second experimentation is more practical. It concerns the Deriche’s smoothing filter. As it has been mentioned, this algorithm is parameterized. There are many ways to select the parameter. The parameter can be wire-introduced in the FPGA from the host processor. This needs a special interface for each parameter. The resulting interface becomes very big when there are too many parameters. The classical way to overcome this problem is to store the parameters in internal registers. This solution needs a special interface to control the registers. When the number of parameters grows linearly, the number of data stored grows linearly too. However, the number of wires needed grows in a logarithmic way (1 address bit gives two possibilities, two gives four, etc.). DR can solve this problem. For each value of the parameter corresponds a configuration. Let us consider the case where a single bit parameter. One has to deal with only two different configurations. Then it is possible to develop those two configurations independently (this method is unusable when the number of parameters grows) or to include the same register, used previously, into a unique design. The difference is that this register is now read-only, and does not need any interface. The register’s value is selected during configuration and it can be changed only by another configuration of the FPGA. Partial configuration makes possible to change this value very quickly because only the register zone has to be reconfigured. For example, 500 ns are needed to change the Deriche’s three bits parameter. 5.4. Sobel’s implementation using one board ARDOISE architecture is oversize to compute a Sobel’s algorithm. For each incoming image, the two

configurations of Sobel should be the first steps to execute on the central module of ARDOISE. The manufacture of the modules took several weeks, during this time Sobel’s algorithm was tested in real time on a single board. This implementation is the first dynamically reconfigured application really executed on ARDOISE, and it validates many concepts of this architecture.

6. Software tools for simulating and hardware debugging DR is inadequately supported by the commercial tools and the FPGA manufacturers’s tools. Previously, tools used within the framework of the approach are based on the manual specification of various parameters. The debugging of the prototype and the simulation of the applications are based on a set of tools written in VHDL language. Testing DRAs imposes new CAD software and techniques which allow a high development flexibility. 6.1. Design flow Previously the designs were coded in VHDL language. Simulation and synthesis steps are performed using Synopsyst tools, then the design is placed and routed using Atmel’IDS place and route tool. During the simulation step, designer use binary image files as input. Consequently, data must be converted into intermediate format in order to simulate them with the VHDL simulation environment (through a text file). Inverse process must be applied to the results of the previous process. As just seen, the hardware/software debugging end up with heavy method because of the use of two different languages C and VHDL (Fig. 15a). It is with the aim of simplifying this procedure that a new CAD software is developed. For more flexibility, a new simulating and debugging environment was written completely in C++ language with SystemC approach (Fig. 15b). For that purpose, a hardware model of ARDOISE

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

310

Input image (binary format)

C, C++, SystemC C program

data translation

generic module library

Image (text format)

Simulation, testbenchs (C++ Compiler)

FPGA Synthesis (CoCentric Compiler)

RTL Level

ARDOISE VHDL Simulator View Results

Place & Route (Figaro - Atmel)

Output image (binary format) C program data translation

Image (text format)

(a)

Dynamically reconfigurable logic

(b) Fig. 15. VHDL testbench environment (a) and novel approach of tools based on SystemC (b).

architecture was written in SystemC. This environment was designed as a tool to help the designer to simplify the development procedure of applications. SystemC is a C++ object oriented, cycle based simulator that primarily for hardware/software cosimulation. SystemC was launched by the Open System C Initiative in 1999, and includes all the language elements necessary to describe hardware and software functionality of complex systems. The Advantage of SystemC is to be able to use the same language description for the synthesis and the simulation stages. The possibility of interfacing SystemC and the GUI libraries (X-Windows or WIN32 and System API) allows easy creation of Windows GUI software effective (Fig. 16). The ARDOISE simulator can be used for simulation of both the hardware and the software components. The GUI environment enables the user to simulate and/or execute treatment, edit memory contents, choose test image, show result, display waveforms, create scripts for synthesis and place/route with a simple click (Fig. 17). To synthesize SystemC hardware description into Register-Transfer-Level designs accepted directly by Atmel’IDS place and route, Synopsyst CoCentric Compiler (Fig. 15b) is used. The most important objectives of the development environment are: 1. Easy verification of the designs (using images in their origin coding). 2. Simulation steps faster than a comparable HDL simulator. 3. Helping the user to evaluate strategy performance by using this environment like software platforms. Indeed, DR is a new paradigm where development methods and partitioning strategies should be created.

4. Testing in a sequential way the processing stages used in DR steps. 5. Hardware/software co-emulation: it is possible to debug the processing stages and reconfigure the system using host interface afterwards. A collection of pre-routed and placed circuits reside in the main memory of the host system, ready for rapid download onto the FPGA over a host interface. 6. Step by step execution on the board configurations and download the contents of intermediate results (data memories). 7. Evaluation methodology to reduce the resource requirements in partially reconfigurable systems. 8. Measurement of reconfigurable architecture performances with more precision and refinement. 9. Development environment of System on Chip (SoC), intellectual property (IP) and reusable hardware/ software design in digital image application. 6.2. Constraints on IP for DRA The first point is to remember that the application programming is based on the parallel or sequential association of coarse grain blocks or IP. So, the design flow is top-down if all the IP to be used already exist. If it is not the case, the new IP can be designed with conventional tools. In the context of DRA, specific constraints must be added to IP description and organization. All the IP memories used must be extracted from separate block. Then, the IP core just contains the operator flow used to process one result of the given algorithm. The inputs of the IP core are reduced to the data needed to compute this result. So, the global execution on the data packet can be seen as iterations (nested loops) of the one result process. This allows to define intelligent data placement in memory

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

311

Fig. 16. ARDOISE simulation and debugging environment.

Fig. 17. Simulator: designed to be easy to use.

and smart address generator to handle local and interconfiguration storage management. It is also possible to compute the synchronization for the configuration manager. This method is also a good practice for the use of IP in standard SoC designs when the memory model can change from one design to another (resource mapping, memory partitioning, etc.). For each IP, an extraction of the amount of the resources needed (memory, computation), measures of the propagation time (with synthesis, place and

route on the Atmel FPGA) and the memory bandwidths are made. Then global partitioning [7] is used to define the moments where reconfiguration steps take place (inherent to the IP, or constrained by resource lack). After then between two configuration steps, data parallelism can be used to reduce the computation time taking care of the resource availability. It will be a good idea to build a specific tool which can synthesize automatically a data parallel version of a given IP core.

ARTICLE IN PRESS 312

L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

After this first step, memory allocation is studied and address generators are defined (special tool for automatic VHDL and systemC are in construction). The synchronization module of each configuration is elaborated and the global scheduler module of the application (which will be mapped on the FPGA mother module) is defined. Then, a functional simulator is used to verify the global application process on real data. After that, the back end part of the design flow occurs. First, each individual FPGA configuration is produced using conventional Atmel place and route tools. Address generators for the local memory accesses during a given configuration are placed in the center FPGA (in gray in Fig. 4), and address generators for intermediate storage between configurations are stored in FPGA GTI module (Fig. 4). Configuration files are generated in a compressed form to reduce the size and the bandwidth of the mother module memory.

ending addresses, and all the intermediate addresses are generated by an incremental algorithm. Data just have to be stored in the same order than addresses are computed. BST file are usually smaller than MD4 files because two addresses may be enough to configure many data. But this format is not suitable with highspeed configurations because it has to be unpacked before data being sent to the FPGA’s high-speed interface. The unpacking algorithm in the mother board’s FPGA has been implemented. In order to do it efficiently, it seemed necessary to modify the BST format. A C program automatically constructs the new format and allows the user to configure the daughter boards’s FPGA as rapidly as the MD4 format. It is necessary to underline that three cycles only are lost in the beginning and in the ending of the configuration for each window of the FPGA.

6.3. Functional simulator including reconfiguration management

7. Conclusion and future work

In order to simulate a dynamically configured application, the safeguard of the IP’s results each simulated configuration is necessary because these results may be the entries of the next configuration. In this purpose, a model of the material of ARDOISE is written in both languages VHDL and SystemC. These descriptions contain the three daughter boards’ FPGA and the six corresponding RAM. Each RAM is associated with two ASCII files. The first file represents the RAM before simulation, and it is read at the beginning of simulation by the RAM module. The second file represents the RAM at the end of simulation; The RAM module fills it when all computations are over. Therefore, it is possible to describe the three FPGA modules, and chain different simulations using previous results. The different algorithm descriptions used for simulation can be directly synthesized and then be placed and routed by the Atmel IDS software. 6.4. About daughter boards configuration The AT40K FPGA’s family offers high-speed configuration interface. It is possible to set two configuration bytes in a 33 MHz cycle. The way of doing this is to force a 24 bits address and a 16 bits data on the configuration interface of the FPGA [5]. The MD4 configuration file format is adapted to this task: each sixteen bits data is directly located with its address. On one hand, it is really easy to read this file in a RAM and to send it to the high-speed configuration interface of the AT40K. On the other, the file size is bigger than if it had contained only useful data. The BST file format mostly contains data. In this format data are windows arranged. The window is delimited by its beginning and

In this paper, a real-time reconfigurable architecture using industrial AT40K FPGAs is presented. ARDOISE architecture was designed in order to perform sequentially high-speed image processing treatments. However, ARDOISE provides the flexibility and hardware resources to meet the requirements and specifications in digital signal processing, high-speed control, digital communications, etc. Using real design examples, the Sobel’s masks used in edge detection and the Deriche’s smoothing filter, one showed that it is possible to obtain high performances if the AT40K DR capability is effectively exploited. The results presented in this paper indicate that short reconfiguration delay makes the DR mechanism more attractive for applications that require high performances. The design experiment and issues related the development of reusable modules allowed the user to estimate architecture performances. Dynamic approach is not supported by industrial design flow. A typical design environment tool where the realization is in progress is presented. The ARDOISE design environment tool was used to simulate and verify the various algorithms that have been implemented. These abilities will be used to measure accurately architecture performances and to automate the design process for DRA. The functional simulator has permitted to estimate the computation time for a real-time treatment in image processing application including the filters of Deriche, Nagao and Sobel for the smoothing stage, derivatives computation, local maxima of the derivative detection, edge closing and region labelling. The results show that one can execute 7 dynamic configuration steps within 37 ms that make real-time image processing feasible (less than 40 ms for a single frame). All the hardware prototype boards are now tested and operational. The

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

real-time implementations described in this paper, Sobel’s detector and multi-resolution Deriche’s filter, was implemented on a single module board demonstrate the potential of ARDOISE platform to support DR paradigm. As a perspective, the priority is to finish the development tool for debugging and monitoring a real dynamically reconfigurable multi-FPGAs. Future works will aim more progress in algorithm partitioning and the improvement of CAD environment to integrate partial reconfiguration capability, automate the place and route stage of the design flow, make reliable measures of performances, integrate partitioning strategy and methodology, etc.

[5] [6] [7] [8]

[9]

[10]

Acknowledgements We wish to acknowledge the support of Atmel Corp. for software, hardware and technical support. R. Bourguiba, S. Pillement, M. Paindavoine, E.B. Bourenanne and S. Weber, have contributed to the original ideas which led to the architecture design of ARDOISE. E.B. Bourenanne, S. Weber, and Y. Berviller have contributed to the conception, the realization and the test of the hardware prototype. We thank once again R. Bourguiba for the time he passed to the conception of the various boards. Finally, we thank S.M. Karabernou for the numerous suggestions and the reading of this article.

[11]

[12]

[13]

[14]

References [15] [1] Arnauld J, Buell D, Davis E. Splash II. In: Proceedings of the Fourth ACM Symposium of Parallel Algorithms and Architectures, San Diego, CA 1992. p. 316–22. [2] Vuillemin J, Bertin P, Roncin D, Shand M, Touati H, Bucard P. Programmable Active Memories: reconfigurable Systems Come of Age. IEEE Transactions on VLSI Systems, March 1996. [3] Sassatelli G, Torres L, Benoit P, Cambon G, Robert M, Galy J. Dynamically reconfigurable architectures for digital signal processing applications. SoC design methodology. Kluwer Academic Publisher, Dordrecht, 2002. p. 63–74. [4] David R, Chillet D, Pillement S, Sentiyes O. A dynamically reconfigurable architecture for low-power multimedia terminals.

[16]

[17]

313

SoC design methodology. Kluwer Academic Publisher, Dordrecht, 2002. p. 51–62. AT40K FPGA with FreeRAMt, data sheet Atmel Inc., 1999. Xilinx. XC6200 FPGA family, data sheet. Xilinx Inc., 1995. Xilinx Corporation, San Jose, CA, Virtex data sheet, 2001. Bourguiba R, Demigny D, Kessal L. Dynamic configuration: a new paradigm applied to real time image analysis. In: Proceedings of the 10th International Conference on Microelectronics, IEEE Electron Device Society, Monastir, Tunisia, December 1998. p. 25–8. Demigny D, Kessal L, Bourguiba R, Boudouani N. How to use high speed reconfigurable FPGA for real time image processing? Proceedings of the International Conference on Computer Architecture for Machine Perception, IEEE Circuit, Padova, September 2000. p. 240–46. Kessal L, Demigny D, Boudouani N, Bourguiba R. Reconfigurable hardware for real time image processing. Proceedings of the International Conference on Image Processing, IEEE ICIP, vol. 3, Vancouver, September 2000. p. 159–73. Demigny D, Quesne JF, Devars J. Boundary closing with asynchronous cellular automata. In: Proceedings of the IEEE Conference on Computer Architecture for Machine Perception, IEEE Circuit and Systems, vol. 1, Paris, December 1991. p. 8188. Lorca FG, Kessal L, Demigny D. Efficient ASIC and FPGA implementations of IIR filters for real time edge detection. In: Proceedings of the International Conference on Image Processing, Santa Barbara, October 1997. p. 406–9. Bertin P, Roncin D, Vuillemin J. Programmable active memories: a performance assessment. In: Meyer auf der Heide F, Monien B, Rosenberg AL, editors. Parallel architectures and their efficient use, Lecture Notes in Computer Science. Berlin: Springer; 1992. p. 119–30. Guermoud H. Architectures reconfigurables dynamiquement ddies aux traitements en temps rel des signaux vido. Thesis, Faculty of Nancy I, France, 1997. Bourguiba R. Conception d’architectures matrielles reconfigurables dynamiquement ddies au traitement d’images temps rel. Thesis, Faculty of Cergy Pontoise (Jury: P. Bertin, D. Demigny, L. Kessal, M. Paidavoine, R. Tourki, S. Weber), France, July 2000. Kessal L, Bourguiba R, Demigny D, Boudouani N. Reconfigurable hardware using high speed FPGA. International Conference on Very Large Scale Integration, IFIP VLSI-SOC’01, Montpellier, France, December 2001. Abel N, Boudouani N, Kessal L, Demigny D. Reconfiguration partielle sur l’architecture reconfigurable ARDOISE. Journes Francophones sur l’Adquation Algorithme architecture, Monastir, Tunisie, December 2002.