Real-time image processing with dynamically reconfigurable architecture

ARTICLE IN PRESS Real-Time Imaging 9 (2003) 297–313 Real-time image processing with dynamically reconﬁgurable architecture L. Kessal*, N. Abel, D. D...

Download PDF

654KB Sizes 4 Downloads 216 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

Real-Time Imaging 9 (2003) 297–313

Real-time image processing with dynamically reconﬁgurable architecture L. Kessal*, N. Abel, D. Demigny ETIS, UMR 8051 CNRS/ENSEA, Cergy Pontoise University, France

Abstract During the last few years, many architectures using processors and/or ﬁeld programmable gate arrays (FPGA) were built to accelerate computationally complex problems. The processors allow a high degree of ﬂexibility, whilst the FPGA implementation might be considerably faster. In spite of the possibility of reconﬁguring the conventional FPGA an unlimited number of time, many of these architectures were built to compute a single application. If the FPGA is reconﬁgured several times to execute various algorithms, the conﬁguration time increases and degrades global performances. In this paper, an architecture dedicated to real-time image processing using the AT40K reconﬁgurable FPGA family is presented (ARDOISE project1). We discuss Dynamic Reconﬁguration (or Run-Time Reconﬁguration), a technique based on the reuse of the same device (an FPGA conﬁgured on the ﬂy) by scheduling the execution of different algorithms building an application. The techniques and the tools developed to test and use the system are described. r 2003 Elsevier Ltd. All rights reserved.

1. Introduction The multimedia growth in every domain needs more and more complex algorithms for coding/decoding and processing huge data ﬂows. Nowadays, two technologies are used to face these requirements: parallel processing and dedicated circuits. The ﬁrst solution gives a high ﬂexibility, but increases the size, the power consuming. The second solution results in very fast specialized systems that cannot be upgraded or adapted to other purposes. For more than 10 years, conﬁgurable computing systems demonstrate their efﬁciency to execute complex algorithms for several applications: convolution, morphology, image ﬁltering, edge extraction, and object recognition. The most common devices used for conﬁgurable computing are field programmable gate arrays (FPGA). Over the last 10 years, FPGAs have demonstrated the potential for achieving conﬁgurable computing and prototype systems for a wide domain of applications. Most of the conventional FPGA technology has a highly ﬁne-grained architecture. FPGAs are *Corresponding author. E-mail address: [email protected] (L. Kessal). 1 This work involves 10 French research labs and is supported by the French agency for education, research and technology. 1077-2014/$ - see front matter r 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.rti.2003.07.001

composed of programmable logic array which is used as custom hardware that can be conﬁgured to implement the treatment required. These hardware resources can be used to design custom computing systems highly adapted to speciﬁc applications. This approach offers the possibility of exploiting signiﬁcant data-level parallelism to increase performances. To execute a new treatment in SRAM-based FPGA, it is necessary to reconﬁgure logic functionality and set connectivity at the bit level. Because the bitstream ﬁles have important size and the conﬁguration interface is slow, the conﬁguration time can be a signiﬁcant overhead in conﬁgurable systems, which limits performance when an FPGA needs to be reconﬁgured ‘‘on the ﬂy’’ (run-time reconﬁguration—RTR). If the logic resources are targeted to execute one treatment, the chip is conﬁgured only at the startup and remains unchanged until the application is ﬁnished. For example, multiple-FPGA machines, like Splash-2 [1] (each module contains 16 Xilinx XC4010) or Programmable Active Memory DEC PeRle-1 [2] (25 Xilinx XC3020), were used in order to achieve highly parallel processing rates on multiple data. The entire chips are conﬁgured once for the target application. One way to reduce the hardware complexity and increase the functional density of the reconﬁgurable resources is to share the same physical chip to perform

ARTICLE IN PRESS 298

L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

all application’s stages. For most embedded systems, the ability to reuse hardware resources for several treatments by reconﬁguring the hardware during runtime execution of the application is a promising alternative. This approach, termed Dynamically Reconﬁgurable Architecture (DRA), can reduce the system’s hardware complexity and increases the ﬂexibility with the ability to choose at any time an algorithm rather than another because it is more adapted. Since some years, the DRAs represent a rich area for industry and research laboratories. For the design of DRAs, two levels of the reconﬁgurable element granularity have been used in order to take beneﬁt of the data parallelism: *

*

Coarse-grained conﬁgurable logic. In this case, the reconﬁgurable resources are organized in a twodimensional array allowing signiﬁcant interconnect ﬂexibility. The hardware structure is based on both linear data and control paths (data-path structure). The coarse-grained architecture includes multipliers, adders, DSP operators, and pipelined registers. Because of their coarse granularity, the hardware systems built according to this model require only few bits to be conﬁgured (like DART [3] and Systolic Ring [4]). Fine-grained conﬁgurable logic. Unlike the coarse granularity, the ﬁne granularity of logic elements offers better ﬂexibility; their reconﬁgurable resources can be used to implement operations at the bit level, data-paths and basic arithmetic functions necessary to build custom embedded systems need to be highly adapted to speciﬁc data for the target applications. Most of the conventional SRAM-based FPGA are classiﬁed in this category. The conﬁguration bits in the device have to tell for each gate and interconnection element how to behave. Consequently, conﬁguration stream sizes and times increase considerably; this major drawback reduces their ability to be used in embedded systems that are dynamically reconﬁgurable. Some years ago, FPGA families have been developed, such as, the Atmel AT40K FPGA family [5] and Xilinx XC6200 FPGA family [6], in order to reduce conﬁguration time and to allow partial reconﬁguration. Moreover, their ﬁne-grained reconﬁgurable logic enables custom computing according to the demand of the applications. By accepting a weak decrease of performances, it is possible to design basic elements (data-paths, DSP functions, etc.) to emulate advantageously coarse-grained architectures. In this study the Atmel AT40K FPGA have been used to build a platform named ARDOISE whose the main goal is to experiment and demonstrate the potential and performances of the dynamically reconﬁgurable systems.

In the rest of this paper, we describe ARDOISE architecture and we show how it can be used to provide

real-time design swapping capability (hardware sharing). The organization of the paper is as follows. First, a brief reﬂection about the conﬁgurable computing technology which has been introduced to improve performances on a wide range of applications is given (Section 2). Then the dynamically reconﬁgurable system which allows the ability to be conﬁgured entirely or partially (Section 3) is presented. An evaluation and a comparison of the performances of various architectures are discussed in the Section 4. To illustrate the approach of the use of ARDOISE, for real-time image processing, several examples of algorithms implementation are described (Section 5). Today, ARDOISE platform is operational and is available in all participating laboratories. The Section 6 is dedicated to the description of the tools developed to better use the prototype and help designers to develop and debug applications. Finally, we summarize and discuss about future research.

2. Dynamically reconﬁgurable hardware The reconﬁgurable embedded systems using FPGA represent an attractive alternative for applications with a strong real-time constraint. The basic steps of typical image segmentation algorithms (see Fig. 1) can be classiﬁed in: smoothing ﬁlter, edge extractor, region labeling, etc. According to the complexity of the target application, one or several FPGA devices are used. In general, the system is statically reconﬁgured: the entire chips are conﬁgured once for the target speciﬁc application. An important advantage of reconﬁgurable logic is the ability to reconﬁgure functionality in response to changing application data sets. However, most of algorithms need external resources (such as FIFO and memory buffer). So it becomes difﬁcult to adapt hardware to be able to choose a ﬁlter rather than the other one. FPGAs have limited resources and offer lower performances than Application Specific Integrated Circuit (ASIC) solutions. With conﬁguration clock rate and processing clock rate, in permanent progress, the FPGAs can supply both advantages with a limited inconvenience. The commercially FPGA such as the AT40K series are particularly suited for RTR because they can be rapidly reconﬁgured. These devices can be totally or partially conﬁgured in less than 1 ms and their processing speed is approximately 1/3 of ASIC speed. This FPGA family has a bus which allows the device to be easily connected to microprocessors. So, it is possible to conﬁgure rapidly an arbitrary subset of the device by using a fast parallel programming interface. The ability to reconﬁgure a part of the chip during run-time presents an immediate advantage: the size of the conﬁguration data is reduced. Dynamic reconﬁguration (DR) offers the possibility of implementing on a same

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

Pixels In

Frame Grabber

FPGA-1

FPGA-2

FPGA-3

Smoothing Filter

Edge Extractor

Boundary closing, etc.

FIFO

FIFO + Buffer

LIFO

299

Frame Grabber

Pixels Out

Fig. 1. Basic steps in image segmentation.

Input of pixels

Frame Grabber Smoothing filter Edge detection Contour extraction.

Reconfiguration Data

.........................

Dynamically Reconfigurable Area

Memory resources LIFO, FIFO, ...

Frame Grabber FPGA Output of results Fig. 2. FPGA structure using dynamic and partial reconﬁguration.

FPGA in a time-multiplexed fashion, a set of algorithmic steps (Fig. 2).

3. ARDOISE architecture As in the virtual memory case, the principle of ARDOISE consists in implementing a virtually large design on a small FPGA as shown in Fig. 2. The realization of such a system requires commercially available FPGA with: (1) sufﬁciently fast conﬁguration times, (2) possibility to reconﬁgure subsections of the chip while it is still running, and (3) an important number of I/O pins to obtain the required data bandwidth. Nowadays, there are many FPGA devices that allow an important number of I/O pins. By using fast and few reconﬁguration logic (by column as in Virtex [7] series) and a good partitioning strategy, it is possible to design embedded systems that take beneﬁt of partial reconﬁguration mechanism. Because there is not yet a base of commercial FPGA with all the previous characteristics to satisfy these requirements, prototype development became necessary to investigate the hardware issues of DR. Ten French research teams launched ARDOISE project to build a

dynamically reconﬁgurable platform dedicated to realtime image processing; the main goal is to help in the investigation of DR paradigm and the hard wired systems. The basic idea of ARDOISE is to swap algorithms, used in image segmentation, on the same hardware structure, by reconﬁguring few devices several times during the processing of each image. This is equivalent to assign to the same hardware resource, an FPGA device, to execute a sequence of algorithms according to a deﬁned schedule (virtual hardware). Using this concept and exploiting temporal parallelism, the ﬁnal system allows to making the FPGA run consistently at its peak rate. Updates in the case of an ASIC are longer and more expensive. It would be possible to replace an algorithm implemented in an ASIC, by an implementation on DRA, such as ARDOISE, with the same computation time if the data parallelism is well used and if operators are pipelined. Furthermore, it allows high ﬂexibility level, formerly reserved to microprocessors. This architecture uses fast and DR provided in new FPGA chips. The AT40K family from Atmel which is particularly suited for RTR is used, because it can be rapidly reconﬁgured. These devices which include 45K equivalent gates capacity can be totally or partially reconﬁgured in less than 1 ms. It is possible to

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

300

Host processor (ADSP 21061)

address bus

256k, 32b memory

Host (PC) Serial Port

data bus

FPGA AT40K

Scheduler, global I/O module FPGA + RAM + Flash + clock gen. configuration bus global I/O bus Data IN

module (1)

module (2)

module (3)

256k, 32b memory Data OUT

(a)

(b)

Fig. 3. Organization view of ARDOISE system (a) and details of one module (b).

reconﬁgure rapidly an arbitrary subset of the device by using a fast parallel programming interface. This is possible because the conﬁguration bus allows easy connection to microprocessors. The ARDOISE architecture [8–10] is based on three (or more) identical modules (Fig. 3a). Every module (Fig. 3b) includes one FPGA connected to two local memories to save temporary intermediate results. The modules are interconnected using a 48 bits bus which can carry either addresses or data. Free buses on both sides are used for data input and output (for example camera and screen). An other module, called mother module, includes all the hardware for conﬁguration storage in a compressed form and scheduling of the conﬁguration and clock generation. Mother module’s FPGA is used to uncompress the conﬁguration ﬁles during the reconﬁguration process and to manage the exchanges of the high-level results with the host processor. ARDOISE can be used according to 2 modes: stand-alone mode and coprocessor mode. The second mode is used to accelerate algorithm computation. In stand-alone mode, the conﬁguration data of every treatment to be interlaced, are initially stored in the mother board’s FLASH memory. After power up, the FPGA of the mother board is itself conﬁgured. In the second mode, ARDOISE system is endowed with an external host processor (DSP SHARC 21061 by Analog Devices). In this conﬁguration the host processor can ﬁll memories (static memory or ﬂash memory) with the data of various conﬁgurations. The mother board tasks are summarized in the following points: *

reconﬁgures modules by loading the conﬁguration data. This operation, equivalent to write in a memory by using the FPGA fast interface, is made by using the conﬁguration bus;

*

*

*

*

makes the management of the algorithm data by mapping operators like DMA in various modules; reduces the conﬁguration delay. Indeed, every time the data of conﬁgurations are stored in the memory in a compressed form. The mother board performs decompression on the ﬂy and sends conﬁguration data to desired modules; manages dynamically the conﬁguration data by loading them selectively in modules; distributes clocks in modules. To make beneﬁt of the whole frame duration (40 ms), algorithms swapped in computing module work at various frequencies, greater than the frame data in/out rate. The mother module integrates a block, which generates four clocks where frequencies are programmable. Some of these points are discussed in the tools section.

3.1. ARDOISE utilization and swapping algorithms in FPGA Figs. 4a and b show one of the multiple uses of ARDOISE. The two GTI modules are used as frame grabber. During computing of frame n, the frame n þ 1 is an input and stored in memory A and the frame n 1 is read from memory C and outputted. The frame n is stored initially in memory B. The ﬁrst conﬁguration is loaded on the center FPGA (in gray), the ﬁrst algorithm is then applied to the data previously stored in memory B (Fig. 4a). Then, the FPGA is conﬁgured again for computing the second algorithm (Fig. 4b). D and B are used respectively as input and output memories for this algorithm. For successive algorithms, the scheme is repeated. At the end of the frame duration, several algorithms have been computed on the central FPGA and the use of memories A (respectively C) and B (respectively D) are exchanged for the next frame

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

A

Memory

Memory

256kx32

256kx32

FPGA

FPGA

FPGA

GTI1

ALGO 1

GTI2

Memory 256kx32

IN

Memory

B

256kx32

Memory

Memory

256kx32

256kx32

C

A

Memory

Memory

256kx32

256kx32

FPGA

FPGA

FPGA

GTI1

ALGO 2

GTI2

Memory 256kx32

OUT IN

D

GTI2

BC

GTI1

GTI2

BC

GTI1

301

Memory

B

256kx32

(a)

C

Memory

Memory

256kx32

256kx32

OUT

D

(b)

Fig. 4. Scheme of algorithms execution with dynamical reconﬁguration. Table 1 Performances measures of algorithm implementations Algorithms

Number of conﬁg.

Pixels computed in one cycle

Computation time (ms)

Conﬁguration time (ms)

Deriche’s ﬁlter Sobel’s gradient Contour closing Region labelling (1st stage) Region labelling (2nd stage) Total

2 1 2 1 1 6

2 2 6 1 2

7.9 3.9 2 7.9 4 25.7

1.08 0.54 1.08 0.54 1.08 4.32

computation. The GTI1 and GTI2 modules allows desynchronizing between the central computing module, BC, and the video acquisition system (which occurs often at lower frequency than the maximum FPGA abilities). The different treatments are swapped in the central module and computed at a high clock speed. Intermediate results will be stored in the GTI local memories with different memory models, during computing or reconﬁguring. The computing module memories are used to store local data inherent to every algorithm: temporary data storage, realization of data structures such as FIFO and various treatment parameters. In addition, because most of the treatments must deal with non-interlaced images, the GTI module can be used to interlace video data input, sampled from acquisition system. 3.2. Example of algorithm implementation The potentiality of ARDOISE on image segmentation application have been tested. Similar to some digital signal processing, the image segmentation implementations include basic stages: noise ﬁltering, edge detection, contour closing and region labeling. The algorithms studied for noise smoothing are Deriche’s ﬁlter, Sobel’s convolution masks and Nagao’s algorithm (refer to Section 5). For edge detection the Sobel’s gradient masks is used. Finally, a solution using cellular automata architecture [11] for contour closing and Rosenfeld’s algorithm for region labeling is chosen. All

of these algorithms have been previously studied on static architectures. At each stage, ARDOISE’s FPGA is reconﬁgured to perform a speciﬁc algorithm. The following measures are given for throughput frequency of 35 MHz and image resolution of 512 512 pixels. The complete application can be executed in less than 31 ms. Table 1 shows details. 3.2.1. Deriche’s noise filter Through this illustrative example, the efﬁciency of the ARDOISE architecture and the necessity of having successful tools to perform rapidly simulation and synthesis of algorithm stages is highlighted. The chosen algorithm is FGL’s smoothing ﬁlter [12], a version of Deriche’s ﬁlter. This operator on various technologies (ASIC, DSP and FPGA) have been already studied, optimized and implemented. This ﬁlter possesses an unique parameter (g is 3 bits in length) which allows to adapt the impulse response to the input image to be computed. For example, this possibility is very attractive to choose the resolution of the ﬁlter to be employed for smoother transition of intensity. Fig. 5 deﬁnes the computation model of the ﬁlter. The elementary cell is a low-pass; the z-transform and the recursion equations are given below: HðzÞ ¼

1g ) Yn ¼ Xn þ gðYn1 Xn Þ: 1 gz1

ð1Þ

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

302

γ xn

X

yn

yn-1

R

Fig. 5. Deriche’s smoothing ﬁlter, F. Garcia Lorca’s version.

Input Image

Horizontal Filtered Image

2 x FGL filter

2 x FGL filter

Horizontal Filtered Image 2 x FGL filter

2 x FGL filter

Line Buffer (LIFO)

Line Buffer (LIFO)

2D Filtred Image (a) Horizontal stage

(b) Vertical stage

Fig. 6. 2D implementation of FGL’s recursive ﬁlter.

3.2.2. 2D image implementation Below the Deriche’s noise ﬁlter implementation developed for real-time image smoothing is presented. FGL’s algorithm which is a single pole recursive lowpass ﬁlter, as basic stage of the realization was preferred. Causal and anti-causal parts of the 1D structure include 2 stages connected in a cascade (Fig. 6a). The data samples of each line are ﬁltered with the 2 FGL’s ﬁlter, passing from left-to-right. Then, the pixels are processed in the same manner, but with the ﬁlter moving right-toleft. Finally, results are stored horizontally. In software emulation, these stages can be made separately by using intermediate buffer. The recommended solution to extend the ﬁltering to two dimensions is to incorporate an intermediate image memory between row and column. During the vertical step (Fig. 6b), the results previously calculated are performed in the same manner as horizontal step; the input data are taken vertically from the image memory. 3.2.3. Implementation of Deriche’s filter using ARDOISE One of the main advantages of using ARDOISE structure in a real-time image processing is the ability of performing treatments independently of the data stream (desynchronizing between the computing rate and the data input sampling). This means that it is possible to process several data samples in parallel. The 2D ﬁltering have been decomposed into 2 conﬁgurations. During the ﬁrst conﬁguration the horizontal stage (Fig. 7a) is applied, then the computing unit is reconﬁgured dynamically to implement the vertical stage (Fig. 7b). The computing scheduling is made in 2 clock cycles. The left frame grabber GTI1 is in input mode while the right

one GTI2 is in output mode. Several models of data organization in the memories can be implemented in GTI’s reconﬁgurable logic in order to optimize the execution of each treatment. During data input video, pixels are stored in the memory controlled by GTI1 in such a way that 2 pixels of the same column and 2 consecutive lines are stored at the same address. First cycle begins with the sampling of 2 pixels from GTI1’s memory. Then, the left-to-right ﬁltering step is made on the 2 pixels in parallel that will end with a storing of the results in the intermediate buffer. During the same clock cycles, intermediate results of 4 pixels of the previous lines are extracted from the buffer to complete horizontal ﬁltering with the right-to-left step. So, two lines are performed in parallel. The 1D processing results are stored in the GTI2’s memory after a data reorganization procedure. Similarly, the vertical treatment will make during the following conﬁguration. By including the conﬁguration time with a conﬁguration clock rate of 30 MHz, the ﬁltering of one image (512 512 pixels) is made in less than 10 ms.

4. Performances and limits of DRA ARDOISE structure was deﬁned to be very ﬂexible in order to implement easily most of the algorithms encountered in real-time image processing. The purpose of ARDOISE project is to elaborate development methodologies for the DRA. Partitioning strategy, study of the DRA performances and the management of the conﬁgurations are the important points that have been studied.

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

GTI1

GTI2

GTI1

2x8

32

32

Vertical Data storage

Horizontal Data storage

8

8

2 x FGL filter 2 x FGL filter

16

2 x FGL filter 2 x FGL filter

16

16

303

GTI2 2x16 16

2 x FGL filter 2 x FGL filter

16

16

16

16

2 x FGL filter 2 x FGL filter

16

16

R1 (32 bits)

R3 (32 bits)

R1 32 bits

R3 (32 bits)

R2 (32 bits)

R4 (32 bits)

R2 32 bits

R4 (32 bits)

64

(a)

64

64

Intermediate buffer (LIFO structure)

(b)

64

Intermediate buffer (LIFO structure)

Fig. 7. Implementation of the Deriche’s noise ﬁlter.

The immediate beneﬁt of the dynamic reconﬁgurable computing is the silicon reduction. Certainly, today the integration density of the silicon is not the critical point. However, in the case of a system on chip (SoC) in which will be resulting processors, memories, DSP functions and why not a hardware reconﬁgurable area, it is important to reduce the size of this last one. Nevertheless, the ﬂexibility in choosing the algorithms to be used in a simple manner is more important. One can make, in real time with a hardware process, image analysis where the result will help to choose the appropriate algorithm according to incoming image. This functionality allows to exploit the mechanisms of sequence breaking and decision-taking. Because it is possible to adequately pipeline the design and exploit more parallelism per cycle, FPGAs can provide computational power per unit of area highly than conventional processors and can so complete more work per unit of time. In a target application dedicated to real-image processing, because of horizontal and vertical blank-time the valid samples occupy a proportion of total image duration. If the image size is N pixels, the data input frequency Fi is Fi ¼

N ¼ aFs ; T

ð2Þ

where T is the frame duration and Fs the video sampling clock.

4.1. Hardware usability The main objective of the DR is to allow a system to react during run-time to choose the most suited algorithm according to the data of the target application. Compared to a static solution, DR does not improve the execution speed of an application. However, it reduces and optimizes the usability of the reconﬁgurable logic area in the FPGA. For static architectures, Bertin suggested [13] the expression of the power Pus needed by an application, as the product of the number of gates Gs to be used and the computing frequency (Eq. (3a)). In a real-time implementation without hardware buffering, the computing frequency is necessarily dependent on the video sampling clock Fs : By analogy of the power application, one proposes the deﬁnition of an architecture power as the product of the number of equivalent gates and the maximum frequency Ft that the system can supply (Eq. (3b)) Pus ¼ Gs Fi ;

ð3aÞ

Pms ¼ Gs Ft :

ð3bÞ

The quotient of Pus and Pms gives the usability of the hardware resources Zs : Pus Fi ¼ : ð4Þ Zs ¼ Pms Ft

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

304

For example, image processing requires more and more computational power and data throughput; the real-time implementations (without hardware buffering) offer disappointing results: *

Image size N ¼ 512 512 pixels (a ¼ 0:655; Fs ¼ 10 MHz, Ft ¼ 35 MHz) ) Zs ¼0.19.

4.2. Performances and silicon reduction A static implementation which uses Gs gates is partitioned into C stages (or partitions) to be mapped on a DRA. In a unit of time T, frame duration, C treatments are computed by performing time-multiplexing of execution/reconﬁguration steps. The system reconﬁgures the same conﬁgurable logic to implement different partitions at successive times. Functional ﬂexibility and data parallelism b are the main goals which drove the development of FPGA designs. In the rest of the article, let us note by Gd the number of reconﬁgurable gates. To simplify the demonstration of the approach, one supposes that partitions are equivalent size and use the same data parallelism b: The silicon area reduction is Rsxd ¼

Gs CGd C ¼ ¼ : b bGd bGd

ð5Þ

Maximizing the data parallelism means trying to ﬁnd the best implementation of a given task on the DRA based on FPGA architecture. The computational time of one image is the product of the number of pixels and the delay required to evaluate one pixel (with Ti ¼ 1=Fi and Tt ¼ 1=Ft ): Dcomput ¼

T Tt Fi T ¼ : Ti b bFt

ð6Þ

In addition, if Vc indicates the conﬁguration speed of a given FPGA family, expressed in number of gates conﬁgured per second, the reconﬁguration time will last Dreconfig ¼

Gd : Vc

ð7Þ

The duration Dstep to execute one conﬁguration at frequency Ft ; including conﬁguration overhead is Dstep ¼

Fi Gd : þ bFt Vc T

ð8Þ

Temporal order constraint imposes that the sum of the computation time of all the partitions of the algorithm and the additional time required for reconﬁguration, should be less or equal than the block acquisition duration T: These overhead times can be minimized separately to improve performances. One deduces in these conditions, the maximum number of conﬁgurations C for a given FPGA family, according to the size of the partitions Gd and the data parallelism rate

b used for each implementation Fi T Gd þ pT CDstep pT ) C bFt Vc Fi T Gd 1 )C¼ þ : bFt Vc T

ð9Þ

The previous expression contains two terms: the ﬁrst term is the overall processing time and the second one is the total time required to reconﬁgure the hardware. To study the impact which they can have on the number of conﬁgurations (C), let us consider the example of the real-time image processing at the resolution of N ¼ 512 512 pixels. So, for the frame rate of 25 frames/s (frame duration T ¼ 40 ms) the sampling frequency of the pixels is Fs ¼ 10 MHz and Fi ¼ N=T ¼ 6:55 MHz. The rate of the valid pixels is aE0:655: For illustrative purposes, the working frequency Ft and conﬁguration speed Vc which depend heavily on the structure and the technology of the FPGA used are ﬁxed. For example, AT40K family offers faster reconﬁguration times of Vc ¼ 50 106 gates/s. Because FPGA implementations for many algorithms used in video image processing are studied, our experience suggests that maximum frequency Ft ¼ 35 MHz is a good choice. With these technology and application constants, Eq. (9) becomes 2 106 b ðAtmel 40K familyÞ; 0:4 106 þ bGd 2 104 b CE ðXilinx 4000E familyÞ: 0:4 104 þ bGd

CE

ð10Þ

Fig. 8 plots the relationship shown in Eq. (10). For a dynamically reconﬁgurable FPGA, the study of C according to Gd and b shows that the size and the conﬁguration time of the physical device have an impact on the number of successive reconﬁgurations. For example, the conﬁguration of Atmel AT40K devices is 100 times faster than Xilinx 4000E family’s device. Consequently, the complexity of the application which it is possible to implement is 100 times greater. In this project, Atmel AT40K family which allows faster reconﬁguration is used. In the ﬁrst curve, one distinguishes two zones: in the ﬁrst one ðGd {105 Þ the number of conﬁgurations is sufﬁcient, so the use of the DR is completely justiﬁed. On the contrary, in the second zone (when Gd c105 ) the number of conﬁgurations decreases quickly. For a given FPGA technology, these curves can help to choose the size of the device Gd to be used for building DRAs. The second complementary curve, drawn for Atmel AT40K technology, shows the DR efﬁciency at varying parallelism data rate. For a ﬁxed size Gd of DR, one notices that the parallelism data is not systematically an advantage. Beyond a certain value of b; the total conﬁguration time becomes dominating compared with the computational time.

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

12 β=4

Number of configurations C

Number of configurations C

25 20 15 10 5

305

β=2

β=1

8 100000 gates

6 4 2

AT40K

XC4000XL

40000 gates

10

750000 gates

0 10 2 (a)

10 3 10 4 10 5 Partitions granularity Gd

10 6

10 7

2 (b)

4

6 8 10 Data parallelism β

12

14

Fig. 8. Impact of the FPGA technology on the efﬁciency of the RD, comparison of Atmel AT40K and Xilinx XC4000E family (a) for ﬁxed values of b and the number of conﬁgurations according to the data parallelism rate (b) for ﬁxed values of Gd :

4.3. Architecture power and efficiency

gain increases proportionally with Ft : The silicon reduction can be approached by

By considering Eq. (9a) one can summarize the limit of the architecture power as follows: Gd Pu ¼ Gd Ft 1 C : ð11Þ Vc T As it underlined above, the expression of the limit depends essentially on technological performances of the used FPGA device: the conﬁguration speed Vc and the maximal computing frequency Ft : For a ﬁxed value of the number of conﬁgurations, when Gd increases, the architecture power takes a maximum: Pu max ¼

Ft Gd 2

when Gd ¼

Vc T : 2C

ð12Þ

The maximal complexity of the application which can be implemented for DRA is as follows: Gs max ¼

C Vc T X Vc T : bp p 2C p¼1 2

ð13Þ

bp is the data parallelism rate which the conﬁguration p is processed. The precedent expression is a function only of Vc and T: This result is very important; algorithms used for images segmentation, described in Section 3.2, add up with less than 4 105 gates. What consolidates the choice of the AT40K technology to build ARDOISE. 4.4. How to choose the FPGA device For a given application (Fi ; T; Gs ), the useful power is Pstatic ¼ Gs Fi pPdynamic :

ð14Þ

DR allows better silicon reduction when the data parallelism rate b ¼ 1: With this condition, the silicon

Gs Ft E : Gd min Fi

ð15Þ

The performances obtained are better when the choice of G is less than 20% of the complexity limit. In this project Atmel’s AT40K40 device, which offers E50K gates and presents good rate E6%, is used. 4.5. Which reconfigurable device to build DR systems? The DR is a concept which is not reserved for only FPGA that offers a high conﬁguration speed such as Xilinx XC6200 and Atmel AT40K series. These two FPGA families allow dynamic partial or complete reconﬁguration during run-time by using fast conﬁguration interface. The global performances can be degraded by the reconﬁguration time of the reconﬁgurable logic. This does not mean that classical FPGA cannot be used to design dynamic systems. These last years, a great number of systems were built by teams of research integrating mechanisms aiming to reduce the impact of the conﬁguration time: conﬁguration data caching, bitstream compression techniques, masking the conﬁguration, etc. The masking of conﬁguration strategy was studied by Guermoud in his thesis [14]. In the following, the conclusions of a comparative study of performances are presented. Three typical DR realizations, using one or two conﬁgurable devices, are evaluated: 1. Without masking the reconﬁguration time. 2. Masking the reconﬁguration time: at the same time one device is reconﬁgured while the other one is used to compute.

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

306

3. Doubling the reconﬁguration time: the two devices are reconﬁgured at the same time, and then compute together. The three solutions use the same quantity of hardware resources: total number of gates, memory size and memory bandwidth. To simplify expressions and ﬁgures interpretation, one assumes that the same data parallelism rate is used for each algorithm. 4.5.1. Solution without masking the reconfiguration time In this case, a single FPGA with a capacity of Gd equivalent gates is used. In order to execute several algorithms, a series of reconﬁguration/computation are performed alternatively. 4.5.2. Solution with masking the reconfiguration time In this architecture (Fig. 9), multiple steps of reconﬁguration/computation are applied in alternation to each of the two FPGA (each provided the equivalent of Gd =2 gates).

4.5.3. DR with double configuration speed Here, the two FPGA devices are reconﬁgured simultaneously with different conﬁguration data. After the reconﬁguration step, each FPGA starts to execute its own corresponding treatment. Fig. 10 shows the functionality of this architecture. This mechanism allows the doubling of the conﬁguration speed. By calculating the silicon reduction, the three architectures performances [15,16] can be estimated to compare them. Table 2 summarizes the expressions of the silicon area gain for various solutions: Fig. 11 represents the three curves. The silicon reduction of the three solutions discussed above are drawn according to the size normalized for a ratio Ft =Fe ¼ 10: Two zones can be clearly distinguished, according to the importance of the time of conﬁguration. A ﬁrst zone, in which the masking of the conﬁguration time is disadvised, because the silicon reduction is lower than the classical solution. In the second zone, the masking solution is better, because the conﬁguration time is the dominant term. The DR is interesting only if the conﬁguration time of the FPGA is around of 10% of the block duration T: This feature

Data reconfiguration

FPGA-1 Data flow

FPGA-1

results

Execution

Reconfiguration

FPGA-2 FPGA-2 Reconfiguration

Data flow

results

Execution

Data reconfiguration

step (a)

step (b)

Fig. 9. Technique of masking the reconﬁguration time.

FPGA-2

FPGA-1 Reconfiguration

Reconfiguration

FPGA-1 Data flow

FPGA-2 Execution

Execution

results

Data reconfiguration

b- Computation

a- Reconfiguration

Fig. 10. Technique of doubling the reconﬁguration speed.

Table 2 Silicon area gain of three dynamic architecture examples Architecture without masking conﬁguration delays Ft Gs 1 R1 ¼ Fe Vc T

Architecture with masking conﬁguration delays R2 ¼

Ft 2Fe

Architecture with doubling conﬁguration speed Ft Gs R3 ¼ 1 Fe 2Vc T

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

307

Silicon Reduction

10 8 R3

6

R2

4

R1 static architecture

2

Gs VcT

0 0.2

0.4

0.6

0.8

1

Fig. 11. Silicon reduction of the three various solutions.

depends on the technology of the FPGA used. It is the case of the AT40K device used. So, the technique of masking the conﬁguration becomes interesting if the FPGA has a slow reconﬁguration speed. The third technique, doubling the conﬁguration speed, presents the best result. Furthermore, it is easier to conﬁgure two FPGAs in parallel that in alternation. This proves that the solution to increase performances is technological. FPGA devices should be endowing with mechanisms allowing reconﬁguration time reduction: better reconﬁguration interface, using conﬁguration data caching, partial reconﬁguration, etc. In conclusion, for a given technology parallel reconﬁguration solution offers better performances, because it reduces the conﬁguration time. Consequently, the solution based on masking conﬁguration is consequently without great interest. Due to hardware management in a ping-pong manner of the FPGA devices, the system implementation will present some difﬁculties (PCB realization more complex).

5. Application example: edge detector implementations The ﬁrst step after hardware architecture development was obviously the test (Fig. 12). Some process, already studied, have been selected and mapped onto ARDOISE. The study of those algorithms may underline the advantages of DR use and may give also the answer to question: why ARDOISE is suitable for those applications? This section will deal with edge detection algorithms. 5.1. Management of configurations using cache logict As it has been already explained, one way of using ARDOISE is to construct some IP libraries. The IP may be considered as macro-instructions that are executed (the IP is loaded) the one after the others on ARDOISE. The ﬁrst IPs developed were the smoothing ﬁlters of Sobel, Deriche and Nagao. These algorithms will not be presented, because they were studied and were used by

Fig. 12. ARDOISE prototyping platform.

the scientiﬁc community as algorithm for the same aim: image smoothing. But, they do not have the same characteristics. For example, Sobel is very fast computed but does not smooth correctly very noisy pictures. Nagao gives better results for low noisy images but it takes more time. Finally, Deriche’s ﬁlter gives the best results because it can be parameterized according to image’s noise level. Deriche’s ﬁlter can be used unconditionally when the computation time is not critical. However, it takes at least twice more time than Nagao. The choice of the algorithm has to be made to obtain a good compromise between the global processing time and the needed performance. The use of ARDOISE allows to map those three algorithms onto a small portion of programmable logic. While the current computed algorithm is mapped onto the FPGA, the two other algorithms are stored in cheaper media (SRAM, FLASH memory or hard disk drive). Similarly to data, logic is stored in cache memory. In case where logic is active and stored in SRAM, it is ready to be loaded onto the device. However, when logic data is stored in a hard disk drive, it needs a long time to transfer the logic data from the hard disk drive to the SRAM and then conﬁgure the FPGA (Fig. 13). Today, the conﬁguration data of the algorithms are stored in the FLASH memory. At start-up, one can decide which of these algorithms will take place into the SRAM (there is enough place for the three IPs and more) and then one conﬁgures the computing device with the desired IP. All those operations are described with a high-level language: memory transfers and conﬁgurations are managed with a C library. Of course, it is possible to store a conﬁguration bitstream in the SRAM while the active conﬁguration is computed in the FPGA. For example, it is possible to transfer the conﬁguration data of Nagao’s ﬁlter from the Flash

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

308

SRAM

Config. A

Hard disk drive

Config. D Config. F

Config. A Config. B

PC Host

Config. C

FPGA

Config. Manag. Config. D

Config. F

Config. E Logic

Config. F Fig. 13. Cache logic technique.

memory to the SRAM, and, when this transfer is completed, the Nagao task becomes active by conﬁguring the FPGA by loading these data directly from the SRAM. The design which takes place into the scheduler makes possible this transfer without interrupting the conﬁguration task. For this example, the transfer time is less than an image processing time. The system is really fast to react to the order of changing the conﬁguration. The transfer time can become critical if the conﬁguration data is not in the Flash memory but in a hard-drive. The ARDOISE prototype is not well ﬁtted for this case, because transfers between the hard-drive and the prototype are made using a serial port. But the current process is not interrupted by the transfer and the new conﬁguration becomes active a few seconds after the order is given. 5.2. Dynamic reconfiguration The ﬁrst tests on ARDOISE make possible to use this architecture to compute the three smoothing algorithms of Deriche, Nagao and Sobel. The user chooses which algorithm is the well adapted for the processed picture. Algorithms are swapped just by button push. This may underline one of the beneﬁts of dynamical conﬁguration: this three algorithms are exclusives because no one wants to use at the same time Nagao and Deriche algorithms. But, the system may need to use one of these algorithms at different moments. Of course, using an FPGA with 10 Mio gates makes possible to map the three algorithms without using dynamic conﬁguration. In this case, area where are mapped the two inactive algorithms is wasted. The proportion of wasted logic grows with the number of stages and the number of the algorithms to be mapped by stage.

But this choice can be managed automatically. In this case, one of the conﬁgurations has to measure some chosen parameters which are sent to the host processor. The processor is high-level programmed to choose which algorithm suits best the incoming data. The FPGA is then conﬁgured with the corresponding IP. In the way, tasks are well shared: hardware computes low-level algorithms and those algorithms are software managed. One can talk about dynamic conﬁguration because the FPGA is conﬁgured twice during a unique data treatment. The ﬁrst conﬁguration is the parameters estimation and the second is the smoothing algorithm. This high-level management has not been tested yet, but the conﬁguration of the device twice sequentially for each incoming image has been really tested. The ﬁrst conﬁguration computes a smoothing ﬁlter and second one is the gradient’s norm computation of the smoothed image. The time used by those two conﬁgurations and computations is less than 25% of the image duration time. Therefore it seems possible to map many different IPs into the hardware. Next algorithms we will develop come from the edge detection: threshold, segmentation, etc. Also, some pictures estimation algorithms will allow to test the data dependent scheduling of the IPs. 5.3. Partial configuration This is the last point which was experimented with ARDOISE. The FPGA used for this architecture is compatible with ﬁne grain partial conﬁguration. It means that it is possible to conﬁgure a single cell while the rest of the FPGA is working. The ﬁrst experimentation consists in developing a dynamically reconﬁgured application on a single ARDOISE board. This application is the Sobel’s smoothing ﬁlter (1st conﬁguration:

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

309

Fig. 14. Partial conﬁguration.

dynamic1) followed by the gradient’s norm computation (2nd conﬁguration: dynamic2). These two treatments were computed successively using dynamical conﬁguration whereas a static treatment (image sampling) was active (Fig. 14). This approach is not classic, and a new methodology was developed to place and to and route the designs [17]. The second experimentation is more practical. It concerns the Deriche’s smoothing ﬁlter. As it has been mentioned, this algorithm is parameterized. There are many ways to select the parameter. The parameter can be wire-introduced in the FPGA from the host processor. This needs a special interface for each parameter. The resulting interface becomes very big when there are too many parameters. The classical way to overcome this problem is to store the parameters in internal registers. This solution needs a special interface to control the registers. When the number of parameters grows linearly, the number of data stored grows linearly too. However, the number of wires needed grows in a logarithmic way (1 address bit gives two possibilities, two gives four, etc.). DR can solve this problem. For each value of the parameter corresponds a conﬁguration. Let us consider the case where a single bit parameter. One has to deal with only two different conﬁgurations. Then it is possible to develop those two conﬁgurations independently (this method is unusable when the number of parameters grows) or to include the same register, used previously, into a unique design. The difference is that this register is now read-only, and does not need any interface. The register’s value is selected during conﬁguration and it can be changed only by another conﬁguration of the FPGA. Partial conﬁguration makes possible to change this value very quickly because only the register zone has to be reconﬁgured. For example, 500 ns are needed to change the Deriche’s three bits parameter. 5.4. Sobel’s implementation using one board ARDOISE architecture is oversize to compute a Sobel’s algorithm. For each incoming image, the two

conﬁgurations of Sobel should be the ﬁrst steps to execute on the central module of ARDOISE. The manufacture of the modules took several weeks, during this time Sobel’s algorithm was tested in real time on a single board. This implementation is the ﬁrst dynamically reconﬁgured application really executed on ARDOISE, and it validates many concepts of this architecture.

6. Software tools for simulating and hardware debugging DR is inadequately supported by the commercial tools and the FPGA manufacturers’s tools. Previously, tools used within the framework of the approach are based on the manual speciﬁcation of various parameters. The debugging of the prototype and the simulation of the applications are based on a set of tools written in VHDL language. Testing DRAs imposes new CAD software and techniques which allow a high development ﬂexibility. 6.1. Design flow Previously the designs were coded in VHDL language. Simulation and synthesis steps are performed using Synopsyst tools, then the design is placed and routed using Atmel’IDS place and route tool. During the simulation step, designer use binary image ﬁles as input. Consequently, data must be converted into intermediate format in order to simulate them with the VHDL simulation environment (through a text ﬁle). Inverse process must be applied to the results of the previous process. As just seen, the hardware/software debugging end up with heavy method because of the use of two different languages C and VHDL (Fig. 15a). It is with the aim of simplifying this procedure that a new CAD software is developed. For more ﬂexibility, a new simulating and debugging environment was written completely in C++ language with SystemC approach (Fig. 15b). For that purpose, a hardware model of ARDOISE

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

310

Input image (binary format)

C, C++, SystemC C program

data translation

generic module library

Image (text format)

Simulation, testbenchs (C++ Compiler)

FPGA Synthesis (CoCentric Compiler)

RTL Level

ARDOISE VHDL Simulator View Results

Place & Route (Figaro - Atmel)

Output image (binary format) C program data translation

Image (text format)

(a)

Dynamically reconfigurable logic

(b) Fig. 15. VHDL testbench environment (a) and novel approach of tools based on SystemC (b).

architecture was written in SystemC. This environment was designed as a tool to help the designer to simplify the development procedure of applications. SystemC is a C++ object oriented, cycle based simulator that primarily for hardware/software cosimulation. SystemC was launched by the Open System C Initiative in 1999, and includes all the language elements necessary to describe hardware and software functionality of complex systems. The Advantage of SystemC is to be able to use the same language description for the synthesis and the simulation stages. The possibility of interfacing SystemC and the GUI libraries (X-Windows or WIN32 and System API) allows easy creation of Windows GUI software effective (Fig. 16). The ARDOISE simulator can be used for simulation of both the hardware and the software components. The GUI environment enables the user to simulate and/or execute treatment, edit memory contents, choose test image, show result, display waveforms, create scripts for synthesis and place/route with a simple click (Fig. 17). To synthesize SystemC hardware description into Register-Transfer-Level designs accepted directly by Atmel’IDS place and route, Synopsyst CoCentric Compiler (Fig. 15b) is used. The most important objectives of the development environment are: 1. Easy veriﬁcation of the designs (using images in their origin coding). 2. Simulation steps faster than a comparable HDL simulator. 3. Helping the user to evaluate strategy performance by using this environment like software platforms. Indeed, DR is a new paradigm where development methods and partitioning strategies should be created.

4. Testing in a sequential way the processing stages used in DR steps. 5. Hardware/software co-emulation: it is possible to debug the processing stages and reconﬁgure the system using host interface afterwards. A collection of pre-routed and placed circuits reside in the main memory of the host system, ready for rapid download onto the FPGA over a host interface. 6. Step by step execution on the board conﬁgurations and download the contents of intermediate results (data memories). 7. Evaluation methodology to reduce the resource requirements in partially reconﬁgurable systems. 8. Measurement of reconﬁgurable architecture performances with more precision and reﬁnement. 9. Development environment of System on Chip (SoC), intellectual property (IP) and reusable hardware/ software design in digital image application. 6.2. Constraints on IP for DRA The ﬁrst point is to remember that the application programming is based on the parallel or sequential association of coarse grain blocks or IP. So, the design ﬂow is top-down if all the IP to be used already exist. If it is not the case, the new IP can be designed with conventional tools. In the context of DRA, speciﬁc constraints must be added to IP description and organization. All the IP memories used must be extracted from separate block. Then, the IP core just contains the operator ﬂow used to process one result of the given algorithm. The inputs of the IP core are reduced to the data needed to compute this result. So, the global execution on the data packet can be seen as iterations (nested loops) of the one result process. This allows to deﬁne intelligent data placement in memory

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

311

Fig. 16. ARDOISE simulation and debugging environment.

Fig. 17. Simulator: designed to be easy to use.

and smart address generator to handle local and interconﬁguration storage management. It is also possible to compute the synchronization for the conﬁguration manager. This method is also a good practice for the use of IP in standard SoC designs when the memory model can change from one design to another (resource mapping, memory partitioning, etc.). For each IP, an extraction of the amount of the resources needed (memory, computation), measures of the propagation time (with synthesis, place and

route on the Atmel FPGA) and the memory bandwidths are made. Then global partitioning [7] is used to deﬁne the moments where reconﬁguration steps take place (inherent to the IP, or constrained by resource lack). After then between two conﬁguration steps, data parallelism can be used to reduce the computation time taking care of the resource availability. It will be a good idea to build a speciﬁc tool which can synthesize automatically a data parallel version of a given IP core.

ARTICLE IN PRESS 312

L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

After this ﬁrst step, memory allocation is studied and address generators are deﬁned (special tool for automatic VHDL and systemC are in construction). The synchronization module of each conﬁguration is elaborated and the global scheduler module of the application (which will be mapped on the FPGA mother module) is deﬁned. Then, a functional simulator is used to verify the global application process on real data. After that, the back end part of the design ﬂow occurs. First, each individual FPGA conﬁguration is produced using conventional Atmel place and route tools. Address generators for the local memory accesses during a given conﬁguration are placed in the center FPGA (in gray in Fig. 4), and address generators for intermediate storage between conﬁgurations are stored in FPGA GTI module (Fig. 4). Conﬁguration ﬁles are generated in a compressed form to reduce the size and the bandwidth of the mother module memory.

ending addresses, and all the intermediate addresses are generated by an incremental algorithm. Data just have to be stored in the same order than addresses are computed. BST ﬁle are usually smaller than MD4 ﬁles because two addresses may be enough to conﬁgure many data. But this format is not suitable with highspeed conﬁgurations because it has to be unpacked before data being sent to the FPGA’s high-speed interface. The unpacking algorithm in the mother board’s FPGA has been implemented. In order to do it efﬁciently, it seemed necessary to modify the BST format. A C program automatically constructs the new format and allows the user to conﬁgure the daughter boards’s FPGA as rapidly as the MD4 format. It is necessary to underline that three cycles only are lost in the beginning and in the ending of the conﬁguration for each window of the FPGA.

6.3. Functional simulator including reconfiguration management

7. Conclusion and future work

In order to simulate a dynamically conﬁgured application, the safeguard of the IP’s results each simulated conﬁguration is necessary because these results may be the entries of the next conﬁguration. In this purpose, a model of the material of ARDOISE is written in both languages VHDL and SystemC. These descriptions contain the three daughter boards’ FPGA and the six corresponding RAM. Each RAM is associated with two ASCII ﬁles. The ﬁrst ﬁle represents the RAM before simulation, and it is read at the beginning of simulation by the RAM module. The second ﬁle represents the RAM at the end of simulation; The RAM module ﬁlls it when all computations are over. Therefore, it is possible to describe the three FPGA modules, and chain different simulations using previous results. The different algorithm descriptions used for simulation can be directly synthesized and then be placed and routed by the Atmel IDS software. 6.4. About daughter boards configuration The AT40K FPGA’s family offers high-speed conﬁguration interface. It is possible to set two conﬁguration bytes in a 33 MHz cycle. The way of doing this is to force a 24 bits address and a 16 bits data on the conﬁguration interface of the FPGA [5]. The MD4 conﬁguration ﬁle format is adapted to this task: each sixteen bits data is directly located with its address. On one hand, it is really easy to read this ﬁle in a RAM and to send it to the high-speed conﬁguration interface of the AT40K. On the other, the ﬁle size is bigger than if it had contained only useful data. The BST ﬁle format mostly contains data. In this format data are windows arranged. The window is delimited by its beginning and

In this paper, a real-time reconﬁgurable architecture using industrial AT40K FPGAs is presented. ARDOISE architecture was designed in order to perform sequentially high-speed image processing treatments. However, ARDOISE provides the ﬂexibility and hardware resources to meet the requirements and speciﬁcations in digital signal processing, high-speed control, digital communications, etc. Using real design examples, the Sobel’s masks used in edge detection and the Deriche’s smoothing ﬁlter, one showed that it is possible to obtain high performances if the AT40K DR capability is effectively exploited. The results presented in this paper indicate that short reconﬁguration delay makes the DR mechanism more attractive for applications that require high performances. The design experiment and issues related the development of reusable modules allowed the user to estimate architecture performances. Dynamic approach is not supported by industrial design ﬂow. A typical design environment tool where the realization is in progress is presented. The ARDOISE design environment tool was used to simulate and verify the various algorithms that have been implemented. These abilities will be used to measure accurately architecture performances and to automate the design process for DRA. The functional simulator has permitted to estimate the computation time for a real-time treatment in image processing application including the ﬁlters of Deriche, Nagao and Sobel for the smoothing stage, derivatives computation, local maxima of the derivative detection, edge closing and region labelling. The results show that one can execute 7 dynamic conﬁguration steps within 37 ms that make real-time image processing feasible (less than 40 ms for a single frame). All the hardware prototype boards are now tested and operational. The

ARTICLE IN PRESS L. Kessal et al. / Real-Time Imaging 9 (2003) 297–313

real-time implementations described in this paper, Sobel’s detector and multi-resolution Deriche’s ﬁlter, was implemented on a single module board demonstrate the potential of ARDOISE platform to support DR paradigm. As a perspective, the priority is to ﬁnish the development tool for debugging and monitoring a real dynamically reconﬁgurable multi-FPGAs. Future works will aim more progress in algorithm partitioning and the improvement of CAD environment to integrate partial reconﬁguration capability, automate the place and route stage of the design ﬂow, make reliable measures of performances, integrate partitioning strategy and methodology, etc.

[5] [6] [7] [8]

[9]

[10]

Acknowledgements We wish to acknowledge the support of Atmel Corp. for software, hardware and technical support. R. Bourguiba, S. Pillement, M. Paindavoine, E.B. Bourenanne and S. Weber, have contributed to the original ideas which led to the architecture design of ARDOISE. E.B. Bourenanne, S. Weber, and Y. Berviller have contributed to the conception, the realization and the test of the hardware prototype. We thank once again R. Bourguiba for the time he passed to the conception of the various boards. Finally, we thank S.M. Karabernou for the numerous suggestions and the reading of this article.

[11]

[12]

[13]

[14]

References [15] [1] Arnauld J, Buell D, Davis E. Splash II. In: Proceedings of the Fourth ACM Symposium of Parallel Algorithms and Architectures, San Diego, CA 1992. p. 316–22. [2] Vuillemin J, Bertin P, Roncin D, Shand M, Touati H, Bucard P. Programmable Active Memories: reconﬁgurable Systems Come of Age. IEEE Transactions on VLSI Systems, March 1996. [3] Sassatelli G, Torres L, Benoit P, Cambon G, Robert M, Galy J. Dynamically reconﬁgurable architectures for digital signal processing applications. SoC design methodology. Kluwer Academic Publisher, Dordrecht, 2002. p. 63–74. [4] David R, Chillet D, Pillement S, Sentiyes O. A dynamically reconﬁgurable architecture for low-power multimedia terminals.

[16]

[17]

313

SoC design methodology. Kluwer Academic Publisher, Dordrecht, 2002. p. 51–62. AT40K FPGA with FreeRAMt, data sheet Atmel Inc., 1999. Xilinx. XC6200 FPGA family, data sheet. Xilinx Inc., 1995. Xilinx Corporation, San Jose, CA, Virtex data sheet, 2001. Bourguiba R, Demigny D, Kessal L. Dynamic conﬁguration: a new paradigm applied to real time image analysis. In: Proceedings of the 10th International Conference on Microelectronics, IEEE Electron Device Society, Monastir, Tunisia, December 1998. p. 25–8. Demigny D, Kessal L, Bourguiba R, Boudouani N. How to use high speed reconﬁgurable FPGA for real time image processing? Proceedings of the International Conference on Computer Architecture for Machine Perception, IEEE Circuit, Padova, September 2000. p. 240–46. Kessal L, Demigny D, Boudouani N, Bourguiba R. Reconﬁgurable hardware for real time image processing. Proceedings of the International Conference on Image Processing, IEEE ICIP, vol. 3, Vancouver, September 2000. p. 159–73. Demigny D, Quesne JF, Devars J. Boundary closing with asynchronous cellular automata. In: Proceedings of the IEEE Conference on Computer Architecture for Machine Perception, IEEE Circuit and Systems, vol. 1, Paris, December 1991. p. 8188. Lorca FG, Kessal L, Demigny D. Efﬁcient ASIC and FPGA implementations of IIR ﬁlters for real time edge detection. In: Proceedings of the International Conference on Image Processing, Santa Barbara, October 1997. p. 406–9. Bertin P, Roncin D, Vuillemin J. Programmable active memories: a performance assessment. In: Meyer auf der Heide F, Monien B, Rosenberg AL, editors. Parallel architectures and their efﬁcient use, Lecture Notes in Computer Science. Berlin: Springer; 1992. p. 119–30. Guermoud H. Architectures reconﬁgurables dynamiquement ddies aux traitements en temps rel des signaux vido. Thesis, Faculty of Nancy I, France, 1997. Bourguiba R. Conception d’architectures matrielles reconﬁgurables dynamiquement ddies au traitement d’images temps rel. Thesis, Faculty of Cergy Pontoise (Jury: P. Bertin, D. Demigny, L. Kessal, M. Paidavoine, R. Tourki, S. Weber), France, July 2000. Kessal L, Bourguiba R, Demigny D, Boudouani N. Reconﬁgurable hardware using high speed FPGA. International Conference on Very Large Scale Integration, IFIP VLSI-SOC’01, Montpellier, France, December 2001. Abel N, Boudouani N, Kessal L, Demigny D. Reconﬁguration partielle sur l’architecture reconﬁgurable ARDOISE. Journes Francophones sur l’Adquation Algorithme architecture, Monastir, Tunisie, December 2002.