Temporal partitioning methodology optimizing FPGA resources for dynamically reconfigurable embedded real-time system

Temporal partitioning methodology optimizing FPGA resources for dynamically reconfigurable embedded real-time system

Microprocessors and Microsystems 27 (2003) 115–130 www.elsevier.com/locate/micpro Temporal partitioning methodology optimizing FPGA resources for dyn...

633KB Sizes 0 Downloads 86 Views

Microprocessors and Microsystems 27 (2003) 115–130 www.elsevier.com/locate/micpro

Temporal partitioning methodology optimizing FPGA resources for dynamically reconfigurable embedded real-time system C. Tanougast*, Y. Berviller, P. Brunet, S. Weber, H. Rabah Laboratoire d’Instrumentation Electronique de Nancy, Universite´ de Nancy 1, BP 239, Vandoeuvre Le`s Nancy 54506, France Received 2 September 2002; revised 22 November 2002; accepted 27 November 2002

Abstract In this paper we present a new temporal partitioning methodology used for the data-path part of an algorithm for the reconfigurable embedded system design. This temporal partitioning uses an assessing trade-offs in time constraint, design size and field programmable gate arrays device parameters (circuit speed, reconfiguration time). The originality of our method is that we use the dynamic reconfiguration in order to minimize the number of cells needed to implement the data-path of an application under a time constraint. Our method consists, by taking into account the used technology, in evaluating the algorithm area and operators execution time from data flow graph. Thus, we deduce the right number of reconfigurations and the algorithm partitioning for Run-Time Reconfiguration implementation. This method allows avoiding an oversizing of implementation resources needed. This optimizing approach can be useful for the design of an embedded device or system. Our approach is illustrated by various reconfigurable implementations of real time image processing data-path. q 2003 Elsevier Science B.V. All rights reserved. Keywords: Field programmable gate arrays; Embedded system; Temporal partitioning; Run-time reconfiguration; Dynamically reconfigurable systems; Image processing; Design implementation; High-level synthesis

1. Introduction Since the introduction of field programmable gate arrays (FPGA), the process of digital systems design has changed radically [1,2]. Indeed, FPGAs occupy an increasingly significant place in the realization of real time applications and has allowed the appearance of a new paradigm: hardware as flexible as programming. The dynamically reconfigurable computing consists in the successive execution of a sequence of algorithms on the same device. The objective is to swap different algorithms on the same hardware structure, by reconfiguring the FPGA array in hardware several times in a constrained time and with a defined partitioning and scheduling [3,4]. Dynamic reconfiguration offers important benefits for the implementation of designs. Several architectures have been designed and have validated the dynamically reconfigurable computing concept for the real time processing [5 –9]. However, * Corresponding author. Tel.: þ 33-383-6841-59; fax: þ33-383-6841-53. E-mail addresses: [email protected] (C. Tanougast), [email protected] (Y. Berviller), [email protected] (P. Brunet), [email protected] (S. Weber), [email protected] (H. Rabah).

the optimal decomposition (partitioning) of an algorithm by exploiting the run-time reconfiguration (RTR) is a domain in which many works are left. The works in the domain of temporal partitioning and logic synthesis exploiting the dynamic reconfiguration generally focus on the application development approach [10]. Thus, firstly we observe that the efficiency obtained is not always optimum with respect to the available spatio-temporal resources. Secondly, the choice of the number of partitions is never specified. Thirdly, this can be improved by a judicious temporal partitioning [11]. We discuss here the partitioning problem for the RTR. In the task of implementing an algorithm on reconfigurable hardware, we can distinguish two approaches [10]. The most common is what we call the application development approach and the other is what we call the system design approach. In the first case, we have to fit an algorithm with an optional time constraint in an existing system made from a host CPU connected to a reconfigurable logic array. In this case, the goal of an optimal implementation is to minimize one or more of the following criteria: processing time, memory bandwidth, number of reconfigurations and power consumption. In the second case, we have to implement an

0141-933/03/$ - see front matter q 2003 Elsevier Science B.V. All rights reserved. PII: S 0 1 4 1 - 9 3 3 1 ( 0 2 ) 0 0 1 0 2 - 3

116

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130

algorithm with a required time constraint on a system throughout the design exploration phase. The design parameter is the size of the logic array that is used to implement the data-path part of the algorithm. Here an optimal implementation is the one that leads to the minimal area of the reconfigurable array. Previous advanced works in the field of temporal partitioning and synthesis for RTR architecture [12 – 19] focus on application development approach targeting already designed reconfigurable architecture. These methodologies are used in the domain of existing reconfigurable accelerators or reconfigurable processors. All these approaches assume the existence of a resources constraint. The most important thing here is that the number of reconfigurable resources is a predefined constant (implementation constraint). In this strategy, the associated tools capture the algorithm and the characteristics of the target platform on which the algorithm will be implemented. In this case, the goal is to minimize the processing time and/ or the memory bandwidth requirement. Moreover, all these approaches are not capable of solving practical DRL (dynamically reconfigurable logic) synthesis problems yet. These techniques use in general simplified models of dynamically reconfigurable systems, which often ignore the impact of routing or reconfiguration resource sharing in order to reduce complexity of a DRL design space search. Among them, there is the GARP project [12]. The goal of GARP is the hardware acceleration of loops in a C program by the use of the data path synthesis tool GAMA [13] and the GARP reconfigurable processor. GARP is a processor tightly coupled to a custom FPGA-like array and designed specially to speed-up the execution of general case loops. The logic array has a DMA feature and is tailored to implement 32 bits wide arithmetic and logic operations with the control logic, all this allows to minimize the reconfiguration overhead. GAMA is a fast mapping and placement tool for the data-path implementation in FPGAs. It is based on a library of patterns for all possible data-path operators. The SPARCS project [14,15] is a CAD tool suite tailored for applications development on multi-FPGAs reconfigurable computing architectures. Such architectures need both spatial and temporal partitioning, a genetic algorithm is used to solve the spatial partitioning problem. The main cost function used here is the data memory bandwidth. Other works propose a strategy to automate the design process that considers all possible optimizations (partitioning and task scheduling) that can be carried out from a particular reconfigurable system [16,17]. Shirazi et al. and Luk et al. [18,19] proposes both a model and a methodology to take advantages of common operators in successive partitions. A simple model for specifying, visualizing and developing designs that contains reconfigurable elements in run-time has been proposed. This judicious approach allows reducing the configuration time and thus the application execution time. But additional logic resources (area) are

required to realize an implementation with this approach. Furthermore this model does not include timing aspects in order to satisfy the real time and it does not specify the partitioning of the implementation. Indeed, the algorithm partitioning must be previously known to determine the elements that do not need to be reconfigured for the next step. However, this concept is interesting to use for DRL designs simulation as Dynamic circuit switching (DCS) [20]. This simulation uses virtual multiplexors, demultiplexors and switches, which are implemented to simulate the dynamic configuration design. These interesting works do not pursue the same goal as we do. In priority, we try to find the minimal area that allows meeting the time-constraint. This is different from searching the minimal memory bandwidth or execution time which allows meeting the resources constraint. Here, we propose a temporal partitioning that uses dynamic reconfiguration of FPGA (also called DRL Scheduling) to minimize the implementation logic area. Each partition corresponds to a temporal floorplanning for DRL embedded systems (Fig. 1) [21]. We search the minimal floorplan area that implements successively a particular algorithm. This approach improves the performance and efficiency of the design implementation. In contrast to previous work, our aim is to obtain, from an algorithm description, a target technology and implementation constraints, the characteristics of the platform to design or to use. This allows avoiding an oversizing of implementation resources. For example, by summarizing the sparse information found in some articles [22 –24], we can assume the following. Suppose we have to implement a design requiring P equivalent gates and taking an area SFC of silicon in the case of a full custom ASIC design. Then we will need about 10 £ SFC in the case of a standard cell ASIC approach and about 100 £ SFC if we decide to use an FPGA. But the large advantage of the FPGA is, of course, its great flexibility and the speed of the associated design flow. This is probably the main reason to include a FPGA array on System on Chip platforms. Suppose that a design is requiring that 10% of the gates must be implemented as full custom, 80% as standard cell ASIC and 10% in FPGA cells. By roughly estimating the areas, we come to the following results: The FPGA array will require more than 55% of the die area, the standard cell part more than 44% and the full custom part less than 1%. In such a case it could make sense to try to reduce the equivalent gate count needed to implement the FPGA part of the application. This is interesting because the regularity of the FPGA part of the mask leads to a quite easy modularity of the platform with respect to this parameter. Embedded systems design can take several advantages of the use our approach based on RTR FPGAs. The most obvious is the possibility to frequently update the digital hardware functions. But we can also use the dynamic resources allocation feature to instantiate each operator

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130

117

Fig. 1. Temporal partitioning with a minimized floorplan area.

only for the minimal required time. This permits to enhance the silicon efficiency by reducing the reconfigurable array’s area [25,26]. Our goal is the definition of a RTR methodology, included in the architectural design flow, which allows minimizing the FPGA resources needed for the implementation of a time constrained algorithm. So, the challenge is double. Firstly to find trade-offs between flexibility and algorithm implementation efficiency through the programmable logic-array. Secondly to obtain computer-aided design techniques for optimal synthesis which include the dynamic reconfiguration in an implementation. The rest of this paper is organized as follows. Section 2 provides a formal definition of our partitioning problem. In Section 3 we present the partitioning strategy. In Section 4 we illustrate the application of our method with two imageprocessing algorithms. In the first example, we carry out the method in an automatic way such as we will describe it. In the second example, we apply our method while showing the possibility of further evolution. The aim is to make possible improvements of the suggested method. In Sections 5 and 6 we discuss the approach compared to architectural synthesis, conclude and present future works.

2. Problem formulation Currently the partitioning for RTR is often made at boundaries of algorithm operators. This means that, for example the partitioning of the image processing algorithm is made at image operators like filters, edge detection operators, and so on. In contrast with this decomposition, we work on the global algorithm independently of image operators. Our method is based on elementary arithmetic

and logic operators of the algorithm (adders, subtractor, multiplexers, registers etc.) of the algorithm. The analysis of the operators leads to a register transfer level (RTL) decomposition of the DFG. This partitioning is then independent of high-level operators (macro-operators). The RTR partitioning for the real time application could be classified as a spatio-temporal problem. That is as a time constrained problem with dynamic resource allocations in contrast with the scheduling for RTR [27]. Then, we make the following formulations about the application. Firstly, the algorithm can be modeled as an acyclic data flow graph (DFG) denoted here by GðV; EÞ where the set of vertices V ¼ {O1 ; O2 ; …; Om } corresponds to the arithmetic and logical operators and the set of directed edges E ¼ {e1 ; e2 ; …ep } represents the data dependencies between operations. Secondly, The application has a critical timeconstraint T. The problem to solve is the following: For a given FPGA family we have to find the set {P1 ; P2 ; …Pn } of sub graphs of G such as: n [

Pi ¼ G;

ð1Þ

i¼1

and that allows to execute the algorithm by meeting the time-constraint T and the data dependencies modeled by E and requires the minimal amount of FPGA cells. The number of FPGA cells used, which is an approximation of the area of the array, is given by Eq. (2), where Pi is one among the n partitions. max ðAreaðPi ÞÞ:

ð2Þ

i[{1…n}

The FPGA resources needed by a partition i is given by Eq. (3), where Mi is the number of elementary operators in partition Pi and AreaðOk Þ is the amount of resources needed

118

by operator Ok : X AreaðOk Þ: AreaðPi Þ ¼ k[{1…Mi }

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130

3.1. Number of partitions ð3Þ

3. Temporal partitioning A typical search sequence of temporal partitions includes the following steps: 1. Definition of the constraints, the type of the design (use dynamic configurable platform fixed-resources or reconfigurable embedded design): time constraint, data-block size, bandwidth bottleneck, memory size, consumption) and the target technology. 2. DFG capture using design entry. 3. Determination of temporal partitioning. 4. Generation of the configuration bitstream of each temporal floorplan for the final design. Here, we are only interested in dynamically configurable embedded design. In our case, the temporal partitioning methodology in the design flow is depicted in Fig. 2. Our method, which determines the minimal floorplan area, is structured on three parts. In the first, from technology target library and constraint parameters, we compute an approximation of the number of partitions, and then we deduce boundaries of each temporal floorplans. Finally we refine when it is possible the final partitioning. Constraint parameters are used to adjust temporal partitioning solution until all design constraints are met. This method can be seen as a heuristic approach.

In order to reduce the search domain we first estimate the minimum number of partitions that we can make and the resources by partition. To do this, we use an operator library that is target dependent. This library allows association of two attributes to each vertex of the graph G. These attributes are ti and AreaðOi Þ; respectively, the maximal path delay and the number of elementary FPGA cells needed for operator Oi : The symbol i represents the index of a particular operator in the DFG. These two quantities are functions of the size (number of bits) of the data to process. If we know the size of the initial data to process, it is easy to deduce the size at each node by a ‘software execution’ of the graph with the maximal value for the input data. Furthermore, we make the following assumptions: The data to process are grouped in blocks of N data. The number of operations to apply to each data in a block is deterministic (i.e. not data dependent). There are pipeline registers between all nodes of the graph. The reconfiguration time of the FPGA technology used can be approximated by a linear function of the area of the functional units being downloaded. Which is realistic in practice. The configuration speed is a constant and a characteristic of the FPGA (see Section 4). Thus, the minimal operating time period tomax is given by: tomax ¼

max ðti Þ

ð4Þ

i[{1…m}

Where {1…m} is the set of all operators of data-path G and ti is execution time of operator Oi : The total number C of

Fig. 2. General outline of the temporal partitioning in the design flow.

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130

cells used by the application is given by: X C¼ AreaðOi Þ i[{1…m}

ð5Þ

The main constraint is the need for real time processing. We suppose that the algorithms are partitioned in n steps corresponding to n execution –reconfiguration pairs. For a given FPGA family characterized by a reconfiguration speed V, the working frequency of each step must verify the following inequality: n X

si þ N·

i¼1

n X

tei # T 2

i¼1

n 1 X · AreaðPi Þ V i¼1

ð6Þ

Where AreaðPi Þ is the number of FPGA’s logic cells needed by Pi ; N is the number of data to process, n the number of reconfigurations ðn [ INÞ; T is the upper limit of processing time (time constraint), si corresponds to the prologue and epilogue of pipeline in the partition Pi and tei is the elementary processing time of a data in the ith step. In one partition, the elementary processing time corresponds to the longest execution time among all operators present in the partition Pi : So, we have: tei ¼

max

k[{1…Mi }

ðtk Þ

ð7Þ

Using Eqs. (4) – (6), we obtain the minimum number of partitions n as given by Eq. (8) and the corresponding optimal size Cn (number of cells) of each partition by Eq. (9). 0 1 B n ¼ RoundB @

C C C A ðN þ sÞ·tomax þ V T

ð8Þ

Where V is the configuration speed expressed by cells/s, C the number of cells required to implement the entire DFG, N the number of data in a block, s the total latency cycles of all data-path and tomax the propagation delay of the slowest operator in the DFG. This time is fixed by the maximum time between two successive vertices of graph G thanks to the fully pipelined process. Cn ¼

C n

ð9Þ

The first expression of the denominator in Eq. (8) is the effective processing time of one data block and the second expression is the consumed time to load the n configurations (total reconfiguration time of G). In most application domain like image processing (see Section 4), we can neglect the impact of the latency time in comparison with the processing time (N q number of pipeline stages (s)). So, we can approximate Eq. (8) by Eq. (10). 0 1 B n < Round B @

C C C A N·tomax þ V T

ð10Þ

119

A limit in the use of dynamic reconfiguration of FPGA has been exhibited [28], because the impact of reconfiguration time depends on the size of the data block to treat. Two conditions must be satisfied to allow dynamic configuration. First, the sample frequency of data must be less than the maximum hardware computation frequency. Secondly, the data must be computed in blocks with a significant size to reduce the reconfiguration overhead. In this case, silicon reduction can be achieved, but the major advantage of dynamic reconfiguration is the possibility of changing the algorithms (which can be data dependent) in real time and to optimize implementation area of designs. From the analysis of the value of n, we can extract some information on the parameters characterizing of the system needed to realize the implementation. We describe these different cases below: 1) n . 1: That means it is possible to realize a RTR partitioning. In this case, an optimization is possible and allows obtaining a reduction of the logic area of implementation with the technology used. n corresponds to the number of partitions that we are sure to obtain. 2) n # 1: In this case, with the RTR partitioning it is not possible to ensure the time constraint. Here, only static implementation with or without parallel processing allows respecting the time constraint. If n ¼ 1: This case means that only static implementation is sufficient to meet the constraints with the used technology. If n , 1: This means that to ensure the constraint it is necessary to realize a static implementation with a processing parallelism. 1/n gives the degree of processing parallelism. In practice 1/n [ IN. In this case, it is necessary to know if the target technology is interesting for the implementation of the application. The pseudo algorithm of the determination of n and Cn is given below. We annotate the DFG by using the operator library. We cover all the nodes in the DFG, we accumulate the area and we search the maximal execution time among the operators of the DFG. V, N, T ( constants//Capture constant parameters of/of target and constraints G ( DFG//DFG capture of the application C ( 0//Total area variable, TO ( 0//Maximal operator time variable for each node Ni in G TO ( max (TO, Ni.ti)//return current max //of execution time C ( C þ Ni.Area//add area of current/node end for n ( T/[(N TO) þ rt( )]//compute n and Cn Cn ( C/n

120

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130

3.2. Initial partitioning A pseudo algorithm of the partitioning scheme is given below. G ( data flow graph of the application P1 ; P2 ; …; Pn ( empty partitions for i in {1…n} C(0 while C , Cn append (Pi ; First_Leave(G)) C ( C þ First_Leave(G).Area remove (G, First_Leave(G)) end while end for We cover again the graph from the leaves to the root(s) by accumulating the sizes of the covered nodes until the sum is as close as possible to Cn : These covered vertices make the first partition. We remove the corresponding nodes from the graph (command ‘remove (G, First_Leave(G))’). The goal of this function is to avoid considering twice the same node and to ensure the data dependencies. Then, we iterate the covering until the remaining graph is empty. The partitioning is then finished. There is a great degree of freedom in the implementation of the First_Leave( ) function, because there are usually many leaves in a DFG. The only strong constraint is that the choice must be made in order to guarantee the data dependencies across the whole partition. The choice of a node among the leaves of the DFG can be carried out in several ways. Since the leaves are terminal nodes, they meet the data dependency constraints. So, the reading of the leaves of the DFG can be random or ordered for example. In our case it is ordered. We consider as a twodimensional table containing parameters relating to the operators of the DFG. First_Leave( ) is carried out in the reading order of the table containing the operator arguments of the DFG (left to right) (Fig. 3). In our case, the pseudo algorithm of First_Leave( ) function is given below. M[ ][ ] ( G; DFG capture for M[i ][ j ] in M if M[i ][ j ] – 0 first_leave(G).area ( M[i ][ j ].area; first_leave(G).tk ( M[i ][ j ].tk; end if end for 3.3. Refinement after implementation After placement and routing of each partition that was obtained in the initial phase we are able to compute the exact processing time. It is also possible to take into account the value of the synthesized frequency close to the maximal processing frequency for each partition.

Fig. 3. Labeling and reading of the DFG.

The analysis of the gap between the total processing time (configuration and execution) and the time constraint permits to make a decision about the partitioning. If it is necessary to reduce the number of partitions or possible to increase it, we return to the step described in Section 3.2 with a new value for n. Else the partitioning is considered as a final one (see Fig. 2). In this case, if time remains after the execution of all partitions, but it is not enough to add one, we decrease this time by reducing the working frequency in one or more partitions. Thus, we limit the consumption of the application. This heuristic partitioning is the best that we can obtain with this straightforward method. However, we do not have enough criteria to compare our heuristic approach to the existing methods based for example on mathematical resolution. For this reason we cannot conclude that our method always leads to an optimal solution.

4. Applications to image processing In this section, we illustrate and model parameters to apply our method in the field of the images processing. This application area is a good choice for our approach because the processing are generally characterized by regular operators (data-path) and the data are naturally organized in blocks (the images). Large blocks of data to process leads to a reduced overhead of reconfiguration time. Otherwise, the time dedicated to a processing is mainly allocated for reconfigurations. Moreover, there are many low level processing algorithms that can be modeled by a data flow graph and the time constraint is usually the image acquisition period. We assume that the images are taken at a rate of 25 per second with a spatial resolution of 512 £ 512 pixels and each pixel gray level is an eight bits value. Then, we have a time constraint of 40 ms by image to assure the real time processing. 4.1. Algorithms We illustrate our method with two image processing algorithms. The first algorithm used here is a 3 £ 3 median

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130

121

normal optical flow estimator algorithm is composed of gaussian and averaging filters, followed by temporal and spatial derivatives and arithmetic divider. The general view of this data-path algorithm is given on Fig. 5. 4.2. Temporal partitioning 4.2.1. Reconfiguration speed evaluation The FPGA family targeted in these examples is the Atmel AT40k family of dynamically reconfigurable FPGA. These devices can be totally configured in lesser than 1 ms. The reconfiguration speed is expressed in area by time units. In practice it is a proportionality constant between the configuration time and the number of used logic cells in the FPGA. We express the reconfiguration speed by the following equation: Fig. 4. General view of the images edge detector.

filter followed by an edge detector and its general view is given on Fig. 4. In this example, we consider a separable median filter [29] and a Sobel operator. The median filter provides the median value of three vertical successive horizontal median values. Each horizontal median value is simply the median value of three successive pixels in a line. This filter allows to eliminate the impulsion noise while preserving the edges quality. The principle of the implementation is to sort the pixels in the 3 by 3 neighborhood by their gray level value and then to use only the median value (the one in 5th position over 9). This operator is constituted of eight bits comparators and multiplexers. A Sobel operator achieves the gradient computation. This corresponds to a convolution of the image by successive application of two monodimensional filters. These filters are the vertical and horizontal Sobel operator, respectively. The final gradient value of the central pixel is the maximum absolute value from vertical and horizontal gradient. The line delays are made with components external to the FPGA (Fig. 4). The second algorithm is an estimator of the normal optical flow and is frequently used in an apparent motion estimator in a Log – Polar images sequence [30]. This



Cmax TG

ð11Þ

Cmax is the number of logic cells in the FPGA. TG is the time for a full reconfiguration of the FPGA. Each configuration time depends on the quantity of logic cells used for each step. We express the area in terms of logic cells of the used FPGA. In our case, AT40K20’s capacity of 819 Cells leads to a total reconfiguration time lower than 0.6 ms at 33 MHz with 8 bits of configuration data [31] (see Section 4.3). These FPGAs have a configuration speed of about 1365 cells per millisecond and have a partial reconfiguration mode (called ‘mode 4’). 4.2.2. Dataflow graph annotation The analysis of the FPGA datasheet allows us to obtain the characteristics given in Table 1 for some operator types which process data that have a size of D bits [31]. In this table Tcell is the propagation delay of one cell; Trout is the intra operator routing delay and Tsetup is the flip – flop setup time. The same considerations will apply to others dynamically reconfigurable device technologies or systems. From the characteristics given in the datasheet [31], we obtain the following values as a first estimation for the execution time of usual elementary operators (Table 2).

Fig. 5. General view of the normal optical flow estimator.

122

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130

Table 1 Usual operator characterization (AT40k) D bits operator

Number of cells

Estimated execution time

Multiplication or division by 2k Adder or subtractor Multiplexer Comparator Absolute value (two’s complement) Additional synchronization register

0

0

Dþ1 D 2D D21

D(Tcell þ Trout) þ Tsetup Tcell þ Tsetup (2D-1)(Tcell) þ 2Trout þ Tsetup D(Tcell þ Trout) þ Tsetup

D

Tcell þ Tsetup

In practice, there is a linear relationship between the estimated individual execution time and the real execution time. Indeed, when a operator is included in daisy chain of operator there is a routing time needed between two successive nodes. Fig. 6 shows a plot of the estimated individual execution time versus the real execution time for some different usual low-level operators. Those operators have been implemented individually in the FPGA array between registers. Then, we have measured and rounded at the nearest integer the obtained real execution times (Fig. 6). From this observation, the estimated individual execution time allows obtaining an approximation of the real execution times of the operators contained in the data-path. With the FPGA technology used, this coefficient has a value of about 1.5 (see Fig. 6). The results are more exact as the algorithm is regular such as the data-path (simple cascade of the operators). The evaluation of the routing in the general case is a more difficult task. The execution time after implementation of a regular graph does not depend on the type of operator. A weighting coefficient binds the real execution time with the estimated one. This coefficient approximates the routing delay between operators based on the estimated execution time. So, these routing delays are a constant portion of the duration of each step. This is realistic and easy to exploit analytically. With these estimations and by taking into account the increase of data size caused by processing, we can annotate Table 2 Estimated execution time of some eight bit operators in AT40k technology Eight bits operators

Estimated execution time (ns)

Comparator Multiplexer Absolute value Adder, Subtractor Combinatory logic with inter-propagation logic cell Combinatory logic without inter-propagation logic cell

27.34 5 22.07 16.46 17 5

Fig. 6. Estimated time versus real execution time of some operators in AT40k technology.

the dataflow graph. Then, we can deduce the number and the characteristics of all the operators. Table 3 gives these data for the first algorithm example. In this table, the execution time is an estimation of the real execution time. From these data, we deduce the number of partitions needed to implement a dedicated data-path in an optimized way. Thus, for the edges detector, among all operators of the data-path we can see that the slowest operator is an eight-bit comparator and that we have to reconfigure 467 cells. Using Eq. (10), we obtain a value of three for n. The size of each partition (Cn) should be about 156 cells. Table 4 summarizes the estimation for a RTR implementation of the algorithm. For the data-path part of the second algorithm (normal optical flow), we can estimate again the logic resources required for the implementation (see Table 5). Thus, among all operators of the data-path, we can see that the slowest operator is a fifteen bit subtractor and that we have to reconfigure 863 cells. Using Eq. (10), we obtain a value superior to three for n. Here, it is possible to implement this global data-path with about 264 cells in each partition (Cn ). Table 6 summarizes the estimation for a RTR implementation of the algorithm. In practice, the partitioning of the data-path must also take into account the memory bandwidth bottleneck. That is why, the best practical partitioning needs to keep the data throughput in accordance with the performances of the used memory. 4.2.3. Optimizing memory bandwidth The first aim of our methodology is to create partitions with area as homogeneous as possible. If this is the wish of a designer, he must also avoid going over technology constraints like memory bandwidth. To help the designer, the tool provides the data size requirement for each level of the DFG. By the way, we are able to see where the partitioning might produce a failure or an optimal solution. Moreover, in order to automate this refinement, the program can defer or anticipate the partitioning of the DFG to make it where the memory bandwidth requirement is minimal. The search is restricted to a neighborhood of the theoretical partitioning points. This neighborhood is adjustable to keep the compromise between homogeneous areas and memory

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130 Table 3 Number and characteristics of the operators of the edges detector (on AT40K) Operators

Comparator Multiplexer Absolute value Subtractor Adder

Multiplication by 2 Division by 4 Register (pipeline or delay)

Quantity

7 9 2 1 1 1 2 1 2 2 13 4 5 1

Size (bits)

Area (cells)

Execution time (ns)

8 8 11 8 10 8 9 10 8 9 11 8 9 10 11

16 8 10 9 11 9 10 11 0 0 0 8 9 10 11

41 8 34 25 30.5 25 27.5 30.5 Routing Routing Routing

8

bandwidth minimization (Fig. 7). The size of the memory bandwidth (data size and real working frequency) along the DFG is known between each operator lines. When we create the DFG, edges receive a size argument that represents the data size of the bus between two nodes (operators). So, when we add all edges size arguments crossing nodes lines, we obtain the total data size in use at this point of the DFG. By applying the method described in Section 3, we obtain a first partitioning for our two examples. Fig. 8 shows the boundaries of each partition for the edge detector. Fig. 9 represents the first partitioning of the normal optical flow estimator. 4.3. Implementation results 4.3.1. Hardware platform In order to illustrate our method, we tested this partitioning methodology on the ARDOISE1 architecture [6,28]. This platform is constituted of three identical modules: two frame grabber modules and a dynamic calculator module. Each Ardoise module is constituted of one FPGA and two 1 MB SRAM memory banks used as draft memory. Here, we only use an Ardoise module (Fig. 10). It is made of a dynamically configurable AT40K20 FPGA and contains two local memories. This FPGA can be configured in less than 1 ms and include an equivalent of 20 K gates. Each local memory is connected to the FPGA with 32 data bits. The configuration can be made from a Flash memory and through a ‘controller’ FPGA. This FPGA module (computing unit) is dedicated to perform intensive computing. The temporal partitions will be implemented in this FPGA. Partial results will be stored 1

This project involves ten French research labs which include our laboratory and is supported by the French agency for education, research and technology.

123

Table 4 Resources estimation for the image edge detector Total area (cells)

Operator execution time (ns)

Step number (n)

Area by step (cells)

Reconfiguration time by step (ms)

467

41

3

156

114

in the local memories while computing and reconfiguring. For this purpose, computing unit receives the input pixels from the image acquisition system, process and writes them into one of its local memories. Next, after a new configuration it reads and processes the data from the last memory and writes results in the other local memory. The duration of this procedure is the duration of current image processing which should not exceed the frame period. Our method is not aimed to target such architectures with resources constraint. Nevertheless, the results obtained in terms of used resources and working frequency are still valid for any AT40k-like array. 4.3.2. Implementation results Table 7 summaries the implementation results of the edge detector algorithm. We notice that a dynamic execution in three steps can be achieved in real time. This is in accordance with our estimation (Table 4). We can note that a fourth partition is not feasible, because the allowed maximal operator execution time would be less than 34 ns. If we analyze the remaining time we find that one more partition does not allow the real time processing. The maximal number of cells by partition allows to determine the functional density gain factor obtained by the RTR Table 5 Number and characteristics of the operators of the normal optical flow estimator (on AT40K) Operators

Quantity

Size (bits)

Area (cells)

Execution time (ns)

Multiplexer Absolute value Subtractor

7 2 2 8 2 1 2 3 1 7 1 57 16 6 2 2 2 2 2 12

15 9 9 15 9 10 11 8 8 9–15 12 1 8 9 10 11 12 13 14 15

15 8 10 16 10 11 12 0 0 0 0 1 8 9 10 11 12 13 14 15

8 30 27.5 44.3 27.5 30.5 33.5 Routing Routing Routing Routing 8 8 8 8 8 8 8 8 8

Adder

Multiplication by 2 Multiplication by 128 Division by 2 Division by 8 Register (pipeline or delay)

124

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130

Table 6 Resources estimation for the normal optical flow estimator Total area (cells)

Operator execution time (ns)

Step number (n)

Area by step (cells)

Reconfiguration time by step (ms)

863

44.3

3.27

264

200

implementation [25,26]. In this example, the gain factor in term of functional density is approximately 3. Fig. 8 represents each partition successively implemented in the reconfigurable array of FPGA resources for the edge detector. Table 8 summaries the implementation results of the normal optical flow estimator. Here, we obtain a gain in functional density of about 2.44 with respect to the global implementation of this data-path (static implementation) for real time processing. We also note that a static implementation of this algorithm is not possible with our Ardoise module. The available logic area is insufficient (only 819 cells). So, this approach can be used on a platform with insufficient fixed-size FPGA resources. Fig. 9 represents each partition implemented successively in the reconfigurable array for the normal optical flow estimator. Here it is possible to refine the partitioning of this algorithm (see Fig. 2). We can note that a fourth partition is feasible because the remaining time after processing, in comparison with the time constraint, is greater that 12 ms. So, if we consider one more partition with the slowest operator execution time estimate of the data-path (44.3 ns, see Table 6), we obtain a processing time of 11.6 ms and a remaining time of 1.2 ms for the reconfiguration. Thus, a partitioning in four steps is possible. In this case, the size of each partition (Cn ) should be about 216 cells. We keep the same partition number one described in the first implementation in order to obtain the same processing time. Moreover, the resources of this step are close to the new size estimated for each partition (Cn ). Then, we have decomposed the two last partitions in three parts. The divider has been split in two parts in order to homogenize the number of resources for each partition. This way to proceed allows real time processing. Table 9

summaries the implementation results of the normal flow optical estimator algorithm with four partitions. We notice that dynamic execution with four steps can be achieved in real time. This is in concordance with our estimation. We obtain a total processing time of about 38 ms and a functional density near to 3. However, an implementation by partitioning in five steps leads to a critical time very harsh for real time operation. Moreover, with these partitions, we need a bigger memory to store the intermediate results than for the last partitioning. Indeed, there are many ways to partition the algorithm with our strategy. Obviously, the best solution is to find the partitioning that leads to the same number of cells used in each step by taking into account memory bandwidth (search minimal memory bandwidth) (see Section 4.2). In our applications we needed a maximum bandwidth of 30 bits. Ardoise module provides a local bandwidth of 32 bits that is sufficient to carry out the theoretical partitioning. Nevertheless we have to adopt a final partitioning that has the lowest bandwidth and that preserves the homogeneity of the resources in each partition. Fig. 11 represents the four partitions successively implemented in the reconfigurable array for the normal optical flow estimator. From our analysis of the data flow graph; we deduced resources requirement and speed of the various operators. This leads to determine the total processing time, from which we deduce the optimized partitioning. This allows a RTR implementation of the DFG that enhances the functional capacity. However, in order to know the real cost of our method, it is necessary to take into account the memory size needed to store the intermediate results, the resources needed by the read and write counters (pointers), the configuration controller and the small associated state machine. 4.3.3. Dynamic configuration and memory controller In order to know the needed real resources and to compare our method with an architectural synthesis (AS) (see Section 5), it is necessary to quantify the cost of reconfiguration in terms of control and memory management. This controller can read/write data in local memories and allows loading the configuration of the next temporal floorplan in the FPGA. In our case, this controller is a static

Fig. 7. Final refined partitioning which take into account memory bandwidth.

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130

125

Fig. 8. Partitioning used to implement the images edge detector dataflow graph.

design module. This controller is mapped in the first partition and it remains during all the processing steps. The use of partial reconfiguration allows keeping the controller in place while it manages the execution of the application in the remaining space of the FPGA. Generally, if we have enough memory bandwidth, we can estimate the cost of the control part in the following

way. The memory resources must be able to store two images (we assume a constant flow processing): memory size of 256 Kpixels. We have supposed that the memory access time is shorter than the slowest operator execution time. That is why, it is preferable that the images are stored in on-chip memory in order to not affect the execution performance of the partitions.

Fig. 9. Partitioning used to implement the normal optical flow estimator.

126

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130 Table 8 Implementation results of the optical flow estimator with an AT40K Partition number

Number of cells

Operator execution time (ns)

Partition reconfiguration time (ms)

Partition processing time (ms)

1 2 3

209 354 336

27.1 38.7 37.8

160 260 250

7. 1 10.15 9.91

algorithm. However, in this case, the computing area needs 344 cells, memory bandwidth: 30 bits with a minimal read/write frequency of 25.8 MHz.

5. Discussion

Fig. 10. Ardoise module.

The controller needs two counters to address the memories, a state machine for the control of the RTR and the management of the memories for read or write access (Fig. 12). In our case, the controller consists in two 18 bit counters (images of 5122 pixels), a state machine with five states, a 4 bit register to capture the number of partitions (we assume a number of reconfiguration lower than 16), a counter indicating the number of partitions, a 4 bit comparator and a toggle flip –flop to indicate to which alternate buffer memory we have to read and write. With the targeted FPGA structure (see Section 4), the logic area of the controller in each configuration stage requires a number of resources lower than 50 logical cells. So, it is needed to add 50 cells to each partition area given in Section 4.3.2. 4.3.4. Summary of needed resources The resources characteristics of the platform to use or to design for the first example algorithm are: Memory size: 256 Kwords of 30 bits, computing area (computing reconfigurable area þ controller area): 209 cells, memory bandwidth: 19 bits with a minimal read/write frequency of 24.8 MHz. For the second algorithm, the memory size and the controller logic area are the same than for the previous Table 7 Implementation results of edges detector with an AT40K Partition number

Number of cells

Operator execution time (ns)

Partition reconfiguration time (ms)

Partition processing time (ms)

1 2 3

152 156 159

40.1 40.3 36.7

111 114 116

10.5 10.6 9.6

The objective of the Algorithm/Architecture Adequation is to find best an efficient matching between an algorithm and an architecture. The aim is to realize an optimal implementation that satisfies the constraints (real time, logic area, etc.). In this Section, we illustrate an analysis of the gains and costs of the use of dynamic reconfiguration. We can compare our method to the more classical architectural synthesis, which is based on the reuse of operator by adding control at different time [32,33]. Indeed, the goal of the two approaches is the minimization of hardware resources with respect to time constraint. If we consider our method with respect to an architectural synthesis, the RTR can be illustrated by Fig. 13. Fig. 13 represents the needed area with respect to the processing time for RTR. The optimization with AS can divide by two the logical resources at the cost of an additional cycle time for the processing. This observation applies only if we do not take in account the controller and memory resources that are needed for AS. These additional resources are difficult to estimate in AS because they depend heavily on the algorithm. In the case of RTR, if we consider critical times identical to those of AS, we notice the same optimization level. Nevertheless, the RTR control and memory management (hidden or not in the FPGA) logical resources are easy to deduce. Moreover, in contrast with AS which sees its control increasing with the level of re-use of the operators, these control resources remain relatively Table 9 Implementation results of the normal optical flow estimator after refined partitioning Partition number

Number of cells

Operator execution time (ns)

Partition reconfiguration time (ms)

Partition processing time (ms)

1 2 3 4

209 241 248 294

27.1 38.7 38.7 37.8

160 180 180 190

7.1 10.15 10.15 9.91

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130

Fig. 11. Implementation of the normal optical flow estimator with four partitions.

127

128

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130

Fig. 12. General view of configuration and memories controller.

identical whatever the number of partitions may be. On the other hand, additional time for the reconfigurations of the FPGA between each partition is added to the processing time. However, with the use of partial reconfiguration, this time is also constant for a given algorithm. Thus, according to the application and the time constraint, it is interesting to search when it is better to use RTR or AS during an implementation on FPGA. We can do several observations. When architectural synthesis is applied, the operators must be dimensioned for the largest data size even if such a size is rarely processed (generally only after many processing passes). Similarly, even if an operator is not frequently used it must be present (and thus consumes resources) for the whole processing duration. These drawbacks, which do no more exist for a run-time-reconfigurable architecture, generate an increase in logical resources needs. Furthermore, the resources reuse can lead to increased routing delay if compared to a fully spatial data path, and thus decrease the global architecture efficiency. But, if we use the dynamic resources allocation features of FPGAs, we instantiate only the needed operators at each instant (temporal locality [10]) and assure that the relative placement of operators is optimal for the current processing (functional locality [10]). Nevertheless, this approach has also some costs. Firstly, if we consider the silicon area, an FPGA needs between five and ten times more silicon than a standard cell ASIC (ideal target for architectural synthesis) at the same equivalent gates count and with lower speed. But this cost is not too important if we consider the ability to make big modifications of the hardware functions without any change of the hardware part. Secondly, in terms of memory throughput, with respect to a fully static implementation, our approach requires an increase of a factor of at least the number of partitions n. And thirdly, in terms of power consumption, both approaches are equivalent if we neglect both the over clocking needed to compensate for reconfiguration durations and consumptions outside the FPGA. Indeed, in a first approximation, power consumption scales linearly with processing frequency and functional area

(number of toggling nodes), and we multiply the first by n and divide the second by n.

6. Conclusion and future works We proposed a method for the temporal partitioning of a dataflow graph that permits to minimize the array size of a FPGA by using the dynamic reconfiguration feature. This approach increases the silicon efficiency by processing at the maximally allowed frequency on the smallest area and which satisfies the real time constraint. The method is based, among other steps, on an estimation of the number of possible partitions by use of a characterized (speed and area) library of operators for the target FPGA. Our method takes into account the memory bandwidth bottleneck, memory size for the storage of the intermediate results. This approach allows obtaining the main parameters characterizing the architecture model, which implements a particular algorithm from the constraints. This is very interesting for the development and optimization of embedded design. This methodology for RTR implementation of a design is adapted to reflect layout constraints imposed by the target technology and allows avoiding an oversizing of the resources needed. We illustrated the method by applying

Fig. 13. Evolution of spatio-temporal resources in RTR implementation.

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130

it on two images processing algorithms and by real implementation on the Atmel AT40K FPGA target technology. Our approach must be compared with other more powerful optimizations tools such as simulated annealing, genetic algorithms and so on. Currently we work on a more accurate resources estimation which takes into account the memory management part of the data path and also checks if the available memory bandwidth is sufficient. At this time, we started to computerize (automate) the partition search procedure, called DAGARD which means, in French, automatic partitioning of DFG for dynamically configurable systems, which is roughly a graph covering function. Currently, only the partition estimation part and boundaries generation between each partitions is automated. Our perspectives are an automatic synthesis of memory and configuration controller by our tool. Indeed, from the partitions computation, it is easy to automatically generate a fixed-resources controller for a particular implementation. Moreover, it is necessary to take into account the power consumption that is an important parameter for embedded systems design. This estimation depends on the target technology and the working environment like working frequency, resources needed for each partitions [34,35]. However, it is mainly on the data-path part (regular processing), that it is necessary to judge the ‘adequation’ of RTR for an implementation, because a controller is more effectively produced by a programmable sequencer like a microprocessor. We also study the possibilities to include an automatic architectural solutions exploration for the implementation of arithmetic operators. Another future work is to find when it is better to use the dynamic configuration or architectural synthesis for the optimization of an implementation (see Section 5).

References [1] S. Hauck, The role of FPGA in reprogrammable systems, Proceedings of IEEE 86 (4) (1998) 615– 638. [2] S. Brown, J. Rose, FPGA and CPLD architectures: a tutorial, IEEE Design and Test of Computers 13 (2) (1996) 42– 57. [3] S.A. Guccione, D. Levi, The advantages of run-time reconfiguration, in: J. Schewel, P.M. Athanas, S.A. Guccione, S. Ludwig, J.T. McHenry (Eds.), Reconfigurable Technology: FPGAs for Computing and Applications, Proceedings of SPIE 3844, SPIE—The International Society for Optical Engineering, Bellingham, WA, 1999, pp. 87–92. [4] P. Lysaght, J. Dunlop, Dynamic reconfiguration of FPGAs, in: W. Moore, W. Luk (Eds.), More FPGA’s, Oxford, Abingdon, England, 1994, pp. 82–94. [5] S.C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, R. Taylor, PipeRench: a reconfiguration architecture and compiler, IEEE Computer April (2000).

129

[6] L. Kessal, D. Demigny, N. Boudouani, R. Bourguiba, Reconfigurable hardware for real time image processing, Proceedings of the International Conference on Image Processing, IEEE ICIP, Vancouver 3 (2000) 159–173. [7] B. Kastrup, A. Bink, J. Hoogerbrugge, ConCISe: a compiler-driven CPLD-based instruction set accelerator, in: K.L. Pocek, J.M. Arnold (Eds.), Proceedings of the Seventh Annual IEEE Symposium on FieldProgrammable Custom Computing Machines, 1999, pp. 92 –101. [8] J.E. Vuillemin, P. Bertin, D. Roncin, M. Shand, H.H. Touati, P. Boucard, Programmable active memories: reconfigurable systems come of age, IEEE Transactions on VLSI Systems 4 (1) (1996) 56–69. [9] M.J. Wirthlin, B.L. Hutchings, A dynamic instruction set computer, IEEE Symposium on FPGA’s Custom Computing Machines, Napa, CA April (1995) 99–107. [10] X. Zhang, K.W. Ng, A review of high-level synthesis for dynamically reconfigurable FPGA’s, Microprocessors and Microsystems 24 (2000) 199–211. Elsevier. [11] C. Tanougast, Me´thodologie de partitionnement applicable aux syste`mes sur puce a` base de FPGA, pour l’implantation en reconfiguration dynamique d’algorithmes flot de donne´es, PhD Thesis, Faculte´ des Sciences, Universite´ de Nancy I, 2001. [12] T.J. Callahan, J. Hauser, J. Wawrzynek, The GARP architecture and C compiler, IEEE Computer 33 (4) (2000) 62–69. [13] T.J. Callahan, P. Chong, A. DeHon, J. Wawrzynek, Fast module mapping and placement for data paths in FPGA’s, Proceedings of the ACM/SIGDA Sixth International Symposium on Field Programmable Gate Arrays, Monterey, CA February (1998) 123–132. [14] S.V. Srinivasan, S. Govindarajan, R. Vemuri, Fine-grained and coarse-grained behavioral partitioning with effective utilization of memory and design space exploration for multi-FPGA architectures, IEEE Transactions on VLSI Systems 9 (1) (2001) 140–158. [15] M. Kaul, R. Vemuri, Optimal temporal partitioning and synthesis for reconfigurable architectures, International Symposium on FieldProgrammable Custom Computing Machines April (1998) 312–313. [16] R. Maestre, F. Kurdahi, M. Fernadez, R. Hermida, N. Bagherzadeh, H. Singh, A framework for reconfigurable computing: task scheduling and context manegement, IEEE Transactions on VLSI Systems 9 (6) (2001) 858–873. [17] M. Karthikeya, P. Gajjala, B. Dinesh, Temporal partitioning and scheduling data flow graphs for reconfigurable computer, IEEE Transactions on Computers 48 (6) (1999) 579 –590. [18] N. Shirazi, W. Luk, P.Y.K. Cheung, Automating production of runtime reconfiguration designs, in: K.L. Pocek, J. Arnold (Eds.), Proceedings of IEEE Symposium on FPGA’s Custom Computing Machines, IEEE Computer Society Press, 1998, pp. 147 –156. [19] W. Luk, N. Shirazi, P.Y.K. Cheung, Modeling and optimizing runtime reconfiguration systems, in: K.L. Pocek, J. Arnold (Eds.), Proceedings of IEEE Symposium on FPGA’s Custom Computing Machines, IEEE Computer Society Press, 1996, pp. 167 –176. [20] P. Lysaght, J. Stockwood, A simulation tool for dynamically reconfigurable field programmable gare arrays, IEEE Transactions on VLSI Systems 4 (3) (1996) 381–390. [21] M. Valsiko, DYNASTY: a temporal floorplanning based CAD framework for dynamically reconfigurable logic systems, in: P. Lysaght, J. Irvine, R. Hartenstein (Eds.), Lecture Notes in Computer Science, vol. 1673, 1999, pp. 124–133, Glasgow, UK. [22] A. Cataldo, Hybrid architecture embeds Xilinx FPGA core into IBM ASICs, EE Times Jun 24, 2002, http://www.eetimes.com/story/ OEG20020624S0016. [23] J. Becker, R. Hartenstein, M. Herz, U. Nageldinger, Parallelization in co-compilation for configurable accelerators, proceedings of Asia and South Pacific Design Automation Conference, ASP-DAC’98 Yokohama, Japan (1998).

130

C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130

[24] A. Dehon, Comparing computing machines, Proceedings of the SPIE 3526 (1998) 124 –133. [25] M.J. Wirthlin, B.L. Hutchings, Improving functional density using run-time circuit reconfiguration, IEEE Transactions on VLSI Systems 6 (2) (1998) 247–256. [26] J.G. Eldredge, B.L. Hutchings, Run-time reconfiguration: a method for enhancing the functional density of SRAM-based FPGA’s,, Journal of VLSI Signal Processing 12 (1996) 67–86. [27] M. Vasilko, D. Ait-Boudaoud, Scheduling for dynamically reconfigurable FPGAs, In Proceeding of International Workshop on Logic and Architecture Synthesis, IFIP TC10 WG10.5, Grenoble, France (1995) 328–336. [28] D. Demigny, L. Kessal, R. Bourguiba, N. Boudouani, How to use high speed reconfigurable fpga for real time image processing?, Proceedings of IEEE Conference on Computer Architecture for Machine Perception, IEEE Circuit and Systems, Padova September (2000) 240–246. [29] N. Demassieux, Architecture VLSI pour le traitement d’images: Une contribution a` l’e´tude du traitement mate´riel de l’information, PhD Thesis, Ecole nationale supe´rieure des te´le´communications (ENST), 1991. [30] C. Tanougast, Y. Berviller, S. Weber, Optimization of motion estimator for run-time reconfiguration implementation, in: J. Rolim (Ed.), Parallel and Distributed Processing, Lecture Notes in Computer Science, vol. 1800, Springer, 2000, pp. 959– 965. [31] Atmel AT40k Datasheet. [32] N. Togawa, M. Ienaga, M. Yanagisawa, T. Ohtsuki, An area/time optimizing algorithm in high-level synthesis for control-base hardwares, Proceedings of the ASP-DAC 2000, IEEE circuits and systems, Yokohama, Japan (2000) 309 –312. [33] A. Sharma, R. Jain, Estimating architectural resources and performance for high-level synthesis applications, IEEE Transactions on VLSI Systems 1 (2) (1993) 175–190. [34] A. Bogliolo, R. Corgnati, E. Macii, M. Poncino, Parameterized RTL power models for soft macros, IEEE Transactions on VLSI Systems 9 (6) (2001) 880–887. [35] A. Garcia, W. Burleson, J.L. Danger, Power modelling in field programmable gate arrays (FPGA), in: P. Lysaght, J. Irvine, R. Hartenstein (Eds.), Lecture Notes in Computer Science, vol. 1673, 1999, pp. 396–404, Glasgow, UK.

Camel Tanougast received his PhD degree in Microelectronic and Electronic Instrumentation from the University of Nancy 1, France in 2001. Currently is a researcher in Electronic Instrumentation Laboratory of Nancy (LIEN). His research interests include: design and implementation real time processing architecture, FPGA design and the Terrestrial Digital Television (DVB-T).

Yves Berviller received the PhD degree in electronic in 1998 from the Henri Poincare´ University, Nancy, France. He is currently assistant professor with Henri Poincare´ University. His research interests include computing vision, system on chip development and research, FPGA design and the Terrestrial Digital Television (DVB-T).

Philippe Brunet received an MSc degree from the University of Dijon, France in 2001. Currently, he is a PhD research student in Electronic Engineering at the Electronic Instrumentation Laboratory of Nancy (LIEN), University of Nancy 1. His main interest concerns Design FPGA and computing vision.

Serge Weber received the PhD. Degree in electronic in 1986, from the University of Nancy (France). In 1988 he joined the Electronics Laboratory of Nancy (LIEN) as an Associate Professor. Since September 1997 he is Professor and Manager of the Electronic Architecture group at LIEN (University Henri Poincare´ Nancy I). His research interests are on reconfigurable and parallel architectures for image and signal processing or for intelligent sensors.

Hassan Rabah received the PhD degree in electronic in 1992 from the Henri Poincare´ University, Nancy, France. He is currently assistant professor with Henri Poincare´ University. His research interests include system on chip development and research, digital signal processing and sensor applications.