Dynamically reconfigurable dataflow architecture for high-performance digital signal processing

Dynamically reconfigurable dataflow architecture for high-performance digital signal processing

Journal of Systems Architecture 56 (2010) 561–576 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.el...

1MB Sizes 0 Downloads 145 Views

Journal of Systems Architecture 56 (2010) 561–576

Contents lists available at ScienceDirect

Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc

Dynamically reconfigurable dataflow architecture for high-performance digital signal processing S. Voigt *, M. Baesler, T. Teufel Hamburg University of Technology, Schwarzenbergstrasse 95, 21073 Hamburg, Germany

a r t i c l e

i n f o

Article history: Received 15 October 2009 Received in revised form 2 April 2010 Accepted 26 July 2010 Available online 5 August 2010 Keywords: Dataflow architecture Hardware reconfiguration Digital signal processing Multi-FPGA platform Parallel FFT

a b s t r a c t In this paper a dataflow architecture is introduced that maps efficiently onto multi-FPGA platforms and is composed of communication channels which can be dynamically adapted to the dataflow of the algorithm. The reconfiguration of the topology can be accomplished within a single clock cycle while DSP operations are in progress. Finally, the programmability and scalability of the proposed architecture is demonstrated by a high-performance parallel FFT implementation. Ó 2010 Elsevier B.V. All rights reserved.

1. Introduction In comparison to ASICs, FPGAs are characterized by a 10- to 100fold logical overhead in chip area due to their ability to be reconfigured. Moreover, the regularly arranged configurable logic cells have to be interconnected via programmable routing switches and the overall performance heavily depends on the routing results delivered by the design tools. Particularly, in complex designs where routing resources become rare, it is difficult to find solutions to this issue, and the interconnection delay dominates over the delay within configurable logic cells. This results in poor clock rates that are usually about 20 times lower compared to general-purpose processors [1]. To overcome this problem, the foremost issue for FPGAs is the need to extract massive amounts of parallelism. Additionally, todays FPGA vendors integrate highly optimized embedded multipliers, fast carry chains, large amounts of on-chip RAM, and dedicated arithmetic routing, all of which facilitate DSP operations. Coupling these features with massive parallelism provided by FPGAs, the resulting systems can outperform the fastest DSP processors by one to two orders of magnitude. While this can be easily achieved, e.g., for matrix multiplication, it can be difficult in other cases, particularly when more data dependencies exist, e.g., in computing parallel fast Fourier transforms (FFTs) [2–4]. Further-

* Corresponding author. Tel.: +49 4042878462; fax: +49 40428784013. E-mail addresses: [email protected] (S. Voigt), [email protected] (M. Baesler), [email protected] (T. Teufel). URL: http://ti3.tu-harburg.de/english (S. Voigt). 1383-7621/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.sysarc.2010.07.010

more, the dedicated DSP resources in FPGAs are strongly limited. Thus, the maximum achievable computational performance highly depends on how efficiently the system architecture scales to multiFPGA platforms and is bounded by the total communication bandwidth between embedded DSP units. Therefore, modern FPGA devices are usually equipped with a large number of high-speed serial transceivers, which are characterized by high noise tolerance, clock data recovery, and error detection, all of which enable reliable transfer rates of several giga bits per second. Moreover, this allows the easy setup of arbitrary network topologies. Another advantage of FPGAs is their ability to be reconfigured. For this reason, dynamic reconfiguration of FPGA architectures has become increasingly more attractive [5,6]. The idea is to map DSP algorithms efficiently on hardware [7] and modify parts in real time to switch from one function to another, e.g., by loading different filters in multimedia applications or a coprocessor on demand. However, one major drawback is that it takes up to milliseconds to partially reconfigure FPGA architectures. In this paper a dataflow architecture is introduced that can be efficiently mapped onto modern FPGAs. In this architecture, the topology of the interconnection between computational units can be dynamically reconfigured. In contrast to the concept of partially reconfiguring FPGAs, our approach is to connect DSP resources via a dynamically variable topology, so that the reconfiguration can be achieved within a single clock cycle and is done while arithmetic operations are in progress. Hence, the proposed dataflow architecture combines the basic idea of reconfiguration with the performance of scalable parallel processing.

562

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

2. Background

2.2. Dynamically reconfigurable network topology

In the following the typical characteristics of parallel algorithms are pointed out. Moreover, a concept is explained how the topology of the architecture can be adapted to the dataflow of the algorithm to provide a direct inter-processor communication at all times and to maximize the computational throughput.

In 1990, the MULTITOP multiprocessor system was introduced in which the inter-processor communication is performed via switchable point-to-point links instead of a shared memory [9]. The MULTITOP network provides M inputs and M outputs and enables the architecture to be matched optimally to the algorithms dataflow. It is quite similar to the Benes [10] and Lee [11] network and is composed of S switches given by Eq. (1), which can be configured individually in two different ways: parallel (0) or crossed (1).

2.1. Dataflow in parallel algorithms One major goal in parallel processing is to minimize the dataflow between computational clusters. Obviously, the difficulty of this optimization increases with the amount of data dependencies involved. For instance, Fig. 1 shows the signal-flow graph of an 8point fast Fourier transform (FFT) and decimation in frequency (DIF) methodology [3]. The input signal x(n) contains the signal being decomposed, while the output signal X(n) contains the amplitudes of the component sine and cosine waves. The number of samples in the time domain is represented by N, i.e., n ¼ 0; . . . ; N  1, and in this particular example N ¼ 8. However, in typical applications the number of FFT points is chosen between 32 and 4096. The signal-flow graph shown in Fig. 1 can be converted easily into a graph of functional elements as depicted in Fig. 2, where each square block represents one basic butterfly operation. For this reason, the lines indicate the dataflow between butterfly units in consecutive stages of the FFT algorithm. If the FFT algorithm is parallelized by assigning the butterfly operations in each row to one computational unit, it is evident that the connections of one particular unit to the others must be the union of interconnections from all stages to achieve a direct communication at all times (see Fig. 3). However, the union of all topologies is usually too complex, because the number of links per unit increases nearly proportional to the total number of butterfly units. For this reason, it is more efficient to continuously modify the topology of the connections and adapt it to the dataflow of the algorithm. Of course, the time required to reconfigure the topology must be very short compared to the duration of each butterfly computation. In conclusion, this concept can be generalized in a sense that for every algorithm an optimal sequence of topologies exists, which minimizes the inter-processor communication [8].



M  ð2log2 M  1Þ 2

ð1Þ

Comparable with the Benes and Lee network, the number of MULTITOP network switches have the same order of magnitude and the topology is free of blocking for all permutations [9]. In contrast to the Lee network, however, the relatively strong meshing is avoided. Moreover, because of its self-similarity the MULTITOP network can be modularly extended (see Section 3.2.2). Mathematically, the purpose of this network can be regarded as a machine that accepts a permutation vector of M numbers and outputs a sorted sequence specified by its switch configurations.

3. Dynamically reconfigurable dataflow architecture The new contribution of this paper is the efficient mapping of a dynamically variable topology on modern FPGA architectures and, in particular, the development of a scalable extension of this concept for multi-FPGA platforms [12]. 3.1. Overview In the following sections the components of the architecture are described in detail: starting bottom-up with the basic switching unit (SU), of which the dynamically variable topology (DVT) mainly consists, the reconfiguration of the topology and how it can be achieved within a single clock cycle is explained comprehensively. Subsequently, the computational cluster (CC) is defined, which is composed of the DVT and computational units (CUs). Finally, the

x(0)

X(0)

x(1)

W80 X(1)

x(2)

W80

X(2)

x(3)

W82

W80 X(3)

x(4)

W80

X(4)

x(5)

W81

W80 X(5)

x(6)

W82

W80

X(6)

x(7)

W83

W82

W80 X(7)

Fig. 1. Signal-flow graph of an 8-point FFT.

563

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

1st t. stage

0th t. stage

x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7)

2nd t. stage

3rd t. stage

0

0

0

1

2

0

2

0

0

1 b. comp.

0

2

3 st

nd

2 b. comp.

rd

3 b. comp.

X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7)

Fig. 2. 8-point FFT graph using functional blocks.

back to memory avoiding the well-known memory bottleneck. Thereby, all outputs become direct inputs either to the same dynamically reconfigurable dataflow cluster (DRDC) or to others as shown in Fig. 5. The DRDC comprises a computational cluster and distribution units that route the computed data to itself or other clusters. If DRDCs are reused for different sets of data (we call this virtualization), the intermediate results are buffered using BRAMs (see Section 3.2.5). The data can also be received via the optical links from a host PC. The DRDCs are controlled by a dedicated parallel topology controller (PTC). The PTC reconfigures the number of S topology switches while computations are in progress by a simple handshake protocol (see Section 4.2). In turn, its operations are monitored by the embedded operating system on the PPC core. 3.2. Components th

st

nd

rd

In the following all components of the dynamically reconfigurable dataflow architecture (DRDA) are comprehensively described bottom-up. In this way the complete DRDA is subsequently built up.

Fig. 3. Dataflow between butterfly units in sebsequent stages of a parallized FFT.

distribution unit (DU) is introduced that implements the scalable extension for multi-FPGA platforms. The dynamically reconfigurable dataflow architecture (DRDA) can be regarded as a coprocessor that is integrated in a system-on-achip (SOC) and controlled by the embedded PowerPC (PPC) core as shown in Fig. 4. All components in the SOC are connected by the IBM CoreConnect™ bus architecture [13]. While the embedded Linux kernel from MontaVista [14] is executed in the double data rate (DDR) SDRAM, the quad data rate (QDR) SRAM holds the data to be computed. On multi-FPGA platforms the dataflow between several DRDAs that reside in different FPGAs is transferred via optical transceivers. Initially, the data to be processed is transmitted from QDR memory to Xilinx Block SelectRAM (BRAM) [15] by the PPC core via the processor local bus (PLB) and the on-chip peripheral bus (OPB). The PLB/OPB controllers depicted in Fig. 5 are combined and mapped in the address space of the PPC core, so that the BRAMs can be easily accessed by memory-mapped I/O. Of course, the address mappings are individual for different sets of DRDA parameters. For example, the number S of topology switches is dependent on the DVT dimension M (see Section 3.2.2), i.e., the corresponding configuration vectors are stored in different locations. However, all memory mappings are generated automatically by the implemented design generation tools, so that the application programming is completely abstracted from these details (see Section 6). It is important to note that between computational steps, the intermediate results stay within the DRDA and are not written

3.2.1. Switching unit The basic switching unit (SU) is depicted in Fig. 6 and simply consists of two 4-input look-up tables (LUT) that operate in parallel. The signal S_sel asynchronously controls the output of the switching unit (see Fig. 6c) and is routed to the data bus of a dual-ported BRAM, which is controlled by the parallel topology controller (PTC) as shown in the top of Fig. 5. The routing of the SU is quite simple. Whenever the input S_sel equals ‘0’, then the upper input S_in(0) will be routed to the upper output S_out(0) and at the same time the lower input S_in(1) will be routed to the lower output S_out(1). Likewise, if S_sel equals ‘1’ the upper input S_in(0) will be routed to the lower output S_out(1) and, in turn, the lower input S_in(1) will be routed to the upper output S_out(0). The complete mapping is summarized in Fig. 6b, where ‘x’ on input I3 denotes a ‘‘dont care” at each LUT. Optionally, this input can be used to generate a specific bit sequence, whenever I3 is asserted. The implementation described in this paper is based on 4-input LUTs because our multi-FPGA platform is composed of Xilinx Virtex-II Pro FPGA devices (see Section 5). However, one basic SU can also be efficiently implemented by a single 6-input LUT with two outputs (see Fig. 6d) that is used in next-generation FPGAs, e.g., Xilinx Virtex-5 FPGA devices. Hence, it is not restricted to FPGAs with 4-input LUTs only. 3.2.2. Dynamically variable topology The dynamically variable topology (DVT) of dimension M (see Fig. 7) is based on the MULTITOP network (see Section 2.2) and consists of switching units (SUs). The corresponding symbol of

564

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

Fig. 4. Integration of the dynamically reconfigurable dataflow architecture.

Dynamically Reconfigurable Dataflow Architecture PLB/OPB

PLB/OPB BRAM Controller

A

BRAM (CNTR.)

B

S

Parallel Topology Controller (PTC)

OPB IP Interface

OPB

PLB/OPB BRAM Controller #0

PLB/OPB

PLB/OPB BRAM Controller #L-1

PLB/OPB

S

A

PLB/OPB

PLB/OPB

PLB/OPB BRAM Controller #0

BRAM (DATA) #0

B

A

BRAM (DATA) #M-1

B

A

BRAM (DATA) #0

B

PLB/OPB BRAM Controller #L-1 A

BRAM (DATA) #M-1

I/O & reconfiguration dataflow

B

W

W

W

Dynamically Reconfigurable Dataflow Cluster (DRDC) #0

W

W

W

W Dynamically Reconfigurable Dataflow Cluster (DRDC) #L-1

intra-FPGA dataflow

W

B

BRAM (DATA) #0

A

B

BRAM (DATA) #M-1

A

B

BRAM (DATA) #0

A

B

BRAM (DATA) #M-1

A

optical inter-FPGA dataflow links (optional)

Fig. 5. Multi-FPGA dynamically reconfigurable dataflow architecture.

the SU is depicted in Fig. 6a. Hence, the DVT provides M serial communication channels, which are free of blocking for all permutations and can be dynamically adapted to the dataflow of the algorithm. Depending on the number of computational units (CUs; see Section 3.2.4), the DVT can also be extended modularly. However, depending on the FPGA technology, each SU introduces a specific propagation delay, e.g., it is about 0.28 ns for a Xilinx Virtex-II Pro FPGA device with the highest speed grade (SG) of ‘‘7”. In addi-

tion, the extra net delay between connected SUs must be considered. Furthermore, it is important to note that the DVT consists of M fully asynchronous paths because no flip-flops are integrated. Given that the depth of the topology increases logarithmically with M, the overall propagation delay increases correspondingly. The delay for different FPGA speed grades is summarized in Table 1 and plotted in Fig. 8. The results are estimated by the Xilinx synthesis tool (XST) of the integrated software environment (ISE) foundation version™ 9.2i (service pack 4).

565

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

Sout(0) Sout(1)

I3 I2 I1 I0 Sout(0)

Sin(0)

x x x x x x x x

(0) (1) Sin(1)

Sout(1)

0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1

(a) 'x'

I3

Sout(0)

Ssel 'x'

I5

'x'

I4

'x'

I3 6-input I2 LUT

Ssel Sin(1)

I3 I2 4-input LUT O I1 #1 I0

Sin(1)

0 0 1 1 0 1 0 1

(b)

I2 4-input LUT O I1 #0 I0

Sin(0)

0 1 0 1 0 0 1 1

0 1 0 1 0 1 0 1

Sout(1)

Sin(0)

(c)

O0

Sout(0)

O1

Sout(1)

I1 I0

(d) Fig. 6. Switching unit (SU).

SU

SU

SU

SU

SU

SU

(a) SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

(b) Fig. 7. Dynamically variable topology (DVT).

Table 1 DVT: resource usage and propagation delay. M

2 4 8 16 32 64

Number of

Propagation delay (ns)

SUs

LUTs

SG ‘‘6”

SG ‘‘7”

1 6 20 56 144 352

2 12 40 112 288 704

0.320 2.258 4.060 5.862 7.664 9.466

0.280 1.990 3.568 5.147 6.725 8.304

Since the interconnection delays dominate logic delays in modern silicon chips, particularly in FPGA architectures, the extra routing delays introduced due to the regular arrangement of logical cells cannot be neglected and have to be carefully investigated after ‘‘place & route”. The meshing of the DVT, however, is quite local, which yields fairly acceptable routing efficiency. Moreover, all switching units (SUs) can be easily distributed over the entire FPGA chip. Contrary, routing wide parallel buses is far more difficult. Thus, the ‘‘place & route” tools deliver much better results for serial communication channels.

566

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

600

500 speed grade -6 speed grade -7 Frequency [MHz]

400

300

200

100

0 4

8

16

32

64

DVT Dimension (M) Fig. 8. DVT: maximum frequency (Xilinx Virtex-II Pro FPGA.)

Another advantage of a serial dataflow implementation is that the architecture is not restricted to a specified operand precision. Only a slight modification of the computational units (CUs) along with a suitable adjustment of the corresponding virtualization buf-

fer (VB; see Section 3.2.4) is necessary. Nevertheless, it can be advantageous to replicate the communication channels to be of width W. We are coming back to this later in Section 4.

Fig. 9. Dynamically reconfigurable dataflow cluster (DRDC).

567

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

W

W

B

W-to-B converter

W

B

ALU B

B

B-to-W converter

Finally, after the computation is completed, a B-to-W conversion must be accomplished. This is efficiently implemented by a dual-ported BRAM memory block with an asymmetric aspect ratio.

W

Computational Unit (CU) Fig. 10. Universal architecture of the computational unit (CU).

3.2.3. Computational cluster The computational cluster (CC) comprises one DVT of dimension M together with M=2 computational units (CUs; see Section 3.2.4). Initially, the dataflow channel width W is serial, i.e., W ¼ 1. However, the parallel extension (W > 1) can be advantageous [16] and is introduced in Section 4. In Fig. 9 the CC and its components are highlighted in grey. Depending on the DSP algorithm to be implemented, the CCs can be composed of either identical or different CUs. For example, by mapping the parallel FFT algorithm on the dataflow architecture (see Section 7), all CCs are assembled from identical CUs, each implementing a butterfly operator (with different complex phase factors to be multiplied). The CC combines the M unidirectional handshake signals from all contained CUs, which are controlled by the parallel topology controller (PTC) as shown in Fig. 5 and Fig. 9. By this signal, each CU indicates its readiness to transfer the results. After the PTC has acknowledged the request, the transfer is started and simultaneously new data to be computed is received. When finished, the CU must deassert its request line for at least one cycle. Immediately after the PTC has detected the reset of all CU request signals, the DVT will be reconfigured (see Section 4.2).

3.2.5. Distribution unit The distribution unit (DU) is used to connect a number L of physical CCs with each other as depicted in Fig. 11. The input of each DU can be routed either to the same CC (upper path) or, via a cascade of demultiplexers, to other CCs (lower path). In this way several CCs can be arbitrarily interconnected and mapped either on a single FPGA or transparently distributed over several FPGAs. In Fig. 9 is shown how the DU is integrated into the proposed dataflow architecture. Let N be the word-length (e.g., N ¼ 16 in a 16-point FFT), then the required number of physical CCs would be Lmax ¼ N=M. These CCs have to be connected via Lmax demultiplexers in the DUs, and the additional delay introduced by the demultiplexers is directly proportional to the number of physically interconnected CCs. However, when only L < Lmax CCs are available, then these have to be reused V ¼ Lmax =L ¼ N=ðM  LÞ times. In this case V-1 additional virtualization buffers (VBs) are required in the DU to buffer intermediate results. These are implemented by one dual-ported BRAM memory block per dataflow channel, which can also be exploited to implement a B-to-W conversion at no extra cost (see Section 7). In summary, the number L of CCs and their dimension M can be chosen before the dataflow architecture is mapped onto FPGA architectures. Depending on these parameters the latency caused by the virtualization buffers and the interconnection delay through the cascaded demultiplexers change (the latter are implemented as cascaded LUTs to the inputs of each CC). Therefore, based on the available resources, an application-specific optimal choice must be found. 4. Parallel extension

3.2.4. Computational unit The computational unit (CU) is application-specific and provides two inputs and two outputs each. Both inputs and outputs of the CU have a data width of W. Because the operands’ precision B is fully user-definable and usually a multiple integer of W, the operands must be transferred in B=W parts subsequently. The arithmetic logic unit (ALU) is typically implemented with the full data width B. Therefore, the CU is mainly composed of three different blocks as shown in Fig. 10. The W-to-B conversion can be efficiently implemented by using the dedicated fast carry chains of the slices. This is applicable when the first operation on both operands is a summation. For example, this is the case for the FFT in DIF methodology (see Section 2.1). Then the W-to-B conversion comes with the addition/subtraction at no extra cost. The implementation of the ALU is completely application-specific. For this reason, it will be described in detail later in Section 7.

In the previous sections the dataflow channels of the DRDA were defined to be serial only, i.e., W ¼ 1. However, a parallel extension of the transfer width W is motivated by the logarithmic increase of the propagation delay mentioned in Section 3.2.2. As a serial dataflow can be compensated by high operating frequencies for small values of M, it is not acceptable anymore for M >¼ 16, i.e., when the maximum operating frequency falls below 200 MHz (see Fig. 8). In particular, when operands with single (32-bit) or double precision (64-bit) have to be transferred between computational stages, a much higher bandwidth is essential for high-performance digital signal processing. 4.1. DVT replication In order to increase the channel width W, we replicate the dynamic variable topology (DVT) W times, where W denotes the de1

1

out(0)

1

DMUX #0 DMUX #1

V

DMUX #2

1

out(1)

1

out(L-2)

1

out(L-1)

DMUX #L-1

Fig. 11. Distribution unit (DU).

568

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

counting the number of fixed transfer cycles as well as by asserting the required enable signals of the input/output memory blocks and the virtualization buffer (VB), respectively. After the data transfer has been completed, the DVT is reconfigured in one cycle and the state machine reaches its last state CU_REQ, in which the PTC waits again for the CU request signals to be asserted. This time, however, the CUs indicate that the computations are completed and the next data transfer is ready to start.

sired dataflow channel width (see Fig. 9). Since all data bits are routed to the same destination, the S control signals of the PTC can be used for all replicates. Hence, no extra control logic is necessary and the total LUT resource usage scales linearly with W. Since all SUs can be easily distributed over the entire FPGA chip, this still yields fairly acceptable routing efficiency. Moreover, each DVT can be operated nearly on the same maximum frequency estimated for serial transmissions, i.e., when W ¼ 1. However, a maximum parallel transfer width W such that it is equal to the precision B of the operands is not always practically feasible because the total number of required LUTs is proportional to W. Moreover, some application-specific CU component implementations further limit the advantages of a maximum parallel transfer width. For example, when implementing the parallel FFT in DIF methodology the propagation delay of the carry chains also increases proportionally against the number of bits to be summed up in parallel. Therefore, the transfer width W is user-definable.

4.3. DRDA implementation results The propagation delay in the dataflow channels increases logarithmically with the number L of DRDCs and their corresponding dimension M. While the replication linearly increases the overall bandwidth, the increase of logical resources is also proportional to the dataflow channel width W. Hence, the bandwidth-to-resource ratio is independent of W and decreases logarithmically with L and M. In conclusion, an application-specific trade-off between the DRDA parameters L, M, and W must be found.

4.2. Parallel topology controller The reconfiguration of the DRDA is dynamically controlled by a simple state-based parallel topology controller (PTC). The PTC is usually assigned to one DRDC and its replicates. However, depending on the algorithm, one PTC can also control several DRDCs simultaneously (when instantiated on the same FPGA). This is possible, e.g., for parallel FFT computations. In Fig. 12 the corresponding state machine of the PTC is depicted. It consists of only four states that basically implement on the one hand the handshake protocol between the PTC and its assigned CUs and on the other hand the control of the DVT reconfiguration. In the INIT state all components of the DRDA are initialized. After the PPC core has started the DRDA operation, the PTC waits for all CUs to complete their individual initialization phase. After successful completion, i.e., all CUs have asserted their REQ signal, the PTC state machine switches to the next state, namely, DATA_DVT. Coming to this state, the corresponding dataflow is always started immediately. Depending on the operand precision B and the data channel width W, the PTC controls the data transfer by

reset

5. Multi-FPGA hardware design To prove the scalability of the proposed dataflow architecture, we developed a hardware board [17] that comprises two Xilinx Virtex-II Pro FPGAs that are connected on board-level via six highspeed RocketIO™ transceivers. Moreover, multi-board computing platforms can be easily composed via four optical links and have been successfully tested, each operating with up to 3.125 Gb/s. In Fig. 13 is shown the block diagram of our printed circuit board (PCB) hardware design. We chose the Xilinx XC2VP30 FPGA [15], which integrates more than 30,000 logic cells, two IBM PowerPCÓ405 RISC processor cores (operating with up to 400 MHz), and eight serial RocketIO™ embedded multi-gigabit transceivers (MGTs) [18]. The PCB design is based on the PCI revision 2.3 standard and fully complies with the corresponding electrical and mechanical specification [19]. The PCI target I/O accelerator chip PCI 9030 from PLX Technology [20] is used to bridge the PCI bus to a 32-bit local bus that, in turn, connects to both FPGAs. The PCI bus is employed

initialization (

)

PPC_START = ’1' & CU_REQs = ’1'

CU notification (

)

notification after one cycle

data transfer & DVT reconfiguration (

) DVT reconfiguration in one cycle after the data transfer has completed

CU monitoring (

)

Fig. 12. PTC state machine.

CU_REQs = ’1'

569

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

RocketIO Transceiver

QDR SRAM (18 Mb)

Single Ended

DDR DRAM (256 MB)

trace & debug

QDR SRAM (18 Mb)

18

18 GPIO

1x 1x

trace & debug

64

64

optical transceiver

DDR DRAM (256 MB)

6x

Xilinx Virtex II Pro

1x

Xilinx Virtex II Pro

32

arbiter

16-bit MPU bus PLX PCI 9030

optical transceiver

1x

SysACE

CF

32-bit local bus (multiplexed)

PCI bus Fig. 13. PCB hardware design: block diagram.

3.125 Gb/s each as depicted in Fig. 13. Moreover, both FPGAs are connected on board-level by six RocketIO™ embedded MGTs (that also operate with up to 3.125 Gb/s) as shown in Fig. 14, which yields to an overall communication bandwidth of more than 30 Gb/s.

for FPGA configuration, the transfer of control and status information, as well as power supply. Additionally, the data to be computed by the multi-FPGA platform can be transferred via the PCI bus from the host to on-board memory. However, it is important to note that during DSP computations the dataflow between boards is solely accomplished via the optical links rather than the PCI bus. In the dynamically reconfigurable dataflow architecture (DRDA) the dataflow is controlled by a PowerPC (PPC) core and a dedicated parallel topology controller (PTC) in background. In this way, the computational throughput is not affected by the control flow, because the dynamically variable topology (DVT) is reconfigured while DSP computations are in progress. Whereas this can be efficiently achieved on FPGA-level, however, on multi-FPGA platforms the overall performance highly depends on the total inter-FPGA communication bandwidth. Therefore, our PCB prototype is equipped with four high-speed optical links that operate with up to

6. Programming model

#3

In recent years significant increases in silicon and algorithmic complexity of todays highly integrated embedded hardware and software systems have triggered a rise in design and verification costs. For this reason, the need for powerful development approaches have emerged and a new paradigm known as electronic system level (ESL) design is promising to usher in a new era in FPGA design. The term ESL refers to tools and methodologies that raise design abstraction to levels above the current register transfer level

MGT4

MGT9

Bank 0

MGT7

Bank 1

MGT18

Bank 5 MGT19

MGT21

Bank 6

Bank 3

Bank 6 Bank 4

MGT4

Bank 0

FPGA #0

Bank 3

#1

FPGA #1

MGT16

MGT6

Bank 7

Bank 7

Bank 2

#2

MGT6

Bank 2

MGT7

Bank 1

Bank 4 MGT16

MGT18

#0

optical transceiver

MGT9

Fig. 14. Dataflow topology of the on-board RocketIO transceivers.

Bank 5 MGT19

MGT21

570

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

(RTL) because RTL development is still characterized by time-consuming and error-prone design cycles [21]. 6.1. Electronic system level design The early ESL tools on the market offer domain-specific synthesis to hardware from languages like C/C++ or MATLAB, aimed at everything from accelerating software algorithms to creating high-performance digital signal processing (DSP) engines. In particular, for algorithms that demand more than what conventional von Neumann processor architectures can currently deliver. The MathWorks has demonstrated that model-based design with SimulinkÓ[22] produces dramatic reductions in development time, cost, and risk. In addition, MATLABÓ[23] is a high-level technical computing language and interactive environment for analyzing data and developing algorithms. In 2004, Xilinx released the System Generator for DSP development tool that fully supports the MATLAB and Simulink software packages [24]. It automates the design, debug and deployment of FPGA-based DSP systems with push-button performance. Thereby, for the first time, system architects, DSP engineers, and hardware designers could model complete DSP systems and subsequently use a direct path to implementation by automatic HDL code and testbench generation, including test vectors. Hence, Xilinx extended MATLAB/Simulink as the state-of-the-art tool for accelerating engineering and science tasks by an integrated system-level design environment to build sophisticated DSP systems and at the same time to drastically reduce the time-to-market design cycles. A typical approach in System Generator is to import an HDL module and use it as a component. The System Generator black box block allows VHDL, Verilog, and electronic data interchange format (EDIF) to be imported into the design and behaves exactly like other System Generator blocks. In this way, the DRDA top-level file that is written in VHDL has been imported into the System Generator. By the means of VHDL generics, the DRDA architecture is fully customizable. Hence, based on the application-specific requirements, the appropriate DRDA parameters (see Section 3) can be

manually chosen or automatically optimized for certain design constraints. Because the computational unit (CU) is also included, DRDA blocks have to be chosen library based for different DSP algorithms. In Fig. 15 is shown an example of a 32-point FFT as described in the following Section 7. All parameters of the DRDA can be easily configured within a so-called masking dialog that is opened by a simple right-click on the DRDA box (see Fig. 15a). 6.2. Coprocessor model As previously described, in the proposed architecture the dataflow is controlled by a PowerPC (PPC) core and a dedicated parallel topology controller (PTC) in background. This means that the computational performance is not affected by the control flow because the topology of the dataflow communication channels between computational units (CUs) is reconfigured while DSP computations are in progress. For this reason the switch configurations of the dynamically variable topology (DVT) must be continuously reloaded to the associated BRAM blocks. Of course, the data format is dependent on the number of switches. However, based on the DVT dimension M the final DVT switch configuration vector is automatically written in the correct data format to a BRAM memory file. In this way, it is made transparent to the application programmer. In conclusion, the multi-FPGA platform can be regarded as a high-performance coprocessor board [25]. The idea is that the application engineer simply chooses a DSP algorithm in a graphical

a

B

a+b

B+1 k

+

b

B

BW

W N = e -j2πk/N

a b B+1

k

B+1+BW

Fig. 16. Radix-2 butterfly computation (DIF).

Fig. 15. Interfacing Xilinx System Generator Blocks in Simulink.

(a b)W N

571

d

B+1 complex multiplier

W

W

VB

B VB

ALU

W-to-B Converter

PORT B

(SO)

b

B

BRAM

phase factors

a subtractor

W

B+1

PORT B

c

BRAM

adder b (SO)

PORT A

a

PORT A

W

scaling & rounding units (optional)

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

B-to-W Converter

Computational Unit (CU) Fig. 17. CU FFT implementation: overview.

7. Application Generally, the dynamically reconfigurable dataflow architecture (DRDA) is suitable for any kind of distributed high-performance digital signal processing (DSP). To demonstrate its effectiveness, the operational principle is explained on a high-performance parallel fast Fourier transform (FFT). 7.1. Fast Fourier transform

Fig. 18. CU FFT implementation: summation operator (SO).

user interface (GUI) on the host and based on available resources and optional criteria the DRDA design tools then automatically generate the best suitable FPGA configuration files, which can then be downloaded to the multi-FPGA coprocessor platform.

The Cooley-Tukey FFT algorithm [4] is a well-known algorithm to compute the DFT in O(N logN). The FFT algorithm decomposes a DFT with N points into N DFTs with a single point each. The second step is to calculate the N frequency spectra corresponding to these N time domain signals. Lastly, the N spectra are synthesized into a single frequency spectrum. Because the decomposition of this FFT is done in time domain rather than in frequency domain, it is commonly referred to as decimation in time (DIT) algorithm. There also exists a very popular variant of the Cooley-Tukey FFT algorithm, the so-called Gentleman-Sande FFT [3]. This algorithm carries out the decomposition in frequency domain rather than in CO

b[W-1]

C(W-1) INV

a[W-1]

MUX XOR LI CI

LUT

b[1]

s[1]

O

s[0]

MUX XOR

LUT

XOR

C0 INV

a[0]

MUX XOR LI CI

b[W-1:0] ADD CI

O

C1

LI CI

a[W-1:0]

s[W-1]

XOR

INV

a[1]

b[0]

O

LUT

XOR

Fig. 19. CU FFT implementation: W-bit adder/subtractor.

s[W-1:0]

572

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

#B/W-1

#i

#0

#B/W-1

#i

#0

Fig. 20. CU FFT implementation: W-to-B converter.

time domain. In Fig. 1 the corresponding signal-flow graph of an 8point decimation in frequency (DIF) FFT has already been shown in Section 2.1. After the conversion into a graph of functional elements (see Fig. 2), each block represents one butterfly operator as depicted in Fig. 16, where the multiplication by the twiddle factor W kN is performed after the complex point b has been subtracted from a. We chose the Gentleman-Sande FFT, since it is advantageous for the CU implementation as explained in the following section.

linx library macros can also be used, which already include the required relative location (RLOC) constraints. The W-bit partial sum flows asynchronously to the adjacent W-to-B converter that is composed of W shift registers (SRs) as depicted in Fig. 20. More precisely, each bit of the result s is connected to the input of its associated SR, where each shift register consists of B=W concatenated flip-flops. The partial sums are shifted subsequently with each transfer cycle into the SRs. For this reason, the output bits must

7.2. DRDA implementation In Section 3 all components of the DRDA have been explained in detail except for the computational unit (CU) that is applicationspecific. In the following the CU implementation of the FFT butterfly operation (with DIF) is described. 7.2.1. CU architecture The CU implementation for the parallel FFT application is depicted in Fig. 17. As previously mentioned in Section 3.2.4 the summation in the butterfly operation (see Fig. 16) can be efficiently implemented by using the dedicated fast carry chains and is implicitly included in the W-to-B conversion. In Fig. 18 is shown the summation operator (SO), which comprises a W-bit adder/subtractor, a simple edge triggered flop-flip, and the W-to-B converter. The SO provides two inputs of width W each and one output of width B þ 1. The carry out (CO) bit is used to forward the exact result of the summation to the ALU. The combinational logic of the W-bit adder/subtractor is depicted in Fig. 19. Starting with the W least significant bits, in all, B=W partial summations are computed. It is important to note that the synthesis must be constrained correctly to obtain the desired mapping onto the dedicated FPGA resources. Alternatively, the corresponding Xi-

Table 2 CU complex butterfly implementation: resource usage [LUTs]. No. CUs

Fixed-point

(M=2)

8-bit

16-bit

24-bit

32-bit

Floating-point 32-bit

1 2 4 8 16 32 64

77 154 308 616 1232 2464 4928

149 298 596 1192 2384 4768

437 874 1748 3496 6992

629 1258 2516 5032

5854

Table 3 Xilinx FFT LogiCore (Version 5.0): configuration settings. Configuration settings

Chosen parameters

Transform size Data sample precision Phase factor precision Arithmetic type Rounding mode

8, 16, 32, 64, 128, 256, 512, 1024 16 16 Scaled fixed-point Truncation

573

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

Fig. 21. Xilinx FFT LogiCore (Version 5.0): Radix-2, burst I/O architecture.

8,000 Xilinx FFT core

7,000

DRDA FFT core

Number of Transform Cycles

6,000

5,000

4,000

3,000

2,000

1,000

0 8

16

32

64

128

256

512

1024

Number of Points (N) Fig. 22. Number of required transform cycles against N.

Table 4 FFT core implementations: Number of transform cycles. N

8

16

32

64

128

256

512

1024

Xilinx core ðW ¼ 16Þ

72

121

214

403

800

1,645

3,450

7,303

DRDA core ðW ¼ 16Þ

19

25

31

75

163

355

768

1,710

be reordered accordingly. Because the additional net delay between the W-bit adder/subtractor outputs and the shift register inputs is negligible against the combinatorial delay of the W-bit adder/subtractor, the W-to-B conversion is achieved at no extra cost. After summation, in the CU ALU the complex phase factor W kN is multiplied to the lower W-to-B converter output as shown in Fig. 16. Optionally, the result is scaled and rounded. If no rounding is selected, the result is simply truncated to the width of B bits. Finally, the width of both results are converted from B to W, so that the data can be subsequently transferred through the dynam-

ically variable topology (DVT). The B-to-W conversion is easily achieved by a dual-ported BRAM that is configured with asymmetric aspect ratio. At the same time, these BRAMs can be exploited as virtualization buffer (VB) when the VBs are chosen to be located directly after the CU. This virtualization is required when the physical CUs must be logically reused V times for N > M and the intermediate results have to be buffered (see Section 3.2.5). 7.2.2. DRDA: user-defined data type and precision Generally, in the DRDA the operands are neither restricted to a special data type nor to a particular precision due to the serial data-

574

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

Table 5 FFT core implementations: Resource usage [FPGA slices]. N

8

16

32

64

128

256

512

1024

Xilinx Core (W ¼ 16)

0.6K 4.6%

0.7K 4.8%

0.7K 5.0%

0.7K 5.1%

0.7K 5.2%

0.8K 5.6%

0.8K 5.9%

0.8K 6.1%

DRDA Core (W ¼ 1)

1.3K 9.2%

2.4K 11.7%

4.6K 33.6%

DRDA Core (W ¼ 16)

1.6K 11.7%

3.3K 24.0%

7.0K 50.6%

6.7K 49.2%

7.3K 53.2%

7.3K 53.2%

7.3K 53.5%

6.7K 49.1%

flow communication channels. Of course, the total resource usage highly depends on the chosen data type and its precision B as well as on the other DRDA parameters, e.g., the total resource usage of the CU implementation for the FFT is summarized in Table 2. 7.3. Results To evaluate the computational performance of the DRDA FFT implementation, the results are analyzed in the following. For this purpose, first of all the Xilinx fast Fourier transform (FFT) LogiCORE™ intellectual property (IP) [26] is described as a reference core. Secondly, the results that have been measured after the ‘‘place & route” phase in the static timing analysis of both FFT cores are compared. And finally, further improvements of the DRDA FFT core are proposed to roughly double the speedup in comparison to the already reduced transform latency achieved by the standard non-optimized DRDA FFT core implementation.

7.3.1. Xilinx fast Fourier transform LogiCore IP The Xilinx FFT LogiCore™ IP was chosen because the core can be configured for different transform sizes, data sample and phase factor precisions, arithmetic types and several complex butterfly implementations. In this way, the Xilinx FFT core can be easily generated for various parameters to be accurately compared with the corresponding DRDA FFT core configurations. The results of the Xilinx FFT core are based on the toolchain of the integrated software environment (ISE) foundation™ version 9.2i (service pack 4). The chosen LogiCore™ configuration parameters are summarized in Table 3 and the corresponding radix-2, burst I/O architecture is shown in Fig. 21. 7.3.2. Benchmarking In Fig. 22 the number of required transform cycles for both FFT core implementations is plotted against the number of sampled signal points. For N ¼ 8, N ¼ 16, and N ¼ 32 points, the computa-

Table 6 FFT core implementations: maximum frequency (MHz). N

8

16

32

64

128

256

512

1024

228.83

228.83

228.83

228.83

228.83

79.02

77.63

82.54

81.50

80.93

Xilinx core ðW ¼ 16Þ

228.83

228.83

228.83

DRDA core ðW ¼ 1Þ

156.27

128.63

100.25

DRDA core ðW ¼ 16Þ

143.35

105.50

80.00

35

Xilinx FFT Core

Overall Transform Latency [µs]

30

DRDA FFT Core (SG6)

25

20

15

10

5

0 8

16

32

64 128 Number of Points (N)

256

Fig. 23. Overall transform latency after place and route.

512

1024

575

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576 Table 7 FFT core implementations: overall transform latency after place and route (ls).

Xilinx core ðW ¼ 16Þ

0.315

0.529

0.935

1.761

3.496

7.189

15.077

31.914

DRDA core ðW ¼ 16Þ

0.133

0.237

0.388

0.951

2.100

4.301

9.423

20.619

PB #0

W

B

CU(0)in(0)

VB #0

VB #1

W

VB #M-2

W

VB #M-1

W

PORT B

W

PORT B

1024

PORT B

512

PORT B

256

PORT A

128

PORT A

64

PORT A

32

PORT A

16

PORT B

8

PORT A

N

W

PORT A

PORT B

PORT A

PB #M-2

PORT B

W

PB #M-1

PORT B

W

W

PORT A

W

PB #1

W

B

CU(0)in(1)

B

CU(M/2-1)in(0) Computational Unit (CU) M 1) #( 2

B

CU(M/2-1)in(1)

W

MxM Dynamically Variable Topology (DVT)

Computational Unit (CU) #0 W

DVTout(0)

DVTout(1)

DVTout(M-2)

DVTout(M-1)

W

W

W

Computational Cluster (CC) Fig. 24. Multi-clock domains.

tional cluster (CC) of the DRDA FFT core integrates the maximum number of computational units (CUs). Hence, a direct communication is achieved in all stages. For N > 32 and W ¼ B ¼ 16 the available FPGA resources in a XC2VP30 device are not sufficient to implement CCs with a dimension of M > 32. For this reason, virtualization must be used from N ¼ 64-point to N ¼ 1024-point FFTs, where the CUs are reused V times. However, the virtualization can also be exploited to improve the data throughput, since it allows that the results of one butterfly computation can be simultaneously transferred while the next logical iteration is processed. In this way, the CUs are kept busy at all times. Moreover, the achievable speedup can be further improved by implementing multi-clock domains (see Section 7.3.3). In conclusion, the resulting speedup for the number of transform cycles of the DRDA FFT core is up to 6.9 (for W ¼ 16) in comparison to the Xilinx FFT core (see Table 4). However, this is dearly bought by clearly higher resource usage as summarized in Table 5. As expected, the higher resource usage has also a direct impact on the maximum frequency that was measured after the ‘‘place & route” phase using the static timing analyzer (see Table 6). While the maximum frequency of the compact and highly optimized Xilinx FFT core implementation is only dependent on the multiplier blocks, the maximum frequency of the DRDA FFT core implementation is limited by the propagation delay through the dynamically variable topology (DVT). Because virtualization is used for N > 64, the resulting resource usage and maximum frequency (for speed grade ‘‘-6”) is roughly the same from N ¼ 32 to N ¼ 1024. Nevertheless, the overall transform latency of the DRDA FFT core is up to 141% (for N ¼ 32, W ¼ 16) better than the related Xi-

linx FFT core (see Fig. 23 and Table 7). When virtualization is used, the improvement is between 86% (for N ¼ 64, W ¼ 16) and 55% (for N ¼ 1024, W ¼ 16). 7.3.3. Further improvements Particularly, when the W-to-B converter in the CU is implemented by the fast carry logic (see Section 7.2.1), the overall propagation delay through the DVT is additionally increased. Hence, as depicted in Fig. 24 another dual-ported BRAM was placed in front of the CUs as pipelining buffer (PB) to decouple the logic and allow the implementation of multi-clock domains. This can also be exploited to clock the CU logic at higher frequencies, which yields to transform cycle speedups of more than one order of magnitude. 8. Conclusion In this paper we presented an FPGA-based dataflow architecture that is composed of communication channels which can be dynamically adapted to the dataflow of the algorithm and maps efficiently onto multi-FPGA hardware platforms. The topology can be reconfigured within a single clock cycle while DSP operations are in progress. Moreover, only the computational unit of the DRDA components is application-specific and must be implemented according to the functional blocks of the DSP algorithm. The remaining components are universally applicable because only the raw dataflow is routed. Hence, for the universal DRDA implementation only the configurations of the dynamically variable topology along with the distribution units must be generated

576

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

based on the dataflow graph of the DSP algorithm and have to be loaded into the corresponding configuration memory. In this way, the parallel topology controller and its associated components accomplish the dataflow without any knowledge about the DSP algorithm: After the operands have been transferred, the next topology configuration is loaded and the controller then waits for all computational units to signal the completion of the current DSP computation. Subsequently, the next data transfer is started and so on. Thus, the DRDA is suitable for various DSP algorithms. Finally, in this paper the efficiency of the proposed architecture was demonstrated by a parallel FFT implementation. In conclusion, this novel dataflow architecture has been proven to be a promising approach for high-performance DSP applications on multi-FPGA platforms. The DRDA is characterized by system-level programmability and high scalability. Particularly, the transparent mapping of different DSP algorithms on the proposed architecture due to its universal dataflow concept in association with application-specific computational units is one of the most important benefits. 8.1. Future work Meanwhile, a multi-FPGA platform with two hardware boards, i.e., four Xilinx Virtex-II Pro FPGAs, has been assembled. It would be desirable to test the DRDA in multi-FPGA platforms with considerably more FPGA devices to prove the scalability of the proposed architecture in hardware and compare it to other high-performance parallel DSP engines. Then, one of the most interesting projects will be the mapping of more DSP algorithms onto the proposed dataflow architecture. References [1] Z. Guo, W. Najjar, F. Vahid, K. Vissers, A quantitative analysis of the speedup factors of FPGAs over processors, in: Proc. of the ACM/SIGDA 12th Int. Symp. on Field Programmable Gate Arrays (FPGA), 2004, pp. 162–170. [2] J. Palmer, B. Nelson, A parallel FFT architecture for FPGAs, in: Proc. of the Int. Conference on Field Programmable Logic and Applications (FPL 2004), 2004, pp. 948–953. [3] W. Gentleman, G. Sande, Fast Fourier transforms – for fun and profit, in: Proc. of the AFIPS Joint Computer Conference, vol. 29, 1966, pp. 563–578. [4] J. Cooley, J. Tukey, An algorithm for machine calculation of complex Fourier series, Mathematics of Computation 19 (1965) 297–301. [5] B. Blodget, C. Bobda, M. Huebner, A. Niyonkuru, Partial and dynamically reconfiguration of Xilinx Virtex-II FPGAs, in: Proc. of the Int. Conference on Field Programmable Logic and Applications FPL (2004), 801–810, 2004. [6] M. Silva, J. Ferreira, Support for partial run-time reconfiguration of platform FPGAs, JSA 52 (12) (2006) 709–726. [7] J. McAllister, R. Woods, S. Fischaber, E. Malins, Rapid implementation and optimisation of DSP systems on FPGA-centric heterogeneous platforms, JSA 53 (8) (2007) 511–523. [8] D. Heller, A survey of parallel algorithms in numerical linear algebra, SIAM Review 20 (4) (1978) 740–777. [9] H. Richter, Multiprocessor with dynamically variable topology, Computer System Science and Engineering 5 (1) (1990) 29–35. [10] V. Benes, Mathematical Theory of Connecting Networks and Telephone Traffic, Academic Press, 1965. [11] K. Lee, On the rearrangeability of a ð2 logN1) stage permutation network, IEEE Transactions on Computers 34 (5) (1985) 412–425. [12] S.Voigt, T. Teufel, Dynamically reconfigurable dataflow for high-performance digital signal processing on Multi-FPGA platforms, in: Proc. of the Int. Conference on Field-Programmable Logic and Applications (FPL 2007), 2007, pp. 633–637.

[13] CoreConnect Bus Architecture – An Open, 32-, 64-, 128-Bit Core on-Chip Bus Standard, IBM Microelectronics, 1999. [14] MontaVista Linux Professional (Edition 3.1) – Optimized for High-performance Embedded Applications, Monta Vista Software Inc., 2004. [15] DS083: Virtex-II Pro Data Sheet, Xilinx, 2007. [16] S. Voigt, T. Teufel, Analysis of a dynamically reconfigurable dataflow architecture and its scalable parallel extension for multi-FPGA platforms, in: Proc. of the 16th IEEE Symp. on Field-Programmable Custom Computing Machines, FCCM ’08, Stanford University, CA, 2008. [17] M. Baesler, Development of a PCI Printed Circuit Board for Multi-FPGA Platform Assembling, Master’s Thesis, Hamburg University of Technology, 2007. [18] UG024: RocketIO™ Transceiver User Guide, Xilinx, 2007. [19] PCI Local Bus Specification (Revision 2.3), PCI Special Interest Group, 2002. [20] PCI 9030 Data Book, PLX Technology, 2002. [21] S. Lass, ESL tools make FPGAs nearly invisible to designers, Xcell Journal 58 (2006) 6–8. [22] SimulinkÒ – Simulation and Model-based Design, The Math Works, 2007. [23] MATLABÒ – The Language of Technical Computing, The Math Works, 2002. [24] K. Karnofsky, Simulink brings model-based design to embedded signal processing, Xcell Journal 51 (2004) 66–69. [25] T. Hill, The benefits of FPGA coprocessing, Xcell Journal 58 (2006) 29–31. [26] DS260: Fast Fourier Transformation, Xilinx, 2007.

Sven-Ole Voigt received both the MSc degree and the Ph.D. degree in computer engineering from the Hamburg University of Technology, Germany. He joined NEC Electronics, Singapore, in 2003 and was responsible for embedded multimedia architectures. Since 2004 he is a researcher at the Institute for Reliable Computing, Hamburg University of Technology, Germany, and has been recently promoted to an assistant professor. His research interests include high-performance dataflow architectures, reconfigurable application-specific instruction-set processors, embedded systems, and rapid prototyping.

Malte Baesler received the MSc degree in electrical engineering from the Hamburg University of Technology, Germany. Since 2007 he is a researcher at the Institute for Reliable Computing, Hamburg University of Technology, Germany. His research interest include computer arithmetic, embedded systems and computer architecture.

Thomas Teufel received the MSc degree in electrical engineering from the University of Bremen, Germany, and the Ph.D. degree in computer science, under the direction of Dr. Ulrich Kulisch, from the Karlsruhe Institute of Technology, Germany. Since 1991 he is an associate professor of computer engineering at the Institute for Reliable Computing, Hamburg University of Technology, Germany. His research interests include implementation of algorithms in hardware, chip design, embedded systems for automation and control engineering, rapid prototyping, computer arithmetic and real-time operating systems. He is a member of the IEEE.