Dynamically reconfigurable dataflow architecture for high-performance digital signal processing

Journal of Systems Architecture 56 (2010) 561–576 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.el...

Download PDF

1MB Sizes 0 Downloads 147 Views

Report

PDF Reader
Full Text

Journal of Systems Architecture 56 (2010) 561–576

Contents lists available at ScienceDirect

Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc

Dynamically reconﬁgurable dataﬂow architecture for high-performance digital signal processing S. Voigt *, M. Baesler, T. Teufel Hamburg University of Technology, Schwarzenbergstrasse 95, 21073 Hamburg, Germany

a r t i c l e

i n f o

Article history: Received 15 October 2009 Received in revised form 2 April 2010 Accepted 26 July 2010 Available online 5 August 2010 Keywords: Dataﬂow architecture Hardware reconﬁguration Digital signal processing Multi-FPGA platform Parallel FFT

a b s t r a c t In this paper a dataﬂow architecture is introduced that maps efﬁciently onto multi-FPGA platforms and is composed of communication channels which can be dynamically adapted to the dataﬂow of the algorithm. The reconﬁguration of the topology can be accomplished within a single clock cycle while DSP operations are in progress. Finally, the programmability and scalability of the proposed architecture is demonstrated by a high-performance parallel FFT implementation. Ó 2010 Elsevier B.V. All rights reserved.

1. Introduction In comparison to ASICs, FPGAs are characterized by a 10- to 100fold logical overhead in chip area due to their ability to be reconﬁgured. Moreover, the regularly arranged conﬁgurable logic cells have to be interconnected via programmable routing switches and the overall performance heavily depends on the routing results delivered by the design tools. Particularly, in complex designs where routing resources become rare, it is difﬁcult to ﬁnd solutions to this issue, and the interconnection delay dominates over the delay within conﬁgurable logic cells. This results in poor clock rates that are usually about 20 times lower compared to general-purpose processors [1]. To overcome this problem, the foremost issue for FPGAs is the need to extract massive amounts of parallelism. Additionally, todays FPGA vendors integrate highly optimized embedded multipliers, fast carry chains, large amounts of on-chip RAM, and dedicated arithmetic routing, all of which facilitate DSP operations. Coupling these features with massive parallelism provided by FPGAs, the resulting systems can outperform the fastest DSP processors by one to two orders of magnitude. While this can be easily achieved, e.g., for matrix multiplication, it can be difﬁcult in other cases, particularly when more data dependencies exist, e.g., in computing parallel fast Fourier transforms (FFTs) [2–4]. Further-

* Corresponding author. Tel.: +49 4042878462; fax: +49 40428784013. E-mail addresses: [email protected] (S. Voigt), [email protected] (M. Baesler), [email protected] (T. Teufel). URL: http://ti3.tu-harburg.de/english (S. Voigt). 1383-7621/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.sysarc.2010.07.010

more, the dedicated DSP resources in FPGAs are strongly limited. Thus, the maximum achievable computational performance highly depends on how efﬁciently the system architecture scales to multiFPGA platforms and is bounded by the total communication bandwidth between embedded DSP units. Therefore, modern FPGA devices are usually equipped with a large number of high-speed serial transceivers, which are characterized by high noise tolerance, clock data recovery, and error detection, all of which enable reliable transfer rates of several giga bits per second. Moreover, this allows the easy setup of arbitrary network topologies. Another advantage of FPGAs is their ability to be reconﬁgured. For this reason, dynamic reconﬁguration of FPGA architectures has become increasingly more attractive [5,6]. The idea is to map DSP algorithms efﬁciently on hardware [7] and modify parts in real time to switch from one function to another, e.g., by loading different ﬁlters in multimedia applications or a coprocessor on demand. However, one major drawback is that it takes up to milliseconds to partially reconﬁgure FPGA architectures. In this paper a dataﬂow architecture is introduced that can be efﬁciently mapped onto modern FPGAs. In this architecture, the topology of the interconnection between computational units can be dynamically reconﬁgured. In contrast to the concept of partially reconﬁguring FPGAs, our approach is to connect DSP resources via a dynamically variable topology, so that the reconﬁguration can be achieved within a single clock cycle and is done while arithmetic operations are in progress. Hence, the proposed dataﬂow architecture combines the basic idea of reconﬁguration with the performance of scalable parallel processing.

562

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

2. Background

2.2. Dynamically reconﬁgurable network topology

In the following the typical characteristics of parallel algorithms are pointed out. Moreover, a concept is explained how the topology of the architecture can be adapted to the dataﬂow of the algorithm to provide a direct inter-processor communication at all times and to maximize the computational throughput.

In 1990, the MULTITOP multiprocessor system was introduced in which the inter-processor communication is performed via switchable point-to-point links instead of a shared memory [9]. The MULTITOP network provides M inputs and M outputs and enables the architecture to be matched optimally to the algorithms dataﬂow. It is quite similar to the Benes [10] and Lee [11] network and is composed of S switches given by Eq. (1), which can be conﬁgured individually in two different ways: parallel (0) or crossed (1).

2.1. Dataﬂow in parallel algorithms One major goal in parallel processing is to minimize the dataﬂow between computational clusters. Obviously, the difﬁculty of this optimization increases with the amount of data dependencies involved. For instance, Fig. 1 shows the signal-ﬂow graph of an 8point fast Fourier transform (FFT) and decimation in frequency (DIF) methodology [3]. The input signal x(n) contains the signal being decomposed, while the output signal X(n) contains the amplitudes of the component sine and cosine waves. The number of samples in the time domain is represented by N, i.e., n ¼ 0; . . . ; N 1, and in this particular example N ¼ 8. However, in typical applications the number of FFT points is chosen between 32 and 4096. The signal-ﬂow graph shown in Fig. 1 can be converted easily into a graph of functional elements as depicted in Fig. 2, where each square block represents one basic butterﬂy operation. For this reason, the lines indicate the dataﬂow between butterﬂy units in consecutive stages of the FFT algorithm. If the FFT algorithm is parallelized by assigning the butterﬂy operations in each row to one computational unit, it is evident that the connections of one particular unit to the others must be the union of interconnections from all stages to achieve a direct communication at all times (see Fig. 3). However, the union of all topologies is usually too complex, because the number of links per unit increases nearly proportional to the total number of butterﬂy units. For this reason, it is more efﬁcient to continuously modify the topology of the connections and adapt it to the dataﬂow of the algorithm. Of course, the time required to reconﬁgure the topology must be very short compared to the duration of each butterﬂy computation. In conclusion, this concept can be generalized in a sense that for every algorithm an optimal sequence of topologies exists, which minimizes the inter-processor communication [8].

S¼

M ð2log2 M 1Þ 2

ð1Þ

Comparable with the Benes and Lee network, the number of MULTITOP network switches have the same order of magnitude and the topology is free of blocking for all permutations [9]. In contrast to the Lee network, however, the relatively strong meshing is avoided. Moreover, because of its self-similarity the MULTITOP network can be modularly extended (see Section 3.2.2). Mathematically, the purpose of this network can be regarded as a machine that accepts a permutation vector of M numbers and outputs a sorted sequence speciﬁed by its switch conﬁgurations.

3. Dynamically reconﬁgurable dataﬂow architecture The new contribution of this paper is the efﬁcient mapping of a dynamically variable topology on modern FPGA architectures and, in particular, the development of a scalable extension of this concept for multi-FPGA platforms [12]. 3.1. Overview In the following sections the components of the architecture are described in detail: starting bottom-up with the basic switching unit (SU), of which the dynamically variable topology (DVT) mainly consists, the reconﬁguration of the topology and how it can be achieved within a single clock cycle is explained comprehensively. Subsequently, the computational cluster (CC) is deﬁned, which is composed of the DVT and computational units (CUs). Finally, the

x(0)

X(0)

x(1)

W80 X(1)

x(2)

W80

X(2)

x(3)

W82

W80 X(3)

x(4)

W80

X(4)

x(5)

W81

W80 X(5)

x(6)

W82

W80

X(6)

x(7)

W83

W82

W80 X(7)

Fig. 1. Signal-ﬂow graph of an 8-point FFT.

563

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

1st t. stage

0th t. stage

x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7)

2nd t. stage

3rd t. stage

0

0

0

1

2

0

2

0

0

1 b. comp.

0

2

3 st

nd

2 b. comp.

rd

3 b. comp.

X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7)

Fig. 2. 8-point FFT graph using functional blocks.

back to memory avoiding the well-known memory bottleneck. Thereby, all outputs become direct inputs either to the same dynamically reconﬁgurable dataﬂow cluster (DRDC) or to others as shown in Fig. 5. The DRDC comprises a computational cluster and distribution units that route the computed data to itself or other clusters. If DRDCs are reused for different sets of data (we call this virtualization), the intermediate results are buffered using BRAMs (see Section 3.2.5). The data can also be received via the optical links from a host PC. The DRDCs are controlled by a dedicated parallel topology controller (PTC). The PTC reconﬁgures the number of S topology switches while computations are in progress by a simple handshake protocol (see Section 4.2). In turn, its operations are monitored by the embedded operating system on the PPC core. 3.2. Components th

st

nd

rd

In the following all components of the dynamically reconﬁgurable dataﬂow architecture (DRDA) are comprehensively described bottom-up. In this way the complete DRDA is subsequently built up.

Fig. 3. Dataﬂow between butterﬂy units in sebsequent stages of a parallized FFT.

distribution unit (DU) is introduced that implements the scalable extension for multi-FPGA platforms. The dynamically reconﬁgurable dataﬂow architecture (DRDA) can be regarded as a coprocessor that is integrated in a system-on-achip (SOC) and controlled by the embedded PowerPC (PPC) core as shown in Fig. 4. All components in the SOC are connected by the IBM CoreConnect™ bus architecture [13]. While the embedded Linux kernel from MontaVista [14] is executed in the double data rate (DDR) SDRAM, the quad data rate (QDR) SRAM holds the data to be computed. On multi-FPGA platforms the dataﬂow between several DRDAs that reside in different FPGAs is transferred via optical transceivers. Initially, the data to be processed is transmitted from QDR memory to Xilinx Block SelectRAM (BRAM) [15] by the PPC core via the processor local bus (PLB) and the on-chip peripheral bus (OPB). The PLB/OPB controllers depicted in Fig. 5 are combined and mapped in the address space of the PPC core, so that the BRAMs can be easily accessed by memory-mapped I/O. Of course, the address mappings are individual for different sets of DRDA parameters. For example, the number S of topology switches is dependent on the DVT dimension M (see Section 3.2.2), i.e., the corresponding conﬁguration vectors are stored in different locations. However, all memory mappings are generated automatically by the implemented design generation tools, so that the application programming is completely abstracted from these details (see Section 6). It is important to note that between computational steps, the intermediate results stay within the DRDA and are not written

3.2.1. Switching unit The basic switching unit (SU) is depicted in Fig. 6 and simply consists of two 4-input look-up tables (LUT) that operate in parallel. The signal S_sel asynchronously controls the output of the switching unit (see Fig. 6c) and is routed to the data bus of a dual-ported BRAM, which is controlled by the parallel topology controller (PTC) as shown in the top of Fig. 5. The routing of the SU is quite simple. Whenever the input S_sel equals ‘0’, then the upper input S_in(0) will be routed to the upper output S_out(0) and at the same time the lower input S_in(1) will be routed to the lower output S_out(1). Likewise, if S_sel equals ‘1’ the upper input S_in(0) will be routed to the lower output S_out(1) and, in turn, the lower input S_in(1) will be routed to the upper output S_out(0). The complete mapping is summarized in Fig. 6b, where ‘x’ on input I3 denotes a ‘‘dont care” at each LUT. Optionally, this input can be used to generate a speciﬁc bit sequence, whenever I3 is asserted. The implementation described in this paper is based on 4-input LUTs because our multi-FPGA platform is composed of Xilinx Virtex-II Pro FPGA devices (see Section 5). However, one basic SU can also be efﬁciently implemented by a single 6-input LUT with two outputs (see Fig. 6d) that is used in next-generation FPGAs, e.g., Xilinx Virtex-5 FPGA devices. Hence, it is not restricted to FPGAs with 4-input LUTs only. 3.2.2. Dynamically variable topology The dynamically variable topology (DVT) of dimension M (see Fig. 7) is based on the MULTITOP network (see Section 2.2) and consists of switching units (SUs). The corresponding symbol of

564

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

Fig. 4. Integration of the dynamically reconﬁgurable dataﬂow architecture.

Dynamically Reconfigurable Dataflow Architecture PLB/OPB

PLB/OPB BRAM Controller

A

BRAM (CNTR.)

B

S

Parallel Topology Controller (PTC)

OPB IP Interface

OPB

PLB/OPB BRAM Controller #0

PLB/OPB

PLB/OPB BRAM Controller #L-1

PLB/OPB

S

A

PLB/OPB

PLB/OPB

PLB/OPB BRAM Controller #0

BRAM (DATA) #0

B

A

BRAM (DATA) #M-1

B

A

BRAM (DATA) #0

B

PLB/OPB BRAM Controller #L-1 A

BRAM (DATA) #M-1

I/O & reconfiguration dataflow

B

W

W

W

Dynamically Reconfigurable Dataflow Cluster (DRDC) #0

W

W

W

W Dynamically Reconfigurable Dataflow Cluster (DRDC) #L-1

intra-FPGA dataflow

W

B

BRAM (DATA) #0

A

B

BRAM (DATA) #M-1

A

B

BRAM (DATA) #0

A

B

BRAM (DATA) #M-1

A

optical inter-FPGA dataflow links (optional)

Fig. 5. Multi-FPGA dynamically reconﬁgurable dataﬂow architecture.

the SU is depicted in Fig. 6a. Hence, the DVT provides M serial communication channels, which are free of blocking for all permutations and can be dynamically adapted to the dataﬂow of the algorithm. Depending on the number of computational units (CUs; see Section 3.2.4), the DVT can also be extended modularly. However, depending on the FPGA technology, each SU introduces a speciﬁc propagation delay, e.g., it is about 0.28 ns for a Xilinx Virtex-II Pro FPGA device with the highest speed grade (SG) of ‘‘7”. In addi-

tion, the extra net delay between connected SUs must be considered. Furthermore, it is important to note that the DVT consists of M fully asynchronous paths because no ﬂip-ﬂops are integrated. Given that the depth of the topology increases logarithmically with M, the overall propagation delay increases correspondingly. The delay for different FPGA speed grades is summarized in Table 1 and plotted in Fig. 8. The results are estimated by the Xilinx synthesis tool (XST) of the integrated software environment (ISE) foundation version™ 9.2i (service pack 4).

565

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

Sout(0) Sout(1)

I3 I2 I1 I0 Sout(0)

Sin(0)

x x x x x x x x

(0) (1) Sin(1)

Sout(1)

0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1

(a) 'x'

I3

Sout(0)

Ssel 'x'

I5

'x'

I4

'x'

I3 6-input I2 LUT

Ssel Sin(1)

I3 I2 4-input LUT O I1 #1 I0

Sin(1)

0 0 1 1 0 1 0 1

(b)

I2 4-input LUT O I1 #0 I0

Sin(0)

0 1 0 1 0 0 1 1

0 1 0 1 0 1 0 1

Sout(1)

Sin(0)

(c)

O0

Sout(0)

O1

Sout(1)

I1 I0

(d) Fig. 6. Switching unit (SU).

SU

SU

SU

SU

SU

SU

(a) SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

SU

(b) Fig. 7. Dynamically variable topology (DVT).

Table 1 DVT: resource usage and propagation delay. M

2 4 8 16 32 64

Number of

Propagation delay (ns)

SUs

LUTs

SG ‘‘6”

SG ‘‘7”

1 6 20 56 144 352

2 12 40 112 288 704

0.320 2.258 4.060 5.862 7.664 9.466

0.280 1.990 3.568 5.147 6.725 8.304

Since the interconnection delays dominate logic delays in modern silicon chips, particularly in FPGA architectures, the extra routing delays introduced due to the regular arrangement of logical cells cannot be neglected and have to be carefully investigated after ‘‘place & route”. The meshing of the DVT, however, is quite local, which yields fairly acceptable routing efﬁciency. Moreover, all switching units (SUs) can be easily distributed over the entire FPGA chip. Contrary, routing wide parallel buses is far more difﬁcult. Thus, the ‘‘place & route” tools deliver much better results for serial communication channels.

566

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

600

500 speed grade -6 speed grade -7 Frequency [MHz]

400

300

200

100

0 4

8

16

32

64

DVT Dimension (M) Fig. 8. DVT: maximum frequency (Xilinx Virtex-II Pro FPGA.)

Another advantage of a serial dataﬂow implementation is that the architecture is not restricted to a speciﬁed operand precision. Only a slight modiﬁcation of the computational units (CUs) along with a suitable adjustment of the corresponding virtualization buf-

fer (VB; see Section 3.2.4) is necessary. Nevertheless, it can be advantageous to replicate the communication channels to be of width W. We are coming back to this later in Section 4.

Fig. 9. Dynamically reconﬁgurable dataﬂow cluster (DRDC).

567

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

W

W

B

W-to-B converter

W

B

ALU B

B

B-to-W converter

Finally, after the computation is completed, a B-to-W conversion must be accomplished. This is efﬁciently implemented by a dual-ported BRAM memory block with an asymmetric aspect ratio.

W

Computational Unit (CU) Fig. 10. Universal architecture of the computational unit (CU).

3.2.3. Computational cluster The computational cluster (CC) comprises one DVT of dimension M together with M=2 computational units (CUs; see Section 3.2.4). Initially, the dataﬂow channel width W is serial, i.e., W ¼ 1. However, the parallel extension (W > 1) can be advantageous [16] and is introduced in Section 4. In Fig. 9 the CC and its components are highlighted in grey. Depending on the DSP algorithm to be implemented, the CCs can be composed of either identical or different CUs. For example, by mapping the parallel FFT algorithm on the dataﬂow architecture (see Section 7), all CCs are assembled from identical CUs, each implementing a butterﬂy operator (with different complex phase factors to be multiplied). The CC combines the M unidirectional handshake signals from all contained CUs, which are controlled by the parallel topology controller (PTC) as shown in Fig. 5 and Fig. 9. By this signal, each CU indicates its readiness to transfer the results. After the PTC has acknowledged the request, the transfer is started and simultaneously new data to be computed is received. When ﬁnished, the CU must deassert its request line for at least one cycle. Immediately after the PTC has detected the reset of all CU request signals, the DVT will be reconﬁgured (see Section 4.2).

3.2.5. Distribution unit The distribution unit (DU) is used to connect a number L of physical CCs with each other as depicted in Fig. 11. The input of each DU can be routed either to the same CC (upper path) or, via a cascade of demultiplexers, to other CCs (lower path). In this way several CCs can be arbitrarily interconnected and mapped either on a single FPGA or transparently distributed over several FPGAs. In Fig. 9 is shown how the DU is integrated into the proposed dataﬂow architecture. Let N be the word-length (e.g., N ¼ 16 in a 16-point FFT), then the required number of physical CCs would be Lmax ¼ N=M. These CCs have to be connected via Lmax demultiplexers in the DUs, and the additional delay introduced by the demultiplexers is directly proportional to the number of physically interconnected CCs. However, when only L < Lmax CCs are available, then these have to be reused V ¼ Lmax =L ¼ N=ðM LÞ times. In this case V-1 additional virtualization buffers (VBs) are required in the DU to buffer intermediate results. These are implemented by one dual-ported BRAM memory block per dataﬂow channel, which can also be exploited to implement a B-to-W conversion at no extra cost (see Section 7). In summary, the number L of CCs and their dimension M can be chosen before the dataﬂow architecture is mapped onto FPGA architectures. Depending on these parameters the latency caused by the virtualization buffers and the interconnection delay through the cascaded demultiplexers change (the latter are implemented as cascaded LUTs to the inputs of each CC). Therefore, based on the available resources, an application-speciﬁc optimal choice must be found. 4. Parallel extension

3.2.4. Computational unit The computational unit (CU) is application-speciﬁc and provides two inputs and two outputs each. Both inputs and outputs of the CU have a data width of W. Because the operands’ precision B is fully user-deﬁnable and usually a multiple integer of W, the operands must be transferred in B=W parts subsequently. The arithmetic logic unit (ALU) is typically implemented with the full data width B. Therefore, the CU is mainly composed of three different blocks as shown in Fig. 10. The W-to-B conversion can be efﬁciently implemented by using the dedicated fast carry chains of the slices. This is applicable when the ﬁrst operation on both operands is a summation. For example, this is the case for the FFT in DIF methodology (see Section 2.1). Then the W-to-B conversion comes with the addition/subtraction at no extra cost. The implementation of the ALU is completely application-speciﬁc. For this reason, it will be described in detail later in Section 7.

In the previous sections the dataﬂow channels of the DRDA were deﬁned to be serial only, i.e., W ¼ 1. However, a parallel extension of the transfer width W is motivated by the logarithmic increase of the propagation delay mentioned in Section 3.2.2. As a serial dataﬂow can be compensated by high operating frequencies for small values of M, it is not acceptable anymore for M >¼ 16, i.e., when the maximum operating frequency falls below 200 MHz (see Fig. 8). In particular, when operands with single (32-bit) or double precision (64-bit) have to be transferred between computational stages, a much higher bandwidth is essential for high-performance digital signal processing. 4.1. DVT replication In order to increase the channel width W, we replicate the dynamic variable topology (DVT) W times, where W denotes the de1

1

out(0)

1

DMUX #0 DMUX #1

V

DMUX #2

1

out(1)

1

out(L-2)

1

out(L-1)

DMUX #L-1

Fig. 11. Distribution unit (DU).

568

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

counting the number of ﬁxed transfer cycles as well as by asserting the required enable signals of the input/output memory blocks and the virtualization buffer (VB), respectively. After the data transfer has been completed, the DVT is reconﬁgured in one cycle and the state machine reaches its last state CU_REQ, in which the PTC waits again for the CU request signals to be asserted. This time, however, the CUs indicate that the computations are completed and the next data transfer is ready to start.

sired dataﬂow channel width (see Fig. 9). Since all data bits are routed to the same destination, the S control signals of the PTC can be used for all replicates. Hence, no extra control logic is necessary and the total LUT resource usage scales linearly with W. Since all SUs can be easily distributed over the entire FPGA chip, this still yields fairly acceptable routing efﬁciency. Moreover, each DVT can be operated nearly on the same maximum frequency estimated for serial transmissions, i.e., when W ¼ 1. However, a maximum parallel transfer width W such that it is equal to the precision B of the operands is not always practically feasible because the total number of required LUTs is proportional to W. Moreover, some application-speciﬁc CU component implementations further limit the advantages of a maximum parallel transfer width. For example, when implementing the parallel FFT in DIF methodology the propagation delay of the carry chains also increases proportionally against the number of bits to be summed up in parallel. Therefore, the transfer width W is user-deﬁnable.

4.3. DRDA implementation results The propagation delay in the dataﬂow channels increases logarithmically with the number L of DRDCs and their corresponding dimension M. While the replication linearly increases the overall bandwidth, the increase of logical resources is also proportional to the dataﬂow channel width W. Hence, the bandwidth-to-resource ratio is independent of W and decreases logarithmically with L and M. In conclusion, an application-speciﬁc trade-off between the DRDA parameters L, M, and W must be found.

4.2. Parallel topology controller The reconﬁguration of the DRDA is dynamically controlled by a simple state-based parallel topology controller (PTC). The PTC is usually assigned to one DRDC and its replicates. However, depending on the algorithm, one PTC can also control several DRDCs simultaneously (when instantiated on the same FPGA). This is possible, e.g., for parallel FFT computations. In Fig. 12 the corresponding state machine of the PTC is depicted. It consists of only four states that basically implement on the one hand the handshake protocol between the PTC and its assigned CUs and on the other hand the control of the DVT reconﬁguration. In the INIT state all components of the DRDA are initialized. After the PPC core has started the DRDA operation, the PTC waits for all CUs to complete their individual initialization phase. After successful completion, i.e., all CUs have asserted their REQ signal, the PTC state machine switches to the next state, namely, DATA_DVT. Coming to this state, the corresponding dataﬂow is always started immediately. Depending on the operand precision B and the data channel width W, the PTC controls the data transfer by

reset

5. Multi-FPGA hardware design To prove the scalability of the proposed dataﬂow architecture, we developed a hardware board [17] that comprises two Xilinx Virtex-II Pro FPGAs that are connected on board-level via six highspeed RocketIO™ transceivers. Moreover, multi-board computing platforms can be easily composed via four optical links and have been successfully tested, each operating with up to 3.125 Gb/s. In Fig. 13 is shown the block diagram of our printed circuit board (PCB) hardware design. We chose the Xilinx XC2VP30 FPGA [15], which integrates more than 30,000 logic cells, two IBM PowerPCÓ405 RISC processor cores (operating with up to 400 MHz), and eight serial RocketIO™ embedded multi-gigabit transceivers (MGTs) [18]. The PCB design is based on the PCI revision 2.3 standard and fully complies with the corresponding electrical and mechanical speciﬁcation [19]. The PCI target I/O accelerator chip PCI 9030 from PLX Technology [20] is used to bridge the PCI bus to a 32-bit local bus that, in turn, connects to both FPGAs. The PCI bus is employed

initialization (

)

PPC_START = ’1' & CU_REQs = ’1'

CU notification (

)

notification after one cycle

data transfer & DVT reconfiguration (

) DVT reconfiguration in one cycle after the data transfer has completed

CU monitoring (

)

Fig. 12. PTC state machine.

CU_REQs = ’1'

569

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

RocketIO Transceiver

QDR SRAM (18 Mb)

Single Ended

DDR DRAM (256 MB)

trace & debug

QDR SRAM (18 Mb)

18

18 GPIO

1x 1x

trace & debug

64

64

optical transceiver

DDR DRAM (256 MB)

6x

Xilinx Virtex II Pro

1x

Xilinx Virtex II Pro

32

arbiter

16-bit MPU bus PLX PCI 9030

optical transceiver

1x

SysACE

CF

32-bit local bus (multiplexed)

PCI bus Fig. 13. PCB hardware design: block diagram.

3.125 Gb/s each as depicted in Fig. 13. Moreover, both FPGAs are connected on board-level by six RocketIO™ embedded MGTs (that also operate with up to 3.125 Gb/s) as shown in Fig. 14, which yields to an overall communication bandwidth of more than 30 Gb/s.

for FPGA conﬁguration, the transfer of control and status information, as well as power supply. Additionally, the data to be computed by the multi-FPGA platform can be transferred via the PCI bus from the host to on-board memory. However, it is important to note that during DSP computations the dataﬂow between boards is solely accomplished via the optical links rather than the PCI bus. In the dynamically reconﬁgurable dataﬂow architecture (DRDA) the dataﬂow is controlled by a PowerPC (PPC) core and a dedicated parallel topology controller (PTC) in background. In this way, the computational throughput is not affected by the control ﬂow, because the dynamically variable topology (DVT) is reconﬁgured while DSP computations are in progress. Whereas this can be efﬁciently achieved on FPGA-level, however, on multi-FPGA platforms the overall performance highly depends on the total inter-FPGA communication bandwidth. Therefore, our PCB prototype is equipped with four high-speed optical links that operate with up to

6. Programming model

#3

In recent years signiﬁcant increases in silicon and algorithmic complexity of todays highly integrated embedded hardware and software systems have triggered a rise in design and veriﬁcation costs. For this reason, the need for powerful development approaches have emerged and a new paradigm known as electronic system level (ESL) design is promising to usher in a new era in FPGA design. The term ESL refers to tools and methodologies that raise design abstraction to levels above the current register transfer level

MGT4

MGT9

Bank 0

MGT7

Bank 1

MGT18

Bank 5 MGT19

MGT21

Bank 6

Bank 3

Bank 6 Bank 4

MGT4

Bank 0

FPGA #0

Bank 3

#1

FPGA #1

MGT16

MGT6

Bank 7

Bank 7

Bank 2

#2

MGT6

Bank 2

MGT7

Bank 1

Bank 4 MGT16

MGT18

#0

optical transceiver

MGT9

Fig. 14. Dataﬂow topology of the on-board RocketIO transceivers.

Bank 5 MGT19

MGT21

570

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

(RTL) because RTL development is still characterized by time-consuming and error-prone design cycles [21]. 6.1. Electronic system level design The early ESL tools on the market offer domain-speciﬁc synthesis to hardware from languages like C/C++ or MATLAB, aimed at everything from accelerating software algorithms to creating high-performance digital signal processing (DSP) engines. In particular, for algorithms that demand more than what conventional von Neumann processor architectures can currently deliver. The MathWorks has demonstrated that model-based design with SimulinkÓ[22] produces dramatic reductions in development time, cost, and risk. In addition, MATLABÓ[23] is a high-level technical computing language and interactive environment for analyzing data and developing algorithms. In 2004, Xilinx released the System Generator for DSP development tool that fully supports the MATLAB and Simulink software packages [24]. It automates the design, debug and deployment of FPGA-based DSP systems with push-button performance. Thereby, for the ﬁrst time, system architects, DSP engineers, and hardware designers could model complete DSP systems and subsequently use a direct path to implementation by automatic HDL code and testbench generation, including test vectors. Hence, Xilinx extended MATLAB/Simulink as the state-of-the-art tool for accelerating engineering and science tasks by an integrated system-level design environment to build sophisticated DSP systems and at the same time to drastically reduce the time-to-market design cycles. A typical approach in System Generator is to import an HDL module and use it as a component. The System Generator black box block allows VHDL, Verilog, and electronic data interchange format (EDIF) to be imported into the design and behaves exactly like other System Generator blocks. In this way, the DRDA top-level ﬁle that is written in VHDL has been imported into the System Generator. By the means of VHDL generics, the DRDA architecture is fully customizable. Hence, based on the application-speciﬁc requirements, the appropriate DRDA parameters (see Section 3) can be

manually chosen or automatically optimized for certain design constraints. Because the computational unit (CU) is also included, DRDA blocks have to be chosen library based for different DSP algorithms. In Fig. 15 is shown an example of a 32-point FFT as described in the following Section 7. All parameters of the DRDA can be easily conﬁgured within a so-called masking dialog that is opened by a simple right-click on the DRDA box (see Fig. 15a). 6.2. Coprocessor model As previously described, in the proposed architecture the dataﬂow is controlled by a PowerPC (PPC) core and a dedicated parallel topology controller (PTC) in background. This means that the computational performance is not affected by the control ﬂow because the topology of the dataﬂow communication channels between computational units (CUs) is reconﬁgured while DSP computations are in progress. For this reason the switch conﬁgurations of the dynamically variable topology (DVT) must be continuously reloaded to the associated BRAM blocks. Of course, the data format is dependent on the number of switches. However, based on the DVT dimension M the ﬁnal DVT switch conﬁguration vector is automatically written in the correct data format to a BRAM memory ﬁle. In this way, it is made transparent to the application programmer. In conclusion, the multi-FPGA platform can be regarded as a high-performance coprocessor board [25]. The idea is that the application engineer simply chooses a DSP algorithm in a graphical

a

B

a+b

B+1 k

+

b

B

BW

W N = e -j2πk/N

a b B+1

k

B+1+BW

Fig. 16. Radix-2 butterﬂy computation (DIF).

Fig. 15. Interfacing Xilinx System Generator Blocks in Simulink.

(a b)W N

571

d

B+1 complex multiplier

W

W

VB

B VB

ALU

W-to-B Converter

PORT B

(SO)

b

B

BRAM

phase factors

a subtractor

W

B+1

PORT B

c

BRAM

adder b (SO)

PORT A

a

PORT A

W

scaling & rounding units (optional)

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

B-to-W Converter

Computational Unit (CU) Fig. 17. CU FFT implementation: overview.

7. Application Generally, the dynamically reconﬁgurable dataﬂow architecture (DRDA) is suitable for any kind of distributed high-performance digital signal processing (DSP). To demonstrate its effectiveness, the operational principle is explained on a high-performance parallel fast Fourier transform (FFT). 7.1. Fast Fourier transform

Fig. 18. CU FFT implementation: summation operator (SO).

user interface (GUI) on the host and based on available resources and optional criteria the DRDA design tools then automatically generate the best suitable FPGA conﬁguration ﬁles, which can then be downloaded to the multi-FPGA coprocessor platform.

The Cooley-Tukey FFT algorithm [4] is a well-known algorithm to compute the DFT in O(N logN). The FFT algorithm decomposes a DFT with N points into N DFTs with a single point each. The second step is to calculate the N frequency spectra corresponding to these N time domain signals. Lastly, the N spectra are synthesized into a single frequency spectrum. Because the decomposition of this FFT is done in time domain rather than in frequency domain, it is commonly referred to as decimation in time (DIT) algorithm. There also exists a very popular variant of the Cooley-Tukey FFT algorithm, the so-called Gentleman-Sande FFT [3]. This algorithm carries out the decomposition in frequency domain rather than in CO

b[W-1]

C(W-1) INV

a[W-1]

MUX XOR LI CI

LUT

b[1]

s[1]

O

s[0]

MUX XOR

LUT

XOR

C0 INV

a[0]

MUX XOR LI CI

b[W-1:0] ADD CI

O

C1

LI CI

a[W-1:0]

s[W-1]

XOR

INV

a[1]

b[0]

O

LUT

XOR

Fig. 19. CU FFT implementation: W-bit adder/subtractor.

s[W-1:0]

572

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

#B/W-1

#i

#0

#B/W-1

#i

#0

Fig. 20. CU FFT implementation: W-to-B converter.

time domain. In Fig. 1 the corresponding signal-ﬂow graph of an 8point decimation in frequency (DIF) FFT has already been shown in Section 2.1. After the conversion into a graph of functional elements (see Fig. 2), each block represents one butterﬂy operator as depicted in Fig. 16, where the multiplication by the twiddle factor W kN is performed after the complex point b has been subtracted from a. We chose the Gentleman-Sande FFT, since it is advantageous for the CU implementation as explained in the following section.

linx library macros can also be used, which already include the required relative location (RLOC) constraints. The W-bit partial sum ﬂows asynchronously to the adjacent W-to-B converter that is composed of W shift registers (SRs) as depicted in Fig. 20. More precisely, each bit of the result s is connected to the input of its associated SR, where each shift register consists of B=W concatenated ﬂip-ﬂops. The partial sums are shifted subsequently with each transfer cycle into the SRs. For this reason, the output bits must

7.2. DRDA implementation In Section 3 all components of the DRDA have been explained in detail except for the computational unit (CU) that is applicationspeciﬁc. In the following the CU implementation of the FFT butterﬂy operation (with DIF) is described. 7.2.1. CU architecture The CU implementation for the parallel FFT application is depicted in Fig. 17. As previously mentioned in Section 3.2.4 the summation in the butterﬂy operation (see Fig. 16) can be efﬁciently implemented by using the dedicated fast carry chains and is implicitly included in the W-to-B conversion. In Fig. 18 is shown the summation operator (SO), which comprises a W-bit adder/subtractor, a simple edge triggered ﬂop-ﬂip, and the W-to-B converter. The SO provides two inputs of width W each and one output of width B þ 1. The carry out (CO) bit is used to forward the exact result of the summation to the ALU. The combinational logic of the W-bit adder/subtractor is depicted in Fig. 19. Starting with the W least signiﬁcant bits, in all, B=W partial summations are computed. It is important to note that the synthesis must be constrained correctly to obtain the desired mapping onto the dedicated FPGA resources. Alternatively, the corresponding Xi-

Table 2 CU complex butterﬂy implementation: resource usage [LUTs]. No. CUs

Fixed-point

(M=2)

8-bit

16-bit

24-bit

32-bit

Floating-point 32-bit

1 2 4 8 16 32 64

77 154 308 616 1232 2464 4928

149 298 596 1192 2384 4768

437 874 1748 3496 6992

629 1258 2516 5032

5854

Table 3 Xilinx FFT LogiCore (Version 5.0): conﬁguration settings. Conﬁguration settings

Chosen parameters

Transform size Data sample precision Phase factor precision Arithmetic type Rounding mode

8, 16, 32, 64, 128, 256, 512, 1024 16 16 Scaled ﬁxed-point Truncation

573

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

Fig. 21. Xilinx FFT LogiCore (Version 5.0): Radix-2, burst I/O architecture.

8,000 Xilinx FFT core

7,000

DRDA FFT core

Number of Transform Cycles

6,000

5,000

4,000

3,000

2,000

1,000

0 8

16

32

64

128

256

512

1024

Number of Points (N) Fig. 22. Number of required transform cycles against N.

Table 4 FFT core implementations: Number of transform cycles. N

8

16

32

64

128

256

512

1024

Xilinx core ðW ¼ 16Þ

72

121

214

403

800

1,645

3,450

7,303

DRDA core ðW ¼ 16Þ

19

25

31

75

163

355

768

1,710

be reordered accordingly. Because the additional net delay between the W-bit adder/subtractor outputs and the shift register inputs is negligible against the combinatorial delay of the W-bit adder/subtractor, the W-to-B conversion is achieved at no extra cost. After summation, in the CU ALU the complex phase factor W kN is multiplied to the lower W-to-B converter output as shown in Fig. 16. Optionally, the result is scaled and rounded. If no rounding is selected, the result is simply truncated to the width of B bits. Finally, the width of both results are converted from B to W, so that the data can be subsequently transferred through the dynam-

ically variable topology (DVT). The B-to-W conversion is easily achieved by a dual-ported BRAM that is conﬁgured with asymmetric aspect ratio. At the same time, these BRAMs can be exploited as virtualization buffer (VB) when the VBs are chosen to be located directly after the CU. This virtualization is required when the physical CUs must be logically reused V times for N > M and the intermediate results have to be buffered (see Section 3.2.5). 7.2.2. DRDA: user-deﬁned data type and precision Generally, in the DRDA the operands are neither restricted to a special data type nor to a particular precision due to the serial data-

574

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

Table 5 FFT core implementations: Resource usage [FPGA slices]. N

8

16

32

64

128

256

512

1024

Xilinx Core (W ¼ 16)

0.6K 4.6%

0.7K 4.8%

0.7K 5.0%

0.7K 5.1%

0.7K 5.2%

0.8K 5.6%

0.8K 5.9%

0.8K 6.1%

DRDA Core (W ¼ 1)

1.3K 9.2%

2.4K 11.7%

4.6K 33.6%

DRDA Core (W ¼ 16)

1.6K 11.7%

3.3K 24.0%

7.0K 50.6%

6.7K 49.2%

7.3K 53.2%

7.3K 53.2%

7.3K 53.5%

6.7K 49.1%

ﬂow communication channels. Of course, the total resource usage highly depends on the chosen data type and its precision B as well as on the other DRDA parameters, e.g., the total resource usage of the CU implementation for the FFT is summarized in Table 2. 7.3. Results To evaluate the computational performance of the DRDA FFT implementation, the results are analyzed in the following. For this purpose, ﬁrst of all the Xilinx fast Fourier transform (FFT) LogiCORE™ intellectual property (IP) [26] is described as a reference core. Secondly, the results that have been measured after the ‘‘place & route” phase in the static timing analysis of both FFT cores are compared. And ﬁnally, further improvements of the DRDA FFT core are proposed to roughly double the speedup in comparison to the already reduced transform latency achieved by the standard non-optimized DRDA FFT core implementation.

7.3.1. Xilinx fast Fourier transform LogiCore IP The Xilinx FFT LogiCore™ IP was chosen because the core can be conﬁgured for different transform sizes, data sample and phase factor precisions, arithmetic types and several complex butterﬂy implementations. In this way, the Xilinx FFT core can be easily generated for various parameters to be accurately compared with the corresponding DRDA FFT core conﬁgurations. The results of the Xilinx FFT core are based on the toolchain of the integrated software environment (ISE) foundation™ version 9.2i (service pack 4). The chosen LogiCore™ conﬁguration parameters are summarized in Table 3 and the corresponding radix-2, burst I/O architecture is shown in Fig. 21. 7.3.2. Benchmarking In Fig. 22 the number of required transform cycles for both FFT core implementations is plotted against the number of sampled signal points. For N ¼ 8, N ¼ 16, and N ¼ 32 points, the computa-

Table 6 FFT core implementations: maximum frequency (MHz). N

8

16

32

64

128

256

512

1024

228.83

228.83

228.83

228.83

228.83

79.02

77.63

82.54

81.50

80.93

Xilinx core ðW ¼ 16Þ

228.83

228.83

228.83

DRDA core ðW ¼ 1Þ

156.27

128.63

100.25

DRDA core ðW ¼ 16Þ

143.35

105.50

80.00

35

Xilinx FFT Core

Overall Transform Latency [µs]

30

DRDA FFT Core (SG6)

25

20

15

10

5

0 8

16

32

64 128 Number of Points (N)

256

Fig. 23. Overall transform latency after place and route.

512

1024

575

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576 Table 7 FFT core implementations: overall transform latency after place and route (ls).

Xilinx core ðW ¼ 16Þ

0.315

0.529

0.935

1.761

3.496

7.189

15.077

31.914

DRDA core ðW ¼ 16Þ

0.133

0.237

0.388

0.951

2.100

4.301

9.423

20.619

PB #0

W

B

CU(0)in(0)

VB #0

VB #1

W

VB #M-2

W

VB #M-1

W

PORT B

W

PORT B

1024

PORT B

512

PORT B

256

PORT A

128

PORT A

64

PORT A

32

PORT A

16

PORT B

8

PORT A

N

W

PORT A

PORT B

PORT A

PB #M-2

PORT B

W

PB #M-1

PORT B

W

W

PORT A

W

PB #1

W

B

CU(0)in(1)

B

CU(M/2-1)in(0) Computational Unit (CU) M 1) #( 2

B

CU(M/2-1)in(1)

W

MxM Dynamically Variable Topology (DVT)

Computational Unit (CU) #0 W

DVTout(0)

DVTout(1)

DVTout(M-2)

DVTout(M-1)

W

W

W

Computational Cluster (CC) Fig. 24. Multi-clock domains.

tional cluster (CC) of the DRDA FFT core integrates the maximum number of computational units (CUs). Hence, a direct communication is achieved in all stages. For N > 32 and W ¼ B ¼ 16 the available FPGA resources in a XC2VP30 device are not sufﬁcient to implement CCs with a dimension of M > 32. For this reason, virtualization must be used from N ¼ 64-point to N ¼ 1024-point FFTs, where the CUs are reused V times. However, the virtualization can also be exploited to improve the data throughput, since it allows that the results of one butterﬂy computation can be simultaneously transferred while the next logical iteration is processed. In this way, the CUs are kept busy at all times. Moreover, the achievable speedup can be further improved by implementing multi-clock domains (see Section 7.3.3). In conclusion, the resulting speedup for the number of transform cycles of the DRDA FFT core is up to 6.9 (for W ¼ 16) in comparison to the Xilinx FFT core (see Table 4). However, this is dearly bought by clearly higher resource usage as summarized in Table 5. As expected, the higher resource usage has also a direct impact on the maximum frequency that was measured after the ‘‘place & route” phase using the static timing analyzer (see Table 6). While the maximum frequency of the compact and highly optimized Xilinx FFT core implementation is only dependent on the multiplier blocks, the maximum frequency of the DRDA FFT core implementation is limited by the propagation delay through the dynamically variable topology (DVT). Because virtualization is used for N > 64, the resulting resource usage and maximum frequency (for speed grade ‘‘-6”) is roughly the same from N ¼ 32 to N ¼ 1024. Nevertheless, the overall transform latency of the DRDA FFT core is up to 141% (for N ¼ 32, W ¼ 16) better than the related Xi-

linx FFT core (see Fig. 23 and Table 7). When virtualization is used, the improvement is between 86% (for N ¼ 64, W ¼ 16) and 55% (for N ¼ 1024, W ¼ 16). 7.3.3. Further improvements Particularly, when the W-to-B converter in the CU is implemented by the fast carry logic (see Section 7.2.1), the overall propagation delay through the DVT is additionally increased. Hence, as depicted in Fig. 24 another dual-ported BRAM was placed in front of the CUs as pipelining buffer (PB) to decouple the logic and allow the implementation of multi-clock domains. This can also be exploited to clock the CU logic at higher frequencies, which yields to transform cycle speedups of more than one order of magnitude. 8. Conclusion In this paper we presented an FPGA-based dataﬂow architecture that is composed of communication channels which can be dynamically adapted to the dataﬂow of the algorithm and maps efﬁciently onto multi-FPGA hardware platforms. The topology can be reconﬁgured within a single clock cycle while DSP operations are in progress. Moreover, only the computational unit of the DRDA components is application-speciﬁc and must be implemented according to the functional blocks of the DSP algorithm. The remaining components are universally applicable because only the raw dataﬂow is routed. Hence, for the universal DRDA implementation only the conﬁgurations of the dynamically variable topology along with the distribution units must be generated

576

S. Voigt et al. / Journal of Systems Architecture 56 (2010) 561–576

based on the dataﬂow graph of the DSP algorithm and have to be loaded into the corresponding conﬁguration memory. In this way, the parallel topology controller and its associated components accomplish the dataﬂow without any knowledge about the DSP algorithm: After the operands have been transferred, the next topology conﬁguration is loaded and the controller then waits for all computational units to signal the completion of the current DSP computation. Subsequently, the next data transfer is started and so on. Thus, the DRDA is suitable for various DSP algorithms. Finally, in this paper the efﬁciency of the proposed architecture was demonstrated by a parallel FFT implementation. In conclusion, this novel dataﬂow architecture has been proven to be a promising approach for high-performance DSP applications on multi-FPGA platforms. The DRDA is characterized by system-level programmability and high scalability. Particularly, the transparent mapping of different DSP algorithms on the proposed architecture due to its universal dataﬂow concept in association with application-speciﬁc computational units is one of the most important beneﬁts. 8.1. Future work Meanwhile, a multi-FPGA platform with two hardware boards, i.e., four Xilinx Virtex-II Pro FPGAs, has been assembled. It would be desirable to test the DRDA in multi-FPGA platforms with considerably more FPGA devices to prove the scalability of the proposed architecture in hardware and compare it to other high-performance parallel DSP engines. Then, one of the most interesting projects will be the mapping of more DSP algorithms onto the proposed dataﬂow architecture. References [1] Z. Guo, W. Najjar, F. Vahid, K. Vissers, A quantitative analysis of the speedup factors of FPGAs over processors, in: Proc. of the ACM/SIGDA 12th Int. Symp. on Field Programmable Gate Arrays (FPGA), 2004, pp. 162–170. [2] J. Palmer, B. Nelson, A parallel FFT architecture for FPGAs, in: Proc. of the Int. Conference on Field Programmable Logic and Applications (FPL 2004), 2004, pp. 948–953. [3] W. Gentleman, G. Sande, Fast Fourier transforms – for fun and proﬁt, in: Proc. of the AFIPS Joint Computer Conference, vol. 29, 1966, pp. 563–578. [4] J. Cooley, J. Tukey, An algorithm for machine calculation of complex Fourier series, Mathematics of Computation 19 (1965) 297–301. [5] B. Blodget, C. Bobda, M. Huebner, A. Niyonkuru, Partial and dynamically reconﬁguration of Xilinx Virtex-II FPGAs, in: Proc. of the Int. Conference on Field Programmable Logic and Applications FPL (2004), 801–810, 2004. [6] M. Silva, J. Ferreira, Support for partial run-time reconﬁguration of platform FPGAs, JSA 52 (12) (2006) 709–726. [7] J. McAllister, R. Woods, S. Fischaber, E. Malins, Rapid implementation and optimisation of DSP systems on FPGA-centric heterogeneous platforms, JSA 53 (8) (2007) 511–523. [8] D. Heller, A survey of parallel algorithms in numerical linear algebra, SIAM Review 20 (4) (1978) 740–777. [9] H. Richter, Multiprocessor with dynamically variable topology, Computer System Science and Engineering 5 (1) (1990) 29–35. [10] V. Benes, Mathematical Theory of Connecting Networks and Telephone Trafﬁc, Academic Press, 1965. [11] K. Lee, On the rearrangeability of a ð2 logN1) stage permutation network, IEEE Transactions on Computers 34 (5) (1985) 412–425. [12] S.Voigt, T. Teufel, Dynamically reconﬁgurable dataﬂow for high-performance digital signal processing on Multi-FPGA platforms, in: Proc. of the Int. Conference on Field-Programmable Logic and Applications (FPL 2007), 2007, pp. 633–637.

[13] CoreConnect Bus Architecture – An Open, 32-, 64-, 128-Bit Core on-Chip Bus Standard, IBM Microelectronics, 1999. [14] MontaVista Linux Professional (Edition 3.1) – Optimized for High-performance Embedded Applications, Monta Vista Software Inc., 2004. [15] DS083: Virtex-II Pro Data Sheet, Xilinx, 2007. [16] S. Voigt, T. Teufel, Analysis of a dynamically reconﬁgurable dataﬂow architecture and its scalable parallel extension for multi-FPGA platforms, in: Proc. of the 16th IEEE Symp. on Field-Programmable Custom Computing Machines, FCCM ’08, Stanford University, CA, 2008. [17] M. Baesler, Development of a PCI Printed Circuit Board for Multi-FPGA Platform Assembling, Master’s Thesis, Hamburg University of Technology, 2007. [18] UG024: RocketIO™ Transceiver User Guide, Xilinx, 2007. [19] PCI Local Bus Speciﬁcation (Revision 2.3), PCI Special Interest Group, 2002. [20] PCI 9030 Data Book, PLX Technology, 2002. [21] S. Lass, ESL tools make FPGAs nearly invisible to designers, Xcell Journal 58 (2006) 6–8. [22] SimulinkÒ – Simulation and Model-based Design, The Math Works, 2007. [23] MATLABÒ – The Language of Technical Computing, The Math Works, 2002. [24] K. Karnofsky, Simulink brings model-based design to embedded signal processing, Xcell Journal 51 (2004) 66–69. [25] T. Hill, The beneﬁts of FPGA coprocessing, Xcell Journal 58 (2006) 29–31. [26] DS260: Fast Fourier Transformation, Xilinx, 2007.

Sven-Ole Voigt received both the MSc degree and the Ph.D. degree in computer engineering from the Hamburg University of Technology, Germany. He joined NEC Electronics, Singapore, in 2003 and was responsible for embedded multimedia architectures. Since 2004 he is a researcher at the Institute for Reliable Computing, Hamburg University of Technology, Germany, and has been recently promoted to an assistant professor. His research interests include high-performance dataﬂow architectures, reconﬁgurable application-speciﬁc instruction-set processors, embedded systems, and rapid prototyping.

Malte Baesler received the MSc degree in electrical engineering from the Hamburg University of Technology, Germany. Since 2007 he is a researcher at the Institute for Reliable Computing, Hamburg University of Technology, Germany. His research interest include computer arithmetic, embedded systems and computer architecture.

Thomas Teufel received the MSc degree in electrical engineering from the University of Bremen, Germany, and the Ph.D. degree in computer science, under the direction of Dr. Ulrich Kulisch, from the Karlsruhe Institute of Technology, Germany. Since 1991 he is an associate professor of computer engineering at the Institute for Reliable Computing, Hamburg University of Technology, Germany. His research interests include implementation of algorithms in hardware, chip design, embedded systems for automation and control engineering, rapid prototyping, computer arithmetic and real-time operating systems. He is a member of the IEEE.