Automatic mapping of C to FPGAs with the DEFACTO compilation and synthesis system

Automatic mapping of C to FPGAs with the DEFACTO compilation and synthesis system

Microprocessors and Microsystems 29 (2005) 51–62 www.elsevier.com/locate/micpro Automatic mapping of C to FPGAs with the DEFACTO compilation and synt...

256KB Sizes 0 Downloads 14 Views

Microprocessors and Microsystems 29 (2005) 51–62 www.elsevier.com/locate/micpro

Automatic mapping of C to FPGAs with the DEFACTO compilation and synthesis system Pedro Diniz*, Mary Hall, Joonseok Park, Byoungro So1, Heidi Ziegler Information Sciences Institute, University of Southern California, 4676 Admiralty Way Suite 1001, Marina del Rey, CA 90292, USA Received 20 October 2003; revised 6 February 2004; accepted 11 June 2004 Available online 21 July 2004

Abstract The DEFACTO compilation and synthesis system is capable of automatically mapping computations expressed in high-level imperative programming languages as C to FPGA-based systems. DEFACTO combines parallelizing compiler technology with behavioral VHDI, synthesis tools to guide the application of high-level compiler transformations in the search of high-quality hardware designs. In this article we illustrate the effectiveness of this approach in automatically mapping several kernel codes to an FPGA quickly and correctly. We also present a detailed example of the comparison of the performance of an automatically generated design against a manually generated implementation of the same computation. The design-space-exploration component of DEFACTO is able to explore a large number of designs for a particular computation that would otherwise be impractical for any designers. q 2004 Elsevier B.V. All rights reserved. Keywords: Design automation; Parallelizing compiler technology and data dependence analysis; Behavioral synthesis and estimation; Reconfigurable computing; Field-programmable-gate-arrays (FPGAs)

1. Introduction The extreme flexibility of field programmable gate arrays (FPGAs), coupled with the widespread acceptance of hardware description languages (HDL) such as VHDL or Verilog, has made FPGAs the medium of choice for fast hardware prototyping and a popular vehicle for the realization of custom computing machines. Programming these reconfigurable systems, however, is an elaborate and lengthy process. The programmer must master all of the details of the hardware architecture, mapping the computation to each of the computing FPGAs, and the data to the appropriate memories. The standard approach to this problem requires that the programmer manually translate the program into an HDL

* Corresponding author. Tel.: C310-448-8457; fax: C310-823-6714. E-mail addresses: [email protected] (P. Diniz), [email protected] (M. Hall), [email protected] (J. Park), [email protected] (B. So), [email protected] (B. So), [email protected] (H. Ziegler). 1 Present address: IBM T.J. Watson Research Center 1101 Kitchawan Road, Route 134/P.O. Box 218 Yorktown Heights, NY 10598, USA. 0141-9331/$ - see front matter q 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2004.06.007

representation, applying his or her knowledge of the FPGA specifics, identifying available parallelism and performing the necessary code transformations at that time. Using commercially available synthesis tools, typically the programmer can specify the maximum number of components of a given kind, the maximum area or the clock period. The tool then determines the feasibility of the design by negotiating area and speed trade-offs, and generates a realizable design. Throughout the process, the burden is on the designer to ensure that the intermediate HDL representation of the program is kept consistent with the original version. We believe the way to make programming of FPGAbased systems more accessible is to offer a high-level imperative programming paradigm, such as C, coupled with new compiler technology oriented towards FPGA designs. Programmers retain the advantages of a simple computational model via a high-level language but rely on powerful compiler analyses to automate most of the tedious and error-prone mapping tasks. In this article, we describe a high-level design approach using the DEFACTO system, a Design Environment for Adaptive Computing Technology

52

P. Diniz et al. / Microprocessors and Microsystems 29 (2005) 51–62

(DEFACTO). DEFACTO maps algorithms to systems that incorporate field programmable gate array (FPGA) configurable logic. Using DEFACTO, key computational components of the algorithm are executed in custom hardware designs on FPGAs, while the remaining computation and coordination with FPGAs is performed in software on an external host. The DEFACTO system demonstrates the power of a basic principle from parallelizing compiler technology when applied to synthesizing designs in hardware. Data dependence analysis determines the relationship between two temporally distinct data accesses. A data dependence exists if two data references access the same memory location and there is a feasible run-time execution path between the references. Analyses and transformations based on data dependence relations provide the underlying foundation for a wide range of activities in parallelizing compilers. For example, if two computations have no data dependences, or if the dependences only involve reads, then it is safe to execute the computations in parallel. If there is a dependence between two accesses, we may be able to reuse the referenced data in a register, avoiding costly and unnecessary memory accesses. Dependences also indicate the safety of code reordering transformations, which may be beneficial to increasing parallelism or data reuse. In DEFACTO, dependence analysis provides the critical support for diverse activities such as exploiting parallelism in hardware, reusing data in on-chip storage structures, partitioning data and orchestrating data movement across the system. In collaboration with behavioral synthesis, DEFACTO derives a parallel hardware implementation from the high-level, sequential specification of portions of the algorithm. We observe that parallelizing compiler technology greatly extends and complements the capabilities of high-level synthesis tools. Behavioral synthesis provides capabilities to derive low-level hardware designs from behavioral specifications. Further, estimates from behavioral synthesis tools of properties of the hardware design (e.g. area and performance) obviate the need for completely synthesizing the design. While fully synthesizing a design takes hours or days for even simple designs, behavioral synthesis estimates are obtained in seconds or minutes. It is through a collaboration of these two technologies that uniquely enables the DEFACTO system to automatically derive good designs. In this paper, we demonstrate the effectiveness of this collaboration, focusing on three main capabilities: † Code reordering transformations and other techniques to increase parallelism and reuse data in on-chip storage structures in lieu of costly external memory accesses. † Optimization of the remaining memory accesses through specialized memory controllers and custom layout of data according to access ordering.

† An automated design space exploration algorithm that combines behavioral synthesis estimation with dependence analysis to exploit computation and memory access parallelism. We illustrate the effectiveness of this approach in the automatic mapping of a set of five image processing and multimedia applications. Such applications typically manipulate large volumes of fine-grain data that, despite recent increases in FPGA capacity, must be mapped to external memories and then streamed through the FPGAs for processing. The small bit widths (typically, 8-bit and 16-bit) and inherent parallelism of the applications are good matches for FPGA strengths. Fortunately, such applications are also a good match for the capabilities of current parallelizing compiler analyses, which are most effective in the affine domain, where array subscript expressions are linear functions of the loop index variables and constants. We focus on multimedia kernels, written as a regular C program. That is, there are no pragmas, annotations or language extensions to guide the hardware mapping process. The computations are limited to a single loop nest. The statements in the loops involve scalar variables and arrays using affine subscript expressions only; that is, computations with non-affine array accesses and pointer accesses are not mapped to hardware. For these five kernels, DEFACTO automatically generates a correct hardware implementation selected among hundreds of possible designs by the application of loop-based transformations. We further address the issue of the quality of the generated hardware implementation by examining an specific kernel in detail, the Sobel edge detection computation. The results for Sobel show an overall 58! reduction in design time with only a 59% increase in execution time as compared to a manual design for the same algorithm. The rest of this paper is organized as follows. In Section 2 we compare parallelizing compiler technology with behavioral synthesis technology used in commercially available tools. Section 3 presents an overview of the DEFACTO system. Section 4 describes analyses to uncover data reuse and exploit this knowledge in the context of FPGA-based data storage structures. Section 5 describes a transformation strategy for automating design space exploration in the presence of multiple loop transformations. Section 6 describes a set of program transformations used to adjust parallelism and data reuse in search of a balanced design. Section 7 describes the application of data partitioning and customization of memory controllers to substantially reduce the costs of accessing external memory from the FPGA. We present experimental results that illustrate the range of applicability of DEFACTO and its effectiveness by comparing the performance of an automatically generated design against a manual design for the same computations in Section 8. We survey related work in Section 9 and conclude in Section 10.

P. Diniz et al. / Microprocessors and Microsystems 29 (2005) 51–62

2. Comparing parallelizing compiler and behavioral synthesis In this section, we discuss how parallelizing compiler technology and behavioral synthesis offer complementary capabilities, as shown in Table 1. Behavioral synthesis performs three core functions: (1) binding operators and registers in the specification to hardware implementations (e.g. selecting a ripple-carry adder to implement an addition); (2) resource allocation (e.g. deciding how many ripple-carry adders are needed); and, (3) scheduling operations in particular clock cycles. In addition, behavioral synthesis supports some optimizations, but relies heavily on programmer pragmas to direct some of the mapping steps. For example, after loop unrolling, the tool will perform extensive optimizations on the resulting inner loop body, such as parallelizing and pipelining operations and minimizing registers and operators to save space. However, deciding the unroll amount is left up to the programmer. The key advantage of parallelizing compiler technology is its ability to perform data dependence analysis on array variables. As will be discussed in the remainder of the paper, dependence analysis is used as a basis for parallelization, loop transformations and optimizing memory accesses. For array variables in a loop nest, dependence analysis is formulated as a pairwise comparison of temporally distinct accesses. Dependence analysis has been shown to be equivalent to integer linear programming, where the variables are the loop index variables, and the coefficients represent the strides of the accesses. When there is a solution, it suggests there is a dependence. We refer to dependence distances as the number of iterations in the iteration space of a loop nest between the two accesses. For example, a distance of h1, 2i means that there are two loops in the nest and a reference in the current iteration will also be accessed in one iteration of the outer loop and two iterations of the inner loop. The presence or absence of a dependence and the distances associated with Table 1 Comparison of capabilities Behavioural synthesis

Parallelizing compiler

† Optimizations only on scalar variables † Optimizations only inside loop body † Supports user-controlled loop unrolling † Manages registers and inter-operator communication

† Optimizations on scalars and arrays † Optimizations inside loop body and across loop iterations † Analysis guides automatic loop transformations † Optimizes memory accesses; evaluates trade-offs of different storage, on- and off-chip † System-level view: multiple FPGAs and memories † No knowledge of hardware implementation of computation

† Considers only single PPGA † Performs allocation, binding and scheduling of hardware resources

53

dependences provide the requisite information for the optimizations described in this paper. At a high level, dependence analysis augments behavioral synthesis to take on a global, system-level view. The compiler can reason about code transformations (such as loop unrolling) without actually applying them. The compiler can orchestrate data movement, internally and off-chip. In our ongoing research, it also provides the foundation for managing communication and pipelining across FPGAs [1–2]. For these reasons, we believe the unique combination of parallelizing compiler technology and behavioral synthesis give DEFACTO an advantage in automatically deriving mixed hardware/software designs for FPGA-based systems.

3. Overview of defacto compilation and synthesis system We now describe the DEFACTO compilation and synthesis flow. We first describe the fundamental steps of the hardware/software partitioning and mapping to the reconfigurable elements. Next we outline an automated compiler strategy for evaluating a set of valid mappings using behavioral synthesis estimation. All components of the system are operational. 3.1. Design flow DEFACTO is implemented in the Stanford University Intermediate Format (SUIF) system [3], and augments SUIF analyses and transformations, designed for conventional high-end systems, for FPGA-based systems. Fig. 1 shows the steps of automatic application mapping in the DEFACTO compiler. In the first step, the DEFACTO compiler takes an algorithm description written in C and performs general compiler optimizations such as constant propagation and common sub-expression elimination. In the second step, the optimized code is partitioned into what will execute in software on the host and what will execute in hardware on the FPGAs using a simple partitioning heuristic that focuses on loop nests that perform memory accesses and exhibit fine-grain instruction-level parallelism Once the compiler has decided which portions of the original application code to map to hardware it engages in an iterative mapping process, applying a series of high-level loop transformations, described in Section 5. These transformations expose parallelism and data reuse, to derive the best possible mapping to hardware. Next the compiler translates the SUIF intermediate code into behavioral VHDL using the SUIF2VHDL tool. This tool generates a behavioral VHDL description of the core datapath of the selected computation to execute in hardware. The resulting behavioral code, once synthesized, interacts with the software code running on the host processor via memory operations. The memory operations

54

P. Diniz et al. / Microprocessors and Microsystems 29 (2005) 51–62

Fig. 2. Target architecture example - wildStare/PCI board.

3.2. Target hardware/software architectures

Fig. 1. The DEFACTO design flow.

are embedded in the behavioral code by the SUIF2VHDL tool and are realized via a handshaking protocol. The compiler links the structural code output by the Monete behavioral compiler against the WildStare memory interface structural VHDL code to bind the physical signals of the FPGA design with those of the compiler generated design. For a proposed behavioral VHDL design the compiler next invokes an estimation step [4]. These estimates will indicate with high confidence if the design will fit on a single FPGA device and, if so, what the expected performance is. If unsuccessful, the compiler will try another set of loop transformations over the selected segment of the original code using a strategy outlined in Section 5. When the compiler succeeds in generating a design, it performs a set of transformations to improve memory accesses by distributing data across multiple memories, and customizing memory access protocols. Finally, the resulting structural hardware design is sent on to logic synthesis and place-and-route. Invocations to library functions are added to the portion of the code that will execute on the host to load FPGA configurations, layout data in memories, synchronize execution and retrieve results. The portions of the original code that will execute on the host are compiled with the host native compiler and linked against the communication and synchronization library.

The compilation techniques designed for DEFACTO focus on targeting board-level hardware/software systems similar to the WildStare/PCI board from Annapolis Micro Systems coupled with a host PC computer. Each WildStare board, as depicted in Fig. 2 consists of three Xilinx Virtex FPGA parts with up to a million gates each; each can access its own local memories. One FPGA serves as a controller; the other two FPGAs are connected by a 36-bit channel, and each has two local SRAMs with 32-bit dedicated channels. The three FPGAs share additional system memory. Although we have targeted FPGA-based hardware/ software systems such as the specific WildStare/PCI board coupled with a host PC computer we expect the compilation approaches to form a solid foundation for a multi-board system, and also be applicable to system-ona-chip devices that have multiple independent memory banks, each with configurable logic.

4. Analyses and transformations to exploit data reuse Data reuse analysis, which is closely related to data dependence analysis, identifies opportunities for storing data in an on-chip structure so that subsequent accesses need not reference memory. For a given loop nest, data reuse analysis, as implemented in our compiler, builds a reuse graph GZ(V, E) that includes all reuse-carrying array references (V) and the reuse edges (E) between them. Reuse edges are represented by the reuse distance between two array references. The reuse distance is defined by the difference in iteration counts of the innermost loop between the dependent references. Exploiting reuse on-chip is essential to good performance on FPGA-based systems. High latency and low bandwidth to external memory, relative to the computation rate, are often performance bottlenecks in FPGA-based systems, just as in conventional architectures. Further, FPGAs have no cache to reduce the latency of memory accesses.

P. Diniz et al. / Microprocessors and Microsystems 29 (2005) 51–62

There are several distinctions in the transformations that are performed in FPGA-based systems, as compared to known compilation techniques. The number of registers is not fixed as in a conventional architecture, so it is possible to tailor the number of registers to the requirements of the design. Second, customized hardware structures, not just registers, can be used to exploit the reuse in a way that maps to a time- and space-efficient design. The remainder of this section describes these differences. 4.1. Scalar replacement The compiler must transform the code to guide behavioral synthesis to exploit the results of reuse analysis. Scalar replacement replaces certain array references with temporary scalar variables that will be mapped to on-chip registers by behavioral synthesis. It also avoids multiple write references by writing the redundant memory writes to a register and only writing the data to memory for the last write. Reuse analysis determines when scalar replacement is applicable, and the reuse distance decides the required number of registers to exploit a possible data reuse for each array. Our scalar replacement [5] has more freedom than previously proposed work [6] in that it can utilize the flexibility of FPGAs, i.e. the number of registers does not have a rigid limit, and is only constrained by the space limitation of the FPGA device and the complexity of designs involving very large numbers of registers. We can adjust the number of registers required by combining scalar replacement with other transformations described in Section 6, which reduce the reuse distance. 4.2. Tapped delay lines An alternative to scalar replacement can be used if reuse is further restricted to input dependences with a consistent dependence carried by the innermost loop in a nest (or other loops in the nest if inner loops are fully unrolled). For example, image processing codes often use a sliding window, where a template is compared against a window of an image. On subsequent iterations of an inner loop, the window will shift by one element, but most of the data will be reused. It is possible to store the values across iterations in a linearly connected set of registers—known as tapped delay lines. This important optimization opportunity has long been recognized by experienced designers who map computations by hand to hardware, by mapping elements in memory to tapped delay lines. We developed several compiler analyses that recognize and exploit opportunities in the source program for data reuse and can be subsequently mapped to tapped delay lines [7]. The application reuses the data in different registers by moving in tandem, across loop iterations, the data stored in all of the registers in the tapped delay line. By doing this shifting operation, the computation effectively reuses

55

the data in all but one of the registers in the tapped delay line. Although the same sort of reuse could be exploited by scalar replacement, the tapped delay line structure may result in a more efficient hardware mapping, as we can incorporate a library with hand-placed components.

5. Automatic design space exploration The DEFACTO compiler takes an algorithm description and a set of architectural constraints, particularly chip capacity, and performs code transformations to optimize the design in an application specific way. We use a fixed target clock rate to guide the synthesis process, but it is not a hard constraint. Under the given constraints, the optimization criteria for mapping loop computations to FPGA-based systems are as follows: 1. The execution time should be minimized. 2. For a given level of performance, FPGA space usage should be minimized. As for the first criterion, our compiler exploits fine-grain parallelism, data locality, and parallel memory accesses to multiple memories. The second criterion is also needed for several reasons. If two designs have equivalent performance, the smaller design is more desirable, in that it frees up space for other uses of the FPGA logic, such as to map other loop nests. A smaller design usually consumes less power and has less routing complexity, and as a result, may achieve a faster target clock rate. Moreover, these optimization criteria suggest a strategy for selecting among a set of candidate designs. The compiler’s decision process has many degrees of freedom. It can leverage coarse-grain parallelism, fine-grain parallelism, and data locality. We introduce several guiding metrics to effectively evaluate a set of candidate designs. Balance is defined as the ratio of the data consumption rate of the computation to the data fetch rate from memory. Depending on the balance value, we can determine whether memory or computation is the performance bottleneck. Another metric is needed because of the inherent space-time tradeoff; as the design exploits more parallelism, more copies of operators are necessary, and as the design exploits more data reuse on chip, more registers for on-chip storage are required. Thus, when the increase in space results in just a minor performance improvement, we must not increase the complexity of the design. As such, we define efficiency as the ratio of the performance increase to the space increase. This metric suggests when to stop our automatic design space exploration. An additional concept, a memory bandwidth saturation point, refers to certain unroll factors where the data is being fetched/stored at a rate corresponding to the maximum bandwidth of the target architecture. Furthermore, the monotonic properties of

56

P. Diniz et al. / Microprocessors and Microsystems 29 (2005) 51–62

these metrics as a result of our compiler transformations enable the compiler to explore a potentially large design space, which without automation would not be feasible. The experimental results for five multimedia kernels reveal the algorithm quickly (in less than 5 min, searching less than 0.3% of the search space) derives a design that closely matches the best performance within the design space and is smaller than other designs with comparable performance [8].

6. Code reordering transformations A number of code reordering transformations can be used to adjust parallelism and data reuse in search of a balanced design. Data dependence (and reuse) distances are used both to ensure the safety and profitability of such reorderings. In this section, we describe one particular code transformation, called unroll-and-jam. Additional code transformations include loop permutation to move the loop in a nest that carries the most reuse to the innermost position to reduce reuse distances, and tiling to divide the iteration space and data of a loop nest into blocks also to reduce reuse distances and coordinate data movement for limited capacity registers and external memories.

for two reasons. First, the replicated computations are sometimes dependent each other, and cannot be scheduled at the same time. The second reason is that the exposed independent computations cannot be scheduled at the same time if all the data operands cannot be ready at the same time. From these monotonicity properties of the fetch rate and consumption rate, balance also increases monotonically as the unroll factor increases until it reaches the saturation point because the data fetch rate increases faster than the data consumption rate. Balance decreases monotonically beyond the saturation point because the data fetch rate is a constant, but the data consumption rate is still increasing.

7. Parallelization and customization of memory accesses While the code transformations to exploit on chip data reuse dramatically reduce memory bandwidth requirements, there are still essential remaining memory accesses. In this section, we describe how DEFACTO optimizes these accesses, exploiting parallelism and pipelining of accesses. Data dependence analysis is used to guarantee the independence of memory accesses to be executed in parallel. 7.1. Custom data layout

6.1. Unroll-and-jam Because synthesis tools can schedule only the memory references in the loop body, memory references to different locations are executed in serial even if they are parallelizable. Unroll-and-jam involves unrolling one or more loops in a nest and fusing copies of the inner loop together [6]. Thus, unroll-and-jam exposes operator parallelism to hardware synthesis by replicating the logical operations and their corresponding operands in the loop body. Our DEFACTO system improves instruction level parallelism and data reuse by performing unroll-and-jam to fully exploit the flexibility of FPGAs. Unroll-and-jam can also decrease the dependence distances for reused data accesses, which, when combined with scalar replacement described in the next subsection, can be used to expose opportunities for parallel memory accesses. We can apply the notion of balance to select the appropriate unroll factor for a loop nest. As the unroll factor increases, both the fetch rate and the consumption rate increase monotonically. A memory parallelism saturation point refers to a certain unroll factor where the memory parallelism available in the target architecture reaches the full bandwidth. The data fetch rate monotonically increases as the unroll factor increases until it reaches the saturation point. Beyond the saturation point, the data fetch rate is a constant because we cannot fetch data faster than the bandwidth provided by the architecture. Similarly, the data consumption rate also monotonically increases as the unroll factor increases, but less than linearly

Custom data layout distributes the remaining memory accesses after scalar replacement across multiple memories. As such, data for independent computations can be fetched in parallel. A unique aspect of our approach is that it adjusts the data layout in the presence of code reordering transformations, particularly loop permutation and unrolland-jam. As compared to other approaches that perform code transformations to match a fixed data layout, performance improvements are more robust when used in conjunction with other transformations. The DEFACTO compiler partitions arrays among multiple memories using a two phase algorithm. In the first phase, array renaming, the compiler divides arrays into subarrays by performing a one-to-one mapping between array subscript expressions and virtual memories, such that memory access parallelism is maximized. The compiler uses array access patterns to perform the mapping and assumes an unlimited number of virtual memories are available. The compiler transforms each array reference so that the transformed subscript expressions take into account both the data element’s virtual memory assignment and position within the newly formed subarray. In the second phase, physical mapping, the compiler binds virtual memory to physical memory, taking into account data dependences and memory accesses conflicts, based on the knowledge of how synthesis will schedule the memory operations, to increase parallelism of memory accesses. In turn, operator parallelism is improved because data is available to perform computations.

P. Diniz et al. / Microprocessors and Microsystems 29 (2005) 51–62

For a collection of five multimedia kernels, we found that custom data layout across four memories yields a 3–5! speedup over mapping data to a single memory [9]. 7.2. Memory interface architecture FPGAs offer a unique opportunity to exploit application specific characteristics in terms of the design and implementation of the datapath-to-external-memory interface. In order to build a modular, yet efficient interface to external memory, we defined two interfaces and a set of parameterizable abstractions that can be integrated with existing synthesis tools. These interfaces decouple targetarchitecture dependent characteristics between the data-path and external memories as depicted in Fig. 3. One interface generates all the external memory control signals matching the vendor-specific interface signals—physical address, enable, write_enable, data_in/out, etc. In addition, this interface allows the designer to exploit application-specific data access patterns by providing support for pipelined memory accesses. The second interface defines the scheduling of the various memory operations allowing the designer to define application specific scheduling strategies. We have implemented these two interfaces over a simple, yet modular, architecture as described in detail in [10]. We introduce two abstractions, data ports and data channels. A data port defines several hardware attributes for the physical connection of wires from the memory controller (the entity responsible for moving data bits to and from the external memory) to the datapath ports. A data channel characterizes (for a given loop execution) the way the data is to be accessed from/to memory to/from a data-path port. Examples include sequential or strided accesses. A channel also includes a conversion FIFO. The FIFO queue allows for the prefetching of data from a memory, and also allows for deferred writebacks to memory such that incoming data transfers, necessary to keep the computation executing, take priority. The conversion FIFO queue also allows for the implementation of data packing and unpacking operations so that no modifications to

Fig. 3. Memory interface architecture for the DEFACTO compiler.

57

the data-path either producing or consuming the data are necessary. Over the lifetime of the execution of a given computation, it is possible to map multiple data channels to the same datapath port by redefining the parameters of the data stream associated with it. 7.3. Memory interface controller and scheduling optimizations Part of the target hardware design consists of auxiliary circuitry to deal with the external memories. These entities provide the means to generate addresses, carry out memory signaling and temporarily store data to and from external memories dealing with the pins and timing variations of the vendor-specific memory and FPGA interfaces. Because of its key role in exploiting application-specific features to reduce memory latency, we focus our attention on the memory channel controller. The role of a memory channel controller is to serialize multiple memory requests for a set of data channels associated with a single memory. The current implementation allows for application-specific scheduling by defining a distinct ordering of the memory accesses for the different channels. While predicting exact memory access order is impossible, our compiler can reduce the number of clock cycles required to access the data corresponding to many memory channels by bypassing memory controller states. These ‘test’ states can be eliminated when the controller is assured that if a datum is required by a given channel, it is also true that it is required by a set of other channels. In this situation the controller need not explicitly check whether or not a memory access is required. The benefits of this optimization, when combined with a pipelined memory access mode, can lead to a performance improvement of up to 50% for a small set of image processing kernels [10].

8. Practical experience We now describe the results of two separate experiments of the utilization of the DEFACTO compiler in automatically mapping several kernel computations expressed in the C programming language, onto a real FPGA-based reconfigurable system. In the first experiment we illustrate the ability of the design space exploration feature of this prototype system in quickly producing a hardware design that is balanced according to a specific metric—balance and efficiency. The second experiments aims at comparing the performance of a specific hardware design automatically generated against a hardware design performed manually. In the first experiment we have the following kernel computations: † Finite impulse response (FIR) filter. This kernel implements an integer multiply-accumulate operation over 32 consecutive elements of a 64 element array.

58

P. Diniz et al. / Microprocessors and Microsystems 29 (2005) 51–62

† Matrix multiply (MAT). This kernel implements an integer dense matrix multiplication of a 32-by-16 matrix by a 16-by-4 matrix using 32-bit one’s complement arithmetic. † String pattern matching (PAT). This kernel implements a character matching operator of a string of length 16 over an input string of length 64. † Jacobi iteration (JAC). This kernel implements a 4 point stencil averaging computation over the elements of two dimensional array. † Sobel (SOBEL) Edge detection (see e.g. [11]). This kernel implements a simple 3-by-3 window Laplacian operator over an integer 256-by-256 gray scale image. Each kernel is written as a regular C program where the computation is limited to a single loop nest. The statements in the loops involve scalar variables and arrays using affine index access functions only. Table 2 summarizes the results of the DEFACTO compiler generated mapping for these kernels. For each kernel we present the total number of designs possible by using exclusively loop-unrolling of either loops in each kernel code. We also present the number of examined designs the DSE algorithm actually examines and the selected design point chosen by the algorithm indicated by the amount of unrolling in the two inner loops of each kernel. For any of the kernels DEFACTO selected a design in less than 3 min which include the interfacing with behavioral synthesis for estimation purposes. The table also presents the generated code number of clock cycles, the attained clock rate after place-and-route of the selected design. We indicate the area consumed by each design’s core data-path (i.e. excluding any memory interface resources) in terms of slices, and the percentage of FPGA resources used. For the SOBEL kernel, the most diverse in terms of the instructions in the loop, we then compare the performance of the automatically mapped base version of this kernel, i.e. without any unrolling, with that of a design mapped manually for the same computation and for the same target Table 2 Synopsis of compiler generated VHDL code performance for selected kernels Kernel

Number designs possible

Number designs examined

Selected design

FPGA area slices

Clock rate (MHz)

FIR

2048

3

4!8

24.50

MAT

2048

4

2!1

PAT

768

4

16!12

JAC

512

5

2!4

2048

2

2!2

1538 (12.5%) 777 (6.3%) 895 (7.2%) 1470 (11.9%) 3481 (28.3%)

SOBEL

Table 3 Experimental results for sobel edge detection code

Space (slices) Clock cycles (K) Clock rate (MHz) Execution time (ms) Development time

Manual

Automated

2238 326 42 7.7 40 (h)

2279 518 40 12.95 41 (min)

architecture and discuss the relative mapping time and effort for the designs. The manual mapping was achieved by designing the core datapath for the edge detection computation directly in structural VHDL In both designs we used the same memory interface libraries as provided by the board vendor. To measure the execution time on the board we have simulated both designs to extract the total number of execution clock cycles. We then used the clock period as derived from the place-and-route results for the designs for computing the wall-clock time execution time. Both designs ran successfully on the target FPGA thereby validating the computed execution time. Table 3 reveals that both designs occupy a modest 18% of the total number of 12,288 FPGA slices, with a negligible increase for the automated design (about 1.5%). Regarding the number of executed clock cycles, we observe a noticeable increase in the automated design. This behavior is due to the way the automated design uses a handshaking protocol when deriving external memory operations. As a result the execution time of the automated design is about 59% slower than the manual design as its clock rate is also slightly lower. In many contexts, we believe this noticeable performance degradation to be a very small price to pay for an automated and correct design. The development time, however, shows a very impressive 58-fold improvement over the manual design. Given that of the 41 min taken by the DEFACTO compiler, 40 min were dedicated to the synthesis of the actual design, this experience reveals that the actual decision and rode generation of a complete and correct design is about four orders of magnitude faster than manual design selection. Fig. 4 illustrates the operation of the automatically generated design on the FPGA board for a sample input image (left image). The output image (right image) contains the black-and-white pixels for which the algorithm detected a segment of an edge. 8.1. Discussion

25.52 30.40 25.98 22.19

These two examples illustrate the power of the DEFACTO automatic compiler in lever-aging a wide range of parallelizing compiler analysis and mapping transformations. While in this example the performance of the generated design was about 2X slower than a manual implementation it was achieved in a extremely small

P. Diniz et al. / Microprocessors and Microsystems 29 (2005) 51–62

59

Fig. 4. Input and output example for sample sobel edge detection.

fraction of the time—and with guaranteed correctness. This experiment, although limited, reveals that DEFACTO system is an effective system in combining parallelizing compiler technology and behavioral synthesis techniques for mapping applications written in high-level programming languages such as C onto FPGA-based reconfigurable architectures.

9. Related work 9.1. Compilation systems for configurable architectures Like our effort, other researchers have focused on using automatic data dependence and compiler analysis to aid the mapping of computations to FPGA-based machines. Weinhardt [12] describes a set of program transformations for the pipelined execution of loops with loop-carried dependences onto custom machines using a pipeline control unit and an approach similar to ours. He also recognizes the benefit of data reuse but does not present a compiler algorithm. The Napa-C compiler effort [13] explores the problem of automatic mapping applications written in C to a hybrid RISC/FPGA architecture. The RISC-based interface as expected is very similar to our target design architecture as a way to control the complexity of the interface between the FPGAs and the host processor. 9.2. Synthesizing high-level constructs Languages such as VHDL and Verilog allow programmers to migrate to configurable architectures without having to learn a radically new programming paradigm. Efforts in the area of new languages include Handel-C [14] and System-C [15]. Several researchers have developed tools that map computations to reconfigurable custom computing architectures [16], while others have developed approaches to mapping applications to their own reconfigurable architectures that are not FPGAs, e.g. RaPiD [17] and

PipeRench [18]. The two projects most closely related to ours, the Nimble compiler and work by Babb et al. [19], map applications in C to FPGAs, but do not perform design space exploration. 9.3. Design space exploration In this article, we focus only on related work that has attempted to use loop transformations to explore a wide design space. Other work has addressed more general issues such as finding a suitable architecture (either reconfigurable or not) for a particular set of applications (e.g. [20]). Derrien/Rajopadhye [21] describe a tiling strategy for doubly nested loops. They model performance analytically and select a tile size that minimizes the iteration’s execution time. Cameron’s estimation approach builds on their own internal data-flow representation using curve fitting techniques [22]. Qasem et al. [23] study the affects of array contraction and loop fusion. 9.4. Data layout Large-scale multiprocessor systems are highly affected by computation and data partitioning across processors and memory and to that end, much work [24–30] derives coarsegrain layouts that avoid communication among processors. Kim and Prasanna [27] propose a fine-grain data distribution scheme, perfect latin squares, to map rows, columns, diagonals, and subarrays to multiple memory modules, without considering the data access pattern of the code. Thus, they support parallel array accesses only if the code is accessing consecutive locations within the square subarray. Petersen et al. [31] develop a type construct, and the Napa-C compiler effort [32,33] defines C language extensions to capture the data layout. Others [34–38] have developed scheduling algorithms either in software or hardware to deconflict memory hierarchy traffic. Custom memory architectures [39–42], derived from the application characteristics, also form part of the solution to the memory-computation unit performance

60

P. Diniz et al. / Microprocessors and Microsystems 29 (2005) 51–62

gap. For tiled architectures such as Raw, the Maps [43] compiler performs modulo unrolling. Other features of the compiler include equivalence class unification for pointer analysis and static versus dynamic promotion of memory accesses. 9.5. Discussion The research presented in this article differs from the efforts mentioned above in several respects. First the focus of this research is in developing an algorithm that can explore a wide number of design points, rather than selecting a single implementation. Second, the proposed algorithm takes as input a sequential application description and does not require the programmer to control the compiler’s transformations. Third, the proposed algorithm uses high-level compiler analysis and estimation techniques to guide the application of the transformations as well as evaluate the various design points. Our algorithm supports multi-dimensional array variables absent in previous analyses for the mapping of loop computations to FPGAs. We use the application specific data access patterns to directly calculate our data layout and automatically derive the variable mapping. We replace array accesses from memory with scalar register accesses in our design for increased on-chip storage of array elements that will be reused. We leave the details of the scheduling to the Monete scheduler in our system. Finally, we use a commercially available behavioral synthesis tool to complement the parallelizing compiler techniques rather than creating an architecture-specific synthesis flow that partially replicates the functionality of existing commercial tools. Behavioral synthesis allows the design space exploration to extract more accurate performance metrics (time and area used) rather than relying on a compiler-derived performance model. Our approach greatly expands the capability of behavioral synthesis tools through more precise program analysis.

10. Conclusion We have described the DEFACTO compilation and synthesis system. DEFACTO combines data dependence analysis with behavioral synthesis to automatically map algorithms written in C to FPGA-based configurable systems. We illustrate the effectiveness of this approach in the automatic mapping of a set of digital image processing kernels. For these kernels DEFACTO automatically generates a correct hardware implementation design selected among hundreds of possible designs by the application of loop-based transformations. We have illustrated the effectiveness of the automation in the DEFACTO system by mapping an image processing kernel code, Sobel edge detection, to a combined software and hardware design on a general-purpose computer

and an FPGA. This automatic mapping delivers a 58X reduction in design time and only a 59% increase in execution time as compared to a manual design of the same algorithm. With the growing number of available transistors in future FPGA devices and the increasing time-to-market pressures, the need to quickly generate a correct design with good performance will make high-level design tools such as DEFACTO extremely attractive. Our current and future work focuses on extending the capabilities of the system to support communication, pipelining and partitioning of multi-loop applications on multi-FPGA designs.

Acknowledgements The DEFACTO project has been funded by the Defense Advanced Research Project Agency under contract #F30602-98-2-0113 and by the National Science Foundation (NSF) under Grant No. 0209228. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Heidi Ziegler has been funded by the Boeing Corp. and Intel Corp.

References [1] H. Ziegler, B. So, M. Hall, P. Diniz, Coarse-grain pipelining on multiple FPGA architectures, Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM), Los Alamitos, CA, IEEE Computer Society Press, 2002 pp. 77–86. [2] H. Ziegler, M. Hall, P. Diniz, Compiler-generated communication for pipelined FPGA applications, Proceedings of the ACM/IEEE 40th Design Automation Conference, 2003 pp. 610–615. [3] The Stanford SUIF compilation system, Public Domain Software and Documentation. [4] B. So, P.C. Diniz, M.W. Hall, Using estimates from behavioral synthesis tools in compiler-directed design space exploration, Proceedings of the 40th Design Automation Conference, 2003 pp. 514–519. [5] B. So, M. Hall, Increasing the applicability of scalar replacement, Proceedings of the International Conference on Compiler Construction (CC’04), Springer, Berlin, 2004. [6] S. Carr, K. Kennedy, Improving the ratio of memory operations to floating-point operations in loops, ACM Transactions on Programming Languages and Systems 15 (3) (1994) 400–462. [7] P. Diniz, J. Park, Automatic Synthesis of Data Storage and Control Structures for FPGA-based Computing Machines. In the Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM’2000), IEEE Computer Society Press, Los Alamitos, Oct. 2000, pp. 91–100. [8] B. So, M. Hall, P. Diniz, A compiler approach to fast hardware design space exploration in fpga-based systems, Proceedings of the ACM Conference on Programming Language Design and Implementation (PLDI), ACM Press, New York, 2002 pp. 165–176.

P. Diniz et al. / Microprocessors and Microsystems 29 (2005) 51–62 [9] B. So, M. Hall, H. Ziegler, Custom data layout for memory parallelism, Proceedings of the ACM International Symposium on Code Generation and Optimization (CGO), ACM Press, New York, 2004 pp. 291–302. [10] J. Park, P. Diniz, Synthesis of memory access controller for streamed data applications for FPGA-based computing engines, Proceedings of the 14th International Symposium System Synthesis (ISSS), ACM press, New York, 2001 pp. 221–226. [11] J. Proakis, D. Manolakis, Digital Signal Processing: Principles, Algorithms and Applications, Prentice-Hall, 1996. [12] M. Weinhardt, W. Luk, Pipelined vectorization for reconfigurable systems, Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM), IEEE Computer Society Press, Los Alamitos, CA, 1999 pp. 52–62. [13] M. Gokhale, J. Stone, NapaC: compiling for a hybrid RISC/FPGA architecture, Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM), IEEE Computer Society Press, Los Alamitos, CA, 1998 pp. 126–135. [14] I. Page, W. Luk, Compiling OCCAM into FPGAs, Field Programmable Gate Arrays, Abigdon EE and CS Books, 1991 pp. 271–283. [15] A. Pelkonen, K. Masselos, M. Cupak, System-level modeling of dynamically reconfigurable hardware with SystemC, Proceedings of the International Symposium on Parallel and Distributed Processing, 2003 pp. 174–181. [16] M. Weinhardt, W. Luk, Pipelined vectorization for reconfigurable systems, Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM), IEEE Computer Society Press, Los Alamitos, CA, 1999 pp. 52–62. [17] D. Cronquist, P. Franklin, S. Berg, C. Ebeling, Specifying and compiling applications for RaPiD, Procssdings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM), IEEE Computer Society Press, Los Alamitos, CA, 1998 pp. 116–125. [18] S. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi, R. Taylor, R. Laufer, PipeRench: a coprocessor for streaming multimedia acceleration, Proceedings of the 26th International Symposium on Computer Architecture (ISCA), ACM Press, New York, 1999 pp. 28– 39. [19] J. Babb, M. Rinard, C. Moritz, W. Lee, M. Rank, R. Barua, S. Amarasinghe, Parallelizing applications into silicon, Proceedings of the IEEE Symposium FPGAs for Custom Computing Machines (FCCM), 1999 pp. 70–81. [20] S. Abraham, B. Rau, R. Schreiber, G. Snider, M. Schlansker, Efficient design space exploration in PICO, Tech. Rep., HP Labs, 1999. [21] S. Derrien, S. Rajopadhye, S. Sur Kolay, Combined instruction and loop parallelism in array synthesis for FPGAs, Proceedings of the 14th International Symposium on System Synthesis, 2001 pp. 165–170. [22] W. Najjar, D. Draper, A. Bohm, R. Beveridge, The Cameron project: high-level programming of image processing applications on reconfigurable computing machines, Proceedings of IEEE International Conference on Parallel Architectures and Compilation Techniques (PACT)—Workshop Reconfigurable Computing, IEEE Computer Society Press, Los Alamitos, CA, 1999 pp. 216–225. [23] A. Qasem, G. Jin, J. Mellor-Crummey, Improving performance with integrated program transformations, in manuscript, October 2003. [24] M. Gupta, P. Banerjee, Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers, IEEE Transactions Parallel and Distributed Systems, no. 2, 1992 pp. 179–193. [25] K. Kennedy, U. Kremer, Automatic data layout for high performance Fortran, Procssenings of the Ninth International Conference on Supercomputing, 1995 pp. 1–24.

61

[26] J. Anderson, Automatic computation and data decomposition for multiprocessors, PhD thesis, Stanford University, March 1997. [27] K. Kim, V.K. Prasanna, Latin squares for parallel array access, IEEE Transactions on Parallel Distributed Systems 4 (4) (1993) 361–370. [28] S. Rubin, R. Bodik, T. Chilimbi, An efficient profile-analysis framework for data-layout optimizations, Proceedings of the 2002 ACM Conference on Principles of Programming Languages (P0PL02), 2002 pp. 140–153. [29] E. Petrank, D. Rawitz, hardness of cache conscious data placement, Proceedings of the 2002 AGM Conference on Principles of Programming Languages (POPL02), 2002 pp. 101–112. [30] M. Cierniak, W. Li, Unifying data and control transformations for distributed shared-memory machines, Proceedings of the 1995 ACM Conference on Programming Languages Design and Implementation (PLDI95), 1995 pp. 205–217. [31] L. Petersen, R. Harper, K. Crary, F. Pfenning, A type theory for memory allocation and data layout, Proceedings of the 2003 ACM Conference on Principles of Programming Languages, 2003 pp. 172–184. [32] M. Gokhale, J. Stone, Napa C: compiling for a hybrid RISC/FPGA architecture, Proceedings of the IEEE Symposium FPGAs for Custom Computing Machines (FCCM), IEEE Computer Society Press, Los Alamitos, CA, 1998 pp. 126–135. [33] M. Gokhale, J. Stone, Automatic allocation of arrays to memories in FPGA processors with multiple memory banks, Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM), IEEE Computer Society Press, Los Alamitos, CA, 1999 pp. 63–69. [34] P. Murthy, S. Bhattacharyya, A buffer merging technique for reducing memory requirements of synchronous dataflow specifications, Proc. 12th Intl. Symp. System Synthesis (ISSS), ACM Press, New York, 1999 pp. 78–84. [35] B. Mathew, S. McKee, J. Carter, A. Davis, Design of a parallel vector access unit for SDRAM memory systems, Proceedings of the Sixth International Symposium on High-Performance Computer Architechture, 1999 pp. 39–48. [36] S. Rixner, W. Daily, U. Kapasi, P. Mattson, J. Owens, Memory access scheduling, Proceedings of the 27th International Symposium on Computer Architecture, 2000 pp. 128–138. [37] S. McKee, W. Wulf, Access ordering and memory-conscious cache utilization, First IEEE Symposium on High-Performance Computer Architecture 1995; 253–262. [38] J.-F. Collard, D. Lavery, Optimizations to prevent cache penalties for the Intel Itanium2 processor, Proceedings of the International Symposium on Code Generation and Optimization (CGO), ACM Press, New York, 2003 pp. 105–114. [39] P. Grun, N. Dutt, A. Nicolau, Access pattern based local memory customization for low power embedded systems, Proceedings of Design, Automation and Test in Europe 2001; 778–784. [40] M. Weinhardt, W. Luk, Memory access optimization and RAM inference for pipeline vectorization, Proceedings of the Ninth International Workshop, Field-Progammable Logic and Applications (FPL), Springer, 1999 pp. 61–70. [41] P. Slock, S. Whytack, F. Catthoor, G. de Jong, Fast and extensive system-level memory exploration for ATM applications, Proceedings of the 10th International Symposium on System Synthesis 1997; 74–81. [42] H. Schmit, D. Thomas, Address generation for memories containing multiple arrays, IEEE Transactions Computer-Aided Design of Integrated Circuits and Systems 17 (1998) 377–385. [43] R. Barua, W. Lee, S. Amarasinghe, A. Agarwal, Maps: a compilermanaged memory system for Raw machines, Proceedings of the 26th International Symposium on Computer Architecture (ISCA), ACM Press, New York, 1999 pp. 4–15.

62

P. Diniz et al. / Microprocessors and Microsystems 29 (2005) 51–62

Pedro Diniz received his BSc and MSc in Electrical and Computer Engineering from the Instituto Superior Te´cnico, Technical University of Lisbon in 1988 and 1992, respectively. In 1997 he received his PhD in Computer Science from the University of California in Santa Barbara in the area of parallelizing compilers. He joined the University of Southern California’s Information Sciences Institute in August 1997 where he has been working on compilers analysis techniques for reconfigurable computing, in particular for automatically mapping computations to FPGA-based architectures. His research interests include program analysis, parallel and distributed computing.

Mary Hall is a project leader at USC/Information Sciences Institute and jointly a research associate professor in the Computer Science Department at USC, where she has been since 1996. Dr Hall graduated magna cum laude with a BA in computer science and mathematical sciences in 1985 from Rice University. She received an MS and PhD in computer science in 1989 and 1991, respectively, also from Rice University. Prior to joining USC/ISI, Dr Hall has previously held positions as Visiting Professor and Senior Research Fellow at Caltech, and as Research Scientist at Stanford University and at Rice University. Dr Hall’s research interests include interprocedural compiler analysis, automatic parallelization, high-level synthesis to FPGAs and software for processing-in-memory systems.

Joonseok Park received his BSc degree in Mathematics from Sogang University, Seoul, Korea in 1997, and the MSc and PhD degrees in Computer Science from University of Southern California in Los Angeles, California in 2000 and 2004, respectively. He is currently with Samsung Corp, in Seoul. His research interests include paralleling compilers, reconfigurable computing and HW/SW co-synthesis.

Byoungro So received his BSc in Computer Science from Dongguk University in 1997, and MSc and PhD in Computer Science from the University of Southern California in 1998 and 2003, respectively. He joined Exploratory System Architecture group at IBM T. J. Watson Research Center in 2003, where he has been working on data/computation partitioning and parallelization for the broadband processor architecture. His research interests include program analysis, highperformance and parallel computing.

Heidi Ziegler has been a doctoral student at the University of Southern California (USC), in the Department of Electrical EngineeringSystems, Computer Engineering, since 1998. She is a PhD candidate in Computer Science and her work focuses on the application of highlevel program analysis and transformations for multi-FPGA configurable computing systems. She received a B.S. from The University of Texas at Austin in electrical engineering and an MS in computer engineering from the University of Southern California. Heidi has been employed at Boeing Satellite Systems for 9 years. From Fall, 1998, until Spring, 2003, Heidi was a Boeing Full-Study Doctoral Scholar.