An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Journal of Network and Computer Applications (1997) 20, 223–252 An asynchronous approach to efficient execution of programs on adaptive architectures...

422KB Sizes 4 Downloads 110 Views

Journal of Network and Computer Applications (1997) 20, 223–252

An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs Sumit Ghosh Department of Computer Science & Engineering, Arizona State University, Tempe, AZ 85287, USA Computers are broadly classified into two classes: general-purpose and special-purpose. General-purpose computers provide tolerable performance on a wide range of programs. In contrast, specialized computers, tailored to a narrow class of programs, usually yield significantly higher throughput. However, specialized computer architectures are limited in availability, inflexible, require special programming environments, and are, in general, expensive. Both classes are limited in that they utilize a ‘fixed’ hardware architecture, i.e. their designs, conceived at creation, are unchanged during their lifetime. PRISM-I introduced a new concept, wherein a custom architecture is dynamically created to execute a specific high-level program in a faster and more efficient manner. While one component of this architecture is a traditional general-purpose processor, the other is automatically synthesized from a collection of FPGAs by a configuration compiler. Speed-up is achieved by executing key sections of the high-level program on the synthesized hardware, while the remainder of the program executes on the core processor. While PRISM-I developed a proof-of-concept platform, it is significantly limited to simple programs. This paper introduces a significant conceptual advancement, PRISMII, which synthesizes asynchronous, adaptive architectures for complex programs, including those that contain iterative ‘loop’ structures with dynamic loop counts. PRISM-II introduces a novel execution model, wherein an operator-network and controller are synthesized for key sections of a high-level program. The operatornetwork, custom-synthesized from FPGAs, executes the key sections in a data-flow manner, i.e. any instruction is executed as soon as its input operands are available. The controller controls the computations in the operator-network, accurately determines when the execution is complete by utilizing key principles developed in PRISM-II, and generates an end-of-computation signal to asynchronously inform the core-processor to fetch the results from the FPGA platform. While the realization of a general-purpose data-flow architecture has continued to be difficult in the architecture community, the PRISM-II approach promises asynchronous, data-flow execution of programs on custom synthesized FPGA hardware.  1997 Academic Press Limited

1. Introduction Since their advent, the performance of computers have increased exponentially, roughly an order of magnitude speed-up every 8–10 years [1]. The performance increase can be attributed to both technology-dependent and technology-independent improvements. While a few of the key technological advances include VLSI technology, semiconductor memories [2], and advances in silicon technology, important technology-independent advances include advanced compiler techniques, caching, pipelining, RISC, and vectorization. Given that VLSI and semiconductor technologies are fast approaching Email: [email protected]

1084–8045/97/030223+30 $25.00/0

ma970052

 1997 Academic Press Limited

224 S. Ghosh the limits of physics, technology-independent techniques are increasingly assuming a prominent role. This paper presents a novel technology-independent mechanism. Along with the ever increasing demand for higher performance, the diversity of programs executed on computers has increased tremendously. While in the early days of ENIAC [3] a computer was principally used for number-crunching applications like computing artillery firing tables, today computers are used for scientific numbercrunching applications, mission-critical business applications, such as air-line reservation systems, banking and medical applications, engineering and design applications, e.g. CAD/CAM, complex simulation and animation tasks, e.g. virtual reality, and ubiquitous applications, such as word processing, spread-sheets and games. Each of these applications requires unique organization of the computational, input–output, memory and communication subcomponents. That is, every application requires an architecture that has been tailored specifically to address the needs of the application and execute it most efficiently. However, such an architecture, tailored to a specific application, may not yield either a modest or even acceptable performance for other applications. Economics and flexibility of architecture trade-offs have led to the design of generalpurpose and special-purpose computers. While general-purpose computers cost relatively less and provide tolerable performance on a wide mix of application programs, specialpurpose machines are expensive, yield excellent performance on a specific application or a class of applications, but execute poorly on other applications. The literature records several efforts to integrate elements of special-purpose architectures into a general-purpose framework. These efforts include the attachment of special hardware substructures, proposed by Estrin [4], enhancing processor instruction sets with specialized complex instructions [5, 6], dynamic microprogramming [7–10], utilizing reconfigurable computing elements [11–14], and the use of co-processors [15]. Unfortunately, all of the above efforts suffer from one or both of the following fundamental limitations: • the integration effort is determined at design time, is permanent throughout its life, and therefore incapable of addressing new application programs; • the integration effort is neither automatic nor transparent to the programmer, and therefore the programmer must possess knowledge of the processor architecture and hardware design. PRISM-I [16] is perhaps the first attempt that effectively addresses the above limitation by demonstrating a proof-of-concept system. In it, the integration effort may be easily adapted to a large class of application programs and it is transparent to the programmer. PRISM-I permits the realization of specialized architectures for maximum execution performance of every individual application program. The underlying philosophy of PRISM-I is to exploit the notion of ‘locality of reference’ [3], which states an empirical finding that most programs spend 90% of their execution time in 10% of the code [17]. PRISM-I aims to expend effort and resources to increase performance of the smaller and frequently-executed sections of the program, termed ‘critical sections’, as opposed to the remainder of the less-frequently-executed sections. In PRISM-I, shown in Fig. 1, reconfigurable hardware is used to augment the functionality of a general-purpose, core processor. A set of Field Programmable Gate Arrays (FPGAs) constitute the reconfigurable hardware which may be dynamically

Programs on adaptive architectures utilizing FPGAs 225

General Purpose CPU

System bus

Reconfigurable Platform

Figure 1. The PRISM-I approach.

configured to execute the critical sections of an application program quickly. The lessfrequently-accessed sections are executed on the core processor. The overall impact of PRISM-I is improved execution performance. To achieve its goals, the synthesized hardware in PRISM must execute the critical sections faster than that of a general-purpose processor. This requirement, in turn, translates into several low-level requirements for the synthesized hardware, namely, (i) simplicity, (ii) minimal communication overhead, and (iii) the exploitation of fine-grain, i.e. operator-level, parallelism. Requirements (i) and (ii) are particularly true for today’s FPGAs, with lower gate counts and slower speeds. The need to exploit fine-grain parallelism is important because of the frequently-encountered small-sized critical sections. A substantial amount of work in hardware synthesis has been reported in the literature. This section reviews the research into hardware synthesis tools and dataflow computational models. Trickey [18] presents a ‘hardware’ compiler that translates high-level behavioral descriptions of digital systems, in Pascal, into hardware, subject to user-specified cost function. Lanneer and colleagues [19] report the CATHEDRAL high-level synthesis environment for automated synthesis of IC architectures for realtime, digital signal processing. IBM’s HIS system [20] translates a behavioral description of a synchronous digital system specified in VHDL into a resulting design consisting of a finite state machine and a datapath, both described in the output language BDL/ CS. The Cyber system [21] aims to compile software programs in C and BDL into ASIC chips, called ‘software chips’. Its first targets are synchronous control dominated applications and processors implemented in ASICs. In addition, Camposano [22] and Walker [23] report a survey of different high-level synthesis systems. Most of the approaches reported in the literature differ with regard to the highlevel hardware description language, optimization and transformation techniques, and scheduling and allocation algorithms. However, they share a common underlying model of execution, namely, a synchronous digital machine that consists of a datapath and a controller that is governed by a finite state machine. The execution of the machine is organized through basic time units, termed control steps, that correspond to the states of the finite state machine. The datapath consists of a set of functional units, e.g. arithmetic logic units, multiplexors, registers, storage units, and buses. The controller is either microcoded or hardwired and executes the instructions sequentially. It also controls the computation of data in the functional units and the transfer of data and results to and from the storage units. Two key limitations of the synchronous approach, similar to those for Von Neumann’s ‘stored program control’, are: • a centralized controller imposes strict sequential execution of all instructions. As a result, a preceding instruction whose operands are not yet available, may prevent

226 S. Ghosh the execution of the subsequent instructions even when the operands of the latter are available [24]. This clearly results in the failure to exploit potential parallelism that may be inherent in the program. An added problem is the reduced ability for the processor to tolerate latency in fetching operands from the storage units [25]. That is, the processor has to wait for each operand fetch to be completed before initiating the computation. Techniques including the use of large register sets, caches, and pipelines aim to reduce the adverse impact of latency. • intermediate results or data are passed between successive instructions through the use of registers and storage units. This indirect mode of transfer not only slows down the communication of data among instructions, but may also cause side effects. Because of the latter, for correctness, external synchronization must be imposed on operand fetches, which may impose additional overhead on the overall hardware execution. In contrast, the data-flow computational model is based on two fundamental principles [26]: • A.1: an operation can be executed as soon as all its required operands are available. • A.2: all operations are functions, i.e. there are no side effects arising from the intermediate results and data being stored in registers and storage units. The principle A.1 permits one to take advantage of the fine-grain parallelism, that may be inherent in a program, and enhances the processor’s ability to tolerate operand fetch latency. In theory, A.2 enables faster sharing of intermediate results or data between successive instructions. Although the data-flow model apparently promises higher hardware execution performance, in reality there are several limitations. First, general-purpose data-flow architectures, proposed in the literature [27, 28], are very complex. They require complex mechanisms for labeling tokens, storing data, and communication between successive instructions. They impose high demand for silicon, because of which it is unrealistic to implement them on the current FPGAs that feature modest gate counts. Second, general-purpose data-flow architectures involve significant overheads in terms of execution time, implying a slow rate of providing operands to the individual processing elements. Consequently, to outperform the conventional processors, general-purpose data-flow architectures require parallelism of the order of several hundred concurrently executable instructions in the programs [26]. The limitations of both hardware synthesis tools and general-purpose data-flow architectures make them poor candidates for flexible, high-performance FPGA-based architectures, which demand (i) inexpensive hardware platforms, (ii) maximum exploitation of parallelism inherent in critical sections, and (iii) minimal implementation overheads. While PRISM-I [16] aims to address a few of these limitations, it suffers from the following critical weaknesses: • the reconfigured hardware is constrained to evaluate functions within the elapse time of a single bus-cycle of the core processor. As a result, critical sections whose critical-path delays exceed the core processor bus cycle tick cannot be synthesized on the reconfigurable platform. • inefficient execution of critical sections that contain control constructs, e.g. ‘ifthen-else’. In general, control constructs may imply multiple, possible-execution paths that, in turn, require different execution times depending on the actual

Programs on adaptive architectures utilizing FPGAs 227 execution semantics and input data. In PRISM-I, the hardware design always chooses the longest of the execution times of the different possible paths which, while conservative, implies inefficiency. • inability to synthesize loops with dynamic loop-counts, which eliminates a large class of programs requiring iterative computations. This paper presents PRISM-II, a new execution model and a novel architecture that addresses the key weaknesses of PRISM-I. The execution model facilitates the exploitation of maximal fine-grain parallelism in the critical sections without imposing rigid sequentiality. The architecture addresses critical sections that may require arbitrary execution times, contain control constructs such as ‘if-then-else’ and ‘switch-case’, and loop constructs with static and dynamic loop counts. As with PRISM-I, PRISM-II accepts user programs written in C. The remainder of the paper is organized as follows: section 2 presents an overview of PRISM-II highlighting the configuration compiler and the hardware platform; section 3 introduces a framework for the configuration compiler and presents a detailed discussion on the mechanism to translate a critical section into executable code; section 4 presents details on the algorithm for the translator and illustrates with an example; section 5 presents the conclusions and suggestions for future work.

2. The PRISM-II approach: overview The PRISM-II approach consists of two principal components—the hardware platform which ultimately executes the application program and the configuration compiler that accepts an application program in C, translates it into executable code for the hardware platform.

2.1 Hardware platform The hardware platform, shown in Fig. 2, consists of (i) a core processor, namely the AMD AM29050 RISC processor [29], and (ii) a reconfigurable platform that consists of a set of FPGAs that are interconnected to the core processor through the latter’s co-processor interface. The hardware platform design addresses two key limitations of PRISM-I. First, unlike PRISM-I, which requires between 45 and 75 clock cycles, 100 ns in length, to access a component of the synthesized hardware, PRISM-II requires only 30 ns. In addition, while the length of the data transfer in PRISM-I is 32 bits, that for PRISMII is 64 bits for inputs and 32 bits for outputs. Second, unlike PRISM-I, where an operation must fit in a single FPGA, PRISM-II permits an operation to utilize up to three FPGAs through partitioning the data-flow graph. The fast AMD AM29050 processor is selected to enable a reasonable balance between hardware and software performance. The AM29050 processor, clocked at 33 MHz, can provide roughly 28 MIPS performance. In addition, it has a built-in floating point unit, which is important, since the area expense necessary to synthesize it on the reconfigurable platform is high. Data transfer to and from the FPGAs is supported in the form of 64-, 32-, 16- and 8bit quantities to facilitate hardware specifications in a high-level language. The Xilinx

228 S. Ghosh Data Bus Am29050–33

32

(with MMU, FPU, timer and cache)

Instruction Bus 32

32

Boot ROM

32

Address Bus

A_ctrl Burst-Mode Memory Controller (V3)

32

Interleaved DRAM Bank A

32

Bus Ex-

addr B_ctrl

Interleaved DRAM Bank B

changer/ latch 32

Data Bus 32

Timer, Comm, PIO

Status Display

3

32

2 Reconfigurable Platform 1 Reconfigurable Platform Reconfigurable Platform

32

Figure 2. PRISM-II hardware platform.

4010 FPGA [30] provides 160 general purpose I/O pins which provides for several 8bit buses. It is expected that the application programs, implemented on PRISM-II, will have high data fan-in, i.e. a large number of inputs. The fan-in may be viewed as a manifestation of function calls that accept several arguments and return a single result. 2.2 The configuration compiler Figure 3 shows an overview of the PRISM-II architecture. In it, the configuration compiler accepts a user program in C as input, and generates hardware and software images. The hardware image contains information that is necessary for synthesizing hardware corresponding to the critical sections, on the reconfigurable platform. The software image contains executable code that realizes the execution of the critical sections on the reconfigurable platform and the remainder of the program on the AM29050. Both hardware and software images are generated automatically without intervention from the user. A current underlying assumption in PRISM-II is that the critical section(s) of an application program is identified a priori, by the programmer. The configuration compiler consists of two principal components—(i) C parser and

Programs on adaptive architectures utilizing FPGAs 229 C Program

Standard C Compiler

Software Image

Configurable Compiler

Hardware Image

Figure 3. Overview of PRISM-II architecture.

optimizer, and (ii) a hardware synthesizer. The parser and optimizer constitute the front end, while the hardware synthesizer forms the back end. To reduce development time, the parser and optimizer of the GNU C compiler (gcc) are utilized as the starting point and are significantly modified to adapt to the needs of this research. The hardware synthesizer constitutes the core of the synthesis subtask, and is being designed and developed in this investigation. Figure 4 presents the structure of the configuration compiler including its components and the flow of information, and is described in greater detail in the subsequent sections of the paper.

3. A framework for the configuration compiler This section presents a framework for the PRISM-II configuration compiler. The critical section is first translated into an intermediate representation and then synthesized onto the FPGA platform. A novel execution model is proposed for developing the architecture of the synthesized hardware. Among its key advantages over existing execution models, used in traditional high-level synthesis architectures and data-flow machines, the proposed execution model exploits fine-grain parallelism inherent in the critical section, requires minimum data and control communication overheads, and imposes low implementation cost. The designs of the intermediate representation and the execution model both utilize Control Flow (CFG) and Data Flow Graphs (DFG) of the critical section. 3.1 Control Flow Graphs The translation begins with a block-level CFG of the critical section. A CFG is a directed graph in which the nodes represent ‘basic blocks’ [31] and the edges represent the flow of control. For example, an edge from node X to node Y indicates that execution of block X may be followed immediately by execution of block Y. A basic block is a sequence of consecutive statements of the section in which the flow of control enters at the beginning and leaves at the end without halting or branching, except at the end. A total of five basic types of nodes are conceivable: • Start: a unique node that has no incoming edges. It represents the start of the computation.

230 S. Ghosh

Program Program Source Program Source (C) Source (C) (C)

GCC Front End: Parsing and Standard Optimizations RTL

Flow Graph Generation

Hardware Synthesizer

Machine Synthesis

X–BLOX Netlist Generation XNF

GCC = GNU C Compiler RTL = Register Transfer Language XNF = Xilinx Netlist Format

Xilinx Tools (PPR etc.)

Program Program Source Hardware Source (C) Images (C)

Figure 4. The structure of the configuration compiler.

• Stop: a unique node that has no outgoing edges. It represents the end of the computation. • Sequential: it has only one outgoing edge. • Predicate: it has at least two outgoing edges. • Merge: it has at least two incoming edges. A complex node is a combination of two or more basic node types. For instance, a ‘start’ node may also be a ‘predicate’ node. A ‘stop’ node may also be a ‘merge’ node. Figures 5(a) through 5(c) presents an example function, its CFG, and its DFG. In Fig. 5(b), the node labeled ‘predicate-block’ is both the start node and a predicate node. The nodes labeled ‘then-block’ and ‘else-block’ are both sequential nodes. The node labeled ‘join-block’ is both a merge node and the stop node. The number at the bottom

Programs on adaptive architectures utilizing FPGAs 231 if_example(int a, int b) { int c = 0;

c = 0; if (a < 0) 25

if (a < 0) { c = a + b; c = c 4; } else { c = a* b; c = c & 4; }

25

predicate–block 25

c = a* b; c &= 4; 525

c = a + b; c = 4; 75 then–block

else–block

return c;

100

550

} return c; 0 join–block

(b)

(a)

b (0)

a (0)

0

4

0

plus (50)

0

0

mult (500)

0

0

50

500

or (25)

50

0

and (25)

ge (50) 75

4

525

mux (25) 100

550

result

(c) Figure 5. (a) An example function; (b) the CFG; (c) the DFG.

of each node represents the time units necessary to execute the corresponding block, while that on an edge represents the cumulative execution time up to the point where the edge emanates. The CFG of a function encapsulates the flow of control information in the function, and a particular instance of execution of a function is expressed through a path from the ‘start’ to the ‘stop’ node. A path in a CFG is an ordered set of statements executed in a sequence. A delay, associated with each path, represents the total time required to execute all the statements in the path, sequentially. The exact path followed during an

232 S. Ghosh instance of execution is a function of the input data. Of the two paths in 5(b) between the ‘start’ and ‘stop’ nodes, the path through the ‘then-block’ requires 100 time units, while that through the ‘else-block’ requires 550 time units. The value of the parameter ‘a’ determines the actual path. In essence, the CFG provides information on the different possible execution paths of a function. The CFG representation of a function is limited, in that its underlying execution model is the sequential, von-Neumann architecture. Consequently, it fails to expose any inherent parallelism—coarse-grain, i.e. block-level, and fine-grain, i.e. operator or statement-level. For example, in Fig. 5(b), one may not obviously conclude whether the first and second statements in the ‘then-block’ may execute concurrently.

3.2 Data Flow Graphs (DFG) A block-level DFG of the critical section is utilized to address the limitations of the CFG. The DFG is a directed graph wherein the nodes and edges represent the operators and the flow of values among them. The DFG, utilized in this paper, differs from the traditional DFG, in that it contains, as explained subsequently, multiplexor and latch operators. A total of five different types of nodes are conceivable in a DFG: • Constant: represented through a circle, as shown in Fig. 5(c). • Unary operator: it accepts only one input and generates a result, e.g. − and NOT. • Binary operator: it accepts two inputs and generates a result, e.g. +, −, AND, and mult. • Ternary operator: it accepts three inputs to generate a result, e.g. multiplexor. • Latches: it accepts two inputs—a data value and clock, and generates the latched value at its output. • Input/output registers: these respectively store the inputs and outputs of a critical section and are represented through shaded rectangles as shown in Fig. 5(c). It is noted that the DFG that is propagated to the ‘machine synthesis’ module from the ‘flow graph generation’ module in Fig. 4 is basic and does not contain multiplexor and latch operators. The ‘machine synthesis’ module adds these operators to derive a ‘complete DFG’. Figure 5(c) presents the DFG representation of the critical section in Fig. 5(a). The unary and the binary operators represent the unary and binary arithmetic and/or logic operations, respectively. Two types of the multiplexor operators are supported—‘mergemux’ and ‘loop-mux’. A multiplexor selects one of two inputs, as dictated by a third select input. The ‘merge-mux’ selects one of two definitions of a variable arriving at a merge-node in a CFG. The ‘loop-mux’, present at the top of a loop, selects either the initial value of a variable or its value fed back value from a subsequent loop iteration. A latch stores a fed back value of a variable from a loop iteration, temporarily. Given that the DFG representation of a critical section is based on the data-flow model of execution [27], i.e. there is a lack of the central locus of control, a node of a DFG may execute as soon as all its input operands are available. Upon execution, a node places the result on its outgoing edges which carry the value to other operators. Thus, an operator never stalls the execution of another operator unless its output serves as the latter’s input operand. Unlike the CFG, the DFG does expose the fine-grain

Programs on adaptive architectures utilizing FPGAs 233 parallelism inherent in the critical section, and eliminates the performance retarding side-effects [26] by converting all operations into functions. Clearly, the DFG fails to capture any control information, inherent in the critical section in scenarios such as loops which involve iterative computation. In the execution of a loop, additional information is required to control the iterations and correctly feed back the values to subsequent iterations. A number of techniques such as tagged tokens and waiting–matching [27], that are proposed in the literature on data-flow architectures, are very complex, expensive in terms of hardware, and require substantial implementation overheads. Such techniques are also inappropriate for a FPGA-based architecture, where hardware is limited and high implementation overheads are unacceptable. 3.3 Model of execution PRISM-II’s objectives include: (i) the exploitation of fine-grain parallelism; (ii) providing primitives for addressing iterative computations; (iii) efficient execution of control constructs; and (iv) implementation on FPGAs that is efficient yet inexpensive. These objectives are encapsulated in the proposed execution model which provides the underlying basis for hardware synthesis on the FPGA platform. The execution model consists of two principal components—(i) ‘operator-network’, and (ii) ‘controller’. The operator-network is a specific organization of arithmetic and logic units intended to perform the actual computation, and is an instantiation of the DFG on the FPGAs. The controller, a finite state machine, controls the computation in the operator-network, determines when the computation is complete, and generates the ‘end-of-computation’ signal at the conclusion of the computation. The ‘end-ofcomputation’ signal provides an asynchronous means of informing the core-processor to fetch the results from the FPGA platform. Figure 6 presents a pictorial view of the execution model. Prior to initiating execution, the input values are loaded into the input registers and the controller is initialized. The operator-network is initiated as soon as inputs are available. That is, operators execute as soon as their input operands are available and intermediate results and data are communicated from one operator to the subsequent operator directly through dedicated connections, thereby implying minimal communications overheads. The controller plays the role of a timer that tracks the execution time of the operations. The controller stores information on the execution delays along all possible execution paths in the operator-network, utilizes specific intermediate data values generated inside the operator-network, and tracks the actual execution path taken by the current instance of execution. The data values may include the predicates for the conditionals and predicates for loops. The presence of control constructs including ‘if-then-else’ generates the possibility of multiple execution paths during the execution of a critical section. Where each of the possible execution paths may require a different execution time, it is important, for efficiency, that the controller tracks the actual path to precisely account for the time taken for an execution. At the end of the duration of the ‘tracked’ path, the controller latches the results in the output registers and generates the ‘end-ofcomputation’ signal. The execution of a loop with dynamic loop count involves three phases—

234 S. Ghosh Input Data

FPGA Platform

Input Latch Operator–Network inputs a 3

S1

6

+

S2

– S3

Operator–Network

Controller (FSM)

Operator–Network output Output Latch

Final Result

Main Processor AMD 29050 Control Signal

End_of_Computation

Data Value Figure 6. The PRISM-II execution model.

‘initialization’, ‘execution of the body’, and ‘feed-back’. In the ‘initialization’ phase, the controller signals all the ‘loop-muxes’, located at the top of the loop, to select the initial values of the input variables to the loop. This phase is executed only once during the entire execution of the loop. In the ‘execution of the body’ phase, the operatornetwork executes the code segment in the loop-body in a data-flow manner. The controller tracks the time required for this phase. In the subsequent ‘feed-back’ phase, the controller first examines the boolean value generated by the loop-predicate, i.e. the loop exit condition, to determine whether to iterate the loop further or exit. Where the controller decides to iterate the loop, it generates appropriate signals to latch the values of all intermediate state variables that are fed back to the loop. The controller also

Programs on adaptive architectures utilizing FPGAs 235 generates signals to the ‘loop-muxes’, at the top of the loop, to select the feed-back values of the intermediate variables. Following the ‘feed-back’ phase, the ‘execution of the body’ phase may be executed again and the cycle continues until termination. To address the issue of nested-loops, an inner-loop is considered a part of the body of the outer-loop, and section 4 presents further details on the synthesis and use of the corresponding operator-network and controller. In essence, the controller permits the operator-network to execute, asynchronously, independently, and in a data-flow manner, while asserting minimal control over the latter’s operations. Every operator in the operator-network executes as fast as possible, limited only by the inherent data dependencies. Given that the data communication overheads among the operators are minimal, the PRISM-II approach has the potential to exploit maximal fine-grain parallelism inherent in the critical section, and result in high performance relative to sequential execution of the critical section. In addition, in the proposed execution model, hardware is required to implement only the operatornetwork and controller. The execution model is novel and unique in many respects. First, unlike PRISM-I [16], which limits the execution time of a critical section to the time period of the core-processor bus cycle, PRISM-II permits fast evaluations of critical sections requiring arbitrary execution times. Second, the PRISM-II execution model successfully addresses the issue of executing loops with dynamic loop counts. Third, the proposed execution model possesses several important advantages over existing models proposed in traditional high-level synthesis architectures [18–20] and [21]. Traditional high-level synthesis architectures utilize centralized controllers that, in turn, impose strict sequential execution of instructions, resulting in the failure to exploit potential fine-grain parallelism. In contrast, such restrictions are absent in PRISM-II. Fourth, unlike the slow mechanism of communicating intermediate results and data between successive instructions or operators through registers and storage units, the novel execution model permits direct communication of data among the operators. Throughout the literature, data-flow architectures [27, 28] have always promised high performance for general-purpose programs but have failed to meet the expectations. For general-purpose programs, they require complex mechanisms for token-labeling, storing, and communicating data which, in turn, pose high demand for silicon area and significant execution overheads. As a result, to compete effectively with a conventional uniprocessor, a general-purpose data-flow architecture requires programs with inherent parallelism in the hundreds [26]. The proposed execution model fully exploits data-flow concepts, namely A.1 and A.2 of section 1, and yet achieves high performance for critical sections of limited sizes.

3.4 Intermediate representation To facilitate its analysis and efficient execution, a critical section is expressed within the configuration compiler through an intermediate representation. The aim of the intermediate representation is a technology-independent expression of the critical section which is based on the execution model described above. Once the intermediate representation is available, customized hardware can be synthesized from it by using the FPGA vendor tools. The intermediate representation in PRISM-II is a ‘machine graph’ that consists of

236 S. Ghosh two principal components—a DFG and a Finite State Machine (FSM). The ‘operatornetwork’ and the ‘controller’ described in the previous section are instantiated on the FPGAs from the DFG and FSM, respectively. In the rest of the paper, operatornetwork and DFG, and controller and FSM may be used interchangeably, and the meaning would be clear from the context. The DFG and FSM represent the computational and the control aspects of the critical section respectively and maintain links to each other. While the DFG purely represents the computational aspects of the critical section, the FSM, derived from the CFG, represents the control flow information. In essence, the FSM behaves as a timer that provides timing signals corresponding to each of the possible paths through the CFG. The FSM is represented through a directed graph, wherein the nodes represent states and the arcs represent state transitions. A duration is associated with every state, and a transition from a state occurs at the end of this duration. The next state is determined by selecting the appropriate arc emanating from the state, based on the inputs. The input to a state is either a value obtained from the DFG or an internal signal generated at the end of the duration of the state.

4. The underlying algorithm of the configuration compiler An algorithm is proposed for the synthesis of the operator-network and controller corresponding to a critical section. The algorithm accepts as input the ‘simple DFG’ and CFG of the critical section, and ultimately generates a ‘hardware netlist’ that, in turn, is utilized by the FPGA vendor tools to place and route and synthesize the target hardware. The algorithm is represented in Fig. 4 through the three modules—flow graph generation, machine synthesis, and X-BLOX netlist generation. In the flow graph generation module, first the CFG and DFG are constructed. The ‘complete DFG’ is then constructed by inserting latches and muxes at appropriate places in the ‘simple DFG’. Then, the operator nodes in the DFG are ‘time-stamped’, i.e. each operator node is assigned the earliest possible time when the operator node may complete execution. In the CFG, utilizing the execution times of the operations, each basic block is assigned two time-stamps—‘starting time-stamp’ and ‘ending timestamp’. The ‘starting time-stamp’ refers to the earliest time when all of the inputs to a basic block are available while the ‘ending time-stamp’ is the latest time when all of the outputs from a basic block have been generated. Then, the CFG is restructured and optimized, utilizing the time-stamps of the basic blocks, to generate an FSM. As indicated earlier, the FSM represents the flow of control in the critical section, serves as a controller for the operator-network, and indicates the end of the computation. The number of states in the FSM is optimized as explained later in this section. The process of generating the FSM is detailed subsequently in this section. The algorithm is presented, in pseudo-code, in Fig. 7 followed by a detailed discussion and an example. (1) createOperatorNetwork: the operator-network is derived from the ‘simple DFG’ passed on from the ‘Flow Graph Generation’ module in Fig. 4. Merge-muxes, introduced earlier in this paper, are inserted at appropriate locations in the ‘simple DFG’ to resolve multiple definitions of a variable reaching an operator node. The select line for a multiplexor is the true/false output from the corresponding predicate operator. Latches are added to loops to help retain the values of variables from a previous iteration.

Programs on adaptive architectures utilizing FPGAs 237

build_machine (DFG, CFG, CDG, DT, PDT) { createOperatorNetwork( ) ;

/* creates the initial /* version of the /* operator network.

*/ */ */

timeStampOperators( ) ;

/* time stamp operations /* in the operator /* network.

*/ */ */

time StampBlocks( ) ;

/* /* /* /*

*/ */ */ */

createController( ) ;

/* create the initial /* states of controller /* from the CFG

*/ */ */

determineDuration( ) ;

/* /* /* /*

determine the "duration" of each of the states of the controller.

*/ */ */ */

optimizeController( ) ;

/* optimize the number /* of states in the /* the controller.

*/ */ */

writeNetlist( ) ;

/* write out the /* hardware description /* of the machine graph.

*/ */ */

using the operator time stamps, create time stamps for the basic blocks in CFG

}

Figure 7. The underlying algorithm of the configuration compiler, in pseudocode.

Loop-muxes are added at the beginning of loops to allow selecting either the initial value of a variable or its subsequent values. The select line of a loop-mux is derived from the state of the controller. (2) timeStampOperators: each operation in the DFG is assigned a time-stamp that indicates the earliest time it may complete execution. Through this assignment, a

238 S. Ghosh schedule is automatically created that also reveals the operations that may execute in parallel. The schedule also permits a view into the timing of the execution of the operator-network. A time-stamp is determined based on the premise that an operation may not execute until all of its input operands are present. Thus, it is computed as the maximum of the time-stamp values of all of the inputs plus the execution time of the operator node. The time-stamps of operations that have no inputs are set to zero, e.g. nodes for constants. The time-stamps of operator nodes are computed utilizing a breadth-first search algorithm. That is, first all operator nodes with zero incoming edges are timestamped. These nodes constitute the first level of the DFG. Then, all operator nodes at the second level of the DFG are time-stamped. This level includes all operator nodes that are either directly connected to the nodes of level 1 or bear a single incoming edge. Next, operator nodes at levels 2, 3, etc. are successively time-stamped. A time-stamp, assigned to an operation, does not necessarily reflect the actual completion time of a computation. For instance, in the case of a loop, a time-stamp value assumes that the loop is executed only once. However, time-stamps serve as a useful mechanism to order data dependencies in a critical section.

(3) timeStampBlocks: time-stamps are computed for every basic block in a CFG utilizing the time-stamps of the individual operations in the DFG. For each basic block, ‘starting time-stamp’ and ‘ending time-stamp’ values are computed, as indicated earlier in this paper. For efficiency, corresponding to every basic block, the ‘ending timestamps’ of the direct predecessor basic blocks are examined. Where the ‘starting timestamp’ of a basic block is less than the minimum of the ‘ending time-stamps’ of the predecessor basic blocks, the ‘starting time-stamp’ of the basic block is modified to the minimum value. As a result, the basic block may be initiated for execution earlier, thereby achieving concurrency. A predicate basic block determines which of the two succeeding basic blocks will be executed subsequently. The definitions of variables from the two succeeding basic blocks will be merged through a merge-mux using the predicate value as its select signal. The predicate value selects which of the two definitions will be propagated. It may be observed that, for a merge-mux to execute, it is adequate if the select signal and the input actually selected are available. It is not necessary for the other input to be available. Thus, for efficient execution of the operator-network, it is important to identify the predicate basic blocks.

(4) createController: the initial version of the controller is simply the CFG, with the states of controller corresponding directly to the basic blocks of the CFG. The final version of the controller results from efforts to restructure and optimize the number of states. A state transition is dictated by the predicate value of the corresponding basic block. In a given state, all operations of the corresponding basic block must be completed. Therefore, in the implementation, a counter is initially loaded with the cumulative time duration of all operations and its value is decremented as time progresses. The controller remains in the given state as long as the counter value is non-zero and then permits a state transition.

Programs on adaptive architectures utilizing FPGAs 239

if (a > b)

: BB0

c = a – b;

: BB1

else c = a * b; d = c + 5;

: BB2 : BB3

Figure 8. Code fragment to illustrate optimization of FSM.

BB0 if–block 10

then–block BB1

BB2 else–block 30

20

join–block

then–path duration = 70

BB3 40

(a)

else–path duration = 80

state0 predicate–state 10

state0 70

state1 state2 20 30 join–state

then–path duration = 70

state3 40

(b)

else–path duration = 80

state1 10

then–path

else–path

(c)

Figure 9. (a) CFG for example code fragment in Figure 8; (b) corresponding unoptimized FSM; (c) optimized FSM.

(5) determineDuration: for every state of the FSM, durations are computed utilizing the time-stamps of the corresponding blocks in the CFG. Duration of a block is the difference between its ‘ending time-stamp’ and ‘starting time-stamp’. (6) optimizeController: the initial FSM, derived directly from the CFG, may contain more than the minimal number of states. In this step, the initial FSM is optimized to yield the final FSM that, in turn, implies an efficient controller hardware for the operator-network. In general, FSMs derived from CFGs corresponding to ‘if-then-else’ constructs, lend themselves to optimization. For instance, the code fragment shown in Fig. 8 consists of four basic-blocks, each with distinct duration, between the ‘if-block’ and ‘join-block’. The corresponding CFG, shown in Fig. 9(a), contains four basicblocks and Fig. 9(b) presents the unoptimized FSM with four states. The optimized FSM requires only two states as shown in Fig. 9(c). The integers at the bottom of the basic-blocks in Fig. 9(a) indicate the corresponding durations. It may be observed that the CFG contains two paths, namely ‘then-path’ of duration 70 and ‘else-path’ of duration 80 and, as a result, the FSM needs only two states, ‘state0’ and ‘state1’. The duration for ‘state0’ is obtained by adding the durations for the ‘predicate-block BB0’, ‘join-block BB3’, and the minimum of the durations of the ‘then’ and ‘else’ blocks. The duration for ‘state1’ is equal to the difference of the duration of the longer of the two

240 S. Ghosh

if (a > b) { if (b > 10) c = a + b; else c = a * b; c++; } else { if (a == 0) { c = b/5; else c = ++b – 5; c /= 2; } c = c * c;

: BB0 : BB1 : BB2 : BB3 : BB4

: BB5 : BB6 : BB7 : BB8 : BB9

Figure 10. Code fragment to illustrate optimization of FSM for nested ‘if-then-else’ constructs.

paths from ‘if-block’ and ‘join-block’ and that of ‘state0’. Thus, optimization leads to a FSM with two fewer states than the initial FSM. The rationale underlying the optimization is as follows. In Fig. 9(a), once the thread of execution reaches block ‘BB0’, it is certain that ‘predicate-block BB0’ and ‘joinblock BB3’ will be executed. Also, depending on the boolean value of the predicate, either ‘then-blockl’ or ‘else-block’ will be executed. Therefore, the total execution time for the ‘if-then-else’ construct will at least equal the sum of the durations of ‘predicateblock’, ‘join-block’, and the minimum of the ‘then’ and ‘else’ blocks. This minimum, required execution time serves as the duration for ‘state0’. The second state, ‘state1’, accounts for the additional time that may be required when the execution is required to adopt the longer of the two alternate paths. The basic idea is extensible to nested ‘if-then-else’ constructs. As a result, in general, for a critical section, the reduction in the number of states is a function of the number of ‘if-then-else’ constructs that it contains. For instance, consider the code segment shown in Fig. 10. The corresponding CFG is shown in Fig. 11(a) with 10 blocks and four possible execution paths from basic-block ‘BB0’ to basic-block ‘BB9’. The initial unoptimized FSM, derived directly from the CFG, is shown in Fig. 11(b). The fully optimized FSM, shown in Fig. 12(a), contains only four states. A limitation, associated with the FSM, is expressed as follows. Complex logic is required for a state transition from ‘state0’ to a subsequent state. Thus, a transition from ‘state0’ to ‘state1’ requires predicate P1 to be TRUE and predicate P2 FALSE. This is reflected in Fig. 12(a) through the symbol ‘P1 AND P2’ on the arc from ‘state0’ to ‘state1’. In general, the depth of nesting determines the complexity of the logic required to initiate the state transitions. While the additional logic results from the effort to reduce the number of states in the FSM, the evaluation of the logic requires time and implies an increase in the total execution time of the corresponding ‘if-then-else’ construct.

Programs on adaptive architectures utilizing FPGAs 241

BB0 10

predicate P1

P1 BB1 10 P2

BB1 BB1

BB2 BB3

P3

BB3 40

BB4 50

join–block

BB4 BB4

BB9 BB9

BB9 100

BB7 50

BB8 90

path3 duration = 290

predicate P3 –P3

BB6 70

path2 duration = 210

path1 duration = 200 path1 = BB0 path2 = BB0

BB5 20

predicate P2 –P2

BB2 30

join–block

–P1

join–block

path4 duration = 270 path3 = BB0 path4 = BB0

BB5 BB5

BB6 BB7

BB8 BB8

BB9 BB9

(a)

state0 10 P1

–P1

state1 10 P2

state5 20 –P2

state2 30

P3

state3 40

state4 50

path1 duration = 200

–P3

state6 70

path2 duration = 210 path3 duration = 290

state9 100

state7 50

state8 90

path4 duration = 270

(b)

Figure 11. (a) CFG of code fragment in Figure 10 illustrating optimization of nested ‘if-then-else’ constructs; (b) corresponding unoptimized FSM.

242 S. Ghosh

P1 AND –P2

state0 200

state0 110

predicate–state

predicate–state (P1)

P1

–P1

–P1

P1 AND P2

state1 10

state2 70 –P1 AND –P3

predicate–state (P3) –P1 AND P3

state3 30 path1 (200) path4 (270) path2 (210) path3 (290) (a)

state1 90

predicate–state (P2)

state2 160

predicate–state (P3)

P3

–P2 P2

–P3

state2 10

state4 20

path2 (210) path1 (200)

path4 (270)

path3 (290) (b)

Figure 12. (a) Optimized FSM for the CFG in Figure 10; (b) alternate optimization of CFG and generation of FSM without additional logic.

For a nested ‘if-then-else’ construct, the principal reason underlying the additional logic is that the ‘inner predicate’ states such as the ‘state1’ and ‘state5’, as shown in Fig. 11(b), are merged with the ‘outer predicate-state’ ‘state0’ in Fig. 11(b) to create a single predicate state ‘state0’ in Fig. 12(a). As a result of these merging, it becomes necessary to examine combinations of the predicate values, as opposed to a single predicate, to determine state transitions. Alternate state optimization may be achieved without introducing additional logic as shown in Fig. 12(b). The predicate states ‘state1’ and ‘state5’ in Fig. 11(b) are not merged with their parent predicate state ‘state0’. Instead, for the outer ‘if-then-else’ construct, only the join-state ‘state9’ in Fig. 11(b) is merged with ‘state0’ in Fig. 11(b) to create ‘state0’ in Fig. 12(b). Moreover, each of the innermost ‘if-then-else’ constructs—{B1, B2, B3} and {B5, B6, B7}, are fully optimized to two states—{‘state1’, ‘state2’} and {‘state3’, ‘state4’}, in Fig. 12(b). The resulting FSM consists of five states and no additional logic in contrast to four states and additional logic in the FSM in Fig. 12(a). The algorithm for achieving optimization without introducing additional logic is described as follows. First, every innermost ‘if-then-else’ construct is identified and then optimized to two states as explained earlier. Second, starting with every innermost ‘ifthen-else’ construct, move outwards, optimizing the unoptimized outer ‘if-then-else’ constructs until the outermost ‘if-then-else’ construct is reached. Finally, the ‘join-state’ of the outermost ‘if-then-else’ construct is merged into its ‘predicate-state’ to form the new ‘predicate-state’ with the duration equal to the sum of those of the two merged states. Non-loop constructs in critical sections, such as ‘switch-case’, may be first reduced to ‘if-then-else’ form and then optimized. For loop-constructs with fixed or dynamic loop counts, including ‘for’, ‘while’, and ‘do-while’, the body of the loop is optimized analogous to the ‘if-then-else’ construct. The body of a loop is the set of basic-blocks, except the predicate block that tests the exit condition from the loop, that is executed

Programs on adaptive architectures utilizing FPGAs 243 iteratively. The loop-body may contain any mix of constructs and represented through a corresponding CFG. Given a critical section containing a mix of loop, switch-case, if-then-else, etc. constructs, the final FSM is obtained as follows. First, all of the loops in the CFG and their constituent basic-blocks are uniquely identified. Then, every loop-body is optimized separately and, in the process, only states contained within a given loop-body are permitted to merge. A state outside a loop-body is not permitted to merge with a state within the loop-body, since this will increase the duration for a single iteration of the loop thereby resulting in increased overall execution time. Code segments that contain ‘if-then-else’ and ‘switch-case’ constructs but lack loops are optimized and the corresponding FSM is generated. (7) writeNetlist: the final FSM, representing the controller, and the operator-network obtained in step 1 of the algorithm, are translated into the XBLOX Xilinx Netlist Format (XNF) representation. XBLOX [32] is a library of components including adders, multiplexors, counters, etc. that have been previously optimized with respect to speed and area for the Xilinx FPGA. Following the translation, ‘xblox’ and ‘ppr’ tools [30] are used to place and route the FPGA, thereby yielding a custom hardware for fast and efficient execution of the critical section. 4.1 Limitations of PRISM-II The limitations of PRISM-II, at the present time, include the following: • The execution model does not address memory references, pointers, data structures including arrays and structs, and global and static variables that require accessing memory locations outside the FPGA. Global and static variables must be reimplemented in the form of explicit arguments of the critical section. • Floating point types and operations are not supported because of their intense demand for silicon. • The number of input operands to the current FPGA platform is restricted to two 32-bit numbers. The size of the return value is limited to 32 bits. 4.2 An example To illustrate the principle of PRISM-II, consider a critical section, expressed in C, in Fig. 13. The critical section aims to perform integer division and computes the quotient ‘q’, given two integers ‘d’ and ‘e’. The purpose of this program is to illustrate the PRISMII approach and the author does not necessarily recommend its usage for integer division. It is selected for the following reasons: • it contains loops whose iteration counts are dynamic, i.e. unknown at compile time; and • it contains if-then-else constructs. First, the code segment in Fig. 13 is analysed and the executable statements are grouped into nine basic-blocks, BB0–BB3 and BB5–BB9. Second, the initial, traditional

244 S. Ghosh

int div(int d, int e) { 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. }

int q = 0; char samesign = 1; if (d < 0) { d = –d; samesign = –samesign; }

bb0 bb0 bb0 bb0 bb1 bb1

if (e < 0) { e = –e; samesign = –samesign; }

bb2 bb3 bb3

loop: d = d – e; if (d >= 0) {

bb5 bb5

q++; goto loop;

bb6 bb6

if (samesign) return q; else return –q;

bb7 bb9

}

bb8

Figure 13. An example function to illustrate the synthesis of operator-network and controller.

DFG and CFG are created for the code segment in Fig. 13 [31]. Then, multiplexors and latches are added to the traditional DFG to synthesize the operator-network. The operator nodes corresponding to the basic-blocks of the code segment in Fig. 13 are shown enclosed in dashed rectangular boxes in the operator-network in Fig. 14. This correspondence will subsequently assist in correlating the functions of the operatornetwork and the FSM controller. In Fig. 14, the circular nodes represent constant values that are utilized during the computation while the rectangular nodes represent unary or binary arithmetic or logical operators. For example, nodes labeled ‘minus’ and ‘plus’ in Fig. 14 correspond to ‘−’ and ‘+’ arithmetic operators, respectively. The diamond nodes labeled ‘it’, ‘ne’, and ‘ge’ correspond to the boolean operators ‘less than’, ‘not equal to’, and ‘greater than’,

Figure 14. (opposite) Operator-network for the C-function in Fig. 13.

Programs on adaptive architectures utilizing FPGAs 245 d

0

0

a

0 0 0

bb0

1

0

It 50

50

neg 50

0 bb1 50

not 15

50 0

50 mux_2

mux_1

65

0 bb2

65

It 50 65 65 not 80

bb3

neg 50

80 80

50 bb4

mux_4

mux_3 95

65

65

65 0

65 latch_1

95

mux_5

latch_2

0

mux_6

80

80

Loop Control Signals from State Machine

1 0

minus 130

0

plus bb6

130 130 to State Machine

0 ge bb5

180

180 180

95 0 0 ne 145

bb7

230

neg

180 bb8 230

Execution–delay of operators: (time units) neg, minus, plus = 50 It, ne, ge = 50 not, mux = 15 latch, constants = 0

230

145 mux_7 245

return

bb10 245

Latch signal from FSM

246 S. Ghosh respectively. Multiplexors are expressed through rectangular-diamond nodes. Nodes labeled ‘mux 1’ through ‘mux 5’ represent ‘merge-muxes’, that were introduced in section 3.2. For instance, the merge-mux, ‘mux 2’, selects one of two definitions—‘d’ and ‘−d’, of the variable ‘d’ reaching line 7 in Fig. 13. Nodes labeled ‘mux 6’ and ‘mux 7’ represent ‘loop-muxes’ which were also introduced in section 3.2. A ‘loopmux’ is located at the top of a loop and it either selects the initial value of a variable or its value, fed back from a subsequent loop iteration. For instance, ‘mux 7’ selects either the initial value of the variable ‘q’ which is ‘0’ or the values fed back from subsequent iterations. The node labeled ‘latch’ represents a latch that is utilized within a loop. It temporarily stores a value from an iteration and feeds it back to the subsequent iteration. The two initial input values to the operator-network are stored in registers represented through shaded rectangular nodes labeled ‘d’ and ‘e’ at the top of Fig. 14. The final result of the critical section is stored in the register labeled ‘return’, which appears at the bottom of Fig. 14. The shaded operations, principally basic-blocks bb5 and bb6, correspond to the loop. The CFG for the critical section is presented in Fig. 15. Except for blocks bb4 and bb10, each of the blocks bb0–bb10 in Fig. 15 corresponds to the basic-blocks in Fig. 13. The block bb4 in Fig. 15 contains merge-muxs ‘mux 3’ and ‘mux 4’ and does not correspond directly to any code segment in Fig. 13. Similarly, block bb10 in Fig. 15 contains ‘mux 7’, and does not correspond directly to any code segment in Fig. 13. The shaded blocks, bb5 and bb6, correspond to the loop. Next, the underlying algorithm of PRISM-II generates time-stamps for the operations in the operator-network and corresponding time-stamps for every block of the CFG. The time-stamps are shown adjacent to every operation in Fig. 14. The starting and ending time-stamps are also shown adjacent to every block in Fig. 15. For the operator-network, the time-stamp values are generated as formerly explained in step 2 of the algorithm in section 4 of this paper. A time-stamp for a node in Fig. 14 is obtained from adding the time required to perform the operation to the maximum of the time-stamp values of the input operands. Thus, a ‘plus’ node with operator delay 50 and input time-stamps 75 and 100 will have a time-stamp equal to maximum{75, 100}+50=150. As also explained earlier in section 4, the nodes in Fig. 14 are selected using a breadth-first search algorithm and then their time-stamp values are determined. For the CFG, the time-stamp values are generated as illustrated in step 3 of the algorithm in section 4 of the paper. The starting time-stamp for a block in Fig. 15 is derived as the maximum of the time-stamps of all inputs to all operations in the block. The ending time-stamp for a block is computed as the maximum time-stamp of all outputs of all operations in the block. The starting and ending time-stamps for the blocks are shown in Fig. 15. In the subsequent step, an initial FSM controller is obtained as an identical copy of the CFG in Fig. 15, and is shown in Fig. 16(a). As per step 5 (‘DetermineDuration’) of the algorithm, the initial FSM is optimized by collapsing a few states onto other states and assigning them appropriate durations. In Fig. 16(a) states ‘ss0’ and ‘ss1’ have the same set of time-stamps. This implies the lack of data dependencies between ‘ss0’ and ‘ss1’, and hence they may be executed concurrently. Thus, ‘ss1’ is merged into ‘ss0’, and the resulting FSM will show a direct link from the resulting state ‘ss0’ (timestamps 0 and 50) into ‘ss2’ (time-stamps 50 and 65). Since resulting state ‘ss0’ and state ‘ss2’ are now sequential, i.e. ‘ss0’ directly followed by ‘ss2’, they are merged into one

Programs on adaptive architectures utilizing FPGAs 247 0 bb0 50 0 bb1 50 50 bb2 65 65 bb3 80 80 bb4 95

65

65 bb5

bb6

180

130

95 bb7 145 0

180 bb9

bb8

0

230 230 bb10 245 Figure 15. CFG for the C-function in Fig. 13.

state, ‘s′0’, whose starting and ending time-stamps are 0 and 65, respectively, and is shown in Fig. 16(b). States ‘s′1’ and ‘s′2’ in Fig. 16(b) are derived directly from states ‘ss3’ and ‘ss4’ of Fig. 16(a), respectively. It may be observed that the statements constituting the loop, i.e. ‘ss5’ in Fig. 16(a), is executed at least once. Therefore, the FSM controller must account for at least one time duration of ss5, i.e. from 65 to 180 time units. In addition, the starting and ending time-stamps of state ‘ss7’ in Fig. 16(a) are completely overlapped by those of state ‘ss5’. Hence, state ‘ss7’ may be merged with state ‘ss5’ for efficiency and without any adverse effect. This resultant state is represented by ‘s′3’ in Fig. 16(b). State ‘ss6’ in Fig. 16(a) is re-labeled as ‘s′4’ in Fig. 16(b), and it accounts for executions of the loop body in subsequent iterations. State ‘ss9’ has the same starting and ending time-stamp values and is therefore deleted. States ‘ss10’ and ‘ss8’ are merged into state ‘s′5’, as shown in Fig. 16(b). The intermediate FSM obtained in Fig. 16(b) is further optimized, and the final FSM

248 S. Ghosh

0

ss0

50

ss1

0 50

50

ss2

65

ss3

65 80

ss4

80 95

65

ss5

ss6

180 95 145 180

65 130

bb7

ss9

ss8

180

180 230

230

ss10

245

(a)

0

0

s' 0

s' 0 = ss0 + ss1 + ss2

s0

(65)

s0 = s' 0

65

65

65

s' 1

65 80

s1

(115)

180

s' 1 = ss3

s1 = s' 1 + s' 2 + s' 3 + s' 4

80

s' 2

s' 2 = ss4

95 180 65

s5

65

s' 3

s2 = s' 5

245

s' 4

180

(65)

130

s' 3 = ss5 + ss7 180

s' 5 245

s' 4 = ss6 s' 5 = ss8 + ss9 + ss10 (b)

(c)

Figure 16. (a) The initial FSM for the code in Fig. 13; (b) intermediate FSM as a result of a few optimizations; (c) final FSM following further optimizations.

Programs on adaptive architectures utilizing FPGAs 249 0 d

0 1

a 0

neg not

It

neg

s0

It

mux_1

mux_2

mux_3 65

s0 (65)

65 0 not

latch_1

mux_5

latch_2

mux_6

s1 (115) 1

mux_4 minus

0

s2 (65)

plus

0 ne

Finite State Machine ge 180 180

s1

neg

mux_7 s2

return 245 End_Of_Computation Signal Figure 17. Machine graph for the division function in Fig. 13.

is shown in Fig. 16(c). The duration of state ‘s′3’ completely overlaps the starting and ending time-stamps of states ‘s′1’ and ‘s′2’ and, therefore, the latter are merged into state ‘s′3’. Furthermore, state ‘s′4’ is merged into state ‘s′3’. The resulting state is represented through state ‘s1’ in Fig. 16(c). Thus, the final FSM consists of only three states—‘s0’, ‘s1’, and ‘s2’, and the duration of each state is determined simply by subtracting the ‘starting’ time-stamp from the ‘ending’ time-stamp, as shown within parentheses in Fig. 16(c). The FSM controller in Fig. 16(c) is in its simplest, irreducible form, and therefore constitutes the final FSM. Figure 17 presents the final ‘Machine-graph’, consisting of the operator-network and optimized FSM. Sets of operations in the operator-network, that are executed within a particular state of the FSM, are shown enclosed by shaded rectangular blocks. For instance, all operations within the shaded block ‘s0’ are executed corresponding to the state ‘s0’ of the FSM. The operations within ‘s0’ complete execution at 65 time units,

250 S. Ghosh upon which state ‘s1’ is initiated. State ‘s1’ corresponds to a loop-state, i.e. the operations may be iterated more than once. First, the FSM signals the multiplexors, ‘mux 5’ and ‘mux 6’, in the operator-network to choose the initial values of variables ‘d’ and ‘e’, respectively. The first iteration of state ‘s1’ completes at 180 time units. Then, the FSM receives from the operator-network the boolean value signal, generated from operator ‘ge’, and determines whether to continue execution of the loop. The propagation of the signal is represented by a dashed-line from node ‘ge’ of the operator-network to state ‘s1’ in the FSM. Where the loop continues to be executed, the FSM first instructs the two ‘latch’ operators in the operator-network to latch the feedback values of the two variables ‘d’ and ‘q’. When it returns to the state ‘s1’, the FSM instructs the muxes—‘mux 5’ and ‘mux 6’, to choose the values from their corresponding latches rather than utilize the initial values. Next, the operations within the loop are executed. Following execution of the loop, the FSM transfers control to ‘state2’ and then latches the final result from the output of ‘mux 7’ into the register labeled ‘return’. The FSM then generates the ‘end-of-computation’ signal and propagates it to the core-processor, informing it of the availability of the final result. The final machine is then instantiated with XBLOX library modules and a netlist file is generated to program the reconfigurable FPGA testbed. Presently, the author is developing an implementation of PRISM-II, i.e. integrating the design of the reconfigurable FPGA testbed, with the configuration compiler augmented by the new execution model. The results will be reported in a future publication.

5. Conclusions and future work This paper has introduced a significant conceptual advancement, PRISM-II. PRISMII is a new general-purpose adaptive computing system, and this paper has presented its architecture and compiler. In this architecture, specialized hardware is synthesized for an FPGA-based reconfigurable platform that executes the critical section(s) of a program written in C. Speeding up the execution of the critical section(s) greatly enhances the overall program execution performance. The reconfigurability of the hardware platform permits it to be reused for different applications, thereby maintaining a general architecture and yet specialized for each application. The synthesis process is automatic and transparent, thereby allowing the user to concentrate on the application and not the architecture. In the novel execution model, underlying the architecture, an operator-network and controller are synthesized for a given high-level program. The operator-network, custom-synthesized from FPGAs, executes the high-level program in a data-flow manner, i.e. any instruction is executed as soon as its input operands are available. The controller controls the computations in the operator-network, accurately determines when the execution is complete, by utilizing key principles developed in PRISM-II, and generates an end-of-computation signal to asynchronously inform the core-processor to fetch the results from the FPGA platform. While the realization of a general-purpose data-flow architecture has continued to be difficult, the PRISM-II approach promises asynchronous, data-flow execution of programs on custom synthesized FPGA hardware. Presently, an implementation of PRISM-II is under development.

Programs on adaptive architectures utilizing FPGAs 251

Acknowledgments The author gratefully acknowledges the support of the National Science Foundation through grant MIP-9021118.

References 1. J. Savage, S. Magidson and A. M. Stein 1986. The Mystical Machine: Issues and Ideas in Computing. Reading, Massachusetts: Addison-Wesley. 2. E. W. Pugh 1984. Memories That Shaped Industry: Decisions Leading to IBM System/360. Cambridge, Massachusetts: MIT Press. 3. J. Hennessy and D. Patterson 1990. Computer Architecture: A Quantitative Approach. San Mateo, California: Morgan Kaufmann. 4. G. Estrin 1960. Organization of computer systems: The fixed-plus variable structure computer. In Proceedings of the Western Joint Computer Conference, 33–40. 5. G. Radin 1983. The 801 minicomputer. IBM Journal of Research and Development, 27, 237–246. 6. R. A. Brunner 1991. VAX Architecture Reference Manual. Second edition. Bedford, Massachusetts: Digital Press. 7. A. Tucker and M. Flynn 1970. Dynamic microprogramming: processor organization and programming. Communications of the ACM, 14, 240–250. 8. A. Abd-Alla and D. Karlgaard 1974. Heuristic synthesis of microprogrammed computer architecture. IEEE Transactions on Computers, C-23, 802–807. 9. T. G. Rauscher and A. K. Agrawala 1978. Dynamic problem-oriented redefinition of computer architecture via microprogramming. IEEE Transactions on Computers, C-27, 1006–1014. 10. C. Papachristou and V. Immaneni 1993. Vertical migration of software functions and algorithms using enhanced microsequencing. IEEE Transactions on Computers, 42, 45–61. 11. Maya Gokhale et al. 1990. SPLASH: A reconfigurable linear logic array. In International Conference on Parallel Processing, I-526–I-532. 12. Jeffrey M. Arnold, Duncan A. Buell and Elaine G. Davis 1992. SPLASH 2. In ACM Symposium on Parallel Algorithms and Architectures, 316–322. 13. P. Bertin, D. Roncin and J. Vuillemin 1993. Programmable active memories: a performance assessment. Research Report 24, Digital Paris Research Laboratory. 14. Herve Touati 1993. Perle1DC: A C++ library for the simulation and generation of DECPeRLe-1 designs. Technical Note 4, Digital Paris Research Laboratory. 15. P. Callahan 1988. Dynamic instruction set coprocessors. MILCOM, 19.1.1–19.1.6. 16. P. Athanas and H. Silverman 1993. Processor reconfiguration through instruction-set metamorphosis: architecture and compiler. IEEE Computer, 26, 11–18. 17. Donald E. Knuth 1971. An empirical study of FORTRAN programs. Software-Practice and Experience, 1, 105–133. 18. Howard Trickey 1987. Flamel: A high-level hardware compiler. IEEE Transactions on Computer-Aided Design, CAD-6, 259–269. 19. D. Lanneer, S. Note, F. Depuydt, M. Pauwels, F. Catthoor, G. Goossens and H. De Man 1991. Architectural synthesis for medium and high throughput signal processing with the new CATHEDRAL environment. In High-Level VLSI Synthesis. Kluwer Academic Publishers, 27–54. 20. R. Camposano, R. A. Bergamaschi, C. E. Haynes, M. Payer and S. M. Wu 1991. The IBM high-level synthesis system. In High-Level VLSI Synthesis. Kluwer Academic Publishers, 79–104. 21. K. Wakabayashi 1991. Cyber: High level synthesis system from software into ASIC. In HighLevel VLSI Synthesis. Kluwer Academic Publishers, 127–151. 22. R. Camposano and W. Wolf, editors 1991. High-Level VLSI Synthesis. Kluwer Academic Publishers. 23. Robert A. Walker and Raul Camposano, editors 1991. A Survey of High-Level Synthesis Systems. Kluwer Academic Publishers.

252 S. Ghosh 24. G. Uvieghara et al. 1992. An experimental single-chip data flow CPU. IEEE Journal of Solid-State Circuits, 27, 17–27. 25. Richard Buehrer and Kattamuri Ekanadham 1987. Incorporating data flow ideas into von neumann processors for parallel execution. IEEE Transactions on Computers, C-36, 1515–1522. 26. D. D. Gajski, D. A. Padua, D. J. Kuck and R. H. Kuhn 1982. A second opinion on data flow machines and languages. Computer, 58–69. 27. J. B. Dennis 1980. Data flow supercomputers. Computer, 13, 48–56. 28. Arvind and R. Nikhil 1987. Executing a program on the MIT tagged-token dataflow architecture. In Proceedings of the Conference on Parallel Architecture and Languages Europe, volume II of Lecture Notes in Computer Science. Eindhoven: Springer, 1–29. 29. Advanced Micro Devices, Inc. 1991. AM29050 Microprocessor User’s Manual. 30. Xilinx Inc. San Jose, California 1992. XACT Reference Guide. 31. A. V. Aho, R. Sethi and J. D. Ullman 1986. Compilers. Principles, Techniques, and Tools. Reading, Massachusetts: Addison-Wesley. 32. Xilinx Inc. San Jose, California 1992. X-BLOX Design Tool User Guide.

Sumit Ghosh is currently an associate professor and the associate chair for research and graduate programs in the Computer Science and Engineering Department at Arizona State University. He had received his BTech degree from IIT Kanpur, India, and his MS and PhD degrees from Stanford University, California. Prior to his current assignment, Sumit had been on the faculty at Brown University, Rhode Island, and before that he had worked at Bell Labs Research in Holmdel, New Jersey. Sumit’s research interests are in fundamental problems from the disciplines of asychronous distributed algorithms, modeling and distributed simulation of complex system, networking, network security, computer-aided design of digital systems, continuity of care in medicine, and metrics to evaluate advanced graduate courses. Presently, he serves on the editorial board of the IEEE Press Book Series on microelectronic Systems Principles and Practice. Sumit is a US citizen.