Future Generation Computer Systems 8 (1992) 321-335 North-Holland
321
The Function Processor: A data-driven processor array for irregular computations Jesper Vasell and Jonas Vasell Department of Computer Engineering, Chalmers Uni~,ersity of Technology, S-412 96 G6teborg, Sweden
Abstract Vasell, J. and J. Vasell, The Function Processor: A data-driven processor array for irregular computations, Future Generation Computer Systems 8 (1992) 321-335. The Function Processor is a data-driven processor array architecture, i.e. a regular structure of locally interconnected processing elements called Function Cells, which operate according to the data flow execution principle. By means of a compilation method developed for this architecture, data flow graphs for functional programs can be created and mapped onto the processor array, so that each Function Cell is assigned the execution of one graph node. The main result presented is a Function Cell architecture which has been designed to support the functionality required by these data flow graphs. We also give a brief description of a programming method for the architecture. Furthermore, some results from an implementation are presented.
Keywords. Reconfigurable processor array; dataflow computing; functional programming; irregular compulation.
!. Introduction Due to the rapid development of VLSI technology in recent years, processor array architectures have become an increasingly interesting alternative for very fast algorithm implementations. A processor array consists of a large number of simple processing elements. Each processing element is directly connected only to a few neighbour cells, and repeatedly performs a single operation on data arriving from these neighbours. Systolic arrays [15,19] is a well-known example of these architectures. Another example is datadriven arrays (also called wavefront arrays) [11,12,16,14]. These architectures differ in the way the processing elements are synchronized. In systolic arrays are the processing elements synchronized by means of a global clock. In datadriven arrays, a processing element works in a Correspondence to: Jesper Vasell, Department of Computer Engineering, Chalmers University of Technology, S-412 96 G6teborg, Sweden.
way similar to the data flow principle [3,7,24,18], i.e. it performs its operation as soon as all necessary operands have arrived from its neighbours. Traditionally, array architectures have been used for highly regular computations in signal processing and image analysis. A regular computation always performs exactly the same sequence of operations, independently of the actual input data, i.e. the control flow is not affected by the value or structure of the input data. An example of a regular computation is multiplication of fixed-size matrices, while multiplication of arbitrary size matrices is an irregular computation. The advantage of regular computations is that it is possible to exploit data parallelism to a large extent, but of course many interesting computations and algorithms are not regular. Our goal has been to try to extend the use of array architectures to include irregular computations as well. It is then more difficult to exploit data parallelism, but much is still to be gained from pipelining parallelism and a very low interpretation and communication overhead. We have
0376-5075/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved
322
J. Vasell, J. Vasell
CAcnn'atrYol
Structure
Ii
::
^
^
FunctionCells
Fig. 1. Function Processor architecture.
also wanted to find support for fast execution of symbolic computations expressed in functional programming languages [5,8,22]. We will here present an architecture for a data-driven array processing element suitable for these purposes. This processing element is called Function Cell. The Function Processor (see Fig. 1) is an architecture consisting of an array of Function Cells, an Array Control Unit and a Structure Memory. It is not the intention that a Function Processor should be capable of executing arbitrarily complex computations by itself. The Function Processor should instead be seen as a component around which different types of systems can be built. For instance, a system can consist of several Function Processors and a conventional processor which performs administrative tasks. Another possibility is to use the Function Processor as an accelerator for a host processor running a conventional implementation of a functional programming language. In this case the Function Processor could execute one or several critical functions in the program, while the rest of the program is executed on the host processor. The Function Processor could also be used in a dedicated system, such as a signal processing system, where it performs a limited number of different computations, but still offers a possibility to make changes in the system functionality without having to redesign the hardware of the system. The Function Cell is in some respects related to the iWarp architecture [20]. The iWarp is a
processor which has been developed as a highly flexible processing element for different kinds of parallel computers. It is, however, considerably more complex than the Function Cell, and is not suitable for a fine-grain architecture like the Function Processor. It is important that a new architecture is developed together with efficient programming methods. We have therefore developed a method to create data flow graphs, DFG:s, for functional programs, which can be m a p p e d onto the Function Processor. By mapping a graph onto an array architecture, we mean assigning one processing element to each graph node in a way that allows intermediate results to be transferred between nodes via the processing element communication links. The use of array architectures for irregular functional computations has also been proposed by Koren et al. [14] who especially address the problems in mapping DFG:s onto data-driven arrays, and by Karabernou et al. [13]. Sheeran presents a method for synthesis of algorithm-specific array processors from functional specifications in [23]. This method has a set of predefined higher order operators as its starting point. A similar approach has been proposed by Lin (see [17]), based on the functional programming language FP, and implemented in a tool called S T R A D . This tool can automatically synthesize regular arrays from a specification written in a dialect of FP.
2. Compiling programs for the Function Processor
There are several features of data-driven architectures that affect the way programs are compiled. One such feature is locality, both in control and in communication. As a consequence of this, a data-driven architecture is asynchronous and no assumptions can be made about timing in the graph. For instance, the time it takes to communicate a data object along an arc must be regarded as indeterministic. The lack of global control also means that the D F G : s must be static, since it is very difficult to perform frequent reconfigurations at run-time.
The Function Processor
Furthermore, there is always a limitation on the size of the array of processing elements that imposes a restriction on the size of the D F G . Naturally, there exists an u p p e r bound on program size for any type of architecture. It is, however, much lower for an array architecture where there is a close correspondence between the size of a D F G and the required number of processors.
323
Also, the number and properties of communication links in an array restricts the type of data that can be efficiently communicated between processing elements. Thus, it is necessary to make sure that the D F G only handles data of types that are supported in the architecture. We have developed a method to compile functional programs to a D F G suitable for execution on the Function Processor. This method has been
cond
+
(d)
Fig. 2. Sum: (a) DFG, (b) expanded DFG, (c) optimized DFG, (d) mapped DFG.
324
J. Vase#, J. Vasell
implemented in a compiler which takes a program in the form of a set of recursive function definitions, along with an expression denoting the value of the program, and produces a D F G suitable for execution on the Function Processor. Currently, the programming method only allows a set of predefined types. This set is restricted by the data type support in the target architecture. It is, however, possible to extend the programming method to handle user-defined data types as well. Furthermore we do not have a general method to handle functions as data objects, which restricts the use of certain types of higher order functions. In Fig. 2, an overview of the compilation is shown for the following simple program which computes the sum of all elements in a list: let sum
l =
case
l
of
nil:
0
h.t:
h+sum
II
end in sum
l
(The (.) is the infix list constructor) The compilation is performed in five steps, the first of which is the buiM step. In this step a D F G is built for each function definition in the program, as well as for the expression denoting the value of the program. The result of this step for the example program can be seen in Fig. 2(a). It consists of two graphs, one for the function definition, and one for the program expression. The second step is expand. The purpose of this step is to expand nodes representing function applications in order to make the D F G static (see
Fig. 2(b)). Typechecking and buffering is the third step. The purpose of this step is to produce a D F G which only uses data types supported by the architecture. The fourth step is optimization which attempts to improve the graph in several aspects. (see Fig.
2(c)). Physical mapping is the last step of the compilation. It produces a mapping of the D F G onto the array of processing elements (see Fig. 2(d)). In the following sections we will describe these steps in more detail.
2.1. Building a DFG Building a D F G for a functional program is relatively straightforward in most cases. The primitive operators are simply compiled into a D F G node performing the corresponding operation. The same is done for constructors operating on data of any type directly supported by the architecture. Construction of more complex data objects requires some more nodes which are introduced in the typechecking and buffering step (see Section 2.3). One major difficulty in compiling for array architectures is how to implement conditionals. The problem is caused by the lack of global control which makes it very difficult to stop a subcomputation in the array. It is therefore important to avoid starting any subcomputation unless it is known to be needed or known to be terminating. In particular, this is a problem for the computations in the branches of a conditional expression. The conditionals in our source language are case-expressions with a simple form of pattern matching, where a pattern can be either a variable or a constructor: case
e of nil:
elEx3
h.t:
e2[h,t]
11
end
This expression is compiled into the D F G shown in Fig. 3. In this graph the switch-nodes perform the actual pattern matching, and depending on whether there is a match, they allow the free variables and constants of the corresponding branch to enter the graph. For the branches which are not selected, the switches produce a special value called placeholder, which will force the result of the corresponding branch to be a placeholder. The results from the different branches are collected by a conditional-node (named cond in Fig. 3) which receives two values and outputs the one which is not a placeholder, unless both values are placeholders in which case it outputs a place-
holder. In a case-expression, complex constructor patterns are taken apart by a destruct-node (named destr in Fig. 3) so that the parts corresponding to
The FunctionProcessor
Fig. 3. Case-expression.
325
puts a placeholder on its right output, otherwise it copies its input to the left output. If the function application is non-recursive it is possible to substitute the graph of the applied function for the triangle in Fig. 4. Recursive applications are somewhat more difficult to handle, since substitution of the graph of the applied function will only create new application nodes. The solution is to make all recursive applications share the same graph to avoid an infinite expansion. A graph for doing this is shown in Fig. 5. The enter-nodes in Fig. 5 keep track of from which application the arguments come. This information is placed on a stack (indicated by a black dot in the figure) by the exit-node, which collects the results from the function and, according to the information on the stack, sends them back to the right application. Another solution to the problem with recursive function applications would be to add a tag to all data objects, and give data belonging to different function instances different tags. This method has been used in several data flow architectures, such as T I ' D A [3,2] and ETS [6]. However, though this is a flexible solution, it has the disadvantage of requiring more complex hardware to be implemented in the processing elements. It would also increase the amount of data communicated be-
the variables in the pattern can be used in the expression corresponding to the branch.
2.2. Expanding a DFG The DFG:s produced by the first step in the compilation may still contain nodes representing function applications. The natural way to handle this is to replace these nodes by the graphs for the applied function. Before doing this it is, however, necessary to make sure that the application will behave like an ordinary node. This means that it should not start executing until it has received all arguments, and that it should produce a placeholder if all its arguments are placeholders (the DFG:s are constructed so that it is not possible for only one argument to be a placehomer). Figure 4 shows how to accomplish this. The wait-node ensures that all arguments must be present before the application is performed, and the shortcut-nodes (named short in the figure) check the arguments to see if they are placeholders. If this should be the case, it simply out-
mid
Fig. 4. Function application.
J. Vasell,J. Vasell
326
st
e/
¢~
~
"~
........ ;/" ....
/// /
/
/
/
/
(
°. ...... --.- . . . . mte
/i s. . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2,_.--'"'""""""° t~i-i ...........................
Fig. 5. Expansion of recursive function applications. tween processing elements. Thus, we have chosen the above approach which we believe is better suited for a fine-grain array processor. Having multiple applications sharing a graph also creates a need for saving the state of the graph between successive recursive applications. The state is saved on stacks placed in nodes which depend on recursive applications. It is important that a recursive application is not allowed to start until the state of the previous application is saved. This is guaranteed by a wait-node, as shown in Fig. 6(a), which stops the arguments of a recursive application until the state has been saved on the stacks. Non-recursive applications have to be synchronized in a similar way, shown in Fig. 6(b). The reason for this is that if the applied function is recursive we must make sure that the previous application is finished before we allow a new set of arguments to enter the graph. Otherwise the order of the results may become reversed.
tempts to assign a For some complex of lists, it is not introducing fetch
k f
type to each arc data structures, possible to do and store-nodes
iI
I
I
2.3. Handling complex data structures The purpose of the typechecking and buffering step is to transform the D F G into a form which only uses data types supported by the architecture. This is done by a typechecker which at-
(a)
(b) Fig. 6. Synchronization.
in the graph. such as a list this without which trans-
f
The Function Processor
form the representation of the data structure to a form which can be handled by the processing elements. In this step FIFO-buffers are also introduced in nodes where large data structures may have to wait for other operands to arrive. This is necessary in order to avoid deadlock.
2.4. Optim&ation The optimization step has two goals. One is of course to improve execution speed through transformations which increase the parallelism in the graph. The other is to reduce the size of the graph. This is important since the size of the array on which we execute the graphs limits the size of the programs that can be executed. A reduction of size can be achieved through merging nodes performing the same or complementary operations, thereby utilizing the available processing elements more efficiently.
2.5. Physical mapping The purpose of the last step in the compiler is to produce a mapping of the graph onto a fixed array of processing elements. This means assigning one processing element to each node and one or several communication links to each arc in the graph. The result is a configuration for the entire array, specifying the function of each processing element. Some of the processing elements in such a configuration will be used only for routing of communication paths, and the required number of processing elements will therefore be slightly higher than the number of nodes in the graph.
3. 1,'unction p r o c e s s o r a r c h i t e c t u r e
The general principles behind the Function Processor, e.g. that it is a data-driven array of processing elements which are statically assigned single operations of uniform complexity, have influenced the compilation technique described in the previous section. Given the details of this compilation technique, we will now discuss the specific requirements this will put on the Function Processor architecture.
327
The Function Processor (see Fig. 1) consists of an array of locally interconnected Function Cells, an Array Control Unit (ACU), and a Structure Memory (SM). The Function Cell is a reconfigurable processor capable of performing the function of one D F G node in each configuration. It could be possible to let each Function Cell execute several nodes according to a sequential program (c.f. [14]), but this would add more complexity to the Function Cell. Instead, we have chosen to keep the Function Cells as simple as possible, thereby allowing for larger arrays. The A C U controls data I / O to and from the array via a number of programmable channels. It is also responsible for downloading configurations to the array. The SM is controlled by the A C U and is used to store input, output, and intermediate data structures. Here, we will mainly concentrate on the Function Cells. The Function Cell architecture will be determined by the different node types that can be m a p p e d onto a Function Cell, and by the data types that are sent between nodes. The compilation technique will also demand a certain buffering capacity. Finally, some support for routing of communication paths is necessary to make the mapping of DFG:s onto the array possible and efficient. In general, nodes have one or two inputs and one or two outputs. An input is usually connected to other nodes, but it can also have a constant value. If the node does not require data from a certain input, that input can be left unconncted. An output may also be left unconnected, which simply means that the values produced by it are not used in the computation. The nodes operate according to the data flow execution principle, i.e. they perform an operation as soon as all necessary operands have arrived and there is room for the result on the outgoing arcs. Most node operations are also defined to produce the special placeholder value as a result on all outputs if it receives a placeholder on any input. Data can be sent between Function Cells as single words or as sequences of words. We define a word, as the amount of data a Function Cell can process within one time unit called a clock cycle. This makes it possible to represent scalar (non-structured) data objects directly as single words, and structured data objects as word sequences.
328
J. Vasell,J. Vasell
The following are the node types which can occur in the DFG:s. Conditional This node requires two operands, one of which is always expected to be a placeholder. It outputs the operand which is not a placeholder as result. Switch. The switch node copies its first operand to one of two outputs, depending on the structure or value of a second operand. If the second operand has the structure the node has been configured to recognize, the first operand is sent to the first output, otherwise to the second. At the same time a placeholder is produced on the other output. Match. Takes two operands. If the structure of the second matches a specified structure, a boolean true is produced on the first output and the first operand is sent to the second output. Otherwise, the true is sent to the second output, and the first operand to the first. Construct. Takes a scalar data object (or a pointer) as its first operand, and a (partial) data structure as its second operand. On its output, it first puts the first operand as the first element of a new structure, and then the second operand as the rest of the structure. In this way, a new data structure is built. Destruct. This is the opposite of construct. It takes a sequence of words representing a data structure as its single operand. On its first output, it puts the first element (word), and on the second, it puts the rest of the structure. Operator. A n operator node performs a specified operation on its two operands, and produces the result on one output. The operations are standard arithmetic and logic operations on scalar data objects. Enter This node has two inputs. When a data object becomes available at any of the inputs, it is directly copied to the first output. At the second output, a boolean value is produced to indicate from where the object came; true for the first, false for the second. Exit. This is the opposite of enter. It copies its first operand to one of its outputs, depending on the boolean value it receives as its second operand; a true means the first output, false the second. Wait. Copies its first operand to the output as soon as its second (trigger) operand becomes available.
Lock. Equivalent to wait, except that in its initial state, it works as if a trigger operand has arrived. Shortcut. Normally copies objects from the first input to the first output, and from the second input to the second output. The exception is for placeholders which are copied from the first input to the second output, and vice versa. The inputs are handled separately, i.e. the node only requires data to present at one of its inputs to be able to execute. Route. This node copies all objects from the first input to the first output, and from the second input to the second output. Store. Stores its operand in the SM. Each element is stored as a pair of words, the first word is the element and the second is pointer to the next element. The last element is stored with a null pointer. As soon as the first element has been stored, a pointer to it is produced on the output. This node is used to change the representation of structures before putting them on a stack or using them as elements of other structures. Note that this node type is m a p p e d onto the A C U and SM, and not a Function Cell. Fetch. Takes a pointer to the first element of a structure in the SM, and outputs all the elements of the structure. If an element not yet has been stored (indicated by a special pointer from the previous element), fetch waits until it is available. Like store, this node type is not m a p p e d onto a Function Cell. A node may operate on different data types. Therefore, the Function Cell must be able to recognize and distinguish the types of the data objects which are communicated in a D F G . More specifically it must be able to recognize the structure of a data object received on an input, as well as if the whole object has been received. When data structures are represented by word sequences, it is necessary that the Function Ceils are given F I F O buffering (queue) capacity on at least one of its inputs. This buffering capability is also required for L I F O buffering (stacking) of suspended recursive function applications. An important property of the Function Processor is that the length of one clock cycle can vary between the Function Cells. Thus, the communication between Function Cells will be asynchronous. This makes the Function Processor
The Function Processor flexible, with no specific requirements regarding its size, implementation technology, or external interface. When creating a physical mapping of a D F G , it is not always possible to m a p connected nodes onto adjacent Function Cells so that the arc between the nodes is m a p p e d onto a direct link between ports on the Function Cells. Thus it is necessary to allow any node port (also called logical port) to be m a p p e d onto any Function Cell port (also called physical port). It must also be possible to directly connect two physical ports on one Function Cell to each other, so that links can be established between nodes which become separated in the mapping. These direct links should operate independently of the rest of the Function Cell. To further simplify the mapping, the Function Cell is given six ports so that it can have six neighbours in the array, i.e. the array is hexagonal. Since a node has at most four inputs and outputs, at least two physical ports will always be available for routing data between distant Function Cells. If this is not sufficient, extra routenodes can be inserted.
4. Function cell architecture An overview of a Function Cell architecture which fulfills the stated specifications is shown in Fig. 7. This architecture can be divided into five separate parts, each of which will be described below. It is assumed that the Function Cell has a synchronizing clock which is used by all parts of its architecture. During one cycle of this clock, the Function Cell can perform an operation on one word from each of its inputs, and transmit the result of this operation to its neighbours. The size of a word will not be specified here, and there is nothing in the architecture which prevents it from varying between different implementations, although a practical lower limit probably is 16 bits.
4.1. Ports The Function Cell can be divided into two major parts; a functional unit and an interconnection network. The functional unit has three inputs called logical input ports A, B and C, and
D* 1 Bufferand transmitterstatus
Control
329
_~ Buffer,ALUand transmittercontrol
Fig. 7. Function cell architecture.
J. Vasell,J. Vasefl
330
three outputs called logical output ports A, B and C. The logical input and output ports A and B are used to implement the inputs and outputs of the different D F G node types. The C ports are used to download configurations. The receivers (RA, RB, RC) and transmitters (TA, TB, TC) at the logical ports are responsible for the asynchronous communication between logical ports on different cells via interconnection buses. They can receive and transmit one data word each clock cycle. There are also six physical ports, which can act both as inputs and outputs for interconnection buses to and from the Function Cell. In the Function Processor, every physical port on one cell is connected to exactly one physical port on another cell, or to one of the external communication ports (see Fig. 1). Despite this very rigid physical structure, the flexible interconnection network in the cells makes the task of mapping DFG:s onto the array much easier. The interconnection network is reconfigurable and capable of connecting every logical or physical port to any other logical or physical port. This allows for flexible and fast connections between nodes on non-adjacent cells. The interconnection network configuration is programmed at the same time as the functionality of the cell is programmed.
4.2. Configuration The Function Cell is programmable, or rather, reconfigurable. Each configuration specifies the D F G node operation performed by the cell, as well as through which physical ports the cell communicates with its neighbours. The configuration is stored as a set of control words in a number of configuration registers. Several configurations can be stored in the cell, but only one configuration can be active during the execution of one function, i.e. the configurations should not be seen as instructions in a sequential program. By storing several configurations in the cells, the Function Processor can quickly switch between different tasks by means of a global signal telling the cells which configuration to use. The exact number of configurations and control words that the cell can store is implementation dependent. One of the words in the configuration is always a constant value used by the input buffers, either to recognize data objects, or as a constant input. The Function Cell can be either in execution mode or in configuration mode. The mode is selected by means of an external signal. In configuration mode, the cell shifts configuration data from the logical C input into the currently selected configuration registers, whose earlier contents simultaneously are shifted out via the logi-
Available Acknowledge
TI
Valid o 1
[ r[
From Receiver
Configuration constant
Input Source
Word Buffer
1
Remove
I I
Isplace
~[ ~[
FIFO/LIFO configuration
Fig. 8. Input buffer.
Match Complete
Object Detector
t
I
Data type
~
To ALU
The Function Processor
cal C output. This operation is independent of the current configuration, and the logical C ports are routed to predetermined physical ports. This means that, by putting all cells in an array in configuration mode, their C input and output ports form a chain through which configuration data can be shifted to all cells. After a new configuration has been stored or selected, the input buffers and the control unit are emptied and reset. The cell can then start its new operation in execution mode.
4.3. Input buffers The input buffers are the parts that contribute the most to the special characteristics of the Function Cell. Their main purpose is to asynchronously receive operands for the node implemented by the cell, and to inform the control unit when operands are available. In accordance with the architectural requirements, one of the buffers can be configured as a multi-word F I F O or L I F O (stack) buffer. Simulations of benchmark functions indicate that the buffer size should be at least 256 words. An overview of the input buffer can be seen in Fig. 8. We have, however, also chosen to let the input buffers provide support for a n u m b e r of different data types. The idea is to let the input buffers be responsible for recognizing objects of any specified data type, thereby making the control algorithms for different node types independent of the operand types. The enter node, for instance, is implemented by an algorithm that only has to specify that as soon as one data object is available at any input it should be copied to one of the outputs. It does not have to be concerned about whether the object is a data structure that consists of several words, or a simple scalar value represented by a single word. These functions are performed by a special part of the input buffer called object detector. The object types to be expected at the input buffers are specified as a part of the cell configuration. This information is provided by the type checking phase of the compiler. The object detectors can be configured to recognize any of four different data types which have turned out to be relatively simple to support. Objects of all types are represented by sequences of words representing the elements,
331
sometimes terminated by a word with a special value. The objects of a type can be built up in different ways, with different constructors. One of the constructors for each type is designated the primary constructor of that type. The object detector is capable of detecting when an object is built with the primary constructor of the data type it is configured to recognize. The least complex data type supported is the scalar type. A scalar object always consists of a single word. If this word is equal to the configuration constant, the object is recognized as the primary constructor. The next more complex data type is the pair type. A pair is built up either by two elements, in which case it is formed with the primary pair constructor, or a single word equal to the configuration constant. The third data type is the string type. A string object consists of zero or more elements followed by the configuration constant. Typically a string is a null-terminated character string, i.e. a sequence of character codes followed by character code zero. A string containing at least one element is built up with the primary string constructor. In a way, the string type is many different types, each with a different termination word. The last data type is the list type. A list is very similar to a string, except that the termination word for lists has a predefined value called nil token. A non-empty list is built with the primary list constructor (often called cons). The list has the advantage over other types that the object detector does not depend on the configuration constant which therefore can be used for other purposes. The object detectors output four status signals used by the control unit to determine its actions. The first signal, available, indicates if any word (any part of an object) is available in the buffer. The second signal, isplace, indicates if the word available at the buffer output (if any) is the placeholder value. The third signal, complete, indicates when all the words in a complete data object of the specified type have been received. The fourth signal, match, indicates if an object was constructed with the primary constructor of the specified type. The input buffers are so autonomous that the control algorithms only have to control a single operation on them. This operation is to remove
332
J. Vasell,J. Vasell
the word currently available at the buffer output, called the topmost element, and replace it with the next word in the buffer if there is one. This is controlled by the remove signal. When, for instance, a plus operator has added two scalar operands, these are removed from the input buffer, and the cell starts to wait for two new operands to become available. Usually, an input buffer receives data synchronously from a receiver with which it communicates via a handshaking protocol (valid, acknowledge). It can, however, also be configured to continuously receive a constant value to support nodes with constant inputs. The constant value can be zero, one or the configuration constant. Zero and one are frequently used constants, and are therefore made available even in configurations where the configuration constant is used for other purposes.
0.
i.
2.
3.
If inputs A and B are available and transmitters A and B are ready If inputs A and B are placeholder~ Remove inputs A and B, output placeholders to transmitters A and B Else if input B matches the specified constructor Output placeholder to transmitter B, go to 1. Else Output placeholderto transmitter A, go to 2. Ifinput B is availablebut not complete Remove input B Ifinput A is availableand transmitterA is ready Remove input A and output it to transmitterA Ifinput A is complete If input B is availableand complete Remove input B, go to 0. Else Go to 3. If input B is availab]ebut not complete Remove input B If input A is availableand transmitterB is ready Remove input A and output it to transmitterB Ifinput A is complete Ifinput B is availableand complete Remove input B, go to 0. Else Go to 3. If input B is available Remove input B If input B is complete Go to O.
4.4. A L U The Arithmetic and Logic Unit (ALU) is responsible for all data processing in the Function Cell. It takes the topmost elements from input buffers A and B as input, and it produces two words on separate, independent outputs. Each word can be equal to either of the two operands, the placeholder value, or the result of an arithmetic or logic operation on the two operands. The results are selected by the control unit. The operation performed by the A L U is determined directly by the contents of the configuration registers. In this way, the control algorithms become independent of the A L U operation and the number of different node types supported by the Function Cell is kept down. The set of operations that the A L U can perform can vary between different implementations, but traditional arithmetic operators (add, subtract, multiply), comparison operators (which produce boolean results), and the boolean constants true and false should be supported. There should be no need to support unary operators since either operand can be set to a constant value in the input buffers. The A L U outputs are connected to the transmitters at the logical outputs A and B. The transmitters are controlled directly by the control unit which informs them when valid data are available at the A L U outputs. The transmitters also produce status signals informing the control unit
Fig. 9. Control algorithm for the switch node.
when the last valid data have been sent out. Data are not sent by a transmitter before all the receivers it is connected to have informed it via the interconnection buses that they have passed on the last transmitted word to their input buffers. This information is exchanged between the transmitters and the control unit by means of valid and ready signals.
4.5. Control unit The control unit is a finite state machine that implements control algorithms for all node types supported by the Function Cell. After a cell has been configured, it is reset. This means that the control unit enters a special start state in which it inspects the configuration registers to determine what node type has been selected. It then enters the initial state in the control algorithm for this node type. A control algorithm usually consists of a few states which typically correspond to how much of the necessary operands that has arrived. In Fig. 9, the control algorithm for a switch node is given as an example. The control unit chooses its actions according to the status signals it receives from the input
The Function Processor
buffers and the transmitters. These status signals are sampled by the control unit at regular intervals. As mentioned above, the status signals from each of the input buffers are available, isplace, complete and match, and from each of the transmitters ready. In each state, the control unit also outputs a set of control signals which have been described above. These signals are the input buffer remoue signals, the A L U output select signals, and the valid signals to the output transmitters.
5. Implementation results A single chip VLSI implementation of the Function Cell architecture has been made using a standard cell silicon compiler. The technology is a double metal layer 1.5 /zm CMOS process. The implementation, which required 36000 CMOS transistors not counting buffer memory, supports 16-bit words, a 512-word buffer, and a single-cycle multiplier. The clock cycle time is less than 100 ns, though this probably could be significantly reduced in a full custom design. The largest parts in terms of chip area is the buffer and the interconnection network including the transmitters and receivers. The interconnection buses are only 6 bits wide (4 bits data, 2 bits handshaking) in order to reduce the size of the interconnection network and the chip pin-count. This means that a 16 bit word is transmitted as four 4-bit groups. The configuration data for the cell consist of 61 bits, so the whole configuration is divided into four words. Thus, it takes at least 400 ns to configure each cell in an array with a 100 ns clock cycle. The number of cells in a practical implementation of the Function Processor should be at least 100-200, which would result in a total configuration time of at least 40-80 p~s, assuming that the cells are configured sequentially. The implementation allows only one configuration to be stored in a Function Cell, but it can easily be extended to allow multiple configurations to be stored in each function cell. That would make it possible to quickly switch between different configurations by means of a global signal which tells the cells which configuration to use. The number of configurations stored in a cell could probably be in the range of 8-16 without making the hardware significantly more complex.
333
Table 1 Function processor benchmarks
Program Fib 15 Sort [19...0] Primes 50 Evdist 4
Queens 7 Matmult 4 Simple 25
Nodes mere. other 0 6 0 7 7 13 21
26 42 77 114 156 175 302
C~lls 48 126 192 771 630 1254
Execution time (#s) k~uc. Proc. Sparc/LML 1529 443 881 2243 306 110 181
9400 1980 4280 9540 1240 1540 14340
Storage (wo~,~) 0 3540 0 1076 ,190 310 1414
A simulator for DFG:s produced by the compiler has been developed. It allows measurements of the behaviour of the Function Processor, and how it depends on different parameters such as the effect of delays caused by the physical mapping or clock skew, the Structure Memory access time, and the maximum number of concurrent memory accesses. The results of some measurements using this simulator are shown in Table 1. The example functions are:
Fib 15. Computes the 15:th Fibonacci number. Sort [19... 0]. Sorts the list of numbers from 19 to 0 in ascending order using the insertion sort algorithm. The input and the result are stored in the Structure Memory. Primes 50. Produces a list of all prime numbers less then 50 using the Sieve of Erathostenes. Et'dist 4. An algorithm used in bio-chemistry to compute the evolutionary distance between D N A molecules. The input DNA strings are fetched from Structure Memory. Queens 7. Produces one solution to the problem of placing 7 queens in safe positions on a 7 by 7 chess board. The result is stored in the Structure Memory. Matmult 4. General function for multiplication of matrices of arbitrary size, here applied to 4 by 4 matrices. The matrices are represented as lists of integer lists. The input and result matrices are stored in Structure Memory. Simple 25. A hydrodynamics simulation of the flow velocity for a fluid in a cross-section of a sphere [1]. The velocity is computed in 25 points. The input and result data are stored in the Structure Memory.
334
J. Vasell,J. Vasell
For each program, the number of memory nodes, i.e. store- and fetch-nodes, and the number of other nodes in the DFG, are presented separately. The programs have been mapped onto the Function Processor using a preliminary version of a mapper based on a genetic algorithm [21] which is being developed as a part of the Function Processor project. The number of ceils required for the mapping is shown in the table. It has not been possible to map the simple benchmark function using this version of the mapper due to the size of the graph. Therefore, the number of cells required for this mapping are left out of the table. The execution time on the Function Processor has been measured assuming a Function Cell clock cycle time of 50 ns, which we estimate could be achieved with state-of-the-art implementation technology. For the Structure Memory, we have assumed that at most one memory access can be made at a time, and that the access delay is equal to one Function Cell clock cycle. Times for downloading the configurations are not included. Each program has also been compiled and run on a Sun Sparcstation 1 using the LML (Lazy ML [4,9]) functional program compiler. This makes it possible to compare the execution speed of the Function Processor with the execution speed on a state-of-the-art workstation using (almost) equivalent programs. Finally, the required Structure Memory storage capacity is shown.
6. Conclusions We have developed an automatic compilation method for mapping irregular computations onto a data-driven processor array. This method has the advantage of allowing the computations to be described at a high level of abstraction in a functional programming language. Compared to other methods for programming or synthesis of array architectures, this method imposes less restrictions on the way algorithms can be expressed, and we expect that it can be further improved to give more support for functional and data abstractions. On the other hand, it does not give as good performance as other methods for some algorithms.
The compilation method does not exploit all available parallelism. Therefore, we are working on methods for increasing the parallelism when certain properties, such as recursion depth or the size or value of function arguments, are known at compile-time. One such method is partial evaluation [10] in which a program is executed with only partial knowledge about its inputs. The result is a specialized and, hopefully, more efficient program. For instance, if the matrix multiplication benchmark presented in Section 5 is specialized to a specific matrix size, it becomes approximately 20 times faster. The Function Cell architecture presented here fulfills the requirements stated in Section 3. So far, the implementation results have shown that the architecture is realizable and in accordance with the assumptions made in earlier stages of the project. The complexity of the Function Cell is well below that of ordinary microprocessors, which makes it possible to build sufficiently large arrays. Other interesting properties of the Function Processor are its scalability and potential for fault tolerance. The Function Processor is intended as a building block to be used in other processor architectures. In order to increase its usefulness, and possibly to improve its performance, we have begun to investigate how to use one or more Function Processors to execute a large program by splitting it into smaller, relatively independent pieces (usually separate functions). The program is executed by reconfiguring the Function Processor(s) to execute different program pieces when they are needed. The possibility to quickly switch between previously downloaded configurations in the Function Processor will thus by very valuable. This has several advantages; it will keep more Function Cells working, it is easier to efficiently map smaller graph pieces onto the Funtion Processors, and it will allow much larger programs to be executed.
Acknowledgements We would like to thank Tony Nordstr6m for his contributions to this project. The project has been financially supported by the Swedish Board for Technical Development, STU.
The Function Processor References [1] Arvind and K. Ekanadham, Future scientific programming on parallel machines, J. Parallel Distributed Cornput. 5(5) (Oct. 1988) 460-493. [2] Arvind and R.S. Nikhil, Executing a program on the MIT tagged-token dataflow architecture, in: PARLE Conj. (Lecture Notes in Computer Science, 259, Springer, Berlin, June 1987). [3] Arvind, A data flow architecture with tagged tokens, Technical report, MIT, Cambridge, Massachusetts, June 1980. [4] L. Augustsson, Compiling lazy functional languages, Part 11, PhD thesis, Dept. of Computer Science, Chalmers University of Technology, G6teborg, Sweden, November 1987. [5] J. Backus, Can programming be liberated from the von Neumann style? A functional style and its algebra of programs, Commun. A C M 21 (Aug. 1978) 280-294. [6] D.E. Culler and G.M. Papadopoulos, The explicit token store, J. Parallel Distributed Comput. 10(4) (Dec. 1990) 289-308. [7] J.B. Dennis, Data flow supercomputers, IEEE Comput. 13(11) (Nov. 1980) 48-56. [8] J. Hughes, Why functional programming matters, Cornput. J. 32(2) (1989) 98-107. [9] T. Johnsson, Compiling lazy functional languages, PhD thesis, Dept. of Computer Science, Chalmers University of Technology, G6teborg, Sweden, February 1987. [10] N.D. Jones, Automatic program specialization: A reexamination from basic principles, in: A.P. Ershov D. Bj~rner and N.D. Jones, eds, Partial Et;aluation and Mixed Computation (Elsevier, Amsterdam, 1988) 225-282. [11] S-Y. Kung, K.S. Arun, Ron J. Gal-Ezer and D.V. Bhaskar Rao, Wavefront array processor: Language, architecture and applications, IEEE Trans. Comput. C-31(11) (Nov. 1982) 1054-1066. [12] S-Y. Kung, S.C. Lo, S.N. Jean and J.N. Hwang, Wavefront array processors--concept to implementation, IEEE Comput. 20(11)(July 1987) 18-33. [13] S.M. Karabernou, G. Mazare, E. Payan and P. Rubini, A network with small general processing units for fine grain parallelism, in: Internat. Workshop on Algorithms and Parallel VLS1 Architectures, Pont-a-Mousson, France (June 1990) 197-200. [14] I. Koren, B. Mendelson, I. Pedel, and G.M. Silberman, A data-driven VLSI array for arbitrary algorithms, IEEE Comput. 21(10) (Oct. 1988) 30-43. [15] H.T. Kung, Why systolic architectures?, IEEE Comput. 15(1) (Jan. 1982) 37-46. [16] S-Y. Kung, VLSI Array Processors (Prentice Hall, Englewood Cliffs, NJ, 1988).
335
[17] Y-C Lin, An FP-based tool for the synthesis of regular array algorithms, Parallel Comput. 17(4-5) (July 1991) 457-470. [18] J.R. McGraw, Data flow computing: System concepts and design strategies, in: S.P. Kartashev and S.I. Kartashev, eds, Designing and Programming Modern Computer Systems, VoL III, Ch. 2 (Prentice Hall, Englewood Cliffs, NJ, 1989) 73-189. [19] W. Moore, A. McCabe and R. Urquhart, eds. Systolic Arrays (Adam Hilger, Bristol, 1987). [20] C. Peterson, J. Sutton and P. Wiley, iWarp: A 100-MOPS, LIW Microprocessor for Multicomputers, IEEE Mitro, (June 1991) 26-29, 81-87. [21] G. Rawlins, ed. Foundations of Genetic Algorithms (Morgan Kaufmann, Los Altos, CA, 1991). [22] C. Reade, Elements of Functional Programming (Addison-Wesley, Reading, MA, 1989). [23] M. Sheeran, Designing regular array architectures using higher order functions, in: Proc. Conf. on Functional Programming Languages and Computer Architecture, Vol. 201 of Lecture Notes in Computer Science (Springer, Berlin, 1985) 220-237. [24] A.H. Veen, Dataflow machine architecture, ACM Comput. Surceys 18(4)(Dec. 1986).
.lesl, er \asull has been a Ph.D. student since 1988 at the Department of Computer Engineering, Chalmers University of Technology, G6teborg, Sweden. He has been working on computer architectures for functional computations, in particular finegrained array architectures. His research interests include dataflow computing, functional programming languages and parallel computer architectures. He received an M S de......... gree in Electrical Engineering in 1988, and a Licentiate of Engineering degree in computer engineering in 1990, both from Chalmers University of Technology. ,Iona~ '~a~cll has since 1987 been working toward a Ph.D. in the Computer Engineering Department at Chalmers University of Technology, G6teborg, Sweden. He has been working on architectural support for computations expressed in functional programming languages. His research interests include VLSI array architectures, dataflow computing, and programming methods for new architectures. He received an MS degree in engineering physics in 1987, and a Licentiate of Engineering degree in computer engineering in 1989, both from Chalmers University of Technology.