Microprocessing and Microprogramming 31 (1991) 93-98 North-Holland
93
A DATAFLOW MODEL BASED ON A VECTOR QUEUEING SCHEME
Hallo AHMED Dep. of Telecommunication and Computer Systems, The Royal Institute of Technology 100 44 Stockholm, Sweden.
The average overhead required for producing a scalar result may be reduced by vectorizing dataflow computations. This reduction may lead to eliminating or reducing pipeline starvation during periods characterized by low degrees of parallelism. A dataflow model based on a vector queueing scheme is presented. The model supports parallelism between different activations of vector nodes, and requires less overhead than conventional dynamic models.
Keywords: pipelining, vector processor, datafiow.
1. Introduction A program can be translated into a dataflow graph in which the nodes represent actors and the directed arcs data dependencies. In the past 20 years there have been many attempts to develop architectures capable of efficiently executing these graphs. (We refer the interested reader to an extensive survey by A. H. Veen [Veen 86]). The temptation has always been the asynchrony and functionafity of such graphs as a result of which machines executing them may, in principle, be highly parallel systems where many actors could be processed at any time and in any order. However, it appears that datafiow architectures have not yet been able to achieve the expected performance levels dictated by their underlying execution model. Results obtained from running applications on simulation models or constructed machines speak of modest performance in comparison to today's supercomputers. Events associated with dataflow computations may be classified into two categories: 1. Scheduling events, which are necessary for enabling a node, and consist of: a. Memory access operations, such as: • Matching and updating operations. • Structure memory transactions. b. Message passing operations, such as communicating data and feedback messages. 2. Processing events, which represent the actual evaluation of operations denoted by the nodes of a graph. The average scheduling overhead required for producing a scalar result could be reduced by vec-
torizing dataflow computations. This reduction primarily comes from the following sources: • Structure memory transactions: In general, conventional dataflow machines permit only elementwise access to such structures which store arrays and other types of structured data. The effects of this restriction are an increase in the number of overhead instructions necessary to complete the transaction, and an increase in the number of transferred packets [Hiraki 88, Yamana 88]. However, when a computation is vectorized access to structured data takes place on a vector basis. This method requires at most one instruction per vector access. More significantly, instead of transferring a number of scalar packets each carrying its own overhead information (header), a single vector packet carrying the same number of scalar elements and only one header need to be transferred. , Message passing operations: In conventional schemes a computation result is formed into a packet and sent along with a header to a destination. In a vector dataflow environment a number of such scalar results are sent as one vector packet carrying one header containing destination information. Moreover, when a number of vector packets are destined to the same processing element it is sufficient to send only one copy of the vector along with the number of destination nodes and their addresses. This scheme is affordable only in the presence of vector data. It is not always possibie to overlap processing and scheduling events. The reason is that while the duration of processing events is almost constant, the durations of scheduling events may diversify considerably. This is mainly due to variations in: 1. Communication latency as a result of queueing
94
H. Ahmed
delays. 2. The response time of structure memory transactions as the number of requests increases. 3. The degree of parallelism in a program during execution [Arvind 88]. During the periods where scheduling events can not be overlapped with processing events pipeline starvation occurs. Reductions in scheduling overheads becomes significant during such periods since they may result in either eliminating starvation or reducing it. The reason is that on the one hand, vectorization leads to increasing the average duration of processing events. On the other hand, it results in reducing the average duration of scheduling events. Consequently, the probability of overlapping processing with scheduling events increases. The ideas presented above reveal some aspects of our model. In addition, the model supports an efficient synchronization mechanism based on the ability of arcs to queue vectors. It also supports a good degree of parallelism between different activations of the same node. However, the overheads required to achieve this are less than that needed in dynamic schemes. The model is described in the next section. A target architecture is presented in § 3. In § 4, we discuss some implementation issues. Performance of a hardware level simulator is given in § 5, while § 6 is devoted to conclusions.
2. The Model
In our model a dataflow graph consists of two types of nodes; vector nodes and scalar nodes. A vector node has at least one operand of type vector. An arc connecting the output of a vector node to the input port of another is called a vector arc. A vector arc carries a pointerto a vector, and any number of vector pointers may occupy a vector arc at any time. The only restriction is that the vectors are processed according to an FCFS discipline. As soon as a vector node is forwarded for processing, its operand fields are updated with pointers to the next vector in the queue, if any. In this way it is possible to execute more than one activation of a vector node at the same time. Again, the only restriction is that results are handled (stored in memory, communicated or processed) in an FCFS manner. This may be achieved with the help of a suitable result buffering scheme which will guarantee correct handling of results even if processing times of different activations differ (as a result of different vector lengths, for example). Scalar nodes, on the other hand, are not allowed to queue operands on their arcs. The overhead for managing such a queue for scalar data is unacceptable. One way of synchronizing scalar data in our model is to use synchronization nodes explicitly.
Both scalar and vector nodes have a next node field which points to a successor node. In vector operations, this information is used to process sequences of nodes without writing intermediate results to memory. The next node pointer may also be used to perform vector chaining. Scalar nodes also use this information for efficient processing of sequential nodes, and as a means of implicit synchronization. Our reasons for adopting neither a static nor a dynamic firing scheme are the following: 1. The static model restricts the amount of run-time parallelism and suffers from a long instruction execution period [Dennis 89]. The latter becomes even worse in the case of vector nodes. 2. In the dynamic model, the run-time unfolding of loops results in an execution pattern which proceeds on both the depth and the breadth of a graph. Execution proceeds breadth-first whenever a new iteration is activated. This execution pattern eventually results in generating large numbers of partially enabled nodes instead of completely enabled ones and consequently leads to pipeline starvation [Gurd 85, BShm 89]. This problem may be solved at the cost of additional overheads by throttling techniques [Yamana 88, Arvind 90]. 3. To implement the dynamic dataflow scheme a large number of special scalar instructions are required to guarantee correct program execution. The number of these overhead instructions may be as high as 150% of the original computation [Arv(nd 88]. This large scalar overhead increases the ratio of non-vectorizable nodes in the graph, subsequently degrading overall performance. Several dynamic vector dataflow architectures have been proposed or constructed. The hierarchical data-driven model proposed by Najjar and Gaudiot [Najjar 88] has a low level asynchronous vector operation mode, and a high level dynamic mode. The first operational vector dataflow computer is SIGMA-I. Results from this machine reveal that vector programs can be executed three to four times faster than by unfolding scalar loops [Hiraki 88]. Another vector dataflow machine is the Harray system which has a special activation scheme for reducing the negative effects of dynamic unfolding [Yamana 88].
3. The Architecture
The vector dataflow processor architecture is depicted in Figure 1. The graph memory stores the program assigned to the processor. The ready queue holds the addresses of enabled nodes. The fetch and control unit reads an address from the ready queue, fetches a node structure pointed to by that address and forwards the operation code along with any scalar operands to the functional unit. Vector operands are fed directly to the functional unit from vector memory. Vector data are stored in vector memory which is multiported to
Dataflow model based on a vector queueing scheme
allow concurrent read-write operations. The update unit receives either data or address packets. Data packets are used for updating scalar nodes with values, while address packets update vector nodes with pointers to locations in vector memory. The queueing mechanism on vector arcs is implemented by two tables in the communication processor. The first, called graph memory table, shows the status of vector operands (updated or not) and has pointer fields to the first and last vector pointers on an arc, if any. Vector pointers are stored in another table called the vector rink table. These pointers are linked as a FIFO queue. When a vector node is fetched for processing its corresponding operand fields in graph memory table are updated and a packet containing the value of the next pointer in the FIFO queue, if any, will update the graph memory. When updating a node its address is appended to the ready queue if both its operands are present. Note that this queue management overhead is completely overlapped with the processing of the first elements of the vector at the head of the queue.
comm.
•
',-:-aI_ll I fetch &
,~.
| III I I~
of eight modules. Because of space limitations we will describe only the most interesting ones. • The communication handler module, the heart of which consists of one IMS801 transputer. This is used for communication and downloading. • Vector and graph memory handler, which is based on another IM8801 transputer. This module is responsible for managing and updating both graph memory and vector link tables. It also controls write operations to vector memory. • Result handler module. This module consists of a number of FIFO queues for the efficient management of results. It is constructed from PAL and FIFO chips. It is worth mentioning that in this implementation each destination processor receives only one copy of a result. A counter indicating the number of copies to be read along with the destination node addresses are also sent to the processor. This significant reduction in communication overhead is possible only in the presence of vector data. • Functional unit. This module consists of an integer ALU (AM29C332), an integer multiplier (AM29C323) and a floating point unit (AM29C325). Four scalar registers and one vector register are used for buffering and manipulating data. The functional unit cycle is 200 ns.
5. Performance I I
I voctd Ir, gl,,
_J
control
L..J
ional u
Figure 1 The Vector Dataflow Processor The communication processor also handles communication with other processing elements and is responsible for storing vectors in vector memory and updating the above mentioned tables whenever a new vector is received. There is one vector register used for saving intermediate results. Intermediate scalar results are buffered in the functional unit.
4. Implementation
95
Issues
We are currently constructing one vector dataflow processor using both off-the-shelf components and PAL chips. Work on the construction of the processor is nearing completion. The processor consists
A detailed hardware level simulator was written in the process oriented and C based language CSIM [Schwet 87]. Benchmarks include solving partial differential equations (PDEs), vector summation and matrix multiplication programs on a vector dataflow multiprocessor. We also studied the effects of different job mixes on performance by executing several programs in a multiprogramming environment. Figure 2 depicts system throughput for the following programs: 1. A PDE Laplace equation program for a mesh of 16x64 points on a nearest neighbour topology. The ratio of floating-point vector nodes to the total number of nodes in this program is 0.235 (out of 272 nodes only 64 are floating-point vector operations). 2. A vector summation program for adding 5120 floating-point elements, in vectors of length 64. The above ratio in this case is only 0.166. Figure 3 depicts functional unit utilization for the same programs. This graph shows that utilization is not very sensitive to variations in the degree of available parallelism per processor. The effect of reducing the degree of parallelism on utilization is more profound in conventional schemes [Gaudiot 85]. Another benchmark showed 52% reduction in processing time of the PDE program (on one processor) when intermediate results were saved in registers instead of writing them back to memory. Short term use of registers for holding intermediate re-
H. Ahmed
96
suits leads to higher utilization and lower scheduling overheads because such results by-pass the scheduling pipe.
torize. As an example, the addition of one overhead scalar node to the vector summation program results in reducing throughput by 3.75 MFLOPS when the program is executed on 16 processorsl
30. 25.
Acknowledgements
~20,
:~15. ~I0, 2 N 5'
7 g
0! 0
Ilvector sum
J
No of PEs .... . .... .., ....... 2 4 6 8 10 12 14 16 18 Figure 2: Throughput Measurements
I am gratefut to my supervisor Prof. Lars-Erik Thorelli for his guidance and support. Many thanks to Johan Wennlund and Fredrik Lundevall for solving the implementation problems and supervising the construction. I am very grateful to the students of the EF 160 computer construction course for their efforts in designing, constructing and testing the various modules. This work was supported by The Swedish National Board for Technical Development (STU). References
1. {..,
'~ .95. .9. .85. t," m
g
,
u..75-
~ .7, .65 . . . . . . . . . . . . . . . . . . . 0 2 4 6 8 10 12 14 16 18 Figure 3: Utilization Measurements
6. Discussion in this paper we have argued that it is possible to reduce the durations of scheduling events through vectorization of dataflow computations. As a result the probability of overlapping processing events with scheduling events is increased. Our model is based on the ability of vector arcs to queue operands. This scheme supports a good degree of parallelism between different activations of the same node. We also believe that our scheme requires less overhead than existing dynamic schemes (c.f. § 2). The large amount of scalar nodes required to guarantee correct execution in dynamic schemes have a particularly harmful effect on the performance of vector computations. This is especially true if a graph is already difficult to vec-
[Arvind 88]: Arvind, et. aL Assessing the benefits of fine-grain parallelism in dataflow p~ograms.Proc, of Supercomputing'88. [Papado 89]: Papadopoulos, G. M. Program Development and Performance Monitoring on the Monsoon Dataflow Multiprocessor. Instrumentation for Future Parallel Computing Systems. ACM Press. [Najjar 88]: Najjar, W., Gaudiot, J.-L. A Hierarchical Data-Driven Model For Multi-Grid Problem Solving. High Performance Computer Systems, Editor, E. Gelenbe. Elsevier Science Publishers B. V. [Hiraki 88]: Hiraki, K. et. al. Efficient Vector Processing on a Dataflow Supercomputer SIGMA1. Proc. of Supercompufing "88. [Yamana 88]: Yamaha, H. et. aL System Architecture of Parallel Processing System-Harray-, Proc. of the Int. Conf. on Supercomputing. [Dennis 89]: Dennis, J. B. Dataflow Computation for Artificial Intelligence. Parallel Processing for Supercomputers and Artificial Intelligence. McGraw-Hill. [Gurd 85]: Gurd, J. R. et. aL The Manchester Prototype Dataflow Computer.CACM VoL 28, No. 1. [BShm 89]: BShm, A. P. W. et. aL The effect of iterative instructions in dataflow computers. Proc. of the Int. Conf. on Par. Processing. [Arvind 90]: Arvind, Nikhil, R. S. Executing a Program on the MIT Tagged-Token Dataflow Architecture. IEEE Trans. on Comp. VoL 39, No. 3. [Schwet 87]: Schwetman, H. D. CSIM Reference Manual, MCC, Austin, TX. [Gaudiot 85]: Gaudiot, J-L., et. el. A distributed VLSI architecture for efficient signal and data processing. IEEE Trans on Comp. Vol. c-34, No. 12. [Veen 86]: Veen, A. H. Dataflow Machine Architecture.ACM Computing Surveys. Vol.18.No 4.