JOURNAL
OF PARALLEL
AND DISTRIBUTED
COMPUTING
3, 305-327 (1986)
Processor Allocation in a Multi-ring Dataflow Machine l? M. C. C. BARAHONA AND J. R. GURD Department of Computer Science, University of Manchester, Oxford Road, Manchester MI3 9PL, England Received April 30, 1985
The performance of a multiprocessor architecture is determined both by the way the program is partitioned into processes and by the way these processes are allocated to different processors. In the fine-grain dataflow model, where each process consists of a single instruction, decomposition of a program into processes is achieved automatically by compilation. This paper investigates the effectiveness of fine-grain decomposition in the context of the prototype dataflow machine now in operation at the University of Manchester. The current machine is a uniprocessor, known as the Single-Ring Dataflow Machine, comprising a single processing element which contains several units connected together in a pipelined ring. A Multi-ring Dataflow Machine (MDM) containing several such processing elements connected together via an interprocessor switching network, is currently under investigation. The paper describes a method of allocating dataflow instructions to processing elements in the MDM, and examines the influence of this method on selection of a switching network. Results obtained from simulation of the MDM are presented. They show that programs are executed efficiently when their parallelism is matched to the parallelism of the machine hardware. 0 1986 Academic Ress, Inc.
1.
INTR~DU~TION
A computation in a conventional (von Neumann) computer is achieved by execution of a sequence of instructions. Each such execution constitutes a “side effect” in that it changes the state of a memory. Conventional languages reflect this model of computation through the use of variables (corresponding to memory locations) whose values are changed by means of assignment statements (which thus change the state of the memory). The computing speed of a von Neumann uniprocessor is determined by the rate at which the instructions can be performed. This is in turn limited by technological constraints. Speed can be improved in MIMD architectures by partitioning a program into multiple sequential processes that execute concur305 0743-7315186 $3.00 Copyright 0 1986 by Academic Press. Inc All rights of reproduction in any form reserved.
306
BARAHONA
AND
GURD
rently in different processors. MIMD architectures can be divided into two broad categories, namely closely coupled and loosely coupled systems. Closely coupled systems, such as the Denelcor HEP [34] and the NYU Ultracomputer [ 151, contain several processors connected to a global memory through a switching network. Data shared among different processes are stored in the global memory. Processes may be delayed both by the synchronization constraints of the program (a value should not be used by one process before it has been created by another) and by conflicts when accessing the common memory. Loosely coupled systems, such as the Cosmic Cube [32], do not suffer the same drawbacks since their processors have no common memory. All communication between processes is achieved by exchange of messages. Processes are triggered by the arrival of input messages, and they send output messages that in turn trigger other processes. If the switching network has sufficient throughput, the overall processing rate is limited solely by the logical constraints of the program and the availability of processors. Both closely coupled and loosely coupled MIMD machines require that a program be decomposed into multiple cooperating processes. Ideally, the user should be able to write programs without being concerned about the details of the machine, and a tool for automatic decomposition of a program into the optimum configuration of processes would be essential for achieving this ideal. In practice, optimum decomposition is not currently feasible for state-based models of multiprocessing, and there are reasons for believing it will never be so. The side effects of state-based models impose dependencies on instructions that impede their concurrent execution. Compilation techniques which detect and break spurious dependencies are only able to detect trivial cases due to the complexity of the underlying model of computation [2]. As a consequence, the performance of these architectures relies on the ability of the user to decompose the program adequately into processes that can execute concurrently. In order to overcome the above difficulties, MIMD machines based on the dataflow model of computation have been proposed for efficient exploitation of inherent parallelism in programs [ 10, 29, 30, 11, 8, 3, 19, 351. Dataflow instructions are functional (free from side effects) and their execution is triggered as soon as their input data values are available. Data are carried from instruction to instruction by tokens (which correspond to the interprocess messages in a loosely coupled system) traveling along arcs (which correspond to interprocess communication channels). Compilation therefore constitutes an automatic decomposition of a program intofine-grain processes, each one comprising a single instruction. Great potential has been claimed for dataflow multiprocessors but, as yet, there is little practical evidence to support such claims. Operational dataflow hardware is barely past the prototype stage, with systems often containing just one processing element, as is the case, for example, for the Manchester
MULTI-RING DATAFLOW MACHINE
307
Dataflow Machine [ 191. Most of the published performance data on multiprocessors have been obtained by means of simulation experiments [14, 35, 131. This paper presents the important results of a simulated performance evaluation for a Multi-ring Manchester Dataflow Machine composed of up to 64 processing elements, each similar in structure to the prototype Single-Ring Machine [4]. The following section summarizes the basic concepts of the dataflow model of computation. Section 3 describes the Manchester Dataflow Machine, together with the simulator that was developed to study the multi-ring system. In Section 4, a method of allocating a processor to each dataflow instruction is introduced and its effectiveness in distributing the instructions is assessed for a set of benchmark programs. Section 5 analyzes the requirements for the interprocessor switching network. A selected network is studied in detail. Section 6 investigates the parallelism required of a program in order for it to run efficiently on a multi-ring machine. Finally, the results are summarized and their validity for an enhanced Manchester Dataflow Machine is assessed.
2.
THE DATAFLOW APPROACH
In the dataflow model of computation, a program is described as a directed graph in which the nodes represent instructions and the arcs represent paths along which data values flow. The data values are carried in packets known as tokens. Execution of each instruction is triggered when input data tokens are present on all of its input arcs. Execution may take place immediately or subsequently, depending on the availability of processing resources. The input data tokens, which are said to match together, are extracted from the input arcs, the operation is performed, and the result data tokens are sent along the output arcs. As an example, the dataflow program shown in Fig. 1 evaluates the expression ((a * b) - (b * c)). The dataflow model possesses several interesting properties. It does not impose spurious dependencies on the order of execution of instructions: Instructions are triggered whenever their input data tokens are available, and their execution cannot be delayed by conflicts of access to memory. Totally decentralized instruction control is thus possible, because scheduling of instruction execution does not rely on a global state. Finally, with the fine granularity used, each instruction constitutes a logical process, and so the number of instantaneously active processes can be large. An abstract machine to implement the dataflow model consists of a pool of homogeneous processing elements (PEs) connected together by an interprocessor network. All operations take the same time to execute, and the delays involved in directing tokens through the network, matching them, fetching the code, and so on, are neglected. In such a machine, computation proceeds by a series of processing steps, as shown in Fig. 1.
308
BARAHONA
AND GURD
a
b
c
a
b
c
a
b
c
a
b
c
DUP *
*
v
FIG. 1. Dataflow program to compute a * b - b * c.
The total number of processing steps required by a program (S,) depends on the number of PEs in the pool (P). Two important characteristics of a program are S, and S,, the number of steps required to execute the program with one and an infinity of PEs, respectively. The parallelism of a program can be characterized approximately by naTT,, , the ratio between Si and S,, which represents the average number of instructions executed in each processing step when an infinite number of processors ave available. For the program of Fig. 1, S, = 4, S, = 3, and na, = 1.333. The speedup, SU, , achieved with an abstract machine containing P PEs, is given by S/S,. Ideally, SU, = P: In practice, Su, 5 P. The speedup efficiency is defined as ep = SI/(P*S,). For P I 1, S, L S, L S,, so that the speedup and the speedup efficiency for a program are limited by SU, 5 7ra, and E,, 5 raTT,,/P.To be executed efficiently, a program must exhibit average parallelism 7~~”greater than the parallelism of the machine P. This general requirement is adapted for the MDM in Section 6. A single dataflow instruction is often activated (instantiated) several times during execution of a program. This is known as multiple instantiation of the instruction. There are two situations in which it can occur [ 11. In the first, the instruction belongs to the body of a loop, either being used iteratively to build up a value or being applied repetitively to successive elements of a data
MULTI-RING DATAFLOWMACHINE
309
structure; in the second, the instruction belongs to afunction body, that is, a collection of instructions which can be “called” by the programmer just as though it were an instruction. Multiple instantiation complicates the matching process since there will be multiple token sets traversing the input arcs, and tokens belonging to different instantiations must not be allowed to match. Several solutions to this problem have been proposed. Static dataflow systems limit the number of instruction instantiations that may be used at any time; the limit may be one [ 1 l] or several [29] instantiations. However, this reduces program parallelism, by preventing recursion [ 181, and, in practical (as opposed to abstract) systems, also imposes communication overheads, since an executing instruction must indicate that its input arcs have been cleared using an “acknowledge” communication sent backward to the predecessor instruction [lo]. Dynamic dataflow systems implement general multiple instantiation, either by copying the appropriate code for each instantiation [8], or by allowing a single copy of the code to be shared using tags [3, 191. The latter technique can be thought of as assigning different “colors” to tokens belonging to different instantiations
r91. 3. THE MANCHESTERDATAFLOW MACHINE A tagged-token scheme has been chosen for study at the University of Manchester [19]. Each token carries, in addition to its data value, the address (AD) of the destination instruction, a color, and some control bits. The color is further subdivided into three fields: Multiple instantiations within a function body are distinguished by the activation name (AN), multiple instantiations within an iterative loop body are distinguished by the inter&on level (IL); while position within a data structure is identified by the index (IX). The three color fields can be used independently so that the different types of multiple instantiation may be nested. Multiple layers of nesting can be achieved by wrapping each layer in a function body. The prototype Manchester Dataflow Machine is a processing element composed of several specialized units pipelined together in a ring. The units include a Token Queue, a Matching Unit, an Instruction Store, and a F’rocessing Unit, as shown in Fig. 2. Each unit has its own internal clock, and the units are connected together by asynchronous interfaces [16]. The Token Queue is used to smooth the flow of tokens around the ring. Tokens are stored here until the Matching Unit is available to process them. The heart of the Token Queue is a buffer store which releases the stored tokens according to a predetermined discipline (a first-in/first-out mechanism is being used at present). In common with other units, the Token Queue contains a number of buffer registers. Input and output buffers are used to communicate with neighboring units, and an internal “smoothing” buffer is used to regulate the period between successive tokens at the output.
310
BARAHONA
Token
AND GURD
Processing
Queue
Unit
;. . .. .... . Switch
I.. i ... .... ....I
. . Host
.
... .
,
. .... , ii . . . . . . . . . . . .
i.. ..... .... .... .... i
FIG. 2. The prototype Manchester Single-Ring Dataflow Machine.
The Matching Unit matches together the tokens that trigger each dataflow operation. A pseudoassociative hashing mechanism is used to match pairs of tokens, and the unit is divided into two stages [33]. In the first stage, a hashing function is applied to the address and color fields of the incoming token. In the second stage, the resultant hash address is used to search several memory banks (which store unmatched tokens) in parallel. If no matching token is found, the incoming token is stored in one of the memory banks. Otherwise, the matching token is extracted from the store, and the resultant pair of tokens is formed into a packet which is sent to the Instruction Store. Tokens directed to a monadic operator do not need to search the memory banks for a partner, so they bypass the second stage and a “dummy” matching token is placed in the output packet for the sake of uniformity. Overflow of the parallel hash memory banks is handled by a separate Overflow Unit. The Instruction Store contains the program code, which is preloaded using special tokens. A token pair arriving here from the Matching Unit fetches the instruction code (indicating the operation to be performed and the address to which the result should be sent) and the resulting “executable” packet is sent to the Processing Unit. The Processing Unit is divided into two stages. The first stage executes certain global operations, such as the generation of unique activation names. The second stage contains an array of 20 microcoded Function Units, each capable of executing all the nonglobal operations. The number of Function Units is chosen so as to avoid bottlenecks due to the slow execution of microcoded operations. An executable packet arriving at the second stage is sent to a free Function Unit by a distributor. Results from the Function Units are collected by an arbitrator and sent towards the output buffer of the Processing Unit.
MULTI-RING
DATAFLOW
311
MACHINE
The ring is completed by a Switch Unit, which connects the output of the Processing Unit to the input of the Token Queue. The Switch Unit also provides a communication link between the dataflow machine and a conventional Host computer. A Multi-ring Dataflow machine comprises several such processing elements, as shown in Fig. 3. The Switch Unit is enhanced so as to connect together the Host and the multiple processing elements. The objective of current work at Manchester is to establish the viability of a Multi-ring Dataflow Machine containing of the order of 100 processing elements. Now that the internal structure of a single processing element is known, this work can be pursued by simulation. A software simulator for the Manchester MDM has therefore been implemented to investigate system performance. The structure of each hardware unit is approximated within the simulator, so that internal buffers and the timing of token transfers between them are modeled realistically. For the sake of efficiency, the disparate clock rates of individual units are simulated by a common clock, and the asynchronous interface between units is modeled by a synchronous protocol. A summary of the differences between the prototype single-ring hardware and the simulated version is illustrated in Fig. 4. Using a totally sequential test code and a highly paralled divide-andconquer program, the simulated Single-Ring Dataflow Machine was found to perform similarly to the hardware, to within 5% accuracy [4]. Simulation data for several different programs running on various multi-ring configurations have since been analyzed. The main results are presented in Section 6.
i-------1 1 I I / /
Processing I,----__-----_------_---i
iL---------------.
Element
Switch
i
i-------; / / / /
.---------------:
f’. ....................................
,
, ................... , i.. ................. i: Host i.. ................. i i................... i
FIG. 3. The Multi-ring
Dataflow Machine architecture.
BARAHONAAND GURD
312 TOKEN 150 Q&b
;
QUEUE
INST.
HATCHINGUNIT
150 50 F\
STORE
PROCRSSINGUNIT
~~~~~~~~ 1:
I
L
I
I
I
*
SWITCH I I___---
-----
~~Jyor.~ f 150/50x
------
,--------
TOKEN
QUEUE
HATCHING
-’
I
INST.
UNIT
I I
------
50/;50*
,
a
1
100
STORE
PROCESSING
UNIT
I !
SWITCH
0
L-------
------
--*
J--g
2ci;/o; I
b
,---------------I
-2&i*
I
FIG.4. Single-ringhardwareand simulatedversion.(a)The hardwarering. (b) The simulated ring. All transfer times between buffers are shown in nanoseconds. 4.
ALLOCATION OF INSTRUCTIONSTO PROCESSINGELEMENTS
4.1. Split Functions and Their Merit The speedup efficiency of the Multi-ring Dataflow Machine depends on the way in which instructions are allocated to processing elements. Because of the close relationship between instructions and tokens (each instruction is triggered by the arrival of one or two tokens), distributing tokens has virtually the same effect as distributing instructions. The allocation process is described below in terms of the former. Two approaches may be taken to distributing tokens, depending on whether or not locality (i.e., clustering of tokens in one PE, or a group of PEs) is enforced by the compiler. In the first approach, the fine dataflow granularity is ignored on the grounds that it is at too low a level to allow efficient management of the machine
MULTI-RING
DATAFLOW
MACHINE
313
resources [2]. In particular, there is thought to be too much difficulty involved in allocating individual instructions so as to achieve optimum performance and in controlling token traffic through the Switch. Instructions are therefore grouped together into higher-level processes (e.g., those forming a function or a loop body) and instruction allocation is based on the resulting process structure. Conversely, the second approach views the low level granularity and the concomitant large number of tokens in the machine as an advantage. It is assumed that the pattern of token tags (address and color) is more or less random over time so that it is possible to use stochastic techniques to scatter the tokens evenly across the PEs. The second approach is rather more straightforward than the first, and, although previous work at Manchester has shown that locality can be useful in management of certain machine resources [31], it was decided to use this approach to study system performance. A randomizing split function is applied to the tag of each token arriving at the Switch, yielding the number of the PE to which the token should be directed. For this purpose, the MDM is considered to be formed of P processing elements which are numbered from 0 to P - 1. The (static) merit, p, of the split function is defined as the ratio between the average number and the maximum number of tokens that any PE receives over a complete program run. The highest merit (100%) is thus obtained when every PE receives exactly the same number of tokens. It is known that this allocation is not optimal in the sense of always leading to the earliest possible finish time for a program. However, for large numbers of tokens, it is expected that the random nature of allocation will lead to near-optimal behavior. Evidence supporting this expectation is presented in Section 6. It has been shown by Barahona [4] that, in a MDM which is free of starvation in the PE pipeline stages, the speedup efficiency has the same value as the static merit. In the following, no account is taken of the (dynamic) time-variance of merit. If the split function is applied only to the address (AD) field of each token, all instantiations of a particular instruction will be executed in the same PE. Therefore the program code can be distributed among the PEs without repheating any instructions. Using the color fields (AN, IL, and IX) in addition to the address yields a better merit, but it also requires that the code be replicated to some extent. There is an obvious relationship between the extent of code replication and the extent to which the color fields contribute to the split function. This is characterized below by the parameter R, the total number of code replicas required. Each split function is thus a function of the following: the address (AD), the colour (AN, IL, IX) the number of PEs (P), and the number of code replicas (R). The following split function sf, is used as an example in the rest of the paper:
Sf:=(AD@(ZL@AN@ZX)modR)modP,
314
BARAHONA
AND GURD
where @ denotes the exclusive-or operator. Assuming P = 2’ (and R = Y), the mod operator yields the least significant p (r) bits of the left-hand operand. Several other split functions have been analyzed. Some of them use arithmetic sum operators in place of the exclusive-or operators in sj. Others use a “folding” method to assess how different bits of the address and color influence splitting. In the “folding” method, the complete argument to the split function is sliced in groups of p (or r) bits which are then exclusive-ored together. All of the tested split functions showed similar merit in a large number of simulations [4]. 4.2.
Benchmark Programs
Three different benchmark programs were executed on the simulator. Their characteristics were chosen so as to evaluate the behavior of the MDM when processing three different high-level programming language features, namely recursion, iteration, and data structure processing [7]. The binary integration program computes the integral of a function over an interval using a recursive paralled divide-and-conquer algorithm. The interval is halved, the subintervals are halved again, and the process is repeated a fixed number N of times. The subintegrals of the function over the 2N residual intervals are calculated using a trapezoidal approximation, and the results are summed in pairs recursively until the integral is formed. The matrix multiplication program computes the product of two N X N matrices by the usual method of nested iterations. The Laplace relaxation program uses the method of relaxation to solve Laplace equation over an N X N matrix. The relaxation is applied iteratively K times. The S, and S, parameters of the benchmark programs are listed in Table I. The benchmark programs were coded in two different high-level dataflow languaged: MAD [7] and SISAL [24]. The simulation results were similar for both languages; those obtained with SISAL are presented below.
TABLE I PARAMETERSOFTHE BENCHMARK PROGRAMS Program
Data size
St
s,
nav
Binary integration
N= N= N=
8 9 10
36,599 73,335 146,807
172 192 212
212.8 382.0 692.5
Matrix multiplication
N= N= N=
I 9 11
48,501 97,095 170,961
301 371 457
161.2 261.8 374.2
K= 6 K = 10 K= 6
82,866 134,790 134,130
620 872 692
133.7 154.6 193.8
Laplace relaxation
N= 8; N = 8; N = 10;
MULTI-RING
4.3.
DATAFLOW
MACHINE
315
Merit versus Data Size
As mentioned in Section 2, some instructions are instantiated only once during a program run, while others are instantiated several times. For the given benchmark programs, the number of instantiations of each multipleinstantiated instruction depends on the input data size. In general, the set of multiple-instantiated instructions determines the merit of a split function. If the split function is computed exclusively using the instruction address field (R = l), its merit indicates how evenly these instructions are scattered throughout the different PEs. Greater data sizes will maintain the proportion of instantiations of each instruction, and the merit can be expected to remain approximately constant. If color is also used to compute the split function (Z? > l), the instruction instantiations will be further scattered, but there should still be no substantial change in merit as the data size is increased. To check these assertions, the benchmark programs were each run with three different data sizes. The number of PEs, P, was set at 32 and 64 and the number of code replicas, R, was varied from 1 to 64. The ratio, fir, between the resultant merits was computed for different pairs of data sizes and different programs. The results are plotted in Fig. 5. The curves confirm that there are only slight variations in merit when data size is changed. 4.4.
Merit versus Number of PEs and Number of Code Replicas
As the number of PEs increases, so the average number of tokens per PE decreases. If the split function is computed on address alone (R = l), an instruction (or set of localized instructions) that is instantiated more times than the others will attract a larger number of input tokens into the PE in which it is installed, regardless of the number of PEs. Unless these critical instructions are assigned to PEs in a suitably optimized way, the imbalance in the number of tokens received by each PE is likely to increase with the number of PEs, thus degrading merit. If address and color are used to compute the split function (R > l), lower degradation of merit (versus number of PEs) is expected because instantiations of the critical instructions will be spread across different PEs. The curves of Fig. 6 show the variation of merit against number of PEs (P) and number of code replicas (R), and confirm the expected trends. The curves show a noticeable uniformity of merit for all the benchmark programs. This suggests that the high-level conceptual differences between the programs are masked by the low level at which the split function is applied. Recursive, iterative, and data structure handling programs all produce similar merit figures. In addition, the curves illustrate the dependence of merit on color. With 64 PEs, a merit higher than 90% is obtained when color only is used to compute the split function (R = 64). This compares very favorably with the merit of less than 30% that is obtained when color is not used at all (R = 1).
316
BARAHONA ANDGURD rr 104% 102% 100% 98% 96% 94% 2
4
8
16
32
64
R
rr 104% t 102% --
100% 98% 96%
94%
32 Processing I
2 Binary
4
Elements 8
16
Integration
Matrix
Multiplication
32
N-10 / N-8 N- 9 / N-8
0
N-11
A
N-
Laplace Relaxation FIG.
4.5.
(N- 8;K-lO)/(N(N-lO;K6)/(N-
> R
/ N-7
l
9 /
N-J
A
8;K8;K-
6) 6)
v v
5. Relative split function merit versus data size.
Summary
High speedup efficiency is expected in a MDM when the split function is computed on both color and address of the tokens. The program code must be replicated in some or all of the PEs, and the allocation of a particular instantiation of an instruction to a PE is made at run time by the split function. Such high speedup efficiency is only possible if both the Switch performance and the program parallelism are sufficient to prevent pipeline starvation in the PEs. 5.
THE INTERPR~CESSOR SWITCH
5.1. Throughput and Latency The choice of switching network to implement the inter-processor Switch for the MDM must be made according to its performance characteristics, in particular its throughput and latency.
MULTI-RING
V 0 A o V 0 A
60% 50%
DATAFLOW
MACHINE
317
R-64 R-32 R-16 R- 8 R- 4 R- 2 R-l
40% Binary
Integration
2
4
8
16
R-64 R-32 R-16 R- 8 R- 4 l R- 2 A R- 1
70% 60% 50% 40%
Matrix 1
2
64
P
‘\,
80% v 0 A o V
32
k* Multiplication
4
8
16
32
64
p
90% 80% V 0 A o v R- 4 0 R- 2 A R- I
70% 60% 50% 40%
I
Laplace 2
FIG. 6. Split
Relaxation 4
function
8 merit
16 versus
32 number
64 of PEs.
>
P
318
BARAHONA
AND
GURD
The throughput of the Switch imposes an upper limit on the processing rate for a MDM. Let the throughput for a particular environment, 8, be defined as the maximum number of tokens that the Switch can deliver in unit time. If N, is the number of tokens traversing the Switch during a computation, a lower bound on the total time T required by that computation is Tmin= N,/O. The latency, 6, is defined as the average interval between a token being input to the Switch and the same token being delivered to its destination PE. If r is the average pipeline beat of the PEs, the Switch may be viewed as adding S/T pipeline stages to each PE. Latency is therefore less critical than throughput in that it does not increase processing time as long as programs are sufficiently parallel to keep the extra pipeline stages busy. 5,2.
Topology
Based on the above performance parameters, several topologies for implementing the Switch in the MDM simulator were studied [4]. Common bus and ring topologies were discarded because of their poor throughput. The crossbar was eliminated, in spite of its good performance, because its structural complexity renders it infeasible for even moderately large numbers of PEs. Hierarchical topologies were discarded since no locality is being explored by the allocation strategy, and also because the traffic of tokens is expected to be random. Finally, multistage interconnection networks, comprising multiple arrays of switching elements, were studied because they represent a sensible compromise between cost and performance. Banyan networks, such as the omega network [23] ( sh own in Fig. 7), the n-cube [28], and the delta network [27], were chosen, rather than networks with better performance but greater complexity [25, 261. Due to the asynchronous operation of the PEs, buffered banyan networks, in which tokens are temporarily buffered in the switching elements, were selected for detailed study. Tokens advance through the stages of the Switch by a sequence of “hops,” without requiring global synchronization. In general, each switching element has N inputs and N outputs. Tokens arriving at each switching element input are stored temporarily in a buffer queue (of length K) until they can be forwarded to the next stage. The routing across each N X N switching element is set according to the destination PEs required by the N tokens at the heads of the N input buffer queues. For simplicity, attention was concentrated on networks with N = 2 and K = 1 (2 X 2 switching elements with single input buffer registers). As will be seen, their performance is acceptable for the proposed system, and so more complex networks with N > 2 [22] and/or K > 1 [ 121 were not investigated. 5.3.
Effect of Clashes
When two tokens buffered at the input to a switching element are directed to the same output, a clash is said to occur, and one of the tokens must be delayed. A clash within a buffered omega network is illustrated in Fig. 7.
MULTI-RING
DATAFLOW
319
MACHINE
2 3
4 5
110 6
16
FIG. 7. “Normal” routing in a buffered omega network. 1 + 3 and 7 + 2 clash in the second stage. 4 + 6 and 6 -+ 5 do not clash.
Clashes degrade the performance of the switch, both by decreasing its throughput and by increasing its latency. The extent of this degradation was studied using a model with the following properties: -Tokens lying at a stage are evenly distributed over both space and time. -The probability CYthat two tokens clash in a switching element is constant over space and time. -At any instant, the set of buffers in one stage that tokens from the previous stage attempt to occupy is independent of the set of buffers in the same stage that remain occupied because of clashes. These properties were chosen because they were consistently observed in numerous simulations of different program runs [4]. The results obtained with this model are shown in Fig. 8. The normalized throughput 0. and latency 6, are shown as functions of the number of PEs and of the parameter (Y. Normalized throughput is the ratio between actual throughput and optimal throughput (i.e., that obtained when there are no clashes). Normalized latency is defined similarly.
320
BARAHONA
AND GURD
100% 90%
---I!
80%
0.1 JO% 60% 50% 0.5 40%
0.7 0.9
30%
. 4
8
16
32
64
128
256
P
4
8
16
32
64
128
256
P
Sn 180% 170%
160% 150% 140% 130% 120% 110% I
FIG.
8. Throughput and latency of buffered banyan networks.
Note that the performance of this buffered banyan switch degrades asymptotically with increasing numbers of PEs. Under random traffic and with “normal” routing of tokens (see below), Q! = 0.5, which gives asymptotes for 8, at approximately 45% and S, at about 150%. These match the values obtained by Dias and Jump [ 121 for large numbers of PEs with t-select = 1.O. To avoid overall performance degradation in the MDM, the implication is that the Switch beat period (the time for a token to be transferred between two stages of the Switch in the absence of clashes) should be less than 45% of the average interval between arrival of tokens at a Switch input. For example, with the normally observed ratio of 1.6 between Nt and S, , a PE working at
MULTI-RING
DATAFLOW
MACHINE
321
1 MIP will deliver a token to the switch every 625 ns, on average, requiring a switch beat period of 280 ns. 5.4. Routing Strategy The above limit may be relaxed when “special” routing is used for bypass tokens (i.e., those directed to monadic operators). As long as code is replicated in all PEs (R = P), such tokens may be directed to any PE since they do not need to match with other tokens. Some clashes can therefore be avoided by modifying the routing algorithm so that if one of two tokens arriving at a switching element is a bypass token, it will be routed to the opposite output to that requested by the other token. A clash now occurs only when two non-bypass tokens require the same switching element output. From observation, about 40% of iV1 are bypass tokens. Hence, (Y can be decreased to about 0.5 - (1 - 0.4)2 = 0.18, in which case the asymptote for normalized throughput increases to about 65%. The corresponding Switch beat period for 1 MIP PEs is increased to 400 ns. The special routing algorithm was expected to further improve the performance of the MDM by directing bypass tokens to less heavily used PEs. However, simulation shows that this does not happen. Instead it shows significant variation in the numbers of tokens sent to each PE. This occasionally decreases, but mostly increases the execution time. The short time scale and the localized position over which special routing is applied do not take into account the long-term variations in numbers of tokens sent to each PE which largely determine merit. An alternative approach to that of sending bypass tokens through the Switch is to feed each one directly back into the PE that generated it. This will reduce Switch traffic to about 60% of its previous value. With this scheme, 1 MIP PEs will deliver a token every 1250 ns, on average, thus requiring a Switch beat period of 560 ns. On the other hand, the hardware complexity of a PE will increase, requiring an arbitrator at its input and a distributor at its output to implement the feedback path for bypass tokens. Whichever routing scheme is used, buffering is required at the interface between each PE and the Switch in order to even out irregularities in the token traffic. The Switch model assumes that PEs are able to accept tokens delivered by the Switch at any time. Simulations for which this was not true (i.e., those for which the rate of writing tokens into the Token Queue was too slow) showed that contention in the Switch causes a significant drop in performance . 5.5. Summary In summary, within the constraints mentioned, a single-buffered 2 X 2 banyan Switch is adequate for the expected token traffic, and will not significantly degrade overall MDM performance.
322
BARAHONA ANDGURD 6.
PROGRAMAND MACHINE PARALLELISM
It is interesting to study the relationship between program and machine parallelism. First we need appropriate measures for these quantities. For the purposes of the following, machine parallelism is defined as the total number of buffers in the PE pipelines and Switch stages plus the total number of PE Function Units. It has already been shown (in Section 2) that a program can be approximately characterized by its average parallelism 7raV.However, in general, the parallelism of a program varies over its run time, so this concept needs further development. Consider, for example, a recursive divide-and-conquer algorithm in which parallelism increases exponentially near the beginning of the run and decreases exponentially toward the end [ 171. Right at the start and right at the finish, there are too few tokens to keep all the hardware modules busy. Consequently, speedup efficiency is degraded, even though the average program parallelism may be greater than the machine parallelism. Barahona [4] has shown that, where tokens are evenly distributed among the PEs during the whole of the computation, the speedup efficiency of the MDM for this kind of program is bounded by 1 + aa In(b) ’ = 1 + a.P.ln(b.P)’ where a and b are parameters that depend on both program features (e.g., S1 and S,) and machine characteristics (e.g., number of buffers and number of Function Units). The speedup efficiencies predicted by this model for the MDM running the binary integration program (see Section 4.2) for several data sizes are plotted in Fig. 9. These curves are in accord with those obtained by simulation, as also shown in Fig. 9. This agreement indicates that the split function does indeed spread tokens evenly across the PEs and over time, and that the allocation is near-optimal in the sense described in Section 4.1. The low efficiencies obtained for large P, both from the model and from simulation, are caused by insufficient program parallelism due to the small data sizes used. Similar curves are obtained with the other benchmark programs as shown in Fig. 10. Using the above formula, the number of PEs required to run the binary integration program with 90% speedup efficiency, Pm, can be computed for each input data size. The machine parallelism for such a system is Pw multiplied by K (the number of buffers and Function Units per PE plus the number of stages in the Switch). The ratio of average program parallelism, ng,, to machine parallelism, K * Pw, is plotted against Pw in Fig. 11. The superimposed dots show the corresponding results obtained when the benchmark programs are run on the simulator. It is apparent that the speedup efficiency of the MDM is closely related to
MULTI-RING
DATAFLOW
323
MACHINE
100% N-14 90% N-12
80% 70% 60%
N-10
50% 40%
N-
9
30% N- 8
20%
N- 6 I
2
4
8
16
32
--
64
P
FIG. 9. MDM Speedup efficiency (binary integration program).
the parallelism of programs run on it. Roughly speaking, the average program parallelism, TV,, should be three times greater than the mar-chine parallelism, K-P, for 90% speedup efficiency.
ON-
20%
gin-382
ON8; K- b;n-134 ON8; K-lO;n-155 -I?-10; K- 6;n-192 Laplace Relaxation
10% I
2
4
8
16
32
64-
FIG. 10. MDM Speedup efficiency (benchmark programs).
P
BARAHONAAND GURD
324
1
I
0 Binary A Hatrix 7 Iaplace
Integration nultiplication Relaxation
I-
.
2 FIG.
4
8
16
32
64
128
Pgo
11. Programversusmachineparallelismfor 90% speedupefficiency.
7.
CONCLUSIONS
7.1. Summary of Results An assessment of the performance of a Multi-ring Dataflow Machine (MDM) composed of several processing elements (PEs) similar to the one that has been implemented at the University of Manchester has been presented. Attention has focused on two areas, namely the allocation of instructions to PEs and the performance of the interprocessor switching network. Simulation of the MDM has shown that it is possible to exploit fine-grain dataflow parallelism using a randomizing split function to distribute work among the PEs. The merit of this distribution appears to be independent of the different possible sources of parallelism in a program (i.e., concurrent loop or function activations, or concurrent processing of the elements in large data structures). The crude measure of program parallelism, TV,, has proved useful in establishing a rule of thumb concerning the speedup e@ciency of a Multi-ring Dataflow Machine: Where the average program parallelism ITS,is three times greater than the machine parallelism K *P, a program will run with at least 90% efficiency on a P processor MDM. The Switch performance required by a multi-ring data flow architecture can be provided by a buffered banyan network. Each token is directed through such a network by a series of asynchronous “hops” which require no global synchronization of the PEs. The routing of tokens between the switching elements is completely decentralized, each routing decision being made by a switching element according only to the tokens in its input buffers. The throughput provided by a buffered banyan network for the traffic pattern of a MDM is roughly proportional to the number of PEs used. If an acceptable
MULTI-RING
DATAFLOW
MACHINE
325
beat period is selected, there will be no significant performance degradation (for example, a switch beat period of several hundred nanoseconds is acceptable when 1 MIP PEs are used). 7.2.
Future Research
The above results will be modified as improvements are made to the hardware and software of the Manchester Dataflow Machine. This section reviews the influence of some recent and projected changes in the system architecture [5]. Replication of large data structures reduces the performance of a dataflow machine and has so far been avoided in the Manchester system by storing data structures in the Matching Store [7]. A garbage collection mechanism disposes of structures when they are no longer required. Since the Matching Store is the most critical resource in the machine and is becoming overly congested, a separate Structure Store Unit has been implemented, connected to the Switch as a set of specialized hardware modules [21, 201. This will make future multiprocessor configurations heterogeneous, instead of the homogeneous arrangement described above. Another change has been prompted by the observation that program parallelism is often much higher than machine parallelism. In the current system, this overloads hardware units with tokens whose processing can only be delayed. Further hardware is therefore required to exercise run-time control over program parallelism. Compiler-generated code will naturally change, with or without these improvements to the architecture. Preliminary optimization of code generated from SISAL programs has shown that it is possible to reduce certain overheads without significantly reducing the program parallelism [6]. Further developments can be expected in this area. It is too early to predict the changes that these hardware and software improvements will produce. The research presented above will simply have to be repeated in the new environment. Nevertheless, the random pattern of token traffic assumed here is likely to remain, and this makes it probable that the above conclusions about instruction allocation and required Switch performance will continue to be valid. ACKNOWLEDGMENTS The authors gratefully acknowledge the assistance of their colleagues in the Dataflow Research Group at the University of Manchester. Pedro Barahona was supported by the Calouste Gulbenkian Foundation and the University of Lisbon during his masters research. The Manchester Dataflow Project has been supported by the U.K. Science and Engineering Research Council and Digital Equipment Corporation. REFERENCES 1. Arvind, and Gostelow, K. P. Some relationships between asynchronous interpreters of a dataflow language. In Neuhold, E. J. (ed.), Formal Description of Programming Concepts. North-Holland, Amsterdam, 1978, pp. 95-119.
326
BARAHONA
AND GURD
2. Arvind, and Ianucci, R. A. A critique of multiprocessing, von Neumann style. Proc. ZOth Annual Symposium on Computer Architecture, June 1983, pp. 426-436. 3. Arvind, Culler, D. E., Iannucci, R. A., Kathail, V., Pingali, K., and Thomas, R. E. The tagged token dataflow architecture. Internal Report, Laboratory for Computer Science, MIT, Aug. 1983. 4. Barahona, P M. C. C. Performance evaluation of a Multi-Ring Dataflow Machine. M. SC. thesis, Department of Computer Science, University of Manchester, Oct. 1984. 5. Bdhm, A. P W., Gurd, J. R., and Sargeant, J. Hardware and software enhancement of the Manchester Dataflow Machine. Proc. IEEE Spring Computer Conference, Feb. 1985, pp. 420423. 6. Biihm, A. P W., and Sargeant, J. Efficient dataflow code generation for SISAL. Proc. International Conference on Parallel Computing, Sept. 1985, pp. 339-344. 7. Bowen, D. L. Implementation of data structures in a dataflow computer. Ph.D. thesis, Department of Computer Science, University of Manchester, May 1981. 8. Caluwaerts, L. J., Debacker, J., and Peperstraete, J. A. A data flow architecture with a paged memory system. Proc. 9th Annual Symposium on Computer Architecture, Apr. 1982, pp. 120-127. 9. Dennis, J. B. First version of a data flow procedure language. InLecture Notes in Computer Science, Vol. 19, Springer-Verlag, New York/Berlin, 1974, pp. 362-376. 10. Dennis, J. B., and Misunas, D. P A preliminary architecture for a basic data flow processor. Proc. 2nd Annual Symposium on Computer Architecture, Jan. 1975, pp. 126-132. 11. Dennis, J. B., Boughton, G. A., and Leung, C. K. C. Building blocks for data flow prototypes. Proc. 7th Annual Symposium on Computer Architecture, May 1980, pp. 1-8. 12. Dias, D. M., and Jump, J. R. Analysis and simulation of buffered delta networks. IEEE Trans. Comput. C-30, 4 (Apr. 1981), 273-282. 13. Gaudiot, J. L., Vedder, R. W., Tucker, G. K., Finn, D. J., and Campbell, M. L. A distributed VLSI architecture for efficient signal and data processing. IEEE Trans. Comput. C-34, 12 (Dec. 1985), 1072-1087. 14. Gostelow, K. I?, and Thomas, R. E. Performance of a simulated datallow computer. IEEE Trans. Comput. C-29, 10 (Oct. 1980), 905-919. 15. Gottlieb, A., Grishman, R., Kruskal, C. P, McAuliffe, K. I?, Rudolph, L., and Snir, M. The NYU Ultracomputer-Designing an MIMD, shared-memory parallel machine. IEEE Trans. Comput. C-32, 2 (Feb. 1983), 175-189. 16. Gurd, J. R., and Watson, I. Data driven system for high speed parallel computing, P. 2. Comput. Design 9, 7 (July 1980), 97-106. 17. Gurd, J. R., and Watson, I. Preliminary evaluation of a prototype dataflow computer. Proc. 9th World Computer Congress, IFIP ‘83. North-Holland, Amsterdam, 1983, pp. 545-55 1. 18. Gurd, J. R. Fundamentals of datatlow. In Chambers, E B., Duce, D. A., and Jones, G. P. @Is.). Distributed Computing. Academic Press, 1984, pp. 1-19. 19. Gurd, J. R., Kirkham, C. C., and Watson, I. The Manchester Prototype Dataflow Computer, Comm. ACM 28, 1 (Jan. 1985), 34-52. 20. Kawakami, K., and Gurd, J. R. A scalable dataflow structure store. Proc. 13th Annual Symposium on Computer Architecture, June 1986, pp. 243-250. 21. Kirkham, C. C., and Sargeant, J. Stored data structures on the Manchester Dataflow Machine. Proc. 13th Annual Symposium on Computer Architecture, June 1986, pp. 235-242. 22. Kruskal, C. P , and Snir, M. The performance of multistage interconnection networks for multiprocessors, IEEE Trans. Comput. C-32, 12 (Dec. 1983), 1091-1098.
MULTI-RING
DATAFLOW
MACHINE
327
23. Lawrie, D. Access and alignment of data in an array processor. IEEE Trans. Comput. C-24, 12 (Dec. 1975), 1145-1155. 24. McGraw, J. R., Skedzielewski, S. K., Allan, S., Grit, D., Oldehoeft, R., Glauert, .I. R. W., Dobes, I., and Hohensee, P. SISAL-Streams and Iteration in a SingleAssignment Language. Language Reference Manual, Version 1 .O, Lawrence Livermore National Laboratory, July 1983. 25. Nassimi, D., and Sahni, S. A self routing Benes network. Proc. 7th Annual Symposium on Computer Architecture, May 1980, pp. 190-195. 26. Parker, D. S., and Raghavendra, S. C. The gamma network A multiprocessor interconnection network with redundant paths. Proc. 9th Annual Symposium on Computer Architecture, Apr. 1982, pp. 73-80. 27. Patel, J. H. Processor memory interconnections for multiprocessors. Proc. 6th Annual Symposium on Computer Architecture, Apr. 1979, pp. 168-177. 28. Pease, M. C. The indirect n-cube. microprocessor array. IEEE Trans. Comput. C-26, 5 (May 1977), 458-473. 29. Plas, A., Comte, D., Gelly, O., and Syre, J. C. LAU system architecture: A parallel data driven processor based on single assignment. Proc. International Conference on Parallel Processing, Aug. 1976, pp. 293-302. 30. Rumbaugh, J. A data flow multiprocessor. IEEE Trans. Comput. C-26, 2 (Feb. 1977), 138-146. 31. Sargeant, J. Is an intelligent token queue the throttle we need? Internal Report, Department of Computer Science, University of Manchester, Oct. 1983. 32. Seitz, C. L. The cosmic cube. Comm. ACM 28, 1 (Jan. 1985), 22-33. 33. da Silva, J. G. D., and Watson, I. A pseudo-associative matching store with hardware hashing. Proc. IEE-E 130, 1 (Jan. 1983), 19-24. 34. Smith, B. J. A pipelined, shared resource MIMD computer. Proc. International Conference on Parallel Processing, Aug. 1978. 35. Yuba, T., Shimada, T., Hirakl, K., and Kashiwagi, H. SIGMA-l: A dataflow computer for scientific computations. Comput. Phys. Commun. 37, 1 (July 1985), 141-148.