Parallel Computing 28 (2002) 3±33 www.elsevier.com/locate/parco
Applications
Applications on a multithreaded architecture: A case study with EARTH±MANNA Angela C. Sodan
1
GMD FIRST, Berlin, Germany Received 20 December 1999; received in revised form 21 March 2001; accepted 6 August 2001
Abstract Multithreading oers bene®ts with respect to the formulation of irregular dynamic programs and their dynamic scheduling, load balancing and interaction. Furthermore, low-cost communication on distributed-memory machines by remote-memory access is provided by some systems for ecient communication. EARTH is one of the few systems which combines both, while most other systems either focus on communication or provide multithreading in shared-memory environments. Dynamic irregular applications are often awkward to parallelize on distributed memory when using SPMD style programming via MPI and show dierent requirements for formulation. In addition, dynamic irregular applications also may show a fairly tight data coupling. Systems like EARTH are bene®cial then, because they speci®cally support large number of small data exchanges by providing short startup times and the tolerance of even small latencies (oering very ®ne-grain threads). However, static regular applications with tight data coupling are supported too. On the example of EARTH, this paper investigates the bene®ts of low-cost communication and multithreading, parallelizing three AI applications with medium to high communication intensity. We present experimental results obtained on the MANNA machine. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: Multithreading; Latency reduction; Latency hiding; Programming model; Irregular dynamic applications
1
E-mail addresses:
[email protected], acsodan@®rst.gmd.de (A.C. Sodan). Present address: University of Windsor, Windsor, Canada. http://www.cs.uwindsor.ca/users/a/acsodan
0167-8191/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 1 9 1 ( 0 1 ) 0 0 1 2 7 - 2
4
A.C. Sodan / Parallel Computing 28 (2002) 3±33
1. Introduction Low-cost communication systems implementing remote-memory access with distributed memory are currently attracting growing interest in dierent platforms, either as the sole communication alternative to message passing, as oered on the Cray T3E [3], or in combination with ®ne-grained multithreading. Higher-level messagepassing systems like MPI can be eciently implemented in a specialized way on top of remote-memory access (as done by Cray). We consider a representative of the multithreading class using the Ecient Architecture for Running THreads (EARTH) sytem [22], and investigate its direct use in compilers. EARTH was developed by Guang R. Gao's group at the McGill University in Montreal/Canada and is now being enhanced at the University of Delaware in Newark/USA. In our tests, EARTH is run on the general-purpose distributed-memory parallel machine Massively Parallel Architecture for Numeric and Non-numeric Applications (MANNA) [17] developed at GMD, consisting of 20 dual-CPU machine nodes connected via a crossbar. In this con®guration, EARTH operations are implemented completely by software in the runtime system, i.e. compared with other existing solutions [25]. EARTH±MANNA should be regarded as software multithreading, though MANNA oers support for latency hiding by means of using one processor as a special EARTH runtime-system and communication processor. In addition to communication by remote-memory access, EARTH provides ®ne-grained multithreading at a level below the function body, oering very fast thread switches. This makes it possible both to reduce latencies and to hide even small latencies ± which is considered the most dicult goal to achieve [14]. In particular, the startup time in communications is very short, making the number of communications less signi®cant. Remote data accesses are basically load/store operations, but EARTH±MANNA additionally provides fast block transfers for medium to large amounts of data. Thus, applications with very large data transfers (and then often rarely occurring data transfers) can be run eciently as well. EARTH's real bene®ts, however, are demonstrated in applications with a more critical communication/computation ratio. Systems like EARTH are, then, less sensitive to imperfect data distributions [30,38], and they provide bene®ts for applications with complex irregular data structures, relaxing the requirements for mapping or ± if structures evolve during runtime, as in adaptive grid computations [19,32] ± dynamic remapping. In addition, communication optimizations like aggregation become unnecessary in many cases, thus also reducing the demands on the compiler. In fact, splitting may even be advantageous to increase the potential of overlapping computation and communication [38]. Similarly, computational granularities may be fairly small because context switches are cheap, and thus less eort is required for optimization to combine code to give costecient granularities. Furthermore, prefetching of data or knowledge about the subtle ordering of incoming communications is less critical when using threads because the scheduling is dynamic. Thus, programs can still be run eciently if programmer or compiler optimizations are neglected ± e.g. for test runs or for quickly getting a ®rst version of the program run ± or if they are hard or impossible to perform because behavior is dynamic and irregular, i.e. weakly predictable or unpredictable.
A.C. Sodan / Parallel Computing 28 (2002) 3±33
5
Nevertheless, optimizations do not become super¯ous and they can further increase performance: good distributions reduce the number of interactions; block transfers are much cheaper than many individual data accesses; thread granularities should be large enough to hide latencies (for EARTH, thread optimization is performed by the McCAT compiler [20]) while limiting thread overhead; and proper scheduling of communication and synchronization can increase the potential for latency hiding. In addition, load balancing can easily be supported on the basis of threads. As well dynamic asynchronous interaction is more easily expressed. To summarize, ®ne-grained communication and multithreading are expected to be of particular bene®t for applications with dynamically created computational tasks, asynchronous mutual communications between tasks, irregular and unpredictable data accesses, and frequent small- or medium-sized communications, such as those that occur, especially in the AI ®eld, but in some numeric applications, too, in the following forms: · irregular tree-like dynamic task structures, occurring in many search problems, that often have fairly small parallel computation tasks and require dynamic task creation, asynchronous communication and ecient dynamic load balancing; · computations with shared data structures, occurring in many reasoning- or transformation-based applications, 1. data structures potentially linked and created dynamically (with possibly only some of their elements being accessed by each machine node), making distribution dicult; 2. accesses unpredictable, irregular and highly asynchronous; 3. the task-internal control structure complex with computation time potentially varying signi®cantly for each parallel task, thus requiring dynamic load balancing; and 4. mutual interaction · ®ne-grained data-parallel and tightly coupled computations, as in arti®cial neural networks, very-low-cost communication/synchronization and/or subtle latency hiding being the only chance for parallelization. To explore the bene®ts of low-cost communication ± and to some extent of multithreading ± applications of the classes mentioned above are investigated (extending an earlier application study [34,36]). More speci®cally, · we examine the communication intensity and applicability of multithreading, checking the in¯uence of overhead; and · show that some applications become amenable to parallelization only if overhead is kept extremely low. Two of the applications looked at (Eigenvalue and Gr obner Basis) are taken from Multipol [39], which is a library supporting some main shared-data abstractions related to scheduling structures or partial/®nal result structures. Multipol is implemented using the TAM multithreaded system and was ®rst run on a CM-5 ± also a distributed-memory machine. However, providing abstract data structures at library level naturally involves an overhead that may be too high for some critical
6
A.C. Sodan / Parallel Computing 28 (2002) 3±33
types of applications. Here, direct use is made of EARTH's thread and communication operations. The third application is a feed-forward arti®cial neural network with a very critical computation/communication ratio, thus being parallelizable if communication overhead is kept extremely low. We show that these applications run well with low-cost communication, and that the range of applications parallelizable on distributed-memory machines can thus be extended. In Section 2, the EARTH±MANNA system is described in detail. Section 3 presents its basic performance features. Section 4 presents performance results for the applications Eigenvalue, Gr obner Basis and feed-forward neural networks. Section 5 discusses the relation to other works, and Section 6 gives a summary. 2. EARTH±MANNA and its performance 2.1. The EARTH±MANNA system EARTH (Ecient Architecture for Running Threads) is a runtime system supporting a global address space, remote-memory accesses, and a ®ne-grained multithreaded program-execution model. The code within a function body can be subdivided into threads (threads are thus local to the function and the machine node), which are code fragments scheduled using data¯ow-like synchronization operations and sharing the processor per machine node. Communications are performed asynchronously, and the order of thread execution is determined dynamically according to the availability of data or the ful®llment of other conditions (like ®nishing a remote call or remote arrival of data sent). Threads are mostly stateless (sharing, however, the local variables of the function) and non-preemptive, i.e. once started, they run to completion ± context switches are thus very fast. Functions can be invoked remotely to initiate parallel activities and establish some kind of higher-level threads (similar to the threads in shared-memory thread packages like Pthreads), exploiting multiple processor resources (still, of course, partially sharing them if there are more invocations than processors). Thus, there is a two-level execution model, but scheduling is ¯at for threads of all active function calls per machine node. Overhead for thread scheduling and communication startup is in the range of a few tens of instructions; this is discussed below in more detail. Thus, both eective latency reduction and hiding are supported. EARTH also provides ecient dynamic load-balancing facilities (using a work-stealing mechanism). For details, see [22]. EARTH programs can be written in either EARTH-C or EARTH Threaded-C, both extensions of C (see Fig. 1(a)). EARTH Threaded-C is the basic language, dealing explicitly with threads and remote-memory accesses (cf. Fig. 1(b)). EARTH-C oers parallelism at a more abstract level and is compiled by automatic generation of remote-memory accesses and thread handling. It is, however, restricted to a tree-like programming model, while Threaded-C oers considerable ¯exibility with respect to its basic mechanisms, allowing various types of computation±communication patterns to be implemented eciently.
A.C. Sodan / Parallel Computing 28 (2002) 3±33
7
Fig. 1. The EARTH±MANNA language/machine hierarchy (a) and an example of EARTH Threaded-C code (b).
Fig. 1(b) shows a piece of EARTH Threaded-C code. Functions can be called remotely using either INVOKE (explicit machine-node assignment) or TOKEN (subject to automatic dynamic load balancing), and several of them may be simultaneously active on a machine node. THREADED indicates that the function contains multiple threads. Threads are labeled by THREAD_n; the ®rst thread is started at function entry, END_THREAD telling the runtime system to schedule the next ready thread. Remote data accesses are split-phase transactions. GET_SYNC_x and DATA_SYNC_x are remote read and write accesses, respectively, and are de®ned for dierent types of data. The third argument of these operations speci®es a synchronization counter that is initialized by INIT_SYNC and decremented on completion of the operation. The index 1th thread is associated with the counter index and is ready to run when the counter is at zero. The example in Fig. 1(b) demonstrates the syntax of EARTH Threaded-C and shows how split-phase transactions are used. The function Func gets an input vector of doubles, performs some computation on it, and returns a double as its result. The vector input argument is fetched in two portions via block moves, the computation being startable as soon as the ®rst portion of data is available. THREAD_1 is then scheduled. The function synchronizes the second time on THREAD_2, when it needs the second portion of data to continue the computation. At the end of the function, the result of the computation, assumed to be stored in local_result, is transferred to the address result passed in the call. The remote data-store operation also synchronizes using the remote synchronization counter done, given as an argument ± in the standard use case ± notifying the caller in this way that the result is available. In this example, at least fetching the second part of the data performing the ®rst part of the computation can be overlapped. Note that there are often threads of other functions available that
8
A.C. Sodan / Parallel Computing 28 (2002) 3±33
can be run while this function call is waiting for the ®rst part of the data, i.e. communication latency of one function can be overlapped with computation of another. EARTH can be implemented on dierent platforms. It is shown here running on MANNA, but it has been ported to other machines like the IBM SP2 and a cluster of SUN workstations connected via a Myrinet switch. Furthermore, recently it was redesigned to exploit SMP machine nodes with local parallelism and communication via TCP/IP. MANNA is a distributed-memory machine. Each node contains 32 MB of local memory and two 50-MHz Intel i860 XP RISC CPUs [9]. The nodes are linked by a 100 MB/s-bandwidth multilevel crossbar interconnect. Sustained performance is equal to or surpassing that of comparable commercial machines like the SP2 (the node-processing power is approximately half that of the SP2). For details, see [17]. A new version, upgraded for the PowerPC and with about 10 times the node performance, is available in the meantime [5]. MANNA was especially suitable for the implementation of EARTH because of featuring two CPUs per node and because of providing low-level network access. The dual CPUs support EARTH's design approach to dividing application execution and runtime system. Thus, the ®rst EARTH implementation on MANNA was a two-processor con®guration per node, with one processor ± the Synchronization Unit (SU) ± performing the speci®c EARTH operations, and the other ± the Execution Unit (EU) ± executing the basic application code (dual mode). The SU and EU communicate via an event and a ready queue in shared memory. The EU places messages into the event queue. The SU processes these plus incoming messages from the network. Thus, the SU initiates communications and work requests and responds to incoming remote synchronization commands and requests for data or work. Furthermore, it determines which threads are ready to run and puts them into the ready queue. When the next thread is to be scheduled for execution, the EU takes it from the ready queue [21]. Thus, whereas one processor performs application processing, all speci®c EARTH functionalities and all communications are performed by the second processor. The idea was to use this con®guration supported by the MANNA dual nodes (being more uncommon at the time of its development) as an excellent testbed for the later potentially more speci®c hardware support of the SU functionality. Later another implementation was developed as a single-processor version, using one of the processors only and executing both the codes of the EU and SU on it (spn mode), though keeping most of the internal structure. Both versions were then provided as two dierent modes. EARTH has been shown to provide much the same eciency with the existing single-processor implementation [29]. However, the basic overheads are lower for the single-processor version (because it is not necessary to set up the second processor), and the two-processor version hides latencies better in terms of the execution of the EARTH operations themselves (see Section 3). Latencies resulting from waiting times for another node's activity can be equally well hidden. For all the applications shown here, results are given for the single-processor EARTH version, because the dual-processor version did not run for some of the applications. Synthetic patterns (taken from [35]), however, show dual-processor performance, too, and the additional gains obtainable from it. Note, however, that
A.C. Sodan / Parallel Computing 28 (2002) 3±33
9
in a symmetric node, the second processor may, in principle, also be used for performing useful computation. 3. Performance issues Table 1 lists basic performance parameters of the EARTH±MANNA system. For allowing to rate these values: as already mentioned above, MANNA's node-processing power is approximately half that of the SP2, while EARTH communication is approximately twice as good on MANNA. This indicates a good performance of the communication hardware, but is also due to better possibilities to tune the EARTH system for the speci®c MANNA machine (obtaining a well-integrated system being especially important for a low communication layer). Full latency assumes that communications are performed between two nodes only and that the remote node is otherwise idle. In dual mode, non-overlappable overhead means the basic Table 1 Basic performance features of EARTH on MANNA Full latency (ls)
Non-overlappable overhead (ls)
Remote load, long int
spn dual
5:44 9:64
2:45 2:56
blkmov (get), long int
spn dual
7:21 11:24
7:79, remote 6:9 2:88
blkmov (get), 100 long int (400 bytes)
spn dual
23:8 25:3
21, remote 14:2 3:5
blkmov (get), 1000 long int (4000 bytes)
spn dual
158 136
140, remote 146 3
Invoke int, return int
spn dual
6:84 12:48
7:4 2:72
2 token (1 local, 1 remote)
spn dual
7:5 12:8
14:16 7:4
Broadcast loop, call, 2 long ints (to 20 nodes)
spn dual
30 39
Same plus blkmov 100 long int (400 bytes)
spn dual
270:3 267
Broadcast tree, call, 2 long ints (to 20 nodes)
spn dual
22:4 36
Same plus blkmov 100 long ints (400 bytes)
spn dual
110:8 116:7
In dual mode, a non-overlappable thread overhead of 1:03 ls (0:48 ls for synchronization initialization, and 0:55 ls for thread switch), and a transfer time of 0:124 ls per long were measured. Stores take approximately the same time.
10
A.C. Sodan / Parallel Computing 28 (2002) 3±33
operation-startup or -execution time taken by the EU. All communication times include thread and synchronization overhead for a single communication (for synchronization, a new thread always has to be started) and thus all the costs involved in implementing a typical instance of communication (initiating or synchronizing for several communications in a single thread would further reduce per-communication cost). This is essential to give a realistic idea of the cost at the level we are dealing with. Similar arguments apply to the call with static and dynamic assignment of functions. Detailed performance information on individual operations is given in [21]. Note that in spn mode the processor must be shared for communication and computation both at the communication-initiating and communication-receiving processors. The resultant potential interference between computation and communication is the reason for the phenomenon that, in the two cases of a small blockmove and a simple invoke or token, times are higher if communication is overlapped with computation than if it is not. As can be seen, the dierence between an iterative and optimal tree-organized broadcast (for the algorithm, see [13]) is signi®cant (the larger the data sizes, the greater the signi®cance). However, compared with message-passing systems, EARTH±MANNA can be considered quite tolerant with respect to the absence of such optimization, i.e. the simple iterative solution still performs suciently well in many applications (those that are not very communication-critical), this again being an example of low-cost communication relaxing the demands on the compiler and higher-level runtime-system support. Nevertheless, the optimization improves communication cost signi®cantly in some cases, and the very communication-critical neural-network example presented in Section 5 largely depends on it. Note that the latencies shown are minimal and can be further increased in the execution context if, for example, there are multiple incoming or outgoing communications. One of the motivations for developing ®ne-grained multithreading was to support irregularly structured problems with individual accesses to remote data. As Table 1 shows, EARTH supports many small communications by keeping communication cost low, in the range of a few microseconds. On the other hand, block-transfer cost is lower in EARTH as soon as more than a single long int is to be transferred. If dual mode is used, thus enabling transfer cost to be overlapped by computation, arbitrarily long block transfers only impose the initial cost (startup cost of the operations involved) of 2:88 ls. If spn mode is used, and thus transfer cost cannot be completely overlapped, aggregating still pays o. For 10 long ints, aggregation then delivers 8:32 ls vs. 54:5 ls (plus corresponding remote times), both values being, of course, quite low in absolute terms. This again demonstrates the lower sensitivity to communication optimization. Nevertheless, frequently executed loop code or very communication-critical applications may make optimizations worthwhile. In practical applications, we are often faced with a similar converse problem if there is a large data structure that is used by a subcomputation, but only part of it is actually accessed. Then, we have to decide whether to transfer the data structure as a whole or to access individual data items on demand. If no coding/decoding overhead is involved, transferring the structure as a whole usually pays o: transferring 100 long ints (spn mode, non-overlapped at initiating node) costs 17:85 ls, which is
A.C. Sodan / Parallel Computing 28 (2002) 3±33
11
already less costly if more than two items are accessed; 1000 long ints cost 129:45 ls and pay o if more than 17 items are accessed. If transfer cost can be overlapped, as when using dual mode, the choice is simple. If the data structures involved are linked, the situation might change because the data structures then need to be copied and coded/decoded in order for them to be transferred as a single block of consecutive data, which creates potentially signi®cant additional overhead. 4. Experimental results 4.1. Communication characteristics and intensity The communication characteristics of an application are considered to be any qualitative or quantitative features that aect ®nal runtime like: · number and size of communication (the former being supported by short startups, the latter by multiple threads if the application oers hideability) in relation to problem size; · speci®c task (such as tree, SPMD, central server) and communication (such as point-to-point or multicast) topologies because they impose dierent costs (e.g. multicasts allow optimized implementations) and may create contention eects; · algorithmic hideability of delays, given by data- and control-dependency constraints (multiple threads can help only if the application oers hideability); or · irregularities to be dealt with dynamically by the load balancer (which requires further system-internal communication and can be supported by low-cost load balancing and to some degree by multiple threads). The investigations in the rest of Section 4 list general qualitative characteristics such as task and communication topologies and present quantitative characteristics on the basis of speci®c inputs. However, the inputs are chosen to be representative, thus being highly expressive and indicative of general behavior. Communication intensity can, in the simplest case, be de®ned as the ratio between the communication cost (Oc ) and computation cost (T), i.e. Oc =T . We may then e.g. consider communication intensity to be high or medium if Oc =T > Rthreshold , a reasonable value for Rthreshold being e.g. 1/7 (corresponding to a speedup of 17.5 on the 20 nodes which we have available on the MANNA machine); though, this metric does not allow absolute rating of an application independent of the environment in which it is run (or would require a speci®c reference system for both computation and communication). We therefore list the numbers and sizes of communications for the applications and discuss the latency hideability. Ncn then denotes the mean number of invokes data fetch/store and synchronizations in EARTH per node when run on n nodes, and Scn the mean sum of data sizes per node transferred by them. Ncblk;n speci®es the mean number of block transfers used (for larger amounts of data), and Scblk;n the mean sum of sizes in block transfers per node. Scn;1 and Scblk;n;1 are the corresponding mean sizes per transfer. Furthermore, we calculate the Oc on the basis of what is
12
A.C. Sodan / Parallel Computing 28 (2002) 3±33
necessary then to achieve close-to-optimal speedup, i.e. Oc =T 6 Rthreshold (in our case, Oc =T 6 1=7). More speci®cally, we give the overhead per communication Oc1 Oc =Nc (Oc10;1 measured for 10 and Oc20;1 for 20 machine nodes). Similary, T is the runtime per node and Tc;1 the runtime between communications. We, then, roughly classify the communication intensity on the basis of the obtained values. 4.2. Eigenvalue We present here the bisection-based eigenvalue-calculation algorithm originally taken from ScaLAPACK and modi®ed by the Multipol group. ScaLAPACK [7] is a linear-algebra library containing routines for solving linear equations, least squares and eigenvalue problems. Routines are designed to run in parallel on distributedmemory message-passing MIMD computers and networks of workstations that support PVM or MPI. Almost all codes in ScaLAPACK are written in Fortran 77. ScaLAPACK oers several dierent algorithms for eigenvalue calculation. The bisection-based algorithm is one of the two possible eigenvalue algorithms oered by ScaLAPACK for symmetric input matrices (elements being either reals or doubles), allowing a simpler solution. Matrices are transformed to tridiagonal form ®rst. Eigenvalues are de®ned in the following form: Let A be an N N matrix. All z 2 V N , z 6 0, such that Az kz for some k are the eigenvectors and the corresponding k's are the eigenvalues. A symmetric (AT A) tridiagonal matrix is known to have N eigenvalues which are real. The bisection algorithm can compute an initial range on the real line containing all eigenvalues. Furthermore, for a given real number, it is possible to determine how many eigenvalues are lower than it. The eigenvalues are then found by approximation, the real line being successively subdivided (in a binary way ± hence bisection) until the intervals containing eigenvalues ± and ultimately constituting the solutions ± are determined to the desired degree of accuracy. This application is, then, a search for solutions. All parallel codes in ScaLAPACK are formulated in SPMD style, the bisection routine being based on a static partitioning of the data space. However, eigenvalues are not equally spread but clustered, which means that the search tree is irregular. The authors of ScaLAPACK speci®cally point out that the static partitioning can impair the performance of their algorithm if the eigenvalues are clustered, and that for matrices with N > 1000 there tends to be at least one very large cluster. The Multipol version is a rewriting (using C) of (the double-precision version of) the ScaLAPACK routine to create a dynamic work distribution. This oers the prospect of greater stability of performance for all kinds of inputs and the potential for greater eciency in general (provided that the additional overhead is kept low). The eigenvalue bisection implementation of Multipol is thus an example of the important class of search (or discrete optimization) applications. Other examples of such applications (Protein Folding ± ®nding all possible polymers of a speci®c cube, Parans ± enumerating all distinct isomers of parans up to a certain size, or TSP ± computing the optimal route for a traveling salesman through a certain
A.C. Sodan / Parallel Computing 28 (2002) 3±33
13
number of cities) have already been shown to parallelize very well on EARTH± MANNA [22]. Search applications are commercially relevant especially for VLSI design, in the still emerging area of ¯ow-shop scheduling in industrial environments, or for planning problems like ¯ight-crew assignment. While for pure trees parallelization is algorithmically easy, the overhead involved often is critical because in the general case the individual search nodes represent only a small amount of work and dynamic unfolding of the usually irregular tree as well as potentially varying computation times in search nodes makes dynamic load balancing desirable. In the implementation of the bisection algorithm considered here, the matrix is represented as two vectors, one containing the diagonal and one the o-diagonal elements. The required space is thus relatively small, and the tridiagonal matrix can be statically replicated on each node, only interval boundaries being communicated to start a subtask. Search nodes in this example consume sucient time, so that threads can be created for all search nodes in the tree, i.e. no aggregation of search nodes needs be applied. EARTH±C was used for parallelization, with explicit speci®cation of the parallel tasks (i.e. we did not attempt automatic detection of parallelism). While Multipol used random task distribution, we applied EARTH's automatic dynamic load balancing. Function tokens subject to load balancing are supportive of this application because creating a task/thread per search node makes the parallelization model and the scheduling easy. Multithreading at the intra-function level can be exploited for latency hiding by setting up the next function while the previous one is waiting for data arrival. Because of data dependencies (the input arguments must be available before the computation can start), there is, however, no further potential for latency hiding. The characteristics of the application are shown in Tables 2 and 3. Eigenvalue calculation is measured for 1000 1000, 3000 3000, and 5000 5000 matrices, which are typical sizes (ScaLAPACK gives performance results for these matrices [7]). As performance curves show (see Fig. 2), speedups are close to optimum, i.e. communication, thread creation and load-balancing overhead do not signi®cantly aect execution time. Parallel speedups are 18.13, 19.5, and 17.4 on 20 nodes for the three input matrices. The maximum depth of the tree is quite large, and task runtimes for e.g. the latter inputs are between about 20 and 58 ms; thus, the critical path is signi®cant here (unfolding sucient parallelism already takes 5 Ttask on 20 nodes) and mainly is the reason for the slight decrease in speedup relative to the optimum (emphasized by better speedup for the 3000 3000 matrix with the relatively smallest task runtime). The speedup is signi®cantly better than the Multipol version on a CM5, where it is only about 8 on 20 nodes (for a Table 2 Qualitative characteristics of ScaLAPACK Eigenvalue algorithm Task topology
Binary tree, irregular shape, dynamically evolving
Communication topology
Corresponding to tree communications on entry and exit of task, thus locality parent $ children Upper bound on parallel tasks known
2N 1
Predictability
14
A.C. Sodan / Parallel Computing 28 (2002) 3±33
Table 3 Speci®c quantitative characteristics of ScaLAPACK Eigenvalue algorithm Matrix size
1000 1000
3000 3000
5000 5000
Problem size (Tseq ) Neigenv found
20 958 ms 528 (< N because of clustering and threshold on accuracy) 1055
86 100 ms 1091
127 600 ms 1103
2181
2205
28/12 bytes
Same
Same
19.9 ms Min 3, max 36, mean 26.7 Variation coe. 43.1
39.5 ms Min 3, max 55, mean 34.7 Var. coe. 44.4
57.9 ms Min 3, max 55, mean 34.8 Var. coe. 44.1
2090/7.9 ms 264 4853 18.4 1156/8.7 ms 132 2426.5
8609/15.8 ms 545.5 10032.6 Same 4404/16.7 ms 272.3 5015.3
13 720/24.9 ms 551.5 10143 Same 7331/26.6 ms 275.8 5071.5
Ntasks (2 Neigenv 1) Task input/output size (invariant) Mean Ttask Depth of leaves
T10 =Tc;10;1 Nc10 Sc10 Sc10=20;1 T20 =Tc;20;1 Nc20 Sc20
The matrices are synthetically created so that they have geometrically distributed eigenvalues. Nc is computed from invoke/token, blockmove input, and 0.5 blockmove output as the mean value per task. Thus, Nc 2:5 Ntasks =Nnodes . Sc includes arguments used in invoke/token. Sc 46 Ntasks =Nnodes . Tc;1 Ttask =2:5.
1000 1000 matrix). Since communication cost is not very critical here, the reason for Multipol's low speedup apparently is that the random work distribution does not achieve a suciently good load balance. On the other hand, more advanced loadbased distribution adds extra work and extra communications, and the results show that EARTH±MANNA keeps this overhead low. For the communication intensity, we use Otask and assume that it captures all load-balancing overheads (system-internal communications and calculations) as well in addition to the 2.5 application-level communications. Then, for the 3000 3000 matrix, Otask =Ttask 6 1=7 for Otask up to 5.64 ms. For the other two inputs, Otask may be 2.8 ms (1000 1000) or 8.3 ms (5000 5000). Then, strongly depending on whether dynamic load balancing is supported (which it is not in MPI, for example) and on how high the additional overhead involved is, the application may still perform well on more costly communication systems. And the communication intensity may be rated as low or medium. There are, however, search problems with much smaller basic steps, potentially in the range of microseconds, as for the Nqueens benchmark. To demonstrate the in¯uence of the overhead and the bene®ts of EARTH, we again use the simple cost ratio. Assuming that perfect balancing is achieved, that half of the tasks are executed remotely, and argument sizes are set to 28/12 bytes as above, then
A.C. Sodan / Parallel Computing 28 (2002) 3±33
15
Fig. 2. Speedups for eigenvalue calculation by bisection for 1000 1000, 3000 3000, and 5000 5000 matrix (from left to right). All speedups are relative to the original sequential version. The grain size being suciently large, passing the small input data structure (see Table 3) via individual remote accesses to its elements (by pointer dereferencing) or via a block move (fetching the whole structure at once) makes hardly any dierence as regards runtime.
from Table 1 we obtain for EARTH in spn mode Otask 19:7 ls. 2 To achieve a ratio Otask =Ttask 6 1=7, this means that Ttask may be as low as 138 ls. We veri®ed this using a synthetic benchmark, letting every second node create tasks and using the standard ring-access strategy of the EARTH load balancer. 3 Here, real Otask 17 ls and the ratio threshold of 1=7 was achieved on 20 nodes for Ttask 119 ls. The low communication cost and the low-overhead task management and distribution thus oer the chance of avoiding or at least reducing the impact of task aggregation, which is necessary in systems with higher overhead but may be hard or even impossible to perform because some prediction of the evolving tree is required, which may not always be possible (cf. [37]), and variances may be shifted from structure to computation time (increasing them there). Thus, ®ne-grain multithreading not only simpli®es programming and compiling search programs but also promises higher performance quality by lower variances in runtime.
2 Assuming that the remote node performing the blockmoves is otherwise idle, that the communication cost for a locally executed token is that of a load operation, and that the node creating the tokens is busy. 3 Performance strongly depends on the task distribution, the load-balancing strategy, and their proper matching ± and this is one of the more positive constellations.
16
A.C. Sodan / Parallel Computing 28 (2002) 3±33
4.3. Gr obner Basis Gr obner Basis is another application taken from the Multipol library and thus already parallelized for distributed memory. Gr obner Basis (described in more detail by Davenport et al. [15] and Chakrabarti and Yelick [10]) is a computer-algebra problem, and its computation has applications in solving systems of nonlinear equations such as those used in mechanical multicomponent system modeling or constraint-solving systems (if constraints are of corresponding complexity). Such multivariable systems are described by a basis of polynomials (generators), they or combinations of them requiring consideration in all calculations relating to the system. A Gr obner or Standard Basis is a special kind of basis giving more information about the solution set. A system of polynomial equations is a Gr obner Basis if ± and only if ± every linear combination of its polynomials reduces to zero with respect to the basis and a speci®c < variable order in the representation of the polynomials (in¯uencing the operations, and thus to be chosen consistently). 4 A reduced Gr obner Basis ± polynomials in the basis being completely reduced to each other ± is unique to a speci®c system. For example, the input basis of polynomials I has the standard basis Is with respect to the lexicographic order z < y < x: I fx2
1;
x
1y;
x 1zg;
Is fx2
1;
x
1y;
x 1z; yzg:
The application in hand implements the most common Buchberger algorithm to determine the Gr obner Basis. Roughly speaking, Bucherberger's algorithm is a symbolic form of Gaussian elimination, allowing a Gr obner Basis to be constructed from any basis of polynomials by a completion procedure. The algorithm creates socalled S-polynomials 5 on the basis of the existing polynomials and checks for all Spolynomials whether they reduce to zero with respect to the existing polynomials. More speci®cally, the polynomial pairs are created ®rst, only the critical ones being considered (some are known to lead to S-polynomials that reduce to zero and need not be checked). Then, the S-polynomials are calculated from them and simpli®ed (reduced) by subtracting multiples of other polynomials. Polynomials that do not reduce to zero are added to the set. Thus, the original set is extended by further polynomials until no more irreducible ones can appear. The procedure always comes to an end, and the result constitutes a Gr obner Basis. During this procedure, interreduction may be performed, i.e. the polynomials in the basis may be reduced when a new polynomial is entered which may simplify or eliminate polynomials in the basis. The implementation in hand does not, however, apply interreduction (this 4 In other words, a division of the principal monomials (leading terms according to the given order) of it and at least one of the polynomials in the system delivers a residual of zero. Then, the leading term is canceled. 5 S-polynomials S
f ; g are de®ned by: S
f ; g
h=fp f
h=gp g, f and g being two non-zero polynomials and fp and gp their principal monomials. h is the l.c.m. of fp and gp .
A.C. Sodan / Parallel Computing 28 (2002) 3±33
17
is formally described in [11]). Thus, the computed basis is not minimal and unique, and we found slightly increasing number of polynomials with an increasing degree of parallelism, and longer runtimes to be correlated to larger number of polynomials. During calculation of the Gr obner Basis, the order of pair creation and processing has a signi®cant impact on the overall amount of work to be done. The result (if a reduced Gr obner Basis is calculated) is the same, but performance diers. Selection heuristics are therefore used, good ones being essential. Performing the reductions is the algorithm's main job, and it can be done in parallel if additional measures are taken to ensure global irreducibility. If any of the computations yields irreducible polynomials that are candidates for addition to the solution set, exclusive access to the basis is requested. If obtained, irreducibility is checked again because in the meantime other computations may have changed the status of the basis. As most S-polynomials reduce to zero and only a few candidates for insertion emerge, the repeated reduction does not constitute a major drawback in parallelism. Nodes failing to gain access to the solution set perform further reductions and try again after completing the next reduction. Thus, a certain overlapping of access latency with computation occurs at the algorithmic level. The solution set is a shared data structure. The solution set has only a few entries that potentially need to be accessed by all nodes (without interreduction, these are read accesses only). The set is implemented by some maintenance information about its state and by full replication (read caching) of the polynomials. Additionally, a central lock data structure is maintained (see Fig. 3). The pairs that are maintained as a queue also are an explicitly or implicitly (by the system) shared data structure. The pairs queue forms the basis for priority ordering and dynamic load distribution.
Fig. 3. Main data structures and operations of the Gr obner Basis algorithm. Dotted block may be run on dedicated node or its components be spread across nodes and run together with reduction processing. The pair queue is either central or distributed depending on the load-balancing strategy applied.
18
A.C. Sodan / Parallel Computing 28 (2002) 3±33
The basic implementation model is SPMD-like though combined with dynamic task distribution. At each node, one main thread is running, asynchronously interacting with the main threads on the other nodes. The main thread picks up its work from the pairs queue, work being distributed on a task basis. Pairs are constructed at the beginning of the computation from the input polynomials (by the initiating node), and later on dynamically during runtime ± if a new polynomial is added to the set ± from the input polynomials and the new polynomial (by the node adding the polynomial). Then, new polynomials are created asynchronously and in varying unpredictable numbers per node, their runtimes also varying signi®cantly. Hence, they are subject to dynamic load balancing. Computations on dierent nodes operate and communicate asynchronously. Communication is mutual between nodes, mostly serving to access the shared data structures. A summary of the application characteristics is given in Tables 4 and 5. The algorithm as described here was largely taken from Multipol, and ported from TAM on CM5 to EARTH±MANNA. In view of the mutual communication, we used EARTH Threaded±C for the implementation. However, some basic Table 4 General qualitative characteristics of the Gr obner Basis algorithm used Programming scheme
SPMD-like (main thread per node) but with asynchronous operation and asynchronous mutual information exchange at arbitrary points in time Work dynamically distributed (potentially via threads), work picked up from work queue
Communication topology
Several dierent communication patterns n:1 communications for access to central data structures and transferring polynomials Potentially node-to-node for distributed work queue 1 : n broadcasts for status changes in basis No return arguments of tasks, potential result stored in data basis
Task topology
Creation at arbitrary points in time, k-ary splitting of subtasks (logical tree) But global ordering and execution partly dependent on global status
Data structures
Shared distributed or central work queue, shared central solution basis Replicated entries of solution basis, polynomials represented as vectors
Predictability
Varying computation times per task Task sizes likely to increase toward the end of computation Sizes of polynomials varying, but potentially large Number of polynomials added and number of pairs to be reduced are unpredictable Priorities assigned to tasks; consideration bene®cial for limitation of work Indeterminacy in application runtime, but most results clustered around mean
Latency hideability
At algorithmic level in addition to low hideability at communication level
A.C. Sodan / Parallel Computing 28 (2002) 3±33
19
Table 5 Speci®c quantitative characteristics of the Gr obner Basis application (sequential run), internal representation being total degree, then lexicographic order Lazard
Katsura-4
Katsura-5
Problem size
3761 ms
6373 ms
362 750 ms
Number of tasks created ( number of critical pairs)
141
75
168
Number of polynomials in solution set
3 as input, 26 added
5 as input, 15 added
6 as input, 26 added
Man time per full reduction (variation signi®cant)
26.7 ms
85 ms
111.86 ms
Mean size of polynomial
454 bytes
439 bytes
3243 bytes
T10 , Tc;10;1
340 ms, 1.9 ms
438.7 ms, 3.5 ms
26458.3 ms, 135.7 ms
Tthread;10
90.4 ls
150.5 ls
791.1 ls
Nc10 (Ncblk;10 )
177 (27)
124 (18)
195 (30)
Sc10 (Scblk;10 ), Sc10;1 (Scblk;10;1 )
13934 (11210), 78.7(415.2)
8458 (6602), 68.2(366.8)
76 768 (73616), 393.7(2453.9)
T20 , Tc;20;1
249.7 ms, 1.3 ms
289 ms, 3.4 ms
11533.7 ms, 87.4 ms
Tthread;20
90.1 ls
162.1 ls
675.9 ls
Nc20 (Ncblk;20 )
197 (28)
86 (22)
132 (33)
Sc20 (Scblk;20 ), Sc20;1 (Scblk;20;1 )
13 666 (11010), 69.4(393.2)
9168 (7776), 106.6(353.5)
58 190 (56038), 440.8(1698.1)
Nc and Sc values are computed on the basis of typical runs in the mean range of execution times. Nc is calculated for central balancing and here includes all communications related to the dynamic assignment of the tasks. Nc
1 2Ntrygetlock Npolies =Nnodes 3Npolies 4Ncritpairs =Nnodes . Tthread is the mean computation time of an intra-function thread.
changes were made to the orginal code ± such as employing (function-level) threads for task distribution, using (intra-function-level) threads for enabling mutual interaction without the requirement of explicit polling or preemptive scheduling, maintaining the solution set centrally instead of distributedly (no bottleneck occurs because of the very fast EARTH±MANNA communications), exploiting EARTH's synchronization signals to indicate that processing of a pair is ®nished (exploiting the logical tree-like pair structure) instead of using Multipol's termination-checking protocol, using other distribution strategies, and avoiding a dedicated node for termination checking. Then, we either keep the original implementation of tasks being passive data entities and a load-distribution strategy being implemented at the Multipol level, or make threads active entities (using TOKENs) and exploit the load-balancing strategy provided by EARTH. Note that if all dierent relevant distribution strategies would ± as it is desirable ± be provided by the multithreaded system on a thread basis, tasks always would be made threads ± whereas passive task distribution originated as an auxiliary solution in SPMD environments which
20
A.C. Sodan / Parallel Computing 28 (2002) 3±33
do not support automatic dynamic load balancing. Intra-function level threads are used to partition the most time-consuming reduction computation. Thus we ensure reasonable reaction times to external requests by keeping the non-preemptive computation entities small (e.g. 90 ls in Lazard, see Table 5). Externally created request threads can automatically be scheduled between them. Thus, no explicit polling is necessary and code within function-level threads remains atomic, which avoids having to synchronize data accesses. We send termination signals, considering that the processing of a pair can be considered to be ®nished, if either the related S-polynomial reduces to zero or all pairs are ®nished that are created from the polynomial in the case of irreducibility. Using termination signals oers the basis to experiment with (1) dynamic load-balancing strategies other than the simple round-robin distribution provided by Multipol, and with (2) either dedicating a speci®c node for maintenance of the central data structures or performing maintenance together with normal processing (whereas Multipol always dedicates an extra node for lock maintenance and termination checking). Both would not work with Multipol's termination checking protocol. In Multipol, simple load distribution is performed on the basis of distributed work queues (each node maintaining its own partial queue ± disjunctive to the other ones). Each node can add entries to other nodes' queues by round-robin distribution with sender-initiated assignment (not considering, however, the load or performing reassignments, and thus not guaranteeing a real load balance ± especially if the number of tasks is quite low and statistical equalization over time does not apply). The decentralized maintenance prevents the queue from becoming a bottleneck, thus oering scalability. However, priorities (based on ``goodness'' as rated by the heuristic) are only maintained locally per queue, which means that nodes do not necessarily work on the globally best pairs. We optionally maintain the pair's queue centrally and perform central dynamic load balancing or use ± based on TOKENs ± EARTH's built-in distributed dynamic load balancing and keep the queue distributed and implicit. Both allow distributions based on the current system load. The central version allows global priority handling and load-based work distribution but may create a bottleneck. Using EARTH's distributed strategy again observes local priorities (at the creator side) only but oers better scalability. Communication in this application is performed for: distributing pairs; caching polynomials, i.e. broadcasting them to all nodes; centrally making known the change to the set of polynomials; acquiring the lock; and signaling to the creator of a pair that its processing is ®nished. The speci®c EARTH operations used for communication purposes as described above are: block moves for transferring (fetching) the polynomials; individual ± synchronizing ± data load and stores for centrally making known any changes to the solution set, for obtaining status information about the solution set, and for releasing the lock; synchronization signals to indicate that the processing of a pair is ®nished; remote function calls (invokes plus, potentially, synchronizations or data moves for response) for broadcasting information about changes to the solution set, for distributing pairs, and for acquiring the lock. Invokes are used not only used for request handling, but in some cases to enable several arguments to be passed together and perform atomic computations on them.
A.C. Sodan / Parallel Computing 28 (2002) 3±33
21
The main bene®ts of EARTH for this application are the low-cost communication ± even supporting central data-structure access ± enabling mutual asynchronous interaction via threads without preemption or explicit polling, and supporting synchronization on ®nished pair processing. Furthermore, there is some potential for latency hiding. Thus, broadcasting polynomials are performed asynchronously, i.e. the broadcaster does not need to wait until the data are transferred (but the callee has to). Furthermore, if distributed load balancing is applied, distribution of new pairs is asynchronous. However, central pair distribution, central lock access and central status access have to wait for response. The main potential for overlapping communication and computation here is exploited at the higher algorithmic level (doing other works if the lock is hold by another node) ± where the necessary semantic knowledge is available. Figs. 4 and 5 show performance results, running the program with dierent inputs and dierent implementation versions. Fig. 4 gives mean speedups for all the dierent implementation versions we experimented with. The round-robin distribution delivers good performance for small number of nodes, but becomes increasingly sublinear for larger number of nodes on the inputs Katsura-4 and Katsura-5. Using all nodes for computation performs better for small number of nodes and gives similar results for larger numbers of nodes. This means that the reaction to incoming requests is fast, despite continuous polynomial processing on the node. Applying EARTH's automatic distributed work-stealing balancer provides results that are rea-
Fig. 4. Mean speedups (calculated on the basis of 20 test runs) for Gr obner Basis with input basis Lazard, Katsura-4 and Katsura-5. For each, the following are shown: use of round-robin distribution and use of central dynamic load balancing (both with and without a designated node for maintaining the central data structures), and use of distributed dynamic load balancing via EARTH's TOKEN mechanism.
22
A.C. Sodan / Parallel Computing 28 (2002) 3±33
Fig. 5. Mean, minimum, and maximum speedups of Gr obner Basis with input basis Lazard (top row), Katsura-4 (middle row) and Katsura-5 (bottom row) for round-robin distribution using all nodes for computation (left column), central balancing using all nodes for computation (middle column) and central balancing with dedicated maintenance node and increased (300 ls) communication overhead (right column).
sonable but are worse than the simple round-robin distribution. Closer investigations of the number of pairs processed showed large dierences, indicating that the currently used strategies for initiating load distribution and selecting tasks for distribution are not optimal for this application where communication often causes small temporary idle times. Furthermore, (local) priority handling on the creator side may be less eective. By far the best performance is achieved by central load distribution, applying the variant that uses all nodes for computation again performing slightly better. Further tests (using a simple FIFO queue) showed that both global priority handling and true work balancing contribute to the ®ne performance. The performance results show superlinearity in many cases. Parallelization changes the order in which the pairs are processed and thus ± potentially in a significant way ± the amount of work performed, which depends on it. Here, reductions apparently bene®t from several polynomials being reduced in parallel and the ®rst irreducible polynomial obtained being inserted ®rst. However, subtle time dierences in the execution may in¯uence the order in which the global solution set is accessed. Although most results are close to the mean, our experiments yielded signi®cant differences in processing time, as can be seen from Fig. 5, which also shows maximum and minimum values for two of the implementation variants. Chakrabarti and Yelick [10] and Amrhein et al. [2] discuss the superlinearity and runtime-variation
A.C. Sodan / Parallel Computing 28 (2002) 3±33
23
eects for the Gr obner Basis computation too. The runtime dependence on the order of pair processing makes the application intrinsically runtime-indeterministic. Speedup eects are partially hidden by the shortcuts in the computation. Thus, communication intensity is hard to rate. Using the parallel EARTH runtimes as the basis (including idle times due to imbalances and all calculations for the dynamic assignment of the tasks), a ratio Otask =Ttask 6 1=7 could still be obtained if adding per communication an Occ of 274 ls (10 nodes) or 196 ls (20 nodes) for Lazard, 505 ls (10 nodes) or 480 ls (20 nodes) for Katsura-4, and 19.4 ms (10 nodes) or 12.5 ms (20 nodes) for Katsura-5. Considering the sizes of communications (see values in Table 5), this suggests that communication intensity is between medium and low for the dierent inputs. Though the granularities with respect to the central accesses and mutual communications are still very coarse in this application. Thus in Katsura4, intervals between broadcasts are 145 ms, between basis-status accesses 17 ms, and between lock accesses 22 ms (on average). To check how ®ne granularities can be considering the EARTH communication and synchronization overhead, we have run synthetic central invocation and mutual invocation patterns. The results are shown in Fig. 6. As can be seen, EARTH supports really ®ne granularities ± e.g. for central communication in the range of 300 ls for 500-byte data size on 20 machine nodes. This also shows that central data structures or central dynamic load balancing can work very well on such a number of machine nodes if overhead and service times are low. However, this calculation does not re¯ect contention and subtle eects on the execution order. We therefore simulated higher communication cost by arti®cially increasing communication times ± which, of course, provides a very rough approximation only. Communication times are set to 300 ls on both the sender and receiver sides for blocking communication, to only 150 ls on the sender side
Fig. 6. Blocking central invocation (top row) and mutual invocation (bottom row), all run on 20 nodes. Time is in ls, data size in bytes. Top row left to right: dedicated node for requests/spn mode; using all nodes for processing/spn mode/multiple threads; using all nodes for processing/dual mode/multiple threads. Bottom row left to right: spn mode/single thread, spn mode/multiple threads; dual mode/multiple threads. Multiple threads mean that subthreads are introduced mainly serving for providing short reaction times to external requests.
24
A.C. Sodan / Parallel Computing 28 (2002) 3±33
where non-blocking asynchronous communication was possible, and the cost for copying to and from a message buer is added. 6 This is in the lower range of overheads mentioned above. We use the central balancer version with a dedicated maintenance node because this appears more realistic with higher communication and reaction times (which is not completely re¯ectable in the simulation). Results are shown in Fig. 5. As can be seen, performance is much worse for the two smaller inputs, but moderate to much worse for Katsura-5 (depending on the node number). Thus, results dier more signi®cantly than that the above speedup calculation suggests. We obtained similar results for the round-robin version. On the basis of these results, communication intensity may be rated high for Lazard and Katsura-4 and medium for Katsura-5. In addition to the tests shown, we tested further inputs for Gr obner Basis. Larger examples currently do not run on our system, and further medium-sized examples provided less impressive results due to limited parallelism. To summarize: though the parallelizability of Gr obner Basis as such needs further exploration and recently completely dierent but very promising algorithms than Buchberger's one evolved, the tests show that the use of distributed memory for irregularly interacting symbolic computation with central data structures is feasible when using low-cost communication. Completion is typical for many other AI applications (like the Knuth-Bendix [39] algorithm applied in theorem provers). Low-cost communication and ®negrained multithreading can be expected to bring even greater bene®ts for instances of this application class that have a higher communication intensity (as shown by the Lazard and Katsura-4 test cases). Furthermore, this example demonstrates that threads may in some cases be speci®able at very high algorithmic levels only, and that not all the potentials for multithreading can be obtained from the low-level context. 4.4. Neural networks Arti®cial neural networks are an application area of increasing practical use. They come in two major forms: feed-forward and recurrent networks. The former are more widespread, and they are the ones considered here. The training phase in particular is very time-consuming, and thus interesting as regards speedup by parallelization. There are two main levels for parallelization [33]: sample parallelism and unit parallelism. In the training phase, the net is fed thousands of test cases (samples). Sample parallelism involves exploiting the inherent data parallelism, running several neural networks in parallel. Each processes dierent subsets of the samples in batch mode (without any communication), and only at the end of the training phase is information exchanged. This kind of parallelism can be implemented eciently and scales well. Unit parallelism parallelizes the network itself. As networks are not very large and their computations are very closely coupled (see below), the
6
``Messages'' are assumed to be immediately accepted (there being no further delay due to a busy receiver) ± which is very optimistic. Broadcasts are assumed to be sent in sequence.
A.C. Sodan / Parallel Computing 28 (2002) 3±33
25
degree of parallelism obtainable is limited. But updates of the net can be performed after each sample (or after some samples), and there are investigations in the literature showing that such an approach usually converges faster than batch processing (i.e. the overall amount of work to be performed is smaller). Thus, unit parallelism is interesting, too, and could, for example, be exploited in a hybrid approach, mixing unit and sample parallelism. This involves dividing the overall batch and repeatedly presenting small batches, performing an update after each batch ± an approach already used frequently in the sequential case. Arti®cial neural networks (see Fig. 7(a)) consist of a set of units (representing neurons) organized in several layers (typically 3: input layer, output layer and one hidden layer), and ± in the basic case of a full linkage ± all units of one layer are linked to all units of the next one. Each unit calculates a scalar vector product on the vector of input or previous-layer data and on a vector of local weights (with a data ¯ow from the input to the output layer). Thus, although neural networks can be applied to the solution of non-numeric problems (like classi®cation or pattern matching), internally they are a numeric problem (performing ¯oating-point operations). Afterward, the error made by the network is calculated by a data ¯ow in the opposite direction (backpropagation). Weights are updated correspondingly, realizing the network's learning. This is performed in the simplest case per sample in combination with backpropagation. 7 Parallelization at the unit level means performing the unit calculations per layer in parallel. We used typical network sizes of currently practical systems [28], namely 80 and 200 units per layer. Larger networks are not currently used because they involve high computational cost in the sequential case and are questionable as regards bene®t. We did, however, add a test with 720 units. The network is mapped to the nodes by aggregating several units per node (``slicing'' the layer), this not only increases the amount of work per node but also reduces the overall number of communications (see Fig. 7(b)). However, for the current sizes of networks, the aggregation potential is low if relevant parallelism is to be retained. Unit parallelism is at the very end of the spectrum of parallelizable programs, with a very critical ratio of computation to communication. Parallelizability is even critical for shared-memory environments that still involve process/thread management and synchronization overhead. We used three layers (resulting in two computation phases), the same number of units per layer, and a full linkage. Tables 6 and 7 show the characteristics of the application. Communication and invocation are centralized and performed in a synchronous phase: all nodes receive the computation invocation for the input data for the next layer from one central node and send the result back to it. This reduces the overall number of communications on the network, and at the same time synchronizes the layer computations. Though, in order to further improve performance, a tree organization is used (as described by Cordsen et al. [13] and by Besch and Pohl [6] ±
7
However, the accumulation of error values over several samples before the next update may help avoid overreaction to individual bad outputs. The work for update and accumulation is approximately the same.
26
A.C. Sodan / Parallel Computing 28 (2002) 3±33
Fig. 7. Basic principle of feed-forward-neural-network computation with potential grouping of units (a), and structure resulting from partitioning units according to indicated potential grouping (b). (c) shows forward-pass and backward/update-pass calculation per unit. Note that local error values o err are vectors and that o errk;i is to be communicated to uniti (being more than a simple broadcast as for the y values).
though there the costs with message passing were too high to make unit parallelism work). Each unit processes its own weights. 8 Thus, weights are local and assigned to nodes together with the corresponding units. The basic sample set (consisting of a relatively small number of input/goal sample pairs) is replicated on each node. In the learning phase, samples from this set are taken in random order, with a potentially very large number of repetitions per sample, 9 and presented to the network. To process the next sample, only its index need be communicated. In the forward pass (for the algorithm, see Fig. 7(c)), communication takes place for: a call of the computation phase per layer (1:n), a data transfer from hidden to output layer (n:n transformed to an n:1 and a 1:n communication), and a data transfer from the output layer to the central node (n:1). Then, a global error value of the current network status is calculated. The forward and backward computation phase at the output units can be combined, the backward pass thus additionally requiring one more initiation (1:n) of a computation phase only, and one additional data exchange of error values from output to hidden layer (n:n again being modi®ed to 1:n and n:1). Backpropagation communication is more costly because, instead of simply broadcasting the same value, dierent values have to be communicated to dierent units (complicating the communication tree). The values of numbers and sizes of 8
And potentially accumulated errors. The order may, however, be partly controlled, e.g. by giving higher probability to samples that failed to be properly recognized. 9
A.C. Sodan / Parallel Computing 28 (2002) 3±33
27
Table 6 Qualitative application characteristics of neural-network computation Programming scheme
Lock-step data parallel Systolic (data-¯ow-like organization of steps)
Task topology
Graph with regular structure Parallelism per level and tasks uniform
Communication topology
n:m if full linkage (and n or m units resp. per layer) Closely related to task topology Transformation to optimized tree advisable Regular and static
Data structures
Replicated samples, local weights
Table 7 Speci®c quantitative application characteristics of neural-network computation, forward pass (all computations using ¯oats for the operands) Units
80
200
720
Sequential runtime feed-forward Runtime per unit and layer feed-forward Feed-forward (tree) Data size in communication
5.047 ms
26.96 ms
319.1 ms
32 ls
67 ls
222 ls
320 bytes in 4 80=Nnodes bytes outa
800 bytes in 4 200=Nnodes bytes outa
2880 bytes in 4 720=N nodes bytes outa
4.6 ms
31.5 ms
433.6 ms
320 bytes out 320 bytes ina
800 bytes out 800 bytes ina
2880 bytes out 2880 bytes ina
1.21 ms/0.81 ms 26:9 ls=32:4 ls 45 (25)/25 (15) 4389 (4064)/2416 (2176) 97.5/96.6 0.91 ms/0.417 ms 14:4 ls=11:9 ls 63 (35)/35 (21) 5847 (5392)/ 3184 (2848) 92.8/91
6.64 ms/2.99 ms 147:6 ls=110:6 ls Same 10485 (10160)/6580 (5440) 233/263.2 4.03 ms/1.59 ms 63:97 ls=44 ls Same 13935 (13480)/ 7456 (7120) 221.2/213
76.8 ms/32.56 ms 1706:7 ls=1302:4 ls Same 36901 (36576)/19824 (19584) 820/793 41.8 ms/17.73 ms 663:5 ls=510:8 ls Same 48983 (48528)/ 25968 (25632) 777.5/741.9
Sequential runtime backpropagation Data size in communication Backpropagation (tree) All/feed-forward T10 Tc;10;1 Nc10 (Ncblk;10 ) Sc10 (Scblk;10 ) Sc10;1 T20 Tc;20;1 Nc20 (Ncblk;20 ) Sc20 (Scblk;20 ) Sc20;1
Sizes of data are 4 bytes when performing individual communications per unit. Nc and Sc are computed assuming a tree for communication (the number of children being trsplit ) and counting (except for call) incoming n:1 data transfers to take into account signi®cant contention eects. Nc trsplit 9 (all) or Nc trsplit 5 (feedforward), and Sc trsplit
65 8Nunits 12Nunits 12Nunits =Nnodes (all) or Sc trsplit
48 4Nunits 8Nunits 8Nunits =Nnodes (feedforward). a Sizes are larger at higher levels in the tree where data is aggregated.
28
A.C. Sodan / Parallel Computing 28 (2002) 3±33
communications (see Table 7) illustrate the extremely high data coupling and communication intensity in this application, especially for 80 and 200 units. For example, with 200 units on 20 nodes, 25 communications are performed per node, transferring 7456 bytes of data altogether (56% of the communications transferring 97% of the data, the rest being short communications). EARTH Threaded-C was used for the implementation because we wanted to perform subtle tuning and experiments. We ®rst tested the forward pass separately, and then both forward and backward passes together, including updating the weights. Performance results are shown in Fig. 8. EARTH performs surprisingly well for this communication-critical application. For networks of the size currently in practical use (i.e. 80 and 200 units), speedups of 10 on 16 nodes for 80 units and of about 14.5 on 20 nodes for 200 units were achieved. 10 When considering the forward pass only, speedups of 11 on 16 nodes (runtime 458 ls) are obtainable for 80 units, and of 17 on 20 nodes (runtime 1.59 ms) for 200 units per layer. Note that the latter runtimes include four communication phases (two 1:n and two n:1 communications), communication consuming 143 and 242 ls, respectively, only. With 720 units, close-to-optimum speedup is obtained (in the feedforward pass, 18 on 20 nodes with a runtime of 17.73 ms, communication accounting for 1.77 ms). Considering the communication intensity (for the feedforward pass and 20 nodes), to achieve a Oc =T 6 1=7, Occ must be less than 5.5 ls in the case of 200 units and less than 65 ls in the case of 720 units, with Scc 213 bytes for 200 units and SCc 741:9 bytes for 720 units. EARTH is very close to achieve the former and manages the latter perfectly. Achieving these low communication overheads demonstrates EARTH's eciency and shows that only systems with such low communication overhead can parallelize the small-sized neural networks often in practical use. But scalability of unit-parallel networks is, of course, limited because of the limited problem sizes (and if problem sizes would be higher, the limiting factor would be the high interconnectivity). Multithreading to overlap communication and computation failed to yield any further bene®ts. Fig. 9 shows results of separate tests with a microbenchmark testing multithreaded loops distribution vs. single-threaded tree distribution. In dual mode, both are almost the same for non-extreme communication/computation ratios. Though, in spn mode, global communication optimization via a tree is more bene®cial than exploitation of latency hiding via individual threads. Note that the network as described here is merely one basic instance of feed-forward networks. The number of units per layer may dier, the linkage may be only partial, the H function (see Fig. 7(c)) more complex, or the error calculation performed dierently. Compilers for neural-network description languages [26] could perform the appropriate generation of EARTH code and then exploit EARTH's potential for parallelizing even small networks. The handwritten version, however, demonstrates that EARTH±MANNA makes unit parallelism on distributed memory a practical solution.
10
This corresponds to 1228 CUPS (connection updates per second ± a commonly used measure for neural network speed) and 19850 CUPS.
A.C. Sodan / Parallel Computing 28 (2002) 3±33
29
Fig. 8. Speedup for (a) forward pass only in neural-network computation and (b) combination of forward and backward passes (using dierent sizes of networks and dierent number of machine nodes).
Fig. 9. Dierences in speedup between single-thread tree and multiple-thread loop distribution in spn (left) and dual mode (right). Granularity is given in ms and data size in bytes.
5. Related works As far as applications are concerned, instances of search applications ± e.g. chess [24] ± have already been successfully parallelized on other systems, and there are several publications on problems of this sort, reporting close-to-optimal performance (e.g. [27]). However, the lower the overhead in communication, dynamic load balancing and thread management, the more problems of this sort we can expect to parallelize eectively. The aggregation of search steps to increase granularities is discussed in several of the papers on parallel search (e.g. [27,37]). Similar ways of aggregation are addressed in the implementation of functional languages, with the remaining ®ne-grained tree-like parallelism shown to be supportable by multithreading [18]). Object-oriented languages are also candidates for support by low-overhead systems because object boundaries may impose limits on optimizability [12]. For the Gr obner-Basis computation, there are other approaches to parallelizing Buchberger's algorithm (such as [2]) besides the one used by Yelick et al. and recently other algorithms than Buchberger's evolved, being more promising for parallelization in the general case [16]. It is not, however, our intention here to compare dierent parallelizations with respect to their eciency, but rather to show the bene®ts of multithreaded formulation and the in¯uence of communication cost. For neural
30
A.C. Sodan / Parallel Computing 28 (2002) 3±33
networks, most approaches choose sample parallelism, but some have experimented with unit parallelism, as reported in the summary given by Serbedzija [33]. As already mentioned, direct remote-memory access is not necessarily combined with multithreading, but may exist on its own such as on the T3E [3]. There a remote-memory read blocks until data are available, and a store blocks until data are sent ± which is meaningful because the T3E does not provide a separate processor for performing the communication and hiding latency at this communication level (corresponding mostly to the EARTH spn version as regards loads and stores). Multithreading oers the chance of not only reducing but also hiding the remaining latencies ± which can fully be performed, of course, only if special hardware for handling the communication is provided. Compared to prefetching, it does this ¯exibly, supporting dynamic scheduling in situations with an unpredictable execution order. Furthermore, multithreading supports load distribution and asynchronous interaction, whereas it is akward to formulate in SMP style programming. Examples of multithreading systems at the same ®ne-grained level as EARTH are TAM [14] and Cilk [8], which have slightly dierent features from EARTH. They run on different machines, so it was impossible to make a quantitative comparison of their performance with that of EARTH for the examples shown (as regards context switches, it is approximately 10 times faster); but here our concern was to compare them with coarser-level systems anyway. Hardware multithreading (i.e. hardware support via special CPU and register designs) is oered, for example, by *T [31] and Tera [1] and envisioned in the overall EARTH design, too, through a specialized SU [22]. Here, we have shown only the pure software-solution variant of its architecture. Hardware multithreading is helpful for even higher communication intensity and for hiding even lower latencies (e.g. memory-access latencies). Cf. also the multithreading overviews in [23,25]. 6. Summary, discussion, and future work The multithreaded architecture EARTH, oering low-cost communication and ®ne-grained multithreading, has been shown to run eciently on the MANNA machine. EARTH eectively supports applications that have a fairly critical communication intensity and that may additionally require asynchronous interaction, dynamic scheduling, dynamic thread creation, and dynamic load balancing. Thus, systems like EARTH allow us to parallelize applications on distributed memory that are otherwise not parallelizable or only with much less speedup. Experimental results are presented for an eigenvalue massive search problem, the computer-algebra problem of the Gr obner Basis computation, and unit parallelism in feed-forward arti®cial neural networks. Gr obner Basis is an application belonging to the category of complex reasoning or transformations, involving shared data structures and strong internal control dependencies. We obtained superlinear speedups for the inputs shown, but other inputs showed lower speedup. However, even speedups in the range of 10 would be of great signi®cance because of the potentially large problem sizes. Furthermore, there is a trend toward building more complex hybrid
A.C. Sodan / Parallel Computing 28 (2002) 3±33
31
symbolic/numeric applications, and such computations may appear as components of large programs. While isolated problems ± because of their limited degree of parallelism ± can also be run on shared-memory machines, for hybrid applications it is advantageous not to be dependent on static organizations with perhaps shared-memory subclusters, but to have the option of ¯exible partitioning and assignment of processing power to match the concrete application structure (although cluster organizations of machine nodes [4] impose dierences in transfer time for inter- and intra-cluster communication, especially when the amount of data is large). Unit parallelism in neural networks involves very intensive data linking, thus coming close to the limits of parallelizability, even on shared-memory architectures. Still, signi®cant speedup ± making it worth considering parallelization ± was obtained on EARTH±MANNA. Furthermore, latency may be hidden if the application dependencies provide hideability ± but in our applications the hideability was limited. The results show that while ®ne-grained communication and multithreading are often associated with irregular and dynamic behavior or with tree-like task parallelism, not only applications with this kind of behavior but also some of those with completely static, regular and data-parallel behavior ± like neural networks ± can draw signi®cant bene®ts from this kind of communication. Direct memory-access systems like EARTH are ®nely tuned for maximum eciency, but they are more awkward to program and require more compiler support than message passing, e.g. to ensure some degree of correctness of data accesses. However, when overhead is critical, it may be worth having highly optimized code. Then, both user convenience and eciency may be achieved using ®ne-grained systems if compilers support typical computation/communication patterns other than tree-like structures, so that more specialized library routines are used safely or speci®cally optimized code is directly generated.
Acknowledgements My thanks go to Professor G. R. Gao and his group for allowing me access to their EARTH system and their research results, and to G.R. Gao, X. Tian, and O. Maquelin for discussions on the paper. My thanks also go to G. Kock, K.-R. M uller, and N. Serbedzija of GMD FIRST for information about neural networks, to W. Neun of the Konrad-Zuse-Zentrum Berlin for hints on the Gr obner Basis. J.U. Schultz helped me with an initial EARTH parallelization of Gr obner Basis and neural networks. References [1] G. Alverson, R. Alverson, D. Callahan, B. Koblenz, A. Porters®eld, B. Smith, Exploiting heterogeneous parallelism on a multithreaded architecture, in: Technical Report, Tera Computer Company, Seattle/USA, and Workshop on Multithreaded Computers, Albuerque/USA, 1991.
32
A.C. Sodan / Parallel Computing 28 (2002) 3±33
[2] B. Amrhein, O. Gloor, W. K uchlin, A case study of multi-threaded Gr obner Basis completion, in: Proc. ISSAC'96 (International Symposium on Symbolic and Algebraic Computation), Z urich, Switzerland, July 1996. [3] Application Programmer's Library Reference Manual, Cray Research, publication SR216. [4] D. Basak, D.K. Panda, M. Banikazemi, Bene®ts of processor clustering in designing large parallel systems: When and how? in: Proc. Internat. Parallel Processing Symposium, Hawaii, April 1996. [5] P.M. Behr, S. Pletner, A.C. Sodan, PowerMANNA: A parallel architecture based on the PowerPCMPC620, in: Proceedings of the IEEE Conference on High Performance Computer Architecture (HPCA2000), Toulose/France, January 2000. [6] M. Besch, H.W. Pohl, Flexible data parallel training of neural networks using MIMD-computers, in: Proceedings of the Third Euromicro Workshop on Parallel and Distributed Processing, Sanremo/ Italy, January 1995. [7] L.S. Blackford, J. Choi, A. Cleary, et al., ScaLAPACK users' guide, in: http://www.netlib.org/ scalapack/slug/scalapack_slug.html, May 1997. [8] R.D. Blumhofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiersen, K.H. Randall, Y. Zhou, Cilk: An ecient multithreaded runtime system, in: Proceedings of the 5th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP '95), August 1995. [9] U. Br uning, W.K. Giloi, W. Schr oder-Preikschat, Latency hiding in message-passing architectures, in: Proceedings of the 8th International Parallel Processing Symposium, Cancun, Mexico, IEEE Computer Society, Silver Spring, 1994, pp. 704±709. [10] S. Chakrabarti, K. Yelick, Implementing an irregular application on a distributed memory multiprocessor, in: Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPoPP '93), San Diego, USA, 1993. [11] S. Chakrabarti, K. Yelick, On the correctness of a distributed memory Gr obner Basis computation, in: Proceedings on Rewriting Techniques and Applications, Montreal, Canada, June 1993. [12] A.A. Chien, U.S. Reddy, J. Plevyak, J. Dolby, ICC++ ± A C++ dialect for high performance parallel computing, in: Proceedings of the 2nd International Symposium on Object Technologies for Advanced Software, Springer, Berlin, LNCS 742, 1996. [13] J. Cordsen, H.-W. Pohl, W. Schr oder-Preikschat, Performance considerations in software multicast, in: Proceedings of the 11th ACM International Conference on Supercomputing (ICS '97), 1997. [14] D.E. Culler, Multithreading: Fundamental limits, potential gains, and alternatives, in: R.A. Iannucci, G.R. Gao, R.H. Halstead, B. Smith (Eds.), Multithreaded Computer Architecture ± A Summary of the State of the Art, Kluwer Academic Publishers, Dortrecht, 1994, pp. 97±138. [15] J.H. Davenport, Y. Siret, E. Tournier, Computer Algebra ± Systems and Algorithms for Algebraic Computation, Academic Press, London, 1988. [16] J.-C. Faugere, A new ecient algorithm for computing gr obner bases (f4), LIP6/CNR5, Universite Paris VI, January 1998, Mega98. [17] W.K. Giloi, U. Br uning, W. Schr oder-Preikschat, MANNA: Prototype of a distributed memory architecture with maximized sustained performance, in: Proceedings of the Euromicro PDP96 Workshop, IEEE-CS Press, 1996. [18] M. Haines, W. Boehm, On the design of distributed memory Sisal, Journal of Programming Languages (fall issue) (1993) 209±240. [19] S.E. Hambrush, A.A. Khokhar, Maintaining spatial data sets in distributed memory machines, in: Proceedings of the 11th International Parallel Processing Symposium (IPPS-97), Geneva, Switzerland, March 1997. [20] L.J. Hendren, G.R. Gao, X. Tang, Y. Zhu, X. Xue, H. Cai, P. Ouellet, Compiling C for the EARTH multithreaded architecture, in: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT '96), Boston, USA, October 1996, pp. 12±23. [21] H.H.J. Hum, O. Maquelin, K.B. Theobald, X. Tian, G.R. Gao, L.J. Hendren, A study of the EARTH±MANNA multithreaded system, International Journal of Parallel Programming 24 (4) (1996). [22] H.H.J. Hum, O. Maquelin, K.B. Theobald, X.-M. Tian, et al., A design study of the EARTH multiprocessor, in: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT '95), Limassol, Cyprus, 1995.
A.C. Sodan / Parallel Computing 28 (2002) 3±33
33
[23] R.A. Iannucci, G.R. Gao, R.H. Halstead, B. Smith (Eds.), Multithreaded Computer Architecture ± A Summary of the State of the Art, Kluwer Academic Publishers, Dortrecht, 1994. [24] C.F. Joerg, B.C. Kuszmaul, Massively parallel chess, in: Third DIMACS Parallel Implementation Challenge at Rutgers University, October 1994. [25] JPDC, Special issue on multithreading for multiprocessors, Journal for Parallel and Distributed Computing 37 (1±4) (1996). [26] G. Kock, M. Endler, M.D. Gubitosi, S.W. Song, Towards transparent parallelization of connectionist systems, in: Proceedings of the 9th International Conference on Parallel and Distributed Computing Systems (PDCS '96), Dijon, France, 1996. [27] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing ± Design and Analysis of Algorithms, Benjamin/Cummings, Menlo Park, CA, 1994. [28] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural Computation (1) (1989) 541± 551. [29] O. Maquelin, G.R. Gao, H.H.J. Hum, K.B. Theobald, X.-M.Tian, Polling watchdog: Combining polling and interrupts for ecient message handling, in: Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA '96), PA, USA, May 1996. [30] S. Nemawarkar, G.R. Gao, Latency tolerance: A metric for performance analysis of multithreaded architectures, in: Proc. IPPS-97, Geneva, Switzerland, March 1997. [31] R. Nikhil, G.M. Papadopolous, Arvind, *T: A multithreaded massively parallel architecture, in: Proceedings of the 19th International Symposium on Computer Architecture, 1992, pp. 156±167. [32] L. Oliker, R. Biswas, Ecient load balancing and data remapping for adaptive grid calculations, in: Proc. SPAA97, Newport, RI, USA, 1997. [33] N.B. Serbedzija, Simulating arti®cial neural networks on parallel machines, IEEE Computer, March 1996. [34] A. Sodan, G.R. Gao, Bene®ts of ecient multithreading on distributed memory for the parallelization of communication-intensive applications, Technical Report No. 1091, GMD, September 1997. [35] A. Sodan, W. Pfannenstiel, Performance enhancement using multiple threads in distributed-memory environments, IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), November 1999. [36] A. Sodan, G.R. Gao, O. Maquelin, J.-U. Schultz, X.-M. Tian, Experiences with non-numeric applications on multithreaded architectures, in: Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Los Vegas, USA, June 1997. [37] A.C. Sodan, Mapping Symbolic Programs with Dynamic Tree Structure to Parallel Machines, Oldenburg, M unchen, 1996. [38] X.-M. Tian, S. Nemawarkar, G.R. Gao, H. Hum, O. Maquelin, A. Sodan, K. Theobald, Quantitative studies of data locality sensitivity on the EARTH multithreaded architecture: Preliminary results, in: Proc. HiPC'96 (3rd International Conference on High Performance Computing), Trivandrum, India, December 1996. [39] K. Yelick, S. Chakrabarti, E. Deprit, J. Jones, A. Krishnamurthy, C.-P. Wen, Parallel data structures for symbolic computation, in: Workshop on Parallel Symbolic Languages and Systems, October 1995.