Efficient parallel computing in distributed workstation environments

Efficient parallel computing in distributed workstation environments

Parallel Computing 19 (1993) 1221-1234 North-Holland 1221 PARCO 807 Efficient parallel computing in distributed workstation environments Clemens H...

944KB Sizes 15 Downloads 221 Views

Parallel Computing 19 (1993) 1221-1234 North-Holland

1221

PARCO 807

Efficient parallel computing in distributed workstation environments Clemens H. Cap and Volker Strumpen * Institut fiir Informatik, Universitiit Ziirich, Winterthurerstrasse 190, CH-8057 Ziirich, Switzerland Received 29 June 1992 Revised 25 November 1992, 25 January 1993, 20 May 1993

Abstract The typical workstation in a LAN is idle for long periods of time. Within the concept of a hypercomputer this free, distributed computing power can be placed at the disposal of the user. The main problem with this approach is the permanently changing load situation in the network. We show that heterogeneouspartitioning with respect to the load situation at startup and dynamic load balancing throughout the entire computation are essential techniques for obtaining high efficiency with the hypercomputer approach. We describe a parallel programming platform called THE PARFORM, which supports these two features and therefore proves faster than related approaches. Performance measurements and a scalability model for an explicit finite difference solver of a partial differential equation conclude the paper.

Keywords. Distributed parallel computing; idle workstations; dynamic load balancing; scalability

1. Introduction

Present trends in supercomputer development emphasize expensive technologies and specialized architectural concepts. On the other hand, we are observing a significant increase of workstation performance and communication bandwidth, together with a shift of market interest from mainframes to workstations. Networks of high performance workstations are becoming increasingly available in companies and research institutions. For a large percentage of their lifetime the machines are merely used for reading email, editing files and similar small tasks. However, a workstation network may also be considered as a parallel computer or hypercomputer. In research institutions with state of the art RISC workstations often some 300 Mflops are wasted by idling machines. Statistics of the LAN of our department coincide with others [5] by demonstrating an average idle percentage of at least 90%. A number of recent research activities have tried to exploit the computing power of such environments, like PVM [11], Linda [1], P4 [2], PARMACS [3], or Express [6]. It is predicted, that such systems will gain further importance in the near future. * Corresponding author. Supported by Siemens AG, 2FE and Schweizer Bundesamtfiir Konjunkturfragen, Grant No. 2255.1 Email: [email protected] 0167-8191/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved

1222

CH. Cap, V. Strumpen

The goal of our present work is the utilization of idling workstations for parallel computing. Parallel computing aims at increasing both speedup and efficiency. While dedicated parallel machines already impose serious problems on programmers, heterogeneity and the permanently changing load situation of a workstation network dramatically deteriorate the situation. In this paper we identify the two essential problems of hypercomputer performance and describe our solutions. In Section 2 we show, how the dynamically varying load situation of a network places tight limitations on the performance we can expect from a hypercomputer when running practical applications in non-dedicated networks. We explain two load distribution mechanisms, heterogeneous partitioning and dynamic load balancing, which guarantee high performance and cope with dynamically varying loads. Only this approach follows the idea of utilizing idle resources to the end. Section 3 introduces TrtE PARFORM, a hypercomputer platform designed in our group, for studying these mechanisms. With a new and optimized design, aiming primarily at high performance, THE PARFORM achieves excellent speedups, when running in a dedicated network. Efficiency is even very close to that of a tightly coupled multiprocessor system. In non-dedicated networks with permanently varying loads, THE PARFORM maintains high performance, which is generally superior to similar approaches. In Section 4 we describe the parallelization of an explicit heat equation solver with THE PARFORM. We present our practical experiences and compare them with similar platforms. Section 5 concludes with an analysis of scalability with respect to the number of processors and the ratio of communication and calculation.

2. Load management The common approach for solving a given task in parallel is to partition the task into subtasks, which can be executed concurrently. In a hypercomputer, these subtasks are placed on the individual machines in the network. Homogeneous partitioning is the subdivision of a task into equally sized subtasks. In a network of workstations, it does not take into account the capabilities and different load situations of the machines. Thus, this technique is only suited for dedicated networks of equally powerful machines. However, in networks of workstations with different computing capabilities, homogeneous partitioning unnecessarily limits the performance. Cooperating parallel processes usually communicate during the computation. A process running on a faster host than its communication partner will reach the communication statement earlier. Assuming synchronous communication, this process cannot use its host's CPU power because it has to wait until the slower machine arrives at the corresponding communication statement. This is especially a problem with tightly synchronizing computations, where the CPU power, which the faster machines could contribute, is reduced to the power of the slowest participating machine. As a consequence, adding another but slower workstation to the hypercomputer may not increase but decrease the resulting speedup, because all other machines keep waiting for synchronization with the slowest machine. Then, the performance of the hypercomputer is proportional to that of the slowest machine. We have solved this problem with heterogeneous partitioning, where a task is divided according to the performance of the individual workstations. As a result, workstations able to contribute more to a parallel computation are automatically assigned larger subtasks, leading to better hypercomputer performance. We partition a task neither with respect to the peak performance of the workstations nor the lengths of their job queues, but with respect to load values measured at the startup time of the computation. This guarantees optimal task partitioning with respect to the actual load situation at that time. However, the measured load

Efficient parallel computing in distributed workstation environments

1223

values only reflect the load situation of the workstations at the time of the measurement. Chances are that sooner or later this load situation will change dramatically and that we will be constrained by an inefficient partitioning of subtasks. We have solved this problem with dynamic load balancing. At certain times during the computation the processes exchange the load values of their hosts and adapt the size of their subtasks to the actual load situation. During daily operation users will frequently start processes which may radically alter the previous load situation to which the parallel computation was well adapted. On a workstation with increased load, the process of the parallel computation shrinks the size of its subtask. Parts of it are transmitted to processes running on less loaded machines, regaining efficiency of the parallel computation. Before actually changing the sizes of the subtasks, an inexpensive protocol must ensure that the communication partner is indeed ready to receive additional chunks of work with respect to its own resource and load situation. Especially in a hypercomputer, but also in multiprocessor systems, there are several other aspects, which make dynamic load balancing an indispensible technique to improve performance and scalability. It is possible that the initial assignment of work to the individual machines does not fully utilize their processors, since we usually do not have an exact model telling us how much load a subtask of a certain size actually will place on a workstation. Therefore, the initial load distribution will not be optimal. This can be corrected automatically with a mechanism for dynamic load balancing. Due to CPU sharing, the hypercomputer processes may slow down the response times of interactive processes on a workstation. A sensible scheduling mechanism [5], in connection with dynamic load balancing strategies, has to ensure that the hypercomputer does not affect the interactive user but utilizes idling resources instead. The approach described above requires mechanisms to change the size of a subtask during the computation, depending on the load situation of the individual machines. Such techniques are straight forward for many classes of applications: Problems with static and regular data flow graphs, like linear algebra problems, partial differential equation solvers, signal transforms, image processing algorithms and the like. Program development for such algorithms can be supported by libraries which hide the load balancing completely from the programmer. For other problems, like recursive algorithms, geometrical and combinatorial problems, it becomes increasingly difficult, to obtain a well balanced load distribution. To implement dynamic load balancing, basic functionalities like load sensors and load distribution protocols are necessary. To identify and characterize the load of a machine, we distinguish two different situations: Workstations which are idle except very short or interactive jobs, and workstations running time consuming programs. Workstations running at most short or interactive jobs can be identified by large idle time percentages within sampling frames from several seconds to several minutes. For these machines the second type of data, the available performance, measured during a period of several seconds, informs us about their possible contribution to the hypercomputer. Scheduling appropriate parts of our job to those machines, we exploit time slices during which the workstation otherwise would be idle. Workstations with very small idle time percentages run computationally intensive jobs. Here, the available performance value tells us, how much computing power the parallel computation could obtain from this workstation, if it would participate in the CPU sharing. However, scheduling parts of our job to such machines would also prolong the response times for all the other jobs on this machine. Using both data, we are able to construct a reasonable scheduling heuristic. Dynamic load balancing mechanisms can be extended to or based on process migration.

1224

C.H. Cap, V. Strumpen

Migrating processes to stable storage would allow taking snapshots of processes. This feature can be extended to provide fault tolerance for unstable environments.

3. Our hypercomputer:. THE PARFOaM THE PARFORM is a hypercomputer platform for parallel computing in distributed workstation environments, developed at the University of Zurich. It involves three kinds of processes: The administrative process, several executive processes and load sensor processes, providing all the functionalities necessary for hypercomputer programming, like location transparency or remote process startup. The programmer writes an application dependent administrative procedure and executive procedure which, combined with the libraries and macros of THE PARFORM, represent the administrative and executive processes. The load sensors provide the data necessary for heterogeneous partitioning and dynamic load balancing, as described above. At system startup the platform internal part of the administrative process determines the number of workstations running in the network and their load situation. The detection of idle workstations and the collection of load data is based on a prearranged file of Internet addresses and a protocol using connectionless communication and asynchronous interrupt techniques. This design yields total startup times of THE PARFORMof less than one second for some forty machines and is significantly faster than similar systems. From the information of the load sensors and optional constraints given by the programmer, the administrative process determines those workstations which will actually be employed for a parallel computation. Based on the definition of the logical communication topography in the administrative procedure written by the programmer and the previously determined total number of workstations, THE PARFORMplaces one executive process on every employed workstation and maps the logical communication topography onto the physical bus architecture of the Ethernet. The entire communication expense is lower with explicit definition of the topography than with the concept of anonymous communication. The administrative process then executes its application part, which typically involves dividing the computation among the executive processes and collecting the results for storage or further processing. A n executive process consists of application independent code for setting up the communication links, passing messages and supporting dynamic load balancing. The application dependent part of the code performs arbitrarily sized subtasks of the total computation. The executive processes are started by the administrative process via the UNIX Internet daemon, guaranteeing short startup times. THE PARFORMworks without any daemons except the load sensors. The load sensors provide the data necessary for heterogeneous partitioning and dynamic load balancing as described above. Idle time percentages are obtained by periodically reading UNIX kernel variables and available performance information is determined by running a small test load at certain periods of time. The overhead produced by the sensors is negligible. The current implementation of THE PARFORM is written in C and runs on Sun SPARCstations under SunOS, HP 9000 under HP-UX and IBM RS 6000 machines under AIX, connected by Ethernet. The system is implemented on top of the Berkeley UNIX system calls using the transport layer of the Internet protocol suite, TCP, UDP and the Berkeley socket interface [10]. Application programs are written in message passing style using the C language and the parallelization extensions offered by THE PARFORM. Communication preserves the ordering of messages by using the reliable, bidirectional, connection-oriented TCP protocol. Broadcast and multicast techniques are under development. Data representation, marshalling, buffer sizes and similar architectural aspects are hidden from the programmer. Porting THE

Efficient parallel computing in distributed workstation environments

1225

PARFORM to operating systems like MACH is planned for a later phase of the project, since this allows THE PARFORM to use the efficient scheduling and communication primitives of modern microkernel architectures.

4. An application: Heat conduction During the various phases of the development of THE PARFORM several examples have been studied, among them the calculation of fractals, cubic convolution algorithms for image transforms, neural networks and partial differential equations (PDE). In this Section we present an explicit solver for a parabolic PDE from the theory of heat conduction. This problem is not a representative example from the numerical and physical point of view. However, the amount of computation and communication can easily be estimated and scales linearly with the number of processors. The problem is ideally parallelizable and the requirements on precise synchronization of the individual subtasks are very strict. A delay of a single subtask immediately slows down the entire computation. Therefore, this problem is an ideal benchmark for a hypercomputer platform and its load distribution mechanisms. We applied an explicit forward difference approximation to a two-dimensional heat equation on a rectangular domain with given initial and boundary conditions. For details see [9]. Let u~.) denote the temperature at grid point Pi j at time t k = t o + k * At. The temperature u~,~+ 1~ depends on the temperature u~.) of the same grid point and of its four neighbors u~k), ~k) all at the prece d ing t'me 1 i-~j, 'u~k), i+lj, u~k) i j - l , uij+l, step k. This numerical aspect and the equivalent locality of the physical law suggest the parallelization strategy: The grid of the rectangular domain is partitioned. Rectangular strips consisting of a set of neighboring columns of grid points are chosen and the calculation of a single set of columns scheduled to one workstation. Artificial boundary conditions between neighboring subtasks are thereby introduced, which have to be communicated between them at each time step and impose tight synchronization requirements on the subtasks. Since the amount of work to be done within a single subtask is approximately proportional to the number of subtask columns, this number directly determines its size. Homogeneous partitioning is implemented by splitting the grid into equally sized subtasks. With the heterogeneous partitioning strategy, the number of grid columns is proportional to the available performance of the hosts running the subtasks. For dynamic load balancing, the size of the individual subtasks at runtime is adjusted by moving the borders of the subtasks. In case of the heat equation, the communication topography is a pipeline, connecting each executive process with two neighbors, except the two processes at the ends, which have one neighbor only. In addition to the communication of the artificial boundaries at each time step, the executive processes also exchange the idle time percentages of their hosts. If the ratio of these values exceeds a certain threshold, the executive process running on the heavier loaded host communicates a column to the less loaded neighbor. This protocol does not require additional messages, since the load information and the migrated columns are appended to the messages of the artificial boundary conditions. The local redistribution of the workstation's loads yields a globally optimal load distribution. The design of protocols which achieve this goal with minimal overhead is currently under investigation. In the following we present our experiences with this implementation. All network distributed computations were performed in the dedicated LAN of our department consisting of 22 SPARCstationl, 8 SPARCstationl + , 7 SPARCstation2, 2 SPARCserver490 and 1 SPARCserver390 for the executive processes and a separate SPARCserver490 for the administrative process. For comparison with a tightly coupled multiprocessor we carried out computations on a Transputer Multicluster MC-2/32-2, programmed in the programming

C.H. Cap, V. Strumpen

1226

Table 1 Runtimes with homogeneous partitioning, cf. Fig. 1 Executive Processes

Linda (POSYBL)

Linda (SCA)

PVM

MC-2

THE PARFORM

1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

1264.3 737.2 442.6 339.3 284.6 260.2 244.7 242.7 239.5 242.6 241.6

1264.3 662.2 342.6 235.5 175.8 144.3 122.1 104.5 92.8 84.5 76.0 71.5 66.5 63.1 58.5 55.1 54.0 52.4 51.4 51.3

1264.3 648.0 328.0 219.0 168.4 143.6 116.6 100.1 90.0 87.5 75.8 68.5 63.6 60.5 56.7 53.5 54.0 54.0 52.0 54.0 52.9

- a a 921.4 618.5 466.7 376.6 318.2 276.2 240.0 215.9 196.6 182.8 170.9 160.9 151.9 144.7 138.5

1264.3 (1.0) b 654.8 (1.9) 332.3 (3.8) 221.7 (5.7) 170.2 (7.4) 137.4 (9.2) 116.0 (10.9) 103.5 (12.2) 89.0 (14.2) 80.9 (15.6) 73.5 (17.2) 67.5 (18.7) 62.5 (20.2) 58.6 (21.6) 55.8 (22.7) 53.0 (23.9) 51.0 (24.8) 50.8 (24.9) 48.5 (26.1) 48.4 (26.1) 47.2 (26.8)

a Primary memory of T800 too small b Speedup in parentheses

language OCCAM. The programs used in the individual experiments were identical, except the language and system specific aspects. All programs were compiled with the same standard C compiler and maximum optimization.

4.1. Homogeneous partitioning To obtain results comparable with platforms which do not support load distributing strategies, the first experiments were performed with homogeneous partitioning and without dynamic load balancing. We used THE PARFORM, Linda (the public domain implementation POSYBL, version 1.102 and the commercial implementation SCALinda, version 2.4.7) and PVM (version 2.4.1, using the new, fast communication primitives) on a dedicated workstation network, and the OCCAM program on the Transputer Multicluster. Each run on the workstation network employed at least one SPARCstationl. Due to the strict synchronization conditions of the application, the more powerful workstations could not contribute more than a SPARCstationl. Hence, all measurements can be viewed as measurements in a network consisting only of SPARCstationl. We solved the heat equation on a 3800 × 100 grid with 500 time steps and double precision numbers for the temperature values. The time reported in Table 1 is the wall clock time of the loop over the time slices, measured in seconds. The time needed for setting up the different platforms is not included. Figure 1 shows the corresponding speedups. We used one optimized program without any platform code for the single processor reference, measured on a SPARCstationl. The poor performance of the public domain Linda implementation POSYBL [7] can be explained entirely by the overhead of tuple space management. With 10 to 20 processors the speedup remains essentially constant. Tuple space management causes the main difficulties

Efficient parallel computing in distributed workstation environments

40_,,,,I,,,,II,,,I,,,,I,,,

'l

I,,I,1,,,II,,,t

...""

• T h e Parform 35

<>P V M 2.4.0

S = p y'

30

* SCA Linda 2.4.7 o POSYBL 1.102

.

S P e

1227

" ""

~r

25

e

d

20

U

P

15

S 10

0

i

0

5

10

I

15

I

i

20

lilil,llililill,ll

25

30

35

40

Number of processors p 1

Speedup with respect to 4-processor runtime

Fig. I. Speedup with h o m o g e n e o u s partitioning, cf. Table 1.

for an efficient Linda implementation, which appears better suited to shared memory architectures than distributed systems with bus architecture networks. Optimized versions of Linda like SCA-Linda version 2.4.7 mitigate this problem by preprocessing and runtime optimization [4]. For the Linda experiments we used a message passing programming style mimicking the program for the other experiments as closely as possible. These codes were significantly faster than all Linda flavored task bag codes which were measured. The Parallel Virtual Machine PVM [11] is a very similar approach to THE PARFORM. Measurements with PVM show slightly lower performance than THE PARFORM when 16 or more machines are used. Several concepts in PVM, for example the platform startup or the remote daemons produce more overhead than THE PARFORM, which does not use daemons. Heterogeneous partitioning and dynamic load balancing are not supported in PVM. It has to be done entirely by the programmer and therefore, if the network runs under normal daily load, high efficiency is difficult to achieve. The Transputer Multicluster is slower than THE PARFORM by approximately a factor of three which is due to the lower performance of the T800 processors. The most interesting observation in Fig. 1 is that the workstation network scales almost exactly like a tightly coupled multiprocessor with this application.

4. 2. Heterogeneous partitioning Table 2 lists the times of the identical problem with heterogeneous partitioning supported by THE PARFORM and Fig. 2 shows the corresponding speedups. THE PARFORM automatically chooses the fastest machines of the network and provides load information of these hosts to enable optimal partitioning of the task at runtime.

C.H. Cap, 14. Strumpen

1228

Table 2 Runtimes with heterogeneous partitioning, cf. Fig. 2 Executive

Workstations used

processes

2a

1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

1 2 4 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

a b c d e f

490 b

1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

390 c

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1+ d

2 4 6 8 8 8 8 8 8 8 8 8 8 8 8

1e

THE PARFORM plain

THE PARFORM optimized

2 4 6 8 10 12 14 16 18 20 22

564.0 (2.2) 289.8 (4.4) 146.6 (8.6) 98.7 (12.8) 86.9 (14.5) 77.0 (16.4) 70.8 (17.9) 66.2 (19.1) 63.1 (20.0) 60.6 (20.9) 60.9 (20.8) 58.1 (21.8) 59.5 (21.2) 58.4 (21.6) 56.4 (22.4) 56.2 (22.5) 54.8 (23.1) 55.6 (22.7) 57.1 (22.1) 58.2 (21.7) 55.4 (22.8)

564.0 257.3 130.3 87.6 70.6 61.1 56.9 55.7 51.4 51.5 48.8 49.1 47.8 46.9 48.0 48.5 48.5 48.3 47.0 49.2 48.7

f

(2.2) (4.9) (9.7) (14.4) (17.9) (20.7) (22.2) (22.7) (24.6) (24.5) (25.9) (25.7) (26.4) (27.0) (26.3) (26.1) (26.1) (26.2) (26.9) (25.7) (26.0)

SPARCstation2, SPARCserver490, SPARCserver390, SPARCstationl + , SPARCstationl, Speedup vs. SPARCstationl, see Table 1

The results of two implementatibns are presented: a plain version, which also was used in the homogeneous partitioning experiments, and an optimized version. The plain version executes the following sequence of statements for every time step: (1) Calculate temperature values (2) Send artificial boundaries to neighbors (3) Receive artificial boundaries from neighbors. Since the receive statement used in Step 3 is blocking, the processes have to wait for the arrival of the messages. In the optimized version a slightly different sequence was used: (1) Calculate temperature values of artificial boundaries (2) Send artificial boundaries to neighbors (3) Calculate all remaining temperature values (4) Receive artificial boundaries from neighbors. In this version, using nonblocking send statements, the communication could partially take place concurrently with the calculation of all remaining temperature values and so the time the processes have to wait for the messages is considerably reduced. In case of homogeneous partitioning the slowest machine in the network is responsible for the speedup. All faster machines idle while waiting at the synchronization points which are the receive statements. Thus, for homogeneous partitioning it is reasonable to calculate speedup values with respect to the runtime on the slowest machine. With heterogeneous partitioning we cannot use the conventional definition of speedup because of the different workstation performances. For instance, we could introduce a scaled speedup with respect to

Efficient parallel computing in distributed workstation environments

1229

lj,,li,lillltil,ililli,,lli,,lli,,l,lll.'

40

Speedups vs. S p a r c l e , 35

Spare2 O, •

S = p ...'"

Optimized e, o 30

: •

S

P e e

25

$2 / 1..'::"

...'"

~

-

15

20

.'

_

d

u

p $

5 0

0

5

10

25

30

35

40

N u m b e r o f processors p Fig. 2. Speedup with heterogeneous partitioning, cf. Table 2.

an average machine performance, but the dotted lines in Fig. 2 offer all the information necessary for assessing the quality of the speedup. The diagonal represents the curve S = p , the upper dotted line shows the linear speedup $2/1 of p SPARCstation2 with respect to the sequential SPARCstationl run and the lower dotted line represents the linear speedup SI/2 of p SPARCstationl with respect to the sequential SPARCstation2 run. If the necessary field length of an application exceeds the amount of physical memory, this leads to excessive swapping and paging activities. Many scientific computing problems require too much memory to be solved on a single workstation. In our measurements we experienced superlinear speedups for a grid size of 7800 × 200, because the sequential version was paging more than 10 hours, before we aborted it. The same problem with 38 machines resulted in a runtime of 154.0 seconds. Here the physical memory of the workstations was large enough for the computation. Furthermore, we observed slightly superlinear speedups resulting from programs which require so much memory that in the sequential version other processes have to be swapped out before this program can be loaded. The parallel versions have the additional advantage of storing a larger percentage of their data in cache. We also observed an already mentioned anomaly which can be found with homogeneous partitioning. Calculating with p SPARCstation2 and adding one SPARCstationl reduces the speedup instead of increasing it, as one could have expected due to the additional machine power. For example, using six SPARCstation2 yields a runtime of 146.6s. Adding a SPARCstationl with homogeneous partitioning results in 188.6s. Finally, in all types of experiments we observed a limitation of speedup which, in the heterogeneous experiments, occurs earlier than in the homogeneous one. The main factor responsible for this limitation is the communication bandwidth of the network which is discussed in detail Section 5.

4.3. Dynamic load balancing In order to analyze dynamic load balancing, we solved the heat equation on a 3800 × 100 grid for 2500 time steps on 21 machines (SPARCstationl and SPARCstation2). Due to the

C.H. Cap, V. Strumpen

1230

presence of the faster SPARCstation2 machines the heterogeneous partitioning is more efficient than the homogeneous one. However, neither homogeneous nor heterogeneous partitioning can react to load changes after system startup. We compared the different strategies on a dedicated network with all machines idling except one SPARCstationl, on which an infinite loop was started immediately after THE PARFORMstartup. After starting this loop, the local UNIX scheduler shares the CPU equally between the loop and the task of the parallel computation. Therefore, if no dynamic load balancing is used, we can expect a slow down of the entire computation at least by a factor of 2. With the dynamic load balancing strategy used in this experiment, the work of the hypercomputer process was almost completely distributed to its neighboring processes. Only a minimum amount of work remained at the host running the loop. Thus, we regained full utilization of all other machines and made the machine running the loop completely available for this infinite loop process. THE PARFOR~ produced the following execution times: Homogeneous partitioning without dynamic load balancing Heterogeneous partitioning without dynamic load balancing Heterogeneous partitioning with dynamic load balancing

742.0s 536.0s 354.0s

These results for dynamic load balancing were obtained with a very simple protocol. They prove the importance and applicability of dynamic load balancing techniques. For general load balancing protocols further investigations will be necessary.

5. Scalability In Fig. 1 we observed an excellent, nearly linear speedup when using up to 20 workstations. However, the speedup breaks down for more than 30 workstations. There are four possible reasons for this: saturation, desynchronization, protocol overhead and network congestion. Saturation means, that a problem is split into so many subtasks, that the parallelization overhead, mainly communication, outweighs the performance gained by the parallel computation. This phenomenon can only be influenced by choosing different algorithms or parallelization strategies. Desynchronization occurs, if workstations have to wait for data overdue from other machines. This may happen, if sudden activities on the network cause a specific machine to reduce the CPU share of the process which is part of the parallel computation. This is a general problem for applications with strict constraints on synchronization, like the heat equation solver. All machines must wait before calculating the next time slice, if only a single machine is late. The performance improves by separating the commands for sending a message as far as possible from the receiving commands, as implemented in the optimized version of the heterogeneous experiments. Even the best network technologies are not able to avoid protocol overhead which is that part of the communication needed for setting up drivers and adjusting flow control parameters. Network congestion occurs, if the communication load is close to the throughput of the network. This phenomenon is a physical limitation of the communication medium. As it is well known [8], this is more likely to be a problem with CSMA/CD type networks like the Ethernet than with token-passing techniques. With the random-access technique of the Ethernet, collisions might occur when many stations try to transmit a frame at approximately the same time. This is the case during the communication periods of the heat equation solver and leads to limited speedup. Communication technologies in the future, offering up to 10

Efficient parallel computing in distributed workstation environments

1231

Gbit per second bandwidth, might partially solve this problem, whereas increases in workstation performance make this problem more prominent. In order to get a better understanding of those effects we developed an analytic model for the heat equation solver. We suspected that the sublinear speedup behavior (Fig. 1) up to about 34 processors was mostly caused by saturation and the breakdown with more than 34 processors by Ethernet congestion. To show this, we modelled the speedup of the heat solver including the collision behavior of the Ethernet. Solving the heat equation on an m × n rectangular grid for T time steps yields a sequential computation time of ts = C m n T . The constant C denotes the average time for evaluating a single grid point. On a SPARCstationl we found C = 7.21~s. To calculate the elapsed time of the parallel computation, we integrated the total communication time T * too m into our model, where tco m is the average communication time per time step. Let the axis of the rectangular domain, which is discretized into m grid points, be partitioned into p parts for a parallel computation on p processors. From our experiments (Table 1) with two processors, we could reasonably assume that the calculation phase is completely parallelizable. Thus, the elapsed time of the parallel execution can be calculated as ts tp = - - + Ttco m P

CmnT - P

+ Ttco m.

During every time step each processor transmits 2n double numbers (8 bytes each), n to the left and n to the right neighbor, except the processors computing the boundary tasks. Hence, during one communication phase 2p - 2 frames are transferred. A frame comprises n double numbers plus header and trailer information, which consists of 18 bytes in the Ethernet standard. In our case with n = 100, the frame contains 818 bytes. The frame length f is the quotient of the number of bits of the frame and the cable transmission rate, which is 10 Mbit per second in a standard Ethernet. Thus, f = 0.65rns for our 818 byte frames. Ethernet modeling in [8] offers the following formula for the normalized transfer time 3' of a single frame, assuming constant frame length f : 1 + (4e + 2)a + 5a 2 q- 4 e ( 2 e - 1)a 2 7=P

2[1-p(l

+ (2e + l)a)]

(1-e-2pa) 2

(2

-p + 2 a e -1 - 6 a

+ l+2ea

)

2(e -°("+l)-I - 1 + e -2pa)

where e is the base of the natural logarithm, a = z / f depends on the architecture of the Ethernet, where ~- denotes the end-to-end propagation delay, which we estimated to 7.8/~s in our LAN. p = Af denotes the traffic intensity, where A is the total average traffic in frames per second. Our load monitor suggested the approximation that packet submission is equidistributed during the average communication time tco m, because mutually exclusive access to the Ethernet causes desynchronization of the tasks. This will be valid for communication phases with a moderate number of collisions. However, for traffic with heavy collisions, the assumption will merely produce an optimistic approximation. During one communication phase, 2p - 2 frames are transferred. Therefore, we obtain A = 2(p - 1 ) / t c o m and for the traffic intensity p(tc~m) =

2(p - 1)f tcom

(1)

C.H. Cap, V. Strumpen

1232 64

- ' ' ' I ~ ' ' I ' ' ' I ' ' ' I ' ' ' I '

56

48 S P

40

-

I ' ' ~1

O•ModelThe Pafform(f = O.Irns)(CL Figure 1)

...........""

<> M o d e l (.f = 0.2ms) "~

Model (f = 1.0ra,)

,®£r-q

--- Without collisions

.. .'O"

e

-

e

d

32 .'~/

.-"

I

~

A v

A v

U

p

24

S 16 8

0

8

16

24

32

40

48

56

64

N u m b e r of processors p Fig. 3. Scalabilitymodelof our heat equationsolver.

Given frames of constant length f (in units of time), f y is the average transfer time of one frame. We therefore get an equation for toom expressing the highly nonlinear behavior of the network

t¢om = 2 ( p - 1 ) f y ( p ) . Substituting (1) in (2) we obtain a nonlinear equation in p which is solved iteratively. The resulting speedup values are shown as the solid lines in Fig. 3. We chose the frame length f as the only fitting parameter of our model. With f = 0.2ms our model fits the measured curve, while the others show the speedup for varying frame length f. Considering the highly nonlinear behavior of the Ethernet, this fitting matches the previously calculated value of f = 0.65ms quite accurately. We can now explain the sublinear speedups with the help of this model. Setting y = 1 gives the speedup without considering collision effects, leading to the broken lines in Fig. 3. We can obtain a qualitative notion of the saturation effect from Fig. 3 if we interpret the frame length f as the ratio of the communication and calculation parts of the computation. Thus, the curves with larger f correspond to computations on smaller grid sizes and to networks with faster machines. This statement is also confirmed by comparing Figs 1 and 2, since in the heterogeneous experiments the faster machines could fully contribute their power. Including Ethernet specific effects further flattens the speedup curves (solid lines). Our computation with THE PARFORMjust enters the region where collisions become noticeable when calculating with 34 to 40 processors. Identical experiments produced varying runtimes with differences in the order of several seconds. This effect is caused by the nondeterministic behavior of the Ethernet when sequentializing the communication. During our experiments with 40 machines, the average transfer rate was 5.1 Mbit per second. In principle, the Ethernet performance suffers from collisions when the traffic intensity p increases. The average transfer delay increases in an unbounded way if p approximates its maxinium value Pmax = 1/(1 + (2e + 1)a) (see first term of the formula for y). This leads to increasing values of tco m

Efficient parallel computing in distributed workstation environments

1233

and a decreasing speedup curve. Thus, the analytic model gives a fairly good prediction of the experienced speedup and its break down, which is due to saturation and network congestion.

6. Conclusions and further work The presented results and observations show that we can regard a dedicated workstation network as a tightly coupled multiprocessor, when exploiting its resources with a system like THE PARFORM. The conventional approach of homogeneous partitioning is not able to cope with the dynamically changing load situation of a workstation network. We can obtain an almost complete utilization of the otherwise idling workstations in non-dedicated networks only with our load distribution strategies. In our experiment, the increase in efficiency by far outweighs the overhead of the dynamic load balancing. THE PARFORM currently works well for explicit finite difference solvers, but there are still many effects on overhead and performance which have to be investigated further. Additional measurements and system fine tuning under varying and well controlled artificial load situations are necessary to determine in detail all parameters for optimal load balancing protocols. We also want to investigate how the system scales to configurations with several LANs connected by gateways. Presently, the hypercomputer approach aims at users familiar with parallel programming. It offers language extensions for message passing, location transparent parallel programming in a heterogeneous, distributed environment. Furthermore, a number of different applications must be studied to get additional ideas of how the application interface can be improved. For many applications with a static data flow graph, essentially known at compile time, programmer support for dynamic load balancing is fairly easy. In situations, where the data flow graph evolves dynamically with the computation, it is not so obvious, how automatic load balancing can be done without having the programmer explicitly program those strategies. Our parallel computations sometimes slowed down other processes. Nevertheless, feedback from the department staff during our experiments was encouraging, since our activities often went unnoticed by the users. From time to time we were suspected of effects, which turned out to come from a totally different source. We plan to design a priority system, which assigns lower priorities to production runs and programs like screenlocks but higher priorities to interactive programs, in order to get fair CPU sharing between the hypercomputer and interactive jobs. Finally, we want to emphasize that this approach is not only useful in distributed workstation networks. Also, massively parallel computers and networks of supercomputers will benefit from our concepts.

Acknowledgements Without the support of Lutz Richter and many invaluable discussions with Edgar Lederer this work would not have been possible. We further thank our system specialists Beat Rageth and Rico Solca for their support. The comments of Friedel HoBfeld, Siegfried Knecht, Michael Weber and Steven Ericsson Zenith remarkably improved the quality of this paper. David Kaminsky performed several Linda measurements and provided some insight into the Linda programming model. This work is financed by Siemens AG, ZFE, Germany and the Schweizer Bundesamt fiir Konjunkturfragen under grant No. 2255.1.

1234

C.H. Cap, V. Strumpen

References [1] S. Ahuja, N. Carriero and D. Gelernter, Linda and Friends, 1EEE Comput. 19(8) (1986) 26-34. [2] R. Butler and E. Lusk, User's Guide to the p4 Parallel Programming System, University of North Florida, Argonne National L~boratory, August 1992. [3] R. Hempel, The ANL/GMD Macros (PARMACS) in FORTRAN for Portable Parallel Programming using the Message Passing Programming Model, User's Guide and Reference Manual, Gesellschafl f'tir Mathematik und Datenverarbeitung mbH, November 1991. [4] D. Kaminsky, Yale University, private communication. [5] P. Krueger and R. Chawla, The Stealth distributed scheduler, Proc. 11th Internat. Conf. Distributed Computing Systems, Arlington (1991) 336-343. [6] Parasoft Corporation, Express Version 1.0: A Communication Environment for Parallel Computers (1988). [7] G, Schoinas, Issues on the implementation of programming system for distributed applications, Draft Paper, University of Crete, 1991. [8] M. Schwartz, Telecommunication Networks: Protocols, Modeling and Analysis (Addison-Wesley, Reading, MA, 1987). [9] G.D. Smith, Numerical Solution of Partial Differential Equations: Finite Difference Methods (Oxford University Press, New York, 1985). [10] W.R. Stevens, UNIX Network Programming (Prentice-Hall, Englewood Cliffs, NJ, 1990). [11] V.S. Sunderam, PVM: A framework for parallel distributed computing, Concurrency: Practice and Experience 2(4) (1990) 315-339.