PARALLEL COMPUTING Parallel
ELSEVIER
Computing 23 (1997) 23-34
Analyzing scheduling policies using Dimemas
’
Jesus Labarta 2, Sergi Girona, Toni Cortes Dqurtoment
c/‘Aryuitrcturu
de Computuhrs.
CEPBA.
Uniurrsitut
PolitZcnicu
de Cutulunyu.
Burcelonu.
Spain
Abstract Dimemas is a simulator that allows the study of message passing applications on distributed memory machines. Currently, we are using Dimemas to analyze the effects of different processor scheduling policies when several parallel applications share machine resources as processors, interconnection network,... The effect of sequential independent processes on the parallel applications is also being studied with the simulator. This situation is fairly frequent on clusters of workstations running a mixture of parallel and sequential workloads. We also study the influence of communication parameters (network bandwidth and conflicts) in the system performance. The paper presents the structure of the simulator and the workload used. This workload is a mixture of jobs from the NAS parallel benchmarks. We finally compare the effects of the above mentioned factors on both global system throughput and individual response time on each of the different parallel applications. Keywords: Processor schedulin,,0. Distributed memory; NAS benchmarks
1. Introduction In this paper. we will evaluate the performance obtained by several applications using the message passing programming model in a multiprocessor system with distributed memory. Usually, the execution of these applications might require the assignment of processor subsets [7], or the full system assignment. If system sharing is performed in space, it is necessary that each application owns an exclusive processor subset, and applications have to fit into the assigned processors. If processor sharing is allowed in time, there are
’ This work has been supported by the Spanish Ministry of Education (CICYT) TIC-95-0429 ’
([ jesus,
0167-8 I9 I /97/S P/f SO 167-g
under the TIC-94-537
contracts.
sergi,
toni I
(LJRL http://www.ac.upc.es/hpc).
17.00 Copyright Q I997 Elsevier Science B.V. All rights reserved. I9 I (96)00094-4
and
24
J. Lahurru et ul./ Purullel Cnmpuring 23 (1997) 23-34
two possibilities: batch execution, in which each application has exclusive access to whole machine resources (producing low system utilization); and processor sharing, using a very coarse granularity (with quanrums of several seconds or hours) [ 10,2], and including some kind of gang scheduling [8]. We will use Dimemas to study the influence of low level scheduling in such environments. A network of workstations (NOW) with message passing libraries is another environment to execute such applications. PVM [5] and MPI [4] are the most widely used communication libraries. Parallel applications share resources with other parallel applications and sequential ones. Dimemas will allow us to evaluate the influence of sharing processor between sequential and parallel applications. Initially, Dimemas was developed to analyze the influence of sharing a multiprocessor system among several parallel applications. In that sense, the first feature included in the simulator was the possibility to study the influence of different scheduling policies in the global system performance. There is also the interest in evaluating NOW environments, where parallel applications share resources with sequential ones. Dimemas has been extended to support sequential applications with the inclusion of a file system server. To cover the wide range of target architectures, Dimemas includes the possibility to model several communication networks and to analyze the different communication parameters. A more elaborated file system module (31 is currently being added to Dimemas in order to study caching policies in parallel environments and their interaction with the processor scheduling policies. A communication module has also been included for modeling communication using ATM networks. We have selected the NAS parallel benchmarks [l] to study the scheduling policies. Some of these benchmarks have been grouped in different workloads. The global system parameters (throughput and slowdown) are analyzed with each workload. This paper is organized as follows: Section 2 explains the structure of the simulator. Section 3 depicts the workload selection and the characteristics of each benchmark used. Section 4 describes which parameters we use for the global system analysis, and the results obtained. Finally, Section 5 summarizes the paper goals and describe some future work.
2. Dimemas Dimemas is a trace driven simulator that reconstructs the behavior of parallel programs from a set of traces that characterize each application. The next subsections describe the main components of Dimemas (Sections 2.1 and 2.2). 2.1. Application
tracing
The objective of the application tracing is to capture, in a trace file, some records that characterize the application. The idea is to extract information inherent to the applic,ation and to avoid the effects of the platform and the communication library used for this purpose.
J. Luhorto rt ol./ Porullrl
Cwnputin~ 23 11997) 23-34
25
The key of the application characterization lies in two type of records: state and event. State records represent actual resource demands and their duration, CPU burst for example. Event records represent punctual occurrences between state records and, in this work, represent communication endpoints. Although the communication records capture the relationship between different tasks, no absolute times are included in the trace. The trace for a parallel application can be obtained on a single workstation without the utilization of a parallel system. For a precise application rebuild, it is important to achieve precision and accuracy in the measurement of the duration of the CPU bursts. For the results presented in this paper, the used precision is in ps. The probe effect problem has been reduced to the cache pollution caused by other processes sharing the system used for the instrumentation. In Ref. [6] the effect of this problem and the quality of the prediction are shown. The common I/O generated probe effect problem of interaction between the instrumentation and application behavior is not present because the measurements are based on the duration of CPU burst, and the absolute time is not used at all in the instrumentation. 2.2. Architecture model The target architecture model is simple and flexible. It is composed of a set of SMPs (shared memory processors) and a communication network. Fig. 1 represents three different nodes with two processors each. Each node has a local memory, accessible only from the processors located in the node. This memory will be used for communications between tasks running in the same node. Each node is attached to the communication network with 1 different links and the network is able to maintain n communications concurrently. This communication network can model a full connectivity network (to model MPP clusters) and a single bus network (to model networks of workstations, NOW). Dimemas can model several communications models based on two orthogonal properties: buffered and synchronous. By unbuffered we mean render-oous communication, where both tasks involved in the communication must reach the communication point before the real communication is started. By synchronous communication, we mean that the sending task must wait until the communication finishes before doing any other useful work. i.e. the processor can not be assigned to any other task. In the current work, we are using
n
Fig.
I. Dimemas
architecture
communication network.
model.
It is composed of a set of SMPs,
each of them attached to the
J. Ltihurtu
26
et cd./ Porulld
Computing
23 (1997) 23-34
buffered and asynchronous communication, because is the most widely used (PVM, MPI). Dimemas uses a linear function (latency and bandwidth) to model the communication time. The influence of the distance between the nodes is considered irrelevant in accordance with network state of the art. Two different aspects of communication conflicts are modeled: limit on total network connectivity (n) and limit on individual node access to the network (/I. Dimemas is able to model several processor scheduling policies. The two policies analyzed in this paper are: 2.2.1. First in first out (FIFO) A single ready queue is maintained for each node. A running task only frees the processor if it uses the receive communication primitive for a message that is not located in the node (the task must block until the message reaches the node). 2.2.2. Round Robin (RR) A single ready queue is maintained for each node. A running task may leave the processor if the time slice assigned to this task has finished or the task blocks itself receiving a message. In the later option, the remaining time slice is not reassigned when the message reaches the node; in fact, the task must wait for a new and complete time slice. The time slice used is 10 ms. This time slice is short enough to allow a good share of the processor and long enough to reduce the impact of the context switch time.
3. NAS benchmarks and workload The workload used in our work comes from some mixture of different NAS parallel benchmarks (NPB). We have used a PVM version of the NAS codes. We have selected 4 from the 8 NAS benchmarks (CG, IS, LU, and MG) and we have mixture them in a three-elements group, obtaining four different workloads (named B 1, B2, B3, and B4). Table I contains the workload names and the applications included in each workload. We have used the sample class problem size because we want to analyze the effect of time sharing on low granularity applications. We are also interested in mixing applications that achieve high and low processor utilization.
Table
I
Workloads
names and the applications
included in each workload.
Four different
workloads
applications from the NPB Workload
Application
Application 2
Application 3
BI
CC
IS
LU
B2
CC
IS
MC
B.7
CG
LU
MG
B4
IS
LU
MG
I
using four
J. Lohtrrru et cl. / Purdllel Conrpurinx 23 f 1997) 23-34
27
Table 2 Application
characteristics for the NAS codes: CG, IS, LU. and MG. This table shows the results obtained
when the application is simulated in a dedicated system with the properties for NOW
and HPC computers.
Time is measured in seconds NOW
Bytes per
Num.
message
mess.
HPC %cpu
time
FFr4
229378.6
48
42.387596
FFT8
61168.4
I05
9.161998
IS4
272730.9
I21
139.025563
IS8
79399.5
243
20.815442
LU5
152.4
5290
15.730039
LU9
151.4
5330
13.873308
MC4
9240.8
316
6.522070
MC8
10175.2
164
2.102434
12.57 33.35 13.81 51.02 82.10 69.50 31.81 62.06
mess/s
1.13 II.46 0.87 I1.68 336.30 384.17 48.49 77.89
time
%cpu
5.794899 3.269270 25.586458 17.806472 8.791915 6.393461 I.915991 1.233535
91.17 90.56 74.62 58.42 93.19 74.97 99.75 93.83
mess/s
8.28 32.12 4.74 13.65 601.69 833.63 165.06 132.75
The application traces for this work have been obtained in a Power Challenge with 12 RIO000 processors and using the PVM 3.10 socket based version. Table 2 contains some relevant application characteristics: average number of bytes per message, number of messages, application time, average processor utilization and number of messages per second. The three latest have been obtained for two different architecture environments: NOW (Network Of Workstations) and HPC (High Performance Computers). The communication parameters for both environments are presented in Table 3. Bus conrenrion means that only one message can use the network concurrently. 1 link means that there are only one input and one output attachments from each node to the communication network, and the communication network is supposed to allow full connectivity among nodes. Each node has a unique processor. Application static characteristics. These values characterize the application granularity, which is expected to have a significant effect on its behavior under the multiprogrammed environment. They are: (i) Number of communications. Applications FIT, IS and MG use very few communication primitives, but this number is very high for the LU benchmarks. LU uses the same magnitude of communications primitives when using 5 or 9 tasks. In the other hand, MC uses less communication primitives when running with more tasks. (ii) Average bytes per message.
Table 3 Communication
parameters for the NOW
and HPC environments. Includes latency. bandwidth and network
restrictions Environment
NOW
Latency
Bandwidth
Network
f cts)
(MB/S)
restrictions
500.0
I .o
bus contention I=
HPC
50.0
30.0
I, It=
I link I=
I. n=x
I
J. Loharru et ul. / Purullrl Compurin~ 23 f 1997) 23-34
28
Again, this is in same order magnitude for and IS and both use a message size using more LU uses short messages, the size maintained constant the five nine task The MG uses medium messages and size is for eight-task than for tasks. in now All benchmarks not use the processor because of network low In the of FIT4 IS4 this more critical, induce us think on processor resources. application gain IS and when using tasks is to the of the size, and the reduction the bus and its in the path of application. in HPC HPC environment included to the possibility sharing processor network resources. this situation, processor utilization on most the codes, than 90%. simulations show poor scalability the application to the problem size.
Processor scheduling evaluation To ensure the correctness of the results, we have evaluated the different scheduling policies with the following condition: the measurements used have been taken from simulations with a close queue mechanism, where each application is simulated repeatedly until all application of the benchmark have been executed at least a certain number of times. In our experiments, the minimum number of executions for each application is 10, and the results presented correspond to the time interval where all applications are running. To compare the different scheduling policies, we have used the following global system parameters: application slowdown and system throughput. Both of them can be computed using the application time Tsi when the application is executed in a shared environment, and application time Tdi when the application is executed using a dedicated computer. n represent the number of applications included in the workload. The equations to compute the system parameters, for the two different environments, are the following: - Batch environment. In dedicated batch environments, the execution of several applications is performed in sequential order. The global system throughput can be computed as the number of applications divided by the summation of the application times. n
Th, = c= ITd,
(1)
- Time sharing environment In time sharing environments, the execution of several applications is performed in
J. Lohurtu E! ol./ Purullrl CompurinR 23 (1997123-34
z 0.03 I a .: c 0.02
0.01
0.03
-eOd -0u.a dE: HPC and 4 tasks
Fig. 2. Throughput (in jobs per second) with four task workload.
For each workload,
the batch, FIFO
and
Round Robin throughput is presented.
parallel. The global system throughput can be computed as the summation of the individual throughput obtained in each application. Th,=
i; i=l
(2)
s,
The slowdown obtained because of the non dedicated systems, can be computed, for each application, as:
q=?,
0.05
OS.s;S
82
1
83
(3)
84
0.04
5
i’ j
Bl
0.5
B2
B1
Et4
0.4
DO?
0.02
nn
g 0.3
cm ILU
s g
n MG
g 0.2
IFFr
IIS ILU IMG
oar
0.1
0.M
0.0
,
NOW and 8 tasks
HPCand8tasks
Fig. 3. Throughput (in jobs per second) with eight task workload. For each workload, Round Robin throughput is presented.
the batch, FIFO and
30
J. Luhorro er ul./ Purcrllel Con~putin~ 23 (1997) 23-34
1.0,
B’
B2
B3
84
1.0
I
0.Y 0.8 0.7 0.6
n FFI
0.5
Em
0.4
q LU n MG
0.3 0.2 0.1 0.0 e G
%
H
Fi
B 6
2
e 0
P
HPCand4tasks
Fig. 4. Slowdown
with four task workload.
For each benchmark,
the slowdown
of each application
is
presented using the FIFO and the Round Robin policies.
If the slowdown, Si, is near to 1, it means that the execution for this application is not delayed with respect to a dedicated one if it is executed in a non dedicated system. On the other hand, if the value is near to 0, this application suffers a high penalty when run in a time shared environment. The slowdown analysis allows us to evaluate the fairness of the scheduling policies. A fair scheduling policy, in a distributed environment. is the one in which all individual slowdowns are greater or equal to l/n, where n is the number of application sharing resources. If some application gets more than l/n and other applications get near l/n, then processor scheduling is fair and the first application takes advantage of the sharing resources, while the other ones are not penalized.
0.9 0.8 0.7 0.6
n FFr CaIS
0.5
q BLU WMG
0.4 0.3 0.2 0.1 0.0
HPC and 8 tasks
NOW and 8 uuks
Fig. 5. Slowdown
with eight task workload.
For each benchmark,
presented using the FIFO and the Round Robin policies.
the slowdown
of each application
is
The left graph, in Fig. 2, shows the throughput using the NOW environment, and the graph on the right shows it in the HPC environment. Both graphs, as those in Fig. 3, can be analyzed as follows. Each of the different workloads (from Bl to B4) has one different column in the graph. For a given workload. three different bars are presented. The first one represents the throughput of this workload when running in a batch (dedicated) system. The second bar present the throughput of each individual application included in the benchmark and the global system throughput obtained when FIFO is used in a shared environment. The last bar has the same information as the previous one, but it refers to the throughput when Round Robin is used. The other parameter, slowdown, can be analyzed in the graphs of Figs. 4 and 5. In these figures, the information is presented as follows: each of the different workloads (from Bl to B4) has one different column in the graph. For a given workload, six bars are presented, two for each application. The three bars on the left are the individual application slowdown when using FIFO scheduling, and the three bars on the right are due to the Round Robin. The maximum value for the slowdown is 1.0, and a fair policy (in our case) will be that one with all bars are higher to 0.33.
5. Results analysis First of all, resource sharing is also a good solution for distributed memory machines. The system throughput in all workloads and environments is closer or higher to the batch value. Why is the throughput for NOW computers, in Figs. 2 and 3, lower than for HPC computers? Because our proposal is to share the whole computer, including processors and communication network. In NOW computers, the bottleneck for the applications is the network. Then, resource sharing motivates a better processor utilization but the network is still the bottleneck.
Table 4 Slowdown Workload
with four and eight task workload APP~.
in NOW
environments
but without bus conflict network (n = r) 8 tasks
4 tasks FIFO
RR
FIFO
RR
0.265440 0.892657 0.131095
0.70443 I 0.508459 0.459695
0.523 I78
0.268141 0.913100 0.120737
0.703447 0.50 I098 0.508772
Bl
m IS LU
0.32556 I 0.738702 0.014471
0.527780 0.862476 0.15033 1
B2
FFr IS MG
0.31731 I 0.724687 0.029323
0.860089 0.170733
FFr
0.901380
0.886985
LU MC
0.040687 0.098576
0.382024 0.273607
0.730530 0.459825 0.360075
0.7328 I6 0.42001 I 0.52050 I
IS
0.895860
LU MG
0.04967 I 0.157377
0.91 1841 0.239374 0. I60707
0.912384 0. I46022 0. I I3206
0.457280 0.435 127 0.486428
B3
B4
32
J. Luhurru et cd./ Purallel Compurinl: 23 (1997) 23-34
Table 5 Throughput for eight task workload in NOW Batch
environments but without bus conflict network (n = =)
FIFO
Round Robin
Bl
0.013614
0.08 I306
0.134448
B2
0.013900
0.130561
0.342844
B3
0.037799
0.284145
0.357830
B4
0.016304
0.108203
0.284697
The slowdown analysis from Figs. 4 and 5, shows that for NOW environments there is no fairness at all, neither with FIFO nor Round Robin. The reason is that the network is a bottleneck. Scheduling policies try to share processor resources among applications but the bus conflict modeled in NOW environments uses a First In Firs? Our policy. In this situation, the application with less communication takes advantage over the other ones (m and IS in Figs. 4 and 5). The possible solutions are to use a non bus conflict network but the same communication parameters or to use HPC environments. Results in Table 4 contain the slowdown when a non bus conflict network is modeled in a NOW environment. With 4 task applications and FIFO policy, there is still unfairness due to the difference between applications (each application has a different message per second ratio). With the Round Robin policy, best results are obtained, but the better ones are with 8 task applications and Round Robin. In this last case, a very impressive system sharing is obtained. Table 5 contains the individual and global system throughput in this situation. Why do some applications, FFT and IS, take advantage in the execution in shared environments? The reason is located in the application behavior in dedicated systems. Just refer to Table 2, FFT and IS are the applications that use less communication primitives. A process releases the processor when blocking for a non ready message, and it will resume after the arrival of the message when the scheduling algorithm assigns the processor to it. Process with long CPU bursts and infrequent communications are favored in a shared environment. Round Robin is fairer than FIFO as there is a system imposed limit (quantum X (number of process - 1)) on the time frond the actual arrival of the message, until the process is rescheduled. This fairness problem, which is originated at the node level, propagates and accumulates through the dependence chain of the application. That is the reason why a policy such as Round Robin, which is introduced on single processors for fairness purposes, may lead to unfairness in a parallel environment.
6. Conclusion We have shown the use of the Dimemas tool to better understand not only the behavior of individual applications (Section 3) but also to evaluate their behavior in multiprogrammed environments (Section 4). Dimemas is a very useful tool. It runs on several platforms, and the processor time used for each different simulation is proportional to the number of communication of the
J. Luhurtu er al./Purullel
Comptin,~ 23 (1997) 23-34
33
workload and to the number of context switch. As the simulator does not have to rebuild the computation for the applications bein,0 simulated, the simulation time is, commonly, lower than the real execution time. Another goal of the simulator is that it is executed on sequential machines, freeing the parallel machines and the network for parallel application development. In fact, a real multiprocessor is needed if we really need very fine trace files, otherwise message passing libraries and workstations may be used to analyze parallel programs. Because it is a simulation tool, it is very easy to implement different modules to analyze other issues: processor schedulin g, file system caching, ATM communications,... It is also very useful to compare centralized control algorithms versus non centralized ones, just because centralized algorithms have been studied for shared memory machines. Centralized algorithms are not allowed for distributed memory machines and low level scheduling but, in some sense, may offer an optimum value. We have shown that the effect of sharing resources can lead to high system throughput, but very important fairness problems arise depending on the application characteristics (granularity) and scheduling policies. The simulator interaction with Paraver [6] (a visualization and analysis tool for message passing applications sharing multiprocessors systems) will help us in the understanding the effect of different scheduling algorithms has on the applications. Some effort has to be directed to study cache pollution, context switch duration,... Some priority scheduling should be analyzed, and also the interaction of sequential programs in different nodes.
Acknowledgements
We thank Luis Gregoris for his help in the simulator final tuning and for the Dimemas GUI implementation. Dimemas has been developed at CEPBA-UPC (Spain) and is commercially available from PALLAS GmbH (Germany).
References [I]
D. Bailey, J. Barton, T. Lasinski and H. Simon, The NAS parallel benchmarks. Technical Report, NASA Ames Research Center, Moffett Field, CA, July (1993).
[2] R. Berrendorf. H.C. Burg et al., Intel paragon xp/s-architecture, Technical
Report
wandte Mathematik
KFA-ZAM-IB-9409,
software environment, and performance, Julich GmbH,
Zentralinstitut
fur Ange-
(I 994).
[3] T. Cones, S. Girona and J. Labarta. Euro-Par’96.
Forschungsezentrum
PACA: A Cooperufiw
File Sytem Cuche jiw Purullel Muchines,
Lyon, August (19%).
(41 J. Dongarra, R. Hempel, A. Hay and D. Walker. A Proposal for a User-Level in a Distributed Memory Environment,
Message Passing Interface
Oak Ridge National Laboratory Technical Report, ORNL/TM-
12231, Feb. (1993). [51 A. Geist, A. Beguelin, J. Dongatra and W. Jian g. PVM3 Report ORNL/TM-12187,
user’s guide and reference manual, Technical
Oak Ridge National Laboratory, May (1993).
J. Luhurtu et ul. / Parullei
34
[6] J. Labarta, S. Girona, V. Pillet, Environment, Euro-Par’96. Lyon, [7] V.K.
Nail,
T. Cortes
[8] J. Ousterhout,
environments, Scheduling
and L. Gregoris,
Dip:
23-34 A Purdel
Pmgrum
Deuelopment
August (19%).
S.K. Setia and M.S. Squillante,
supercomputing
Computing 23 (1997)
Performance
Supercomputin~
techniques
analysis
of job scheduling
policies
in parallel
93 (I 993) 824-833.
for concurrent
systems,
Proc. of’ Distributed
Compurittg
Systems
Con& ( 1992) 687-690. [9] PARKBENCH CS-93-213. [IO]
Thinking
Committee, Computer
Machine
Public
international
Science Department,
Corporation,
CM5:
benchmarks
University
Technical
for parallel
of Tenesse, Knoxville,
Summary,
Cambridge,
computers,
Technical
Report
Tenesse, November
(1993).
MA (1992).