Performance Evaluation 60 (2005) 165–187
Modeling message-passing programs with a Performance Evaluating Virtual Parallel Machine D.A. Grove∗ , P.D. Coddington School of Computer Science, University of Adelaide, Adelaide, SA 5005, Australia Available online 8 December 2004
Abstract We present a new performance modeling system for message-passing parallel programs that is based around a Performance Evaluating Virtual Parallel Machine (PEVPM). We explain how to develop PEVPM models for message-passing programs using a performance directive language that describes a program’s serial segments of computation and message-passing events. This is a novel bottom-up approach to performance modeling, which aims to accurately model when processing and message-passing occur during program execution. The times at which these events occur are dynamic, because they are affected by network contention and data dependencies, so we use a virtual machine to simulate program execution. This simulation is done by executing models of the PEVPM performance directives rather than executing the code itself, so it is very fast. The simulation is still very accurate because enough information is stored by the PEVPM to dynamically create detailed models of processing and communication events. Another novel feature of our approach is that the communication times are sampled from probability distributions that describe the performance variability exhibited by communication subject to contention. These performance distributions can be empirically measured using a highly accurate message-passing benchmark that we have developed. This approach provides a Monte Carlo analysis that can give very accurate results for the average and the variance (or even the probability distribution) of program execution time. In this paper, we introduce the ideas underpinning the PEVPM technique, describe the syntax of the performance modeling language and the virtual machine that supports it, and present some results, for example, parallel programs to show the power and accuracy of the methodology. © 2004 Elsevier B.V. All rights reserved. Keywords: Performance modeling; Message-passing; Parallel
∗
Corresponding author. E-mail addresses:
[email protected] (D.A. Grove),
[email protected] (P.D. Coddington).
0166-5316/$ – see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.peva.2004.10.019
166
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
1. Introduction Creating high-performance parallel codes is a difficult endeavor, and optimizing performance relies on exploiting the specific characteristics of the machine on which a code will run. Because of the difficulties in foreseeing the effect on performance of factors such as input data, the number of available processors and the characteristics of the communication network that connect them, most performance tuning of parallel programs still relies on a “measure-modify approach” [6]. This approach is dependent upon making detailed performance measurements of programs during execution and can be extremely time consuming, especially if it must be repeated for many target machines. Performance modeling provides hope for an escape from the measure-modify development process through the use of its predictive powers, which can advance the focus on performance optimization to earlier in the design process [19]. Ideally, such a modeling technique should provide ordinary programmers with insight into the performance implications of load imbalance, insufficient parallelism, synchronization loss, communication loss and resource contention on their code. Unfortunately, performance modeling techniques typically have a high development cost and/or they are not accurate enough to be useful. This paper introduces a new performance modeling system for message-passing parallel systems that is both simple to apply and very accurate. We have focused on message-passing parallel programs rather than other parallel programming methodologies for several reasons. The first is that message-passing requires all processing and communication details to be explicitly defined, which is important when developing a performance model. Secondly, because message-passing primitives are sufficiently powerful to describe all parallel programming methodologies [22], a performance modeling technique for message-passing codes can be used as the basis for modeling the performance of codes written using other parallel programming paradigms. Finally, largescale parallel programs for which performance is a high enough priority to warrant the use of performance modeling techniques are usually written using message-passing programming, and typically run on large distributed memory machines. Since MPI is by far the most widely used message-passing system, it is used as the basis for describing our modeling system. Regardless of the physical complexities that underlie any particular parallel machine, conceptually a generalized machine that is capable of running message-passing programs needs to support only two tasks: local processing and communication between processing elements. In developing a performance model, it is therefore possible to abstract over the architectural details of a parallel machine, instead focusing only on the speed of local processing and communication operations. This suggests a conceptually simple “bottom-up” performance modeling system that closely follows the structure of a message-passing code. This is the approach we have adopted in developing the Performance Evaluating Virtual Parallel Machine (PEVPM) technique described in this paper. The PEVPM maintains a representation of the state of a program and the parallel machine that it is running on. Then, when either a segment of local computation or a communication operation is encountered, the current state is used to create a submodel of that computation or communication. This entire process can be considered as a form of partial execution that only performs sufficient work to evaluate the dynamic execution structure of the program, and hence program performance. The time taken for inter-processor communication may be highly variable due to the effects of network contention. This also introduces the potential for non-deterministic program execution to occur. Accurately modeling this communication is the key to accurately modeling the overall performance of a parallel program. Unfortunately, a good prediction of average performance cannot be obtained by merely
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
167
modeling the delivery time of each message by its average value, since this does not take into account the variability of communication times and non-determinism in the execution of parallel programs. Therefore, if truly accurate models are sought, the execution time of each part of a code should be modeled probabilistically [7]. This approach has not previously been applied to parallel programs because of the difficulties in modeling non-determinism. However, using the PEVPM, we can simulate both the causes and the effects of this non-determinism. We simulate the cause by dynamically determining the level of network contention a message will face and then taking a sample time from an appropriate probability distribution that describes message-passing performance. This semi-static approach should be more accurate than previous work on the performance modeling of parallel programs which, at best [23], developed static approximations to contention and program run-time. The probabilistic approach also allows us to obtain a measure of performance variation by using Monte Carlo [16] techniques. In Section 2 of the paper, we describe our basic approach to the performance modeling of messagepassing programs on distributed memory parallel computers. Section 3 presents the details of our performance modeling language. The Performance Evaluating Virtual Parallel Machine that can automatically generate performance predictions from models constructed using our language is described in Section 4. A selection of results from testing the PEVPM on example parallel programs is presented in Section 5. We conclude with a summary of our ideas and a discussion of the current status and future directions of the implementation of the PEVPM system.
2. Performance modeling of message-passing codes There are several significant ideas that form the foundations for this work. The first is the idea of Communicating Sequential Processes (CSP) [14], where the execution of message-passing programs can be viewed abstractly as a collection of processes that independently run segments of local computation until they cooperate by exchanging messages. Other related work [8] has shown that the rich communication capabilities of most message-passing systems, such as collective communication and reduction operations, are reducible to a few basic point-to-point operations: synchronous sends, asynchronous sends, synchronous receives, asynchronous receives and operations to test for the completion of asynchronous messages. Therefore, modeling only these point-to-point operations provides a sufficient basis for modeling any communication that can occur. The final foundational concept of our work is the notion that such systems are congruent [22], which means that the performance of such parallel systems can be determined from the performance of their constituent parts. Our PEVPM methodology models the performance of message-passing programs by decomposing their execution structure into a series of segments of local computation and communication events, each of which is individually subjected to independent performance modeling. Communication events form the boundaries for these submodels. The details of how the structural relationships between submodels are determined are discussed in Section 4. For now, we turn our attention to the task of modeling the time required to execute individual serial segments and the speed of individual communication events. When a processor is performing local computation, it is acting independently from the rest of the system. Since there is no real difference between these segments of serial computation and standard serial programs, the performance of these serial segments can be determined by standard performance modeling techniques for sequential programs [1,12,17]. Pragmatically, the execution time of these serial segments can be even more simply determined by empirical measurement. This is an especially reasonable
168
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
approach when serial segments are themselves subdivided into blocks of processing whose execution is not data dependent, because this essentially results in deterministic local processing requirements that are very stable and therefore well-described by constant quantities. This is the approach taken by the current implementation of PEVPM, essentially passing over this well-understood problem in order to focus on the more complex problem of determining the effect of communication performance on the overall performance of a message-passing program. Unlike the execution time of serial segments, communication time can be highly variable, because of contention in the communication network. This variability is problematic because it creates uncertainty in the time that message-passing operations take to complete and thus also introduces the potential for nondeterministic program execution to occur [18]. Therefore, in order to have some chance of predicting the effect of variable communication times on program structure and overall performance, accurate modeling of the communication times is essential. The key to solving this is to understand the causes and effects of contention at both microscopic and macroscopic levels. At a microscopic level, contention is caused by two or more competing processes trying to gain exclusive access to a shared resource. In the case of the communication network, this is affected by the exact number and timing of all the messages in the network. Although any particular situation will ultimately result in a specific sequence of events, it is a finely balanced chaotic system, so this sequence of events cannot be predicted reliably. However, the possible outcomes of contention delays are well-described by probability distributions at a macroscopic level, and it is now possible to accurately measure the probability distributions of individual message-passing calls using tools such as MPIBench [9]. This enables bottom-up modeling of the macroscopic performance of high-performance computing applications because they typically involve loops that generate a very large number of the same type of messages. Essentially, using a probability distribution to describe messagepassing time macroscopically simulates the dynamic and microscopically unpredictable performance implications of contention on message-passing codes. Therefore, each individual communication event can be separately modeled based on properties such as from whence it came and to where it was sent, when it was sent, its length, and the number of other messages simultaneously in transit. Since the time at which messages are sent and hence the number of messages in transit and the level of contention in the communication system are dynamic properties, they are not apparent in its source code of a program. In order to obtain this information, we use a semi-static approach that simulates the dynamic execution structure of a program based upon the initial conditions that are present in the source code. The dynamic program state that is used as a basis for creating static submodels of local processing and communication is provided by a virtual machine. The virtual machine maintains a representation of the state of a program and the parallel machine that it is running on. When either a segment of local computation or a communication operation is encountered, the current state of the virtual machine is used to dynamically determine a specific submodel of that computation or communication. For computation, this is simply modeled by a number that represents the execution time of the serial segment, which has been obtained by traditional methods or simply measured on a single processor. For communication, this is done by applying current specific state information to a general model of a communication event. By chaining sequences of these submodels together, execution structure and hence performance are determined by partial evaluation that relies only on the determination of what local computation will occur until the next communication event is processed and the details of that particular communication event. Using this conceptual model of the execution of a parallel program, the PEVPM approach focuses on the complex problems caused by communication variability in parallel programs, while hiding the details of serial computation.
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
169
3. The modeling language In order to apply the abstraction described in the last section to real (or hypothetical) message-passing programs, a means of translating common programming structures into a description of serial segments of code and message-passing calls is required. Fortunately, there are only a handful of programming constructs that provide the basic utility of any procedural programming language, or more to the point, the sort of language that message-passing programs are generally written in. Although the following discussion would hold equally well for any procedural language, the C programming language will be used here to facilitate the discussion. The basic programming constructs that C employs are sequences of simple statements, loops, conditional statements and subroutines. In addition to these constructs, message-passing operations must also be considered. This does more than simply create the need for a means to describe these message-passing calls individually. It also affects the way that the other programming constructs must be considered, because message-passing calls can occur in the body of those constructs. While those constructs could otherwise be modeled as single entities, if they contain messagepassing calls, then they must be divided into multiple entities that are separated by message-passing calls. The remainder of this section formalizes the mapping between the basic constructs listed above and a language that can be used to describe their performance. This language consists of a set of performance directives (for example, like those in HPF or OpenMP) that must be supplied by the programmer alongside the program code that they model. In order to prevent the directives from being interpreted by the normal C compiler, they must be contained in comment structures. They are designed to be simple enough so that they could be easily extracted by a basic compiler, but complete enough that they specify all relevant information. As a result of this, directives will sometimes contain information that could be easily obtained from the program source itself. The advantage of this approach, compared to a fully automatic compiler that extracts performance models from program code, is that it allows programmers to use their insight to specify many details about program execution structure that an automatic compiler could not hope to determine. 3.1. Machine dependencies In addition to the directives that represent programming constructs, which will be discussed shortly, some machine-dependent directives are necessary to characterize the processing and message-passing capabilities of the parallel machine that a code will run on. These directives must be incorporated at the beginning of the source code file that contains the main procedure of the MPI program being modeled. The syntax of the directives used to define processing and message-passing capabilities is:
The structure of these directives typifies all of the other directives that will follow. A performance directive is identified by the tag PEVPM at the beginning of a comment line, which alerts the PEVPM to the presence of a performance model construct. Following the PEVPM tag are the attributes of the directive, which may be continued on subsequent lines using an & symbol.
170
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
The Processor directive has two attributes. The identifier attribute should be set to a string which describes the serial processing capability of the machine, for example, a CPU/compiler pair. The timing basis defines the speed of the CPU in MHz for which the time attribute of any subsequent Serial directives (which will be discussed shortly) is valid. This flexible approach to describing serial processing capability allows for many performance models using different processor architectures and processor speeds to coexist in the program source files. This is achieved by using several Processor directives, each with their own unique identifier and timing basis attributes. The Network directive describes the communication capabilities of a machine as a directed acyclic graph, where each edge represents an individual link in the communication network, and each edge is annotated by a performance profile. This performance profile denotes a set of probability distributions that describe message-passing performance on that link based on message sizes and given levels of contention. For a specific evaluation based on a particular machine model, the processor architecture, processor speed, number of processors and network characteristics should be given as input parameters to the PEVPM. Alternatively, a Default directive can be used to specify the default machine model parameters that should be used if this information is not supplied at evaluation time:
Notice that there is a default numprocs attribute here that does not correspond to any attribute of the Processor or Network directives. The numprocs attribute represents the number of processes that will be run by the PEVPM, in the same way that the number of real processes to be run is specified on the command line when an MPI program is launched. In addition, the PEVPM assigns a value procnum, chosen from 0..numprocs-1, to each virtual processor that it instantiates to represent each of the numprocs processes. 3.2. Sequences of simple statements As was stated in Section 2, the run-time of serial segments is assumed to be modeled by existing techniques, or simply by empirical measurement. The syntax of the directive used to model the performance of serial segments of code is: // PEVPM Serial [on
] time =
This directive contains the Serial keyword followed by an optional on attribute and a time attribute. The optional on attribute can be used to specify the machine architecture that this Serial directive applies to; otherwise, it refers to the default architecture. Many Serial directives can be attached to any serial segment of code provided that they are differentiated by their attribute. This is important because different compilers/architectures may produce vastly different object code for a serial segment due to differing Instruction Set Architectures, compiler optimizations, available libraries, etc. For a given processor architecture, the time attribute specifies the amount of time that the serial segment would take to run on a CPU running at timing basis speed. Under the crude assumption that there is a linear relationship between serial processing speed and clock speed for any given architecture (in reality, the relationship is slightly non-linear due to bus and memory speeds), approximate performance models for all speeds of CPU of
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
171
each modeled architecture can be trivially generated. For example, if a Processor directive has a timing basis of 800 MHz but a performance model for a 1 GHz processor is required, the time attribute of all applicable Serial directives is simply divided by 1.25 during an evaluation. Instead of using a simple numerical , the syntax parameter can be used to introduce a symbolic parameter into the performance model. This is useful for deferring modeling of the performance of serial segments of code until model evaluation in the (rare) cases where the run-time of these segments is not statically determinable. 3.3. Message-passing calls Any message-passing directive that is encountered implies the end of a segment of serial computation. The time that the specific message-passing operation described by such a directive will take to complete is dependent on the type of the call, when it was initiated, the size of the communication, where it was sent from, its destination and the level of contention it will encounter in the communication network. With the exception of contention, which is a dynamic property that must be determined from the state of the PEVPM, this information can be determined from the program source code. The syntax for the performance directive that represents a message-passing operation is:
This Message directive consists of a series of attributes that identify the type of MPI operation (for example, an MPI Send or MPI Bcast), the size of that message in bytes, from where the message will originate, to where the message is being sent as well as a req/stat identifier that is used by the PEVPM when it is testing for the completion of asynchronous communication operations using MPI Test or MPI Wait, etc. The size, from and to attributes can all be modeled similarly to the time attribute for serial segments, i.e. using a constant number or using parameter syntax. In addition, the source and destination attributes may be constituted from a simple equation using the semi-static procnum and numprocs values to allow automatic evaluation during model solution. Further, in the case of group operations, a range of processor numbers may be specified. For example, a MPI Bcast operation would consist of a singular from processor and to processors 0..numprocs-1. As noted in Section 2, prior to evaluation by the PEVPM, group operations are deconstructed to individual sends and receives using the techniques described in [8]. 3.4. Loop constructs Loop constructs provide programmers with a concise way of repeating a series of statements, either across a range of values using a for loop or as long as some condition remains true using a while loop. From a performance modeling perspective, these constructs can be reduced to a common description that is simply concerned with the number of loop iterations that will occur. This information should be provided to the PEVPM by: // PEVPM Loop iterations =
172
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
The loop can then be unrolled [2]. The resultant code will be modeled accordingly as a separate body of code with its own interspersed communication, further loops, conditional statements and subroutine calls. It is usually a relatively simple matter for a programmer to statically identify the loop conditions present in the source code, particularly in high-performance codes, since they often loop over large static data structures. In situations with fixed per iteration processing requirements, the computational expense of simulating a highly repetitive loop can be reduced by only simulating the performance of its first so-many iterations and using extrapolation. If it is not possible to statically determine the number of loop repetitions, the loop bounds should be modeled by a parameter that signifies the number of iterations that will occur. This approach is particularly appropriate for while type loops, where the number of iterations is usually data-dependent and therefore statically indeterminate. What this means, essentially, is that in any case where a symbolic performance model results, the problem of loop repetition is being deferred until model evaluation. Rather than being problematic, this is advantageous because it uncovers important data-dependent model parameters and allows the effect on performance of those data dependencies to be expressed by retaining those dependencies as parameters in the model. 3.5. Conditional constructs Probabilistic methods have commonly been used to model the performance of conditional execution. These methods assign fractional values ci to each possible outcome of a conditional. These values are weighted to correspond to the probability that the outcome will be executed. The PEVPM directive that expresses this is:
Usually, a performance model is constructed by summing the run-time of each outcome multiplied by its probability of being executed. However, the PEVPM approach needs to explicitly determine the local instructions that will be executed at each process until the next communication event occurs. Therefore, when a message-passing operation is contained within the body of a conditional construct, the weightings ci cannot simply be used to obtain an average execution time. Instead, the weightings must be used to probabilistically select a specific execution path every time a conditional statement is encountered. This is essentially a form of Monte Carlo sampling. In a short execution sequence, the actual probabilities for the chosen paths will not have had much chance to converge to their desired values, ci . In a long sequence, however, the probabilities will converge to the desired ci values. This probabilistic approximation to execution sequence is well-suited to accurately modeling the performance of high-performance applications, since by their nature, such applications are extremely long running and therefore, long execution sequences are guaranteed. Of course, the difficult part of this problem is determining the probability estimates ci for each condition. These could be estimated by the programmer based on knowledge of the problem, or by running the serial code on a single processor and measuring the probabilities. Alternatively, the conditions could be dealt with in the same way as statically indeterminate loop conditions, i.e. by using a parametric value in order to delay the problem of condition determination until model evaluation, which retains important data-dependent parameters in the overall performance model.
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
173
3.6. Subroutines Support for subroutine constructs is largely intended to support user-level code only. It is not aimed at library functions, which can usually be better modeled as quasi-simple statements, i.e. their completion time can be conceptually “looked up”. However, unlike simple statements, the performance of library calls may depend on a number of parameters, and so no simple mapping between source code and completion time exists. Instead, rather than actually looking up the completion time of library calls in a table, the completion time of these quasi-simple statements must be computed by some special purpose model, similar to using point-to-point communications to model collective communications. The only real difficulty in modeling user-level subroutines is if recursion is used. Fortunately, this is rarely the case in high-performance message-passing programs, since recursion tends to add significant overhead to program execution time and alternative solutions can always be achieved without using recursion. Therefore, rather than provide a complicated procedure for dealing with such an uncommon problem, the possibility of recursion will be excluded from this modeling technique. Without recursion, in-lining is used to insert the directive sequence of a procedure whenever it is encountered in the model description and modeling subroutines becomes a trivial problem.
4. Automatic performance evaluation Once a performance model of a message-passing code has been created using PEVPM directives, the implications of that model can be evaluated on the Performance Evaluating Virtual Parallel Machine. The PEVPM consists of an abstract parallel machine and an associated set of algorithms that govern the execution of PEVPM directives on that machine. This combination provides a platform that is capable of reproducing the time behavior (and hence performance) of programs abstracted by the generalized model of message-passing codes detailed in Section 2, i.e. where processors independently run segments of local computation until they cooperate by exchanging messages. The computational basis of the abstract parallel machine is illustrated in Fig. 1, which shows numprocs virtual processors that are capable of serial computation, each with its own send and receive message queues for communication. However, instead of executing real code for serial processing or messagepassing, the instruction set of this abstract machine is the collection of PEVPM directives that are detailed in Section 3. Execution progress at each virtual processor is recorded by a local program counter, which is incremented every time a PEVPM directive is processed. For example, when a Condition directive is processed, the program counter will move forwards in the directive stream to point at the directive associated with the outcome of the condition. When a Loop directive is encountered, the PEVPM directives in the body of the loop are unrolled to meet the loop bounds and the program counter jumps to the first directive in the first iteration of the loop. For simplicity, it is assumed that neither of these control flow actions incur any time penalty, which is usually a reasonable approximation for most large-scale message-passing codes. Each processor also maintains a local clock that is incremented when a PEVPM directive calls for local processing or a message-passing operation. For local processing, the clock is simply incremented according to the time attribute of the associated Serial directive. For message-passing operations, the clock is incremented by an amount that is computed using the performance model for message-passing operations that will be described shortly.
174
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
Fig. 1. The computational basis of the Performance Evaluating Virtual Parallel Machine is an abstract machine with n virtual processors capable of executing PEVPM directives, a program counter (PC), send queue (Sq ), receive queue (Rq ), local clock and any parametric variables (vari ), as well as a contention scoreboard that keeps track of the number of messages in transit over the communication network.
The PEVPM employs an iterative two-phase process to simulate the performance behavior of a program based on its PEVPM directives. The first phase, called a process sweep and described by Algorithm 1, simulates the processing that will be performed by all processes until they reach their next decision point, which is a message-passing call that signifies the end of a segment of local computation by that process. The second phase, called a match sweep and described by Algorithm 2, searches for the path that further execution will take by identifying the earliest decision point from all processes (the pivotal decision point), and then determines how execution will proceed from there. Chaining the decisions made by successive process/match sweeps simulates the entire execution structure of the program. Algorithm 1. PEVPM process sweep. for p in 0..numprocs − 1 do while p.runnable is true do get p.directive if p.directive is Condition then evaluate condition; set p.program counter to branch location else if p.directive is Loop then unroll loop; set p.program counter to first loop location else if p.directive is Subroutine then in-line subroutine; set p.program counter to start of subroutine else if p.directive is Serial then
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
get time increment p.local clock by time increment p.program counter else if p.directive is Message then get type if type is MPI Isend or MPI Irecv then get size, from, to increment p.local clock by “local completion time” push (type, size, from, to, p.local clock) onto p.send/recv queue increment p.program counter else if type is MPI Send or MPI Recv then get size, from, to push (type, size, from, to, p.local clock) onto p.send/recv queue increment p.program counter; set p.runnable to false else if type is MPI Test or MPI Wait then get from, request/status if request/status.completed is true then if type is MPI Wait then set p.local clock to request/status.match time end if set “flag” to true; increment p.program counter else {currentlyunknown} push (type, from, request/status, p.local clock) onto p.recv queue increment p.program counter; set p.runnable to false end if else {end of input} set p.runnable to terminated end if end if end while end for Algorithm 2. PEVPM process sweep. set match time to ∞ for p in 0..numprocs − 1 do sort all unmatched p.send queues into g.send queue by local clock sort all unmatched p.recv queues into g.send queue by local clock end for set s to head of g.send queue set r to head of g.recv queue while r is not NULL and then r.from.runnable is false do get r.type if r.type is MPI Test then
175
176
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
get r.from set r.completed to false set r.from.runnable to true else if r.type is MPI Wait then set r to successor in g.recv queue else {r.type is MPI Recv} get r.from, r.to, r.size, r.local clock while s is not NULL and then s.local clock < match time do get s.from, s.to, s.size, s.local clock set s.arrival time using s.∗ and “contention scoreboard” if s.to is r.from and s.arrival time < match time then set match from to s; set match to to r set match time to s.arrival time end if set s to successor in g.send queue end while if match time is not ∞ then if match from/to.type is “blocking” then set local clock of match from/to to match time end if set match from/to.completed to true set match from/to.runnable to true end if if r.from.runnable is false then set r to successor in g.recv queue end if end if end while if r.from.runnable is false then set all r.from.runnable to deadlocked end if
An evaluation begins by following the thread of control on the first processor. The program counter steps through PEVPM directives until a decision point in execution structure is encountered. During this process, the execution structure of the program is evolved according to Serial, Loop, Condition and non-blocking Message directives, the local clock is incremented in concert with Serial directives or by a submodel for local completion time (the local queuing time for an asynchronous message-passing call) in the case of non-blocking communication operations, and any communication events are placed on the local send or receive queue as appropriate. At any point where a parameter attribute is encountered, execution is suspended while a higher authority is consulted for the value of the parameter. In most cases, this symbolizes user intervention, although in some cases, it may be possible to infer the value of the parameter from the special variables (such as procnum and numprocs) stored in a process description. When a decision point in control structure is encountered, e.g. a directive
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
177
Fig. 2. Top: The disposition of the modeling system to deadlock-or-not for blocking or non-blocking messages must be the same as the actual programming model. This example shows a simple interaction between three processes in which deadlock would occur if blocking send operations were used. Bottom: The structure of program execution is recorded by a match-sweep line that separates events that have happened from the set of future events that may happen.
symbolizing an MPI Send or MPI Recv operation is reached, further evaluation of that process is postponed by setting the flag p.runnable to false and evaluation skips onto the next process. Directives symbolizing MPI Test or MPI Wait operations also imply a decision point in control structure—in correct programs, they are required to test and flag the arrival (or non-arrival) of messages from nonblocking message-passing operations. This boolean information will eventually be used to determine the progress of the process involved. Although this must occur in a conditional statement and could hence be dealt with by its associated Condition directive, it is sensible to suspend the process sweep at this point, since the arrival of that message cannot be determined until a match sweep has been completed. A minor addendum is necessary to explain why non-blocking operations need to be treated differently to blocking operations. Consider the example show in the bottom half of Fig. 2. This example shows a simple interaction between three processes in which deadlock would occur if blocking send operations were used. However, this program would complete if non-blocking send operations were used. Since these are both legitimate (and in fact, common) ways in which message-passing is used, the modeling system must behave in a semantically identical way. Essentially, this requires that all possible send and receive operations are identified during a process sweep so that the match stage can be guaranteed to complete correctly. This is achieved by allowing a process sweep to continue when non-blocking operations are encountered. Once all processes have completed a process sweep, the second phase match sweep is initiated to determine the result of the earliest decision point that occurs for any process. This is the only decision
178
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
point that can be solved during this match sweep because its outcome can affect any of the future decision points, as shown in the top half of Fig. 2. The match-sweep line divides the execution progress into known events and unknown events. Clearly, the earliest unmatched receive, i.e. the one posted by process P3 , can only be matched by the message labeled “1”. However, the match at process P2 cannot be made until all possible sends that could match it have been discovered during a prior process sweep, i.e. the messages labeled “?”. Therefore, once a match has been made, the match phase is terminated and a new process sweep is begun. How a match is made depends upon the type of operation that is at the head of the global receive queue. If it is an MPI Test operation, then the fact that processing did not continue during the process sweep means that the result of the test was unknown at that point, i.e. the previous send or receive to which it was referring had not been matched yet. However, since this is now the pivotal decision point, there can no longer be any pending matches between send and receive pairs that could occur before the test operation. Therefore, the completion parameter for the test operation can be set to false and a new process sweep can be initiated, thus providing a forward execution path. Dealing with an MPI Wait operation at the head of the receive queue is similar, except that its blocking semantic means that the process must remain blocked (rather than being made runnable) until a match has been made. Therefore, having yet to determine a forward execution path, the match sweep must continue by trying the next earliest operation on the receive queue. The most interesting case is when a receive event must be matched with a complementary send event. A send provides a potential match partner if its destination matches with the process associated with the receive call that is currently being evaluated (and provided that other meta-data such as the MPI tags and communicators match). However, timing information must also be considered in order to find the correctly ordered match: a receive will match with the first appropriate message to arrive. Determining the unique message that fits this criterion can be achieved using the knowledge of the times that messages were sent (which they were tagged with during the process phase) and their transit times through the communication network. Obtaining the transit times of messages through the network is a difficult task because they are affected by contention. As discussed in Section 2, the technique that is used to model the performance of messages subject to contention relies on knowing the number of messages traveling throughout the communication system with which the message being modeled must compete. This information is maintained in the PEVPM by a contention scoreboard. The structure of the contention scoreboard is constructed to match the network topology of the virtual machine that is described by the appropriate Network directive. In the general case, every possible link in the network is described by a contention number which indicates the number of messages that are currently in transit on that particular link. For commodity crossbar switches serviced by a high-bandwidth backplane, the entire contention scoreboard is well-described by just one link which represents that backplane. While the events on the send queue are being scanned during the match sweep, the contention scoreboard is continually being updated. Every time a new message is encountered, the contention numbers of any links that it must traverse (which can be determined using source, destination, topology information and routing protocol) are increased by one, and the contention numbers of any links where a previous message has completed are reduced by one. This is not an exact means of determining the duration of contention on individual network links because it estimates the contention a message faces on each link only once, when the message is scanned during the match process. It does not take into account any increase or decrease in contention that the message will face during its lifetime. This tends to average out over simulations
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
179
with large numbers of messages, however, which are just the kind of codes that are of concern for this modeling technique. Once the contention numbers that a message must transit under are known, an appropriate performance profile can be chosen to simulate message-passing time. A specific value will be sampled from this probability distribution, thereby simulating the effect of contention on the completion time of that message. By evaluating a model many times, a sensitivity analysis can be performed, providing valuable insight into the performance stability of the system. This effectively performs a Monte Carlo analysis to give an average execution time as well as a probability distribution or histogram of completion times. Much of the power of the PEVPM methodology lies in its quasi-execution of program code. Because computation and communication evolve in virtual time, the model automatically accounts for the effects of overlapping communication with computation, load imbalance and insufficient parallelism. Coupled with the facility to explicitly model communication losses, synchronization losses and the associated resource contention issues of each of these, the PEVPM methodology is capable of accounting for all the sources of both performance and performance loss in message-passing parallel programs. The PEVPM methodology has the potential to produce highly accurate performance estimations for only a low-tomoderate evaluation cost. High quality results will be obtained when highly accurate times are used to model the run-time of serial segments of code and the completion time of message-passing events. A low-to-moderate evaluation cost is ensured because only the structure of the code is simulated. Therefore, all the time-consuming segments of serial computation that occur in a real code are eliminated and each segment’s performance is represented by a simple number. Similarly, the completion times of messagepassing operations are determined using simple models within a fraction of the time of the operations that they abstract. There are many situations in which this performance evaluation methodology could be very valuable. Obviously, it could be useful to application programmers who are trying to optimize their code because it can draw attention to performance losses caused by load imbalance or insufficient parallelism, as well as detect some program correctness issues such as deadlock. More ambitiously, it could find application in automated compilers for the same purpose. In addition to merely providing a probabilistic performance estimate for an entire code, with a little additional effort, individual execution traces could be made amenable to examination using existing performance visualization tools [3]. These tools graphically display computation and the communication relationships between processes with respect to a global time-line. This can help programmers to spot the cost of message-passing and performance bottlenecks such as load imbalance and insufficient parallelism. Furthermore, these tools can also assist in the process of debugging complex message-passing programs by helping programmers to search for unmatched send/receive pairs or race conditions. Typically, these tools use trace data gathered at run-time from real message-passing code executing on real parallel machines, which is an intrusive process that can alter the system that it is measuring. Unfortunately, this causes very real problems in the case of message-passing programs because of the volatility in program execution structure caused by non-determinism [21]; quite often, the behavior of message-passing codes will vary drastically between production runs and runs that are subject to the intrusion of a performance debugging tool. Using the PEVPM technique, performance traces are produced by simulation, which is not susceptible to this problem. Producing execution traces using the PEVPM technique is especially useful for cases where access to target parallel machines for test purposes may be difficult or even impossible (for example, for hypothetical or yet-to-be released machines) and hence traditional techniques cannot be used.
180
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
5. Case studies This section presents a number of case studies that showcase the generality, flexibility, cost effectiveness and accuracy of the PEVPM modeling system. These examples are drawn from a matrix of test cases that spans three different parallel codes run on three different parallel machines. More complete results are also available [8]. The first example code performs a Jacobi Iteration (JI) using a regular–local communication pattern constructed from a number of MPI Send and MPI Recv operations. The second example code uses a task-farming approach to process a randomly generated Bag of Tasks (BOTS) and has both irregular and non-deterministic communication and computation requirements. The third example code calculates a two-dimensional fast Fourier transform (FFT) using the MPI Alltoall operation, which has a regular–global communication pattern. The three parallel machines used—Perseus, Orion and the APAC NF—span a wide performance range and exemplify low-end, middle-of-the-range and high-end parallel supercomputers. Perseus [13] is built from Pentium III processors and Fast Ethernet; Orion [20] is built from UltraSparc II processors and Myrinet; and the APAC NF [5] is built with Alpha EV68 processors and QsNet. While we only briefly describe (below) the algorithms used for the example codes listed above, complete sources with PEVPM annotations and expanded explanations are available [8]. The PEVPM annotations were added according to the rules detailed in Section 3. This trivial translation process took us only a few minutes by hand, but could easily be carried out by an automated compiler with little or no human intervention. We translated those PEVPM directives into C language driver programs that mimicked the computation and communication structures of the actual example codes. Once again, because these driver programs merely reflected the control structures defined by the PEVPM directives, they only took a few minutes to generate by hand, but could also have been automatically generated by using appropriate compiler techniques. Finally, we linked our driver programs with implementations of the PEVPM match/sweep algorithms and ran those resulting simulations a number of times to predict the performance of the example application codes for many configurations of the three parallel machines listed above. 5.1. Jacobi Iteration Jacobi Iteration is a common parallel computing example, because it is simple to explain yet has the same basic computation–communication pattern as all parallel algorithms with regular–local communication. In Jacobi Iteration, every point in a grid is iteratively updated to be the average of its four neighbours (excepting boundary values, which do not change). We introduced parallelism into our Jacobi Iteration code by a one-dimensional data decomposition of its grid into n subgrids, one for each of the processes involved in the computation. During each iteration, every process transfers the edges of its subgrid to its immediate neighbours in a regular–local communication phase and performs the computation on its subgrid in conjunction with any edge data received. The PEVPM predictions for our Jacobi Iteration code on Perseus, Orion and the APAC NF are plotted as speedup curves in Fig. 3(a–c). The speedups observed by actually running the Jacobi Iteration are also plotted in the same figures. There are two classes of predictions in each figure. Firstly, shown by dashed lines, there are PEVPM predictions made using complete performance distributions of constituent message-passing operations, which were measured with MPIBench. These are very accurate, always predicting completion time to within 5% and usually to within 1% (except for two anomalous cases
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
Fig. 3. (a–f) Predicted vs. measured speedups for a number of parallel code/parallel machine pairs.
181
182
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
Fig. 3. (Continued )
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
Fig. 3. (continued ).
183
184
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
at very large configurations of the APAC NF). Importantly, these predictions are consistently accurate, regardless of the number of processors used. The second class of predictions, shown by dotted lines, highlights the inaccurate predictions that result from using simple benchmark data in the performance modeling process. Such predictions are typical of previous techniques for performance modeling of parallel programs, and were made by using minimum or average benchmark results in the PEVPM evaluations instead of complete performance distributions. This leads to performance overestimates, roughly in proportion to the number of total processors used, because it does not account for the flow-on effects of contention. Hence, while simple techniques may be useful for modeling the performance of parallel programs running on a small number of processors, they are inadequate for modeling large-scale parallelism, where contention effects become very important. The number of processors per SMP node used in a machine obviously plays an important part in determining overall performance. On Perseus, it does not seem to matter greatly whether single or dual processor configurations are used for our Jacobi Iteration implementation. On Orion, however, SMP nodes slightly outperform single processor nodes; on the APAC SC single processor nodes significantly outperform SMP nodes (given, in both cases, the same total number of processors). While many empirical rules have been established in regard to the potential benefits of using multiprocessor nodes in clusters [4,11,15], the PEVPM’s first-principles approach to understanding the reasons for these outcomes is far more valuable. The exact performance implications of using multiprocessor nodes depends on many factors, including the speed of the external and internal communication networks in a cluster of SMPs, the contention effects that are experienced when multiple processors within a node try and communicate with other processors (either local or remote), and the communication pattern of the program being executed. With regard to model evaluation cost, the 11 h and 15 min of processor time consumed by actually running the Jacobi Iteration program on Perseus were simulated in just under 10 min by our prototype (i.e. unoptimized) PEVPM implementation running on just one processor of Perseus. This comparison shows that the PEVPM simulated the Jacobi program on Perseus at about 67.5 times its actual execution speed. Similar “speedups” were achieved in the simulation of the Jacobi Iteration code on Orion and the APAC NF—although strict comparisons are impossible to make, because those simulations were also run on Perseus rather than on Orion or the APAC NF themselves. Likewise, excellent speedups were achieved for the simulations of the Bag of Tasks and fast Fourier transform codes discussed next. 5.2. Bag of Tasks Our Bag of Tasks code uses a pseudorandom input stream to generate a large number of independent subproblems for solution using a task-farming approach. A good feature of task-farming is that it is inherently load-balanced, as slaves are simply given new tasks whenever they become idle. If computation and communication times for tasks were exact (i.e. not stochastic), this would imply a unique program execution sequence and the Bag of Tasks code would execute deterministically. However, due to the variations in processing and communication times that occur in reality (caused mainly by contention), the Bag of Tasks code executes non-deterministically. Therefore, accurate bottom-up performance prediction of the Bag of Tasks code can only be achieved by accounting for the effects of all types of contention on a microscopic scale. The main type of contention that is encountered when processing the Bag of Tasks code is gather contention. This occurs when multiple slaves all try to communicate with the master simultaneously, thus flooding its communication link(s) and hence limiting program scalability.
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
185
As shown by the results in Fig. 3(d), using the PEVPM with complete distributions of message-passing performance as input makes very good predictions about the performance of the Bag of Tasks code. This is in stark contrast to the predictions that were made with a restricted PEVPM evaluation engine that did not account for gather contention. While these restricted-PEVPM predictions closely match the normal PEVPM predictions for a small number of processors, they completely fail to foresee the performance saturation that ensues when larger numbers of processors are used. This is because, in the restricted-PEVPM model, there is no penalty associated with many processes trying to communicate simultaneously with the master. 5.3. Fast Fourier transform We used the typical parallelisation strategy for computing a multidimensional fast Fourier transform [10], which involves performing a succession of one-dimensional fast Fourier transforms on each line of data that is parallel to each axis in the grid of samples. The data grid must be redistributed amongst participating processes for the set of calculations along each dimension. For a two-dimensional fast Fourier transform, this requires a complete transposition of the data grid using the MPI Alltoall operation followed by some local data redistribution. Because this process is (relatively) simple to describe and program and because of the highly structured and completely connected communication pattern of the MPI Alltoall operation, it is often used as the canonical example of a parallel program with regular and global communications. Regular–global codes are extremely demanding on communication networks, which makes it difficult to accurately model their performance characteristics. Previous performance modeling techniques do not adequately account for the microscopic (but very significant when summed-up) precedence relationships (due to contention) that pervade regular–global codes. Typically, those techniques only produce predictions like those shown by the dotted black lines in Fig. 3(e and f). It can be seen that the proper PEVPM modeling technique, however, which uses complete performance distributions for communication submodels, produces very good performance estimates.
6. Conclusions and further work We have presented a novel performance modeling system for message-passing parallel programs. Our modeling technique was built upon an abstract view of the general structure of message-passing programs, in which execution proceeds as segments of independent serial computation, connected by individual communication events. We presented generalized low-level models for these serial segments and communication operations, explained how to obtain specific versions of these submodels for a particular code, and showed how to combine those specific submodels into an overall performance model. Particular attention was paid to the effects of non-determinism caused by network contention or datadependencies can have on program performance. Monte Carlo techniques were used to sample from probabilistic performance models in order to accurately model the performance implications of these effects. We have defined a language consisting of a set of performance directives that allows an overall performance model for a particular code to be easily constructed. Once a performance model has been built, we showed that it can be automatically evaluated using our Performance Evaluating Virtual Parallel Machine.
186
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
For a low-moderate evaluation cost, this virtual machine can obtain many highly accurate performance traces of a simulated message-passing code, running on either a real or hypothetical machine. Because the PEVPM maintains a log of the execution structure of the code that it is modeling, it can automatically highlight important properties of concern to programmers, such as load imbalance and message-passing overhead. It can also detect some program correctness problems such as deadlock. The case studies we presented highlighted how inaccurate performance predictions will be if contention effects are not properly accounted for, as is typical of previous performance estimation techniques. In contrast to this, the case studies clearly demonstrated the superior accuracy of PEVPM predictions. In addition, the case studies also demonstrated the PEVPM methodology to be general, flexible and costeffective, making it a useful technique for modeling the performance of message-passing programs. Our current PEVPM implementation is just a prototype. Several of the steps in the modeling process must be done by hand. While these are all fairly straightforward, they could be quite time-consuming for large programs. We are therefore aiming to automate most or all of these processes in future versions of the PEVPM, which we plan to demonstrate in operation on more complex real-world parallel programs.
Acknowledgements This work was supported by the Research Data Networks and the Advanced Computational Systems Cooperative Research Centres, the University of Adelaide, and the University of Wales at Bangor. We thank Francis Vaughan and Ken Hawick for useful discussions.
References [1] A. Aho, J. Hopcroft, J. Ulman, The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, MA, 1974. [2] J.R. Allen, K. Kennedy, Automatic loop interchange, in: Proceedings of the SIGPLAN Symposium on Compiler Construction, 1984. [3] S. Browne, J. Dongarra, K. London, Review of Performance Analysis Tools for MPI Parallel Programs, Technical Report, University of Tenessee, Department of Computer Science, December 1997. [4] F. Cappello, O. Richard, D. Etiemble, Understanding performance of SMP clusters running MPI programs, Future Gen. Comput. Syst. 17 (6) (2001) 711–720. [5] Compaq Computer Corporation, The Compaq AlphaServer SC supercomputer, available from http://www.compaq. com/hpc/systems/sys sc.html. [6] M.E. Crovella, Performance prediction and tuning of parallel programs, Ph.D. Thesis, University of Rochester, 1994. [7] H. Gautama, A probabilistic approach to the analysis of program execution time, Master’s Thesis, Delft University of Technology, Information Technology and Systems, 1998. [8] D.A. Grove, Performance modelling of message-passing parallel programs, Ph.D. Thesis, University of Adelaide, Department of Computer Science, January 2003. [9] D.A. Grove, P.D. Coddington, Coddington, Precise MPI performance measurement using MPIBench, in: Proceedings of HPC Asia, September 2001. [10] G. Gusciora, SP parallel programming workshop—2D FFT example, available from http://www.mhpcc.edu/training/ workshop/mpi/exercise.html. [11] J.L. Gustafson, D. Heller, R. Todi, J. Hsieh, Cluster performance: SMP versus uniprocessor nodes, in: Proceedings of Supercomputing, November 1999. [12] J. Hartmanis, R.E. Stearns, On the computational complexity of algorithms, Trans. AMS 117 (1965) 285–306.
D.A. Grove, P.D. Coddington / Performance Evaluation 60 (2005) 165–187
187
[13] K.A. Hawick, D.A. Grove, P.D. Coddington, M.A. Buntine, Commodity cluster computing for computational chemistry, Internet J. Chem. 3 (2000) (article 4). [14] C.A.R. Hoare, Communicating sequential processes, Commun. ACM 21 (8) (1978) 666–677. [15] J. Hsieh, Design choices for a cost-effective, high-performance Beowulf-cluster, Dell Power Solutions 3 (2000). [16] M.H. Kalos, P.A. Whitlock, Monte Carlo Methods, vol. I. Basics, J. Wiley Publishers, New York, 1986. [17] D.E. Knuth, Big omicron and big omega and big theta, ACM SIGACT News 8 (2) (1976) 18–23. [18] L. Lamport, Time, clocks, and the ordering of events in a distributed system, Commun. ACM 27 (7) (1978) 558–565. [19] P. Mehra, M. Gower, M. Bass, Automated modeling of message-passing systems, in: Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, January 1994, pp. 187–192. [20] Sun Microsystems, Sun Technical Compute Farm, http://www.sun.com/desktop/suntcf/. [21] C. Schaubschl¨ager, Automatic testing of nondeterministic programs in message passing systems, Master’s Thesis, Johannes Kepler University Linz, Department for Computer Graphics and Parallel Processing, 2000. [22] D. Skillicorn, Foundations of Parallel Programming, Cambridge University Press, Cambridge, 1994, pp. 123–169 (Chapter 8). [23] A.J.C. van Gemund, Compiling performance models from parallel programs, in: Proceedings of the Eighth ACM International Conference on Supercomputing, July 1994, pp. 304–312.
Duncan Grove holds a Bachelor of Computer Systems Engineering and a PhD in Computer Science from the University of Adelaide in South Australia. He currently works in the Information Networks Division of the Australian government’s Defence Science and Technology Organisation. His research interests include protocol and performance analyses of high-speed networks for parallel computing, computer security, wireless networks and mobile networks.
Paul Coddington is a Senior Lecturer in Computer Science at the University of Adelaide. He leads the Distributed and High-Performance Computing research group, and is deputy director of the South Australian Partnership for Advanced Computing (SAPAC), which supports high-performance computing and grid computing in the state. Dr. Coddington received a degree in physics from the University of Western Australia and PhD in computational physics from the University of Southampton, where he started using the ICL Distributed Array Processor in 1984. He has worked as a research scientist with the Caltech Concurrent Computation Project and at the Northeast Parallel Architectures Center at Syracuse University. His research interests are in high-performance computing, parallel algorithms, grid computing, data grids and computational science.