Parallel Computing: SoftwareTechnology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
769
Deriving analytical models from a limited number of runs* R.M. Badia a, G. Rodriguez a, and J. Labarta ~ ~CEPBA-IBM Research Institute, Technical University of Catalonia, c/Jordi Girona 1-3, 08034 Barcelona, SPAIN We describe a methodology to derive a simple characterization of a parallel program and models of its performance on a target architecture. Our approach starts from an instrumented run of the program to obtain a trace. A simple linear model of the performance of the application as a function of architectural parameters is then derived by fitting the results of a bunch of simulations based on that trace. The approach, while being very simple, is able to derive analytic models of execution time as a function of parameters such as processor speed, network latency or bandwidth without even looking at the application source. It shows how it is possible to extract from one trace detailed information about the intrinsic characteristics of a program. A relevant feature of this approach is that a natural interpretation can be given to the different factors in the model. To derive models of other factors such as number of processors, several traces are obtained and the values obtained with them extrapolated. In this way, Very few actual runs of the application are needed to get a broad characterization of its behavior and fair estimates of the achievable performance on other hypothetical machine configurations. 1. MOTIVATION AND GOAL Obtaining models of the performance of a parallel application is extremely useful for a broad range of purposes. Models can be used to determine whether a given platform achieves the reasonably expected performance for a given program or to identify why this is not achieved. Predictive models can be used in purchasing processes or scheduling algorithms. Models based on simulations can be used to explore the parameter space of a design but can be time consuming and lack the abstraction and possibility of interpretation that analytic models provide. Deriving analytic models for real parallel programs is nevertheless a significant effort [6, 7, 8]. It requires a deep understanding of the program, a lot of intuition of how and where to introduce the approximations, a detailed understanding of the behavior of the parallel platform and how this interacts with the application demand for resources. In this paper we are interested in looking at the possibility of deriving analytical characterizations of an application and models of its performance without the requirement of understanding the application and minimizing the effort to derive the model. The study aims at extending the analysis capabilities of the performance analysis tools DIMEMAS [1] and PARAVER [2] developed at CEPBA. DIMEMAS is a event-driven simulator that predicts the behavior of MPI *This work has been partially funded by the European Commission under contract number DAMIEN IST-200025406 and by the Ministry of Science and Technology of Spain under CICYT TIC2001-0995-CO2-01
770 applications. PARAVER is a performance analysis and visualization tool that supports the detailed analysis, not only of the results of the DIMEMAS simulations but also the behavior of real execution of MPI, OpenMP, MPI+OpenMP and other kind of applications. Both DIMEMAS and PARAVER are based on post-mortem tracefiles. Trace based systems do support very detailed analysis at the expense of having to handle the large amounts of data stored in the trace files. This poses some problems in the analysis process and rises some philosophical questions that we briefly describe in the next paragraphs. Compared to analytic models, simulations are slow as a large program characterization (the tracefile) has to be convolved with the architecture model implemented by the simulator. Furthermore, the point-wise evaluation is slow and it is difficult to interpret its sensitivity to a given parameter. Analytic models are based on a concise characterization of the program. They also have the advantage that are more amenable to a high level interpretation of the parameters in the model and the sensitivity to parameters of the architecture can be derived analytically. A first question that arises when looking at all the data in a tracefile is how much is real information? In this work we used traces from three applications whose respective sizes were 1.7, 4 and 10.2 MB. The question is, can such a large characterization of the program behavior be condensed to a few numbers and still capture the inherent behavior of the program? A second issue we want to address refers to the way the characteristics of the basic components in the architectural model propagate to the application performance. Some components of the architecture model in DIMEMAS are linear, for example the communication time (T = L + S / B W ) which is proportional to the inverse of the bandwidth and proportional to the latency and to the message size. A linear coefficient of relative CPU speed is also used to model target processors different from the one where the trace was obtained. Other components of the DIMEMAS model are highly non-linear, for example, the delays caused by blocking receives or resource contention at the network. The questions is how these two types of components get reflected in the final application execution time. Does it vary linearly with the inverse of the bandwidth? If so, with which proportionality factor? Thus, our objective in this work is to identify a method to evolve the results of a bunch of simulations (from a single trace of a single real execution) to an analytic model, as the one shown in equation 1 where the factors (BW, L,CPUspeed) characterize the target machine and should be independently measured or estimated in order to perform a prediction. BW is the bandwidth of the interconnection network, L is the latency and CPU speed is the relative speed of the processors in the target machine to those in the machine where the trace was obtained.
T = f ( B W ) + g ( L ) + h(CPUspeed)
(1)
Our target is to identify whether a linear expression for each of the terms in equation 1 is adequate. If so, the coefficients of each term would characterize the application and could be given abstract interpretations. As the trace characterizes an instantiation of problem size and number of processors the above model does not include these factors. In order to include them, different traces would be needed and the model modified accordingly. In section 2 we describe the methodology. In section 3 we describe how the methodology is applied and the results we obtained. Finally, section 4 concludes the paper.
771 2. M E T H O D O L O G Y For each application that we want to analyze a real execution is performed to extract the tracefile that feeds the simulator (DIMEMAS). The traces are obtained with the instrumentation package MPIDtrace [9] which relies on DPCL [ 10] to inject probes in a binary at load time. By using this dynamic instrumentation technology the tracing tool can instrument production binaries, not requiting any access to source code. The tracefile obtained contains information of communication requests by each process as well as the CPU demands between those requests. With MPIDtrace it is possible to obtain a trace of a run with more processes that available processors and still get very accurate predictions as shown in [11, 12, 13] . Other previously proposed methods needed the source code and knowledge of the program structure to be able to instrument the application [ 14, 15]. DIMEMAS implements a simple abstract model of a parallel platform. The network is modeled as a set of buses and a set of links connecting each node to them. These parameters of the simulator are used to model in an abstract way the network bisection (number of buses) and the injection mechanism (number of links and whether they are half or full duplex). The communication time is modeled with the simple linear expression: S T = L + BW
(2)
The latency (L) term is in this model constant, independent of the message size (S) and uses the CPU. The transfer time is inversely proportional to the bandwidth (BW) and does use resources of the interconnection network during the whole transfer (one output link, one bus and one input link). Blocking as well as non blocking MPI semantics are implemented by the simulator. An extremely important consideration about the model implemented by DIMEMAS is that it is an abstraction not only of the hardware of the target machine, but also of all the software layers in the MPI implementation. In this work, a bunch of simulations of the same tracefile is launched randomly varying for each simulation the architectural parameters for which we want to characterize the application (i.e., latency and network bandwidth). From the results of the simulations a linear regression is performed that allows us to extract a linear model for the elapsed time of the application against the architectural parameters. The coefficients in the model become the summarized characterization of the application. 2.1. Applications For this work the following applications have been used: NAS BT [3], Sweep3D [5] and RNAfold [4]. All traces were extracted from an IBM SP2 machines, with different processors counts, ranging from 8 to 49 processors. For the NAS BT only ten iterations were traced, since the benchmark is quite large. 3. RESULTS OF THE EXPERIMENTS Three kind of experiments were realized, changing the parameters that are taken into account in the model: experiments varying latency and bandwidth; experiments varying latency, bandwidth and CPU speed; and experiments varying latency, bandwidth and number of processors. In the simulator we considered a simple network model with unlimited busses but a single
772 half duplex link between the nodes and the network. For each experiment we report not only the results but also the experience.
3.1. Latency and bandwidth The objective in this set of experiments was to obtain a linear model for each application which follows an equation like : 7 T - t~ + /f . L + B----~,
where L is the latency and BW is the bandwidth
(3)
This simple expression is quite similar to equation 2. An interpretation based on such similarity can be given to the coefficients, c~ can be interpreted as the execution time of the application under ideal instantaneous communications. This factor will actually represent the execution time of the application under such ideal communications and will account for load imbalances and dependence chains (i.e. pipelines). Coefficient/3 can be interpreted as the number of message sends and receptions in the critical path of the application with infinite bandwidth. Parameter 7 can be interpreted as the total number of bytes whose transference is in the critical path. The interesting thing about the above interpretations is that the coefficients represent a global measure for the whole run and across all processors. The values does not need to match the actual number of messages or bytes sent. The ratios between these coefficients and the total numbers can be good indicators of general characteristics of the application as described by the following indexes: Parallelization efficiency =
Total C P U time P . c~
(4)
Here, Total C P U time is the sum for all processes of all useful computation bursts in the tracefile. When divided by P, which is the number of processors, we have the average value per processor. When divided by c~, it should result in a value close to 1. Values below that mean a poor parallelization due to load imbalance or to dependence chains. Message overhead =
P./3 Total Messages Sent. 2
(5)
To derive this index we considered that under infinite bandwidth, messages in the critical path will pay the latency overhead twice (at the send and at the receive) and also assumed that all processors send the same number of messages. Even with those assumptions this index is a useful estimator of the fraction of messages sent that contribute to the critical path. Values higher than 1 will indicate the existence of dependence chains as for example pipelined executions. Values close to 1 essentially indicate that each processors pays the overhead cost of all the messages it sends and receives. Values below 1 will indicate that the overhead of some messages is not paid by the application. This may happen when some of the startup overheads take place in periods preceding a reception of a message that will block the process. Thus, this is not as positive as might look like because it actually indicates inefficiencies due to load imbalance. We should also be aware that this is a global number, thus local behaviors in different parts of the program may be compensated by other parts. P'7
Transfers overlap = Total Bytes Sent
(6)
773 The result of this calculation represents the fraction of bytes sent that contribute to the critical path (assuming a uniform distribution of messages sent by the different processors). Values less than 1 will indicate overlap between communication and computation. Values above 1 will indicate dependence chains or resource contention. Values close to 1 indicate concurrency in communications but no overlap with computation. Again, this is a global number that may hide local behaviors. 3.2. Initial results An initial set of simulations was obtained by considering the tracefiles of the three applications with 16 processors. The range used in the simulations for the latency was between 2#s and 100ms and for the bandwidth between 2MB/s and 800MB/s. From those results, we performed a linear regression for each application, where the output variable was the predicted elapsed time (by DIMEMAS) and the input variables were the latency and the bandwidth. The results from those regressions were not very good. The confidence interval of the regression for parameter/3 in the NAS BT and Sweep3D models included 0, so those two values should be considered NULL. For RNAfold the confidence interval was also very large although did not include 0. However, a more important fact was that the relative error of the model for some observations was very large (39% for the NAS BT and 54.3% for the Sweep3D). To identify the problem, we analyzed the model values for the NAS BT against the inverse of the bandwidth. Although the visual inspection of the behavior of the application seemed to have a fit the linear regression with the inverse of the bandwidth, a detailed observation showed that different trend lines could fit the application behavior. For example, for large values of the bandwidth, a very different trend line appears (values above 5MB/s). This was also observed on the plot of the relative errors of the model against the bandwidth: for low values of bandwidth, the model had low relative values, but for values of bandwidth above 25MB/s, the relative error was also above 20% and growing. The actual results of the simulation for such a wide dynamic range of the architectural parameters does match a piecewise linear model. Then, we selected the range of values between 80 and 300MB/s, which also are more representative of current and future platforms to repeat the experiments. 3.3. Results with reduced bandwidth range The results for this second bunch of simulations (see results of table 2 for 16 processors) is accurately fitted by the linear model and the error is very small. With these values, we calculated the metrics defined before, and obtained the results shown in table 1. The parallelization efficiency parameter gives good values for the NAS BT and the RNAfold applications, which are well parallelized applications. The lower value for the Sweep3D reflects the fact that this application has a dependency chain between the different processes which reduces its parallelism. Regarding the message overlap, a value slightly larger than 1 is obtained for the Sweep3D. This
Table 1 Metrics evaluation Application NAS BT Class A Sweep3D RNAfold
Parallelization efficiency 0.994 0.792 0.926
Messageo v e r l a p 0.26 1.24 1.03
Transferoverlap 2.82 0.33 0.02
774 is again due to the dependency chain present in this application, which is in fact composed of a series of point to point communications where each depends on the one of a neighbor process. On the case of the transfer overlap it has a large value for the NAS BT application. Analyzing the PARAVER tracefile of a simulation for that application, it was observed that at the end of each iteration there is a bunch of overlapped point to point communication between all processes. As each process is sending several messages at the same time (six for the 16 processors case), a network resource contention arises.
3.4. Latency, bandwidth and CPU speed For the latency, bandwidth and CPU speed experiments, the inverse process was applied. The CPU speed parameter is a reference of the CPU speed of the architecture for which we want to predict against the CPU speed of the architecture were the trace was generated. A value of 1 in this parameter, means that we are predicting for an architecture with the same CPU speed. A value of 2, will mean that we want to predict for a machine with a CPU speed twice than the one of the instrumented architecture. Instead of regressing the results of the simulation, we made the hypothesis that the same equation obtained could be reformulated in the following way: T . ~ +./ 3
.L-4 . "7 . ~
BW
T
c~
CP U speed
+ / 3 . L-~_ ~"Y
(7)
BW
Thus, in this case the data obtained from the simulations was used for validation of the above method. The range of values for the CPU speed parameter was between 0.5 and 10. The results of this experiments were very satisfactory, since all applications fitted in the extended equation model with maximum percentage of relative error between 7% and 13%.
3.5. Latency, bandwidth and number of processors The DIMEMAS simulations do require a trace obtained by running the application with the target number of processes (although the instrumented run can be obtained on less processors by loading each one of them with several processes). Our final objective is to extend the above models to take into account the number of processors as parameter. The approach is to generate several traces with different processor counts, apply to each of them the previously described model and extrapolate the values for c~,/3 and -~ for the desired number of processors. Table 2 shows the results of the model parameters for each of the three applications and different numbers of processors. From the results obtained, we can see that, for each application, Table 2 Parameters obtained for different processor counts Application # Procs c~ /3 NAS BT Class A 9 6.53 289.74 NAS BT Class A 16 3.27 124.54 NAS BT Class A 25 1.57 289.02 NAS BT Class A 36 0.94 323.38 Sweep3D 8 15.46 2089.92 Sweep3D 16 8.84 2425.34 Sweep3D 32 4.91 2930.30 RNAfold 8 11.88 3671.77 RNAfold 16 8.26 3770.68 RNAfold 28 4.07 3311.30
9' 21.8 29.55 37.63 40.37 4.13 9.45 19.10 0.18 0.26 0.17
R2 0.9704 0.9975 0.9982 0.9973 0.9989 0.9994 0.9996 0.9999 0.9996 0.9999
Max err.(%) 0.32% 0.36% 0.70% 1.67% 0.04% 0.05% 0.13% 0.02% 0.09% 0.03%
775 Table 3 Execution time prediction based on model extrapolation Application Bandwidth Simulation time NAS BT Class A 80 1.22 NAS BT Class A 100 1.10 NAS BT Class A 300 0.83 Sweep3D 80 4.22 Sweep3D 100 4.17 Sweep3 D 300 3.92
Model prediction 1.31 1.18 0.83 4.43 4.35 4.16
% Error -8.17 -7.31 0.12 -5.04 -5.26 -6.09
parameter c~ is fairly proportional to the inverse of the number of processors. A first approximation for "7 would be to consider it linear on the number of processors. The slope of such variation depends on the application and thus requires a few traces to estimate it. Parameter/3 also seems to have some linear behavior except for the point corresponding to 16 processors in the NAS BT application. In this trace, many small imbalances in the CPU consumption by each thread result in processes blocking for a small time at the receives. This imbalance is small compared to the duration of the CPU runs but sufficient to allow for several latency overheads to be absorbed by the blocking time. Because of the small relative significance of the imbalance, the model is actually not very sensitive to the fl value. Based on the above considerations we extrapolated the values of c~,/3 and "7 for a number of processors larger than the ones included in Table 2 and made the prediction of the execution time. The results are compared in table 3 with the prediction obtained with DIMEMAS.
4. CONCLUSIONS In this work we have introduced an approach to a methodology that allows to obtain abstract, global, and interpretable information of the behavior of a parallel application from a few (or single) real run. Furthermore, this analysis can be performed without looking at the source code. We consider that the initial results are very interesting, showing the importance and the need for combining simple model methodologies and applying more statistical analysis techniques in performance modeling of parallel application. As future work, we foresee the extension of this work to wider ranges of the input parameters by means of using piecewise linear models. A critical point here would be how we can detect the knots and interpret that knots since it may allow detection of actual cause of problem (due to bottleneck changes). Also we want to automatize the process.
REFERENCES
[1] [2] [3]
Dimemas: performance prediction for message passing applications, http://www.cepba.upc.es/dimemas/ Paraver: performance visualization and analysis, http://www.cepba.upc.es/paraver/ David Bailey, Tim Harris, William Saphir, Rob van der Wijngaart, Alex Woo, Maurice Yarrow, "The NAS Parallel Benchmarks 2.0", The International Journal of Supercomputer Applications, 1995.
776 [4] [5] [6]
[7]
[8]
[9] [ 10]
[ 11 ]
[ 12]
[13] [ 14] [15]
.I.L. Hofacker and W. Fontana and L. S. Bonhoeffer and M. Tacker and R Schuster, "Vienna RNA Package", http://www.tbi.univie.ac.at/ivo/RNA, October 2002. "The ASCI sweep3d Benchmark Code", http ://www.llnl.gov/asci_benchmarks/asci/limited/sweep3 d/asci_sweep3 d.html D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, and T. von Eicken, "LogP" Towards a Realistic Model of Parallel Computation", in Proc. of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, May 1993. Mark M. Mathis, Darren J. Kerbyson, Adolfy Hoisie, "A Performance Model of nonDeterministic Particle Transport on Large-Scale Systems", In Proc. int. Conf. on Computational Science (ICCS), Melbourne, Australia, Jun 2003. Adeline Jacquet, Vincent Janot, Clement Leung, Guang R. Gao, R. Govindarajan, Thomas L. Sterling, "An Executable Analytical Performance Evaluation Approach for Early Performance Prediction", in Proc. of IPDPS 2003. MPIDtrace manual, http://www.cepba.upc.es/dimemas/manual_i.htm L. DeRose, "The dynamic probe class library: an infrastructure for developing instrumentation for performance tools". In International Parallel and Distributed Processing Symposium, April 2001. Sergi Girona and Jesfis Labarta, "Sensitivity of Performance Prediction of Message Passing Programs", 1999 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'99), Monte Carlo Resort, Las Vegas, Nevada, USA, July 1999. Sergi Girona, Jesfis Labarta and Rosa M. Badia, "Validation of Dimemas communication model for MPI collective operations", EuroPVM/MPI'2000, Balatonf'tired, Lake Balaton,Hungary, September 2000. Allan Snavely, Laura Carrington, Nicole Wolter, Jesfis Labarta, Rosa M. Badia, Avi Purkayastha, "A framework for performance modeling and prediction", SC 2002. P. Mehra, M. Gower, M.A. Bass, "Automated modeling of message-passing programs", MASCOTS'94, pgs. 187-192, Jan 1994. P. Mehra, C. Schulbach, J.C. Yan, "A comparison of two model-based performanceprediction techniques for message-passing parallel programs", Sigmetrics'94, pgs. 181190, May 1994.