Graph based characterization of distributed applications

Graph based characterization of distributed applications

Future Generation Computer Systems 16 (2000) 597–607 Graph based characterization of distributed applications Gabriele Kotsis∗ , Markus Braun Institu...

279KB Sizes 0 Downloads 12 Views

Future Generation Computer Systems 16 (2000) 597–607

Graph based characterization of distributed applications Gabriele Kotsis∗ , Markus Braun Institute of Applied Computer Science and Information Systems, University of Vienna, Lenaugasse 2/8, A-1080 Wien, Austria

Abstract A critical task in the development and execution of distributed applications is to identify the potential degree of parallelism contained in the application. This information is necessary in the design of applications in order to pursue only a promising algorithmic idea for implementation, but also in the execution of existing applications for resource allocation and scheduling decisions. In this paper, we present analytical techniques to derive the potential degree of parallelism of distributed applications described by means of timed structural parallelism graphs (TSPGs). A TSPG allows a specification of a distributed application in terms of its components, the activation and dependence relations among the components, and histogram/interval based estimates on the execution times of components. Based on an analysis of paths through the TSPG (corresponding to paths in the execution) and by applying interval arithmetics, we are able to derive from the TSPG model a set of potential parallelism profiles. From these profiles further performance indices as the average degree of parallelism as well as hypothetical speedup can be derived. In this paper we focus on an evaluation of the analysis technique with respect to its computational complexity and validate the proposed approach by a comparison with results obtained from simulation. © 2000 Elsevier Science B.V. All rights reserved. Keywords: Graph based characterization; Distributed applications; Computational complexities

1. Introduction Performance evaluation studies are to be an integral part in the design of (parallel and distributed) applications to reduce the development and performance debugging costs [1]. While measurement techniques can be the method of choice for evaluating existing systems, modeling approaches are necessary when evaluating systems in earlier design stages. Various performance prediction frameworks for evaluating parallel ∗

Corresponding author. Tel.: +43-1-408-63-66-14; fax: +43-1-408-04-50. E-mail addresses: [email protected] (G. Kotsis), [email protected] (M. Braun)

and distributed systems have been developed, where aspects of workload (program), architecture, and mapping can be specified and evaluated in a structured and flexible way. Examples are PSEE [2], the PAPS tool set [3], the PRM approach [4], the approach by Mitschele-Thiel [5,6], or the GRADE environment [7]. In most of these approaches, graph models have been used to characterize the application. In previous works, we have introduced a new graph model, the structural parallelism graph SPG [8,9], which provides two major advantages with regard to the traditional task graph as (1) it is more flexible as far as the level of granularity of the distributed application it represents is concerned and (2) communication restrictions are diminished because a task

0167-739X/00/$ – see front matter © 2000 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 7 3 9 X ( 9 9 ) 0 0 0 7 6 - X

598

G. Kotsis, M. Braun / Future Generation Computer Systems 16 (2000) 597–607

Fig. 1. Example of a timed state graph.

can also communicate while it is still processing. In [10], the SPG concept has been extended to include quantitative information on computation and communication demands of the application; this extended model is called a timed structural parallelism graph (TSPG) and an analysis technique for deriving parallelism profiles considering this timing information has been proposed. In this paper this analysis technique is evaluated with regard to the factors influencing its computational complexity and the accuracy of information it produces.

2. The TSPG model Figure 1 shows an example of a TSPG, which is defined as follows: Definition 1. A timed structural parallelism graph is an acyclic directed graph TSPG = (V , D, W, T ),

where V is a set of vertices corresponding to the components (parts of the application), D = (D P ∪ D A ) is a set of directed arcs defining two different types of relations between the components, and T = {T V ∪ T A } is a set of timing parameters. A timing parameter tiv ∈ T V is associated to a node and represents the duration of the execution time of the corresponding A ∈ T A gives the component. The timing parameter ti.j instance in time relative to the execution time of node i upon which node j is activated. A component is a part of the application; depending on the chosen granularity for the modeling study, it can be for example a procedure or a task. In the following, we will simply use the term component. Single valued parameters could be used as timing information for the components and arcs in the TSPG producing a single point measure for each performance index of interest. However, the exact value of every component execution time may not be known to the performance analyst leading to uncertainties in

G. Kotsis, M. Braun / Future Generation Computer Systems 16 (2000) 597–607

the parametrization. Furthermore the performance of distributed systems may be variable due to 1. non-deterministic processing requirements (if the CPU requirements of the program vary significantly across different executions on a particular input) 2. as well as random delays due to inter communication events and contention for shared hardware and software resources Existing approaches cope with these problems in assuming a certain distribution for timing parameters at the cost of either simplifying model assumptions (e.g., exponentially distributed parameters) or complex solution techniques. In our approach, we propose the use of intervals. To be concise: a timing parameter tiv is given by v v v ], [ti,k , tik tiv : pi,k

k = 1..K,

v is the lower and t v the upper bound of the where ti,k ik execution time interval and pi,k represents the probability that the execution time of component i is within A is defined as the interval k. The timing parameter ti.j A A A A ], : pi.j,l [xi.j,l , xi.j,l ti.j

l = 1..L,

A denotes a percentage and is the lower where xi.j,l bound of the relative part in the execution time of comA repponent i, where i can activate component j , xi.j,l A represents the corresponding upper bound and pi.j,l resents the probability that component j is activated by component i in the interval l. The example graph representing a distributed application shown in Fig. 1 therefore has the following semantics: after performing some initial computation component 1 activates component 2–4. During its execution component 2 may activate component 5 and 6 (e.g., two subprocedures or subfunctions of component 2) at the specified activation intervals. It either activates component 5 (with probability 0.4) or component 6 (with probability 0.6). Component 3 activates both of its successor components during its execution. Component 4 either activates component 9 (with probability 0.2), component 10 (with probability 0.7) or component 11 (with probability 0.7). Component 200 is just a dummy node to reunite the OR branches. The execution of component 20 starts after the execution of component 19, 18 or 19, 11 has finished.

599

Finding appropriate estimates to characterize the timing behavior is crucial for any further analysis. Parameters can be obtained, e.g., from a static analysis of the program code. In [11,12] a performance estimator is introduced which computes a set of parallel program parameters such as network contentions, transfer and computation time. These parameters can be selectively determined for statements, loops, procedures or the entire program and could be used to obtain appropriate lower and upper bounds for the execution time of different program parts. Another possibility to obtain interval parameters would be to measure the execution time of different program parts during execution. From different measurements lower and upper bounds for the execution time of each program part could be obtained. Finally the analyst might be able to estimate the appropriate execution time of different program parts in terms of best and worst case assumptions.

3. Deriving parallelism profiles A parallelism profile represents the degree of parallelism of an application over time. In this paper we will use the following definition: Definition 2. A parallelism profile (PP) is a sequence of n time intervals Ii with i = 1, . . . , n, where ii , ii are the lower and upper bound of the interval Ii . To each time interval a numeric value nIi is assigned, indicating the number of components active in the SPG during the respective time interval. Further performance indices can be derived from a parallelism profile, e.g. the average, minimum and maximum DOP, which are important parameters in mapping and scheduling decisions [13]. From the parallelism profiles, also hypothetical execution times T (n) and speedups S(n) can be derived assuming n available processing elements. A straight forward approach to derive parallelism profiles from TSPGs is to enumerate all possible states of the program (number of active nodes) and to derive from the timing information state transition probabilities. This approach is not applicable from a practical point of view because of state space explosion and because of dependencies of transition probabilities on previous states (Non-Markovian type).

600

G. Kotsis, M. Braun / Future Generation Computer Systems 16 (2000) 597–607

To overcome these problems, alternative solution techniques have been proposed in [10]. A simplified approach is to either consider the minimum or the maximum execution times for each node. Thus, each node has only a single value timing parameter and the computation of the potential parallelism profile is comparatively fast and simple. But the resulting profiles will not be representative of the actual program behavior. A more detailed analysis is provided by the activation interval approach. First all alternative paths through the TSPG are determined. The number of paths depends on the number of outgoing arcs with OR semantic. For all nodes the possible starting and terminating time intervals and the corresponding probabilities are determined. The input to this step is the specification of the TSPG (its structure and the timing parameters). The output is a set of interval pairs for each node. By comparing the start and end times for the components of all alternative paths, parallelism profiles can be derived. The algorithm will produce a set of parallelism profiles and for each profile an associated probability indicating the likeliness of the actual execution exhibiting this particular parallelism behavior. It is obvious, that both, the accuracy as well as the computational complexity of this approach will depend on the specific TSPG under study. Factors of influence include the structure of the TSPG (regular versus irregular structures), the number of nodes, the number and length of timing intervals, etc. In the following, we analyze the effect of these factors on computational complexity and accuracy.

4. Evaluating the computational complexity of the activation-interval approach In order to evaluate the computational complexity experiment series have been devised applying the • minimum approach, • maximum approach, and • activation-interval approach on the TSPG in Fig. 1. The minimum approach would always take the lower bound of the execution times and the earliest possible activation times for arcs, thus resulting in a deterministic timing model. The maximum approach

analogously takes the upper bound for execution time and the latest possible activation time. Table 1 shows the averaged results 1 for each approach. The number of profiles and the portion of time spent in the most important procedures is also denoted. As can be seen in Table 1 the processing of the TSPG using the activation-interval technique requires approximately two and a half times more time than the processing of the TSPG using the minimum or maximum technique. Most time is spent in the “interval splitting” procedure, which computes starting and ending times for nodes with multiple predecessors and AND incoming semantic (approximately 17% for the min/max and 23% for the activation-interval approach), and in the procedure “generate profile”, which assembles parallelism profiles and writes them into a file (approximately 33% for the min/max and 46% for the activation-interval). To be able to analyze which procedures cause the time differences between the three approaches, Fig. 2 compares the absolute time spent in the most important procedures. It shows that for the procedures, which are responsible for determining all paths through the TSPG (procedures “identify or semantics” and “compute path”) the execution time is approximately the same. The same number of paths has to be computed in each of the solution approaches. Differences can be observed for the procedures which compute starting and ending times for the nodes (p rocedures “compute execution times” and “interval splitting”) and for the procedure “generate profile”. On an average they need approximately three times longer for processing the TSPG using the activation-interval approach than using the min/max approach. This time increase is caused by a greater number of parallelism profiles that have to be generated for the activation-interval approach. Exactly three times more profiles are produced, resulting in the denoted time increase. In principle the number of profiles is driven by the number of paths and the number of timing intervals associated to the nodes and arcs. As the min/max 1

All measurement experiments discussed in the following have been made on a Silicon Graphics Indigo 2 using Speedshop, an integrated package of performance tools. Each experiment series was repeated 100 times and averaged values are reported.

G. Kotsis, M. Braun / Future Generation Computer Systems 16 (2000) 597–607

601

Table 1 Comparison of min/max approach and activation interval approach Approach

Timea

Percentage of time spent in procedures Inputb Identification of Computation Calculating Interval Finding next Generation OR semantics of paths end times splitting profiles of profiles

Minimum 0.00753 24.38 Maximum 0.00752 24.39 Activation-interval 0.01780 9.76

0.15 0.15 0.06

12.70 12.72 5.35

8.70 8.76 10.10

17.04 16.95 22.86

4.33 4.33 5.62

32.7 32.69 46.20

Number of profiles generated

6 6 18

a Time b The

gives the execution time in seconds. “input” procedure reads the parameters of a TSPG out of an input file.

approach reduces the timing information to a minimum or a maximum value, the influence factor for the number of profiles is reduced to the number of paths through a TSPG. Differences in the computational complexity between the activation-interval approach and the min/max approach will usually arise if multiple timing intervals are associated to nodes or arcs. Therefore for TSPGs with only one timing interval per node and arc, the computational complexity of the min/max approach corresponds to the complexity of the activation-interval. The analysis of the factors influencing the computational complexity of the activation-interval approach (the number of nodes, the number of arcs with AND/OR semantic, the number of timing intervals associated to nodes and arcs), which is part of the next section, therefore includes the analysis of the factors influencing the performance of the min/max approach (number of nodes, the number of arcs with AND/OR semantic).

Fig. 3. Regularly structured TSPGs.

4.1. 2k Factorial design The second “benchmark” application for the activation-interval approach is characterized by the well-known “divide and conquer” structure (see Fig. 3). In this example all arcs are assumed to be activation arcs. To analyze the main factors influencing the performance of the activation-interval approach appropriate concepts had to be found to obtain the maximum information with a reasonable number of experiments. As the number of factors 2 is large and the number of levels (the values a factor can assume) 2

Fig. 2. Relative execution times of the procedures.

Following [14] a factor is a variable that effects the computational complexity and has several alternatives, e.g., in the context of the activation-interval approach the number of nodes or the number of timing intervals connected to a node are factors.

602

G. Kotsis, M. Braun / Future Generation Computer Systems 16 (2000) 597–607

Table 2 Factors and levels used in the activation-interval study Factor

Description

Level-1

Level-2

A B C D

Number of nodes of the TSPG Number of timing intervals associated to a node Number of activation intervals associated to an activation arc Type of outgoing arcs (AND or OR)

9 na na All nodes connected via AND branches

16 n + 7b n + 7b All nodes connected via OR branches

an

corresponds to the number of nodes. seven nodes two timing intervals are associated.

b To

is theoretically infinite, a full factorial design (meaning that all combinations of levels the different factors can assume are experimented) is impossible. Therefore a 2k factorial design was chosen to determine the effect of k factors. Four factors have been chosen for the analysis (see Table 2). Four binary variables xA , xB , xC and xD are introduced representing the level that each factor may assume (xi = 1 for level-1 and xi = 2 denotes level-2). For all combinations of these variables experiments have been made. The results are outlined in Table 3. To determine which factors have the largest influence on the execution time, a regression model was chosen. The execution time y can be regressed on xA , xB , xC and xD using a nonlinear regression model of the form y = q0 + qA xA + qB xB + qC xC + qD xD + qAB xA xB +qAC xA xC + qAD xA xD + qBC xB xC + qBD xB xD +qCD xC xD + qABC xA xB xC + qABD xA xB xD +qBCD xB xC xD + qABCD xA xB xC xD

This regression model can be solved for the observed execution times, resulting in the following equation y = 1.668 + 0.741xA + 1.625xB + 1.618xC −1.582xD + 0.717xA xB + 0.714xA xC −0.7xA xD + 1.588xB xC − 1.569xB xD −1.581xC xD + 0.701xA xB xC − 0.701xA xB xD −1.557xB xC xD − 0.696xA xB xC xD The result can be interpreted as follows. The mean execution time is 1.668 s, the effect of the number of nodes is 0.741 s, the effect of the number of timing intervals per node is 1.625 and so on. To measure the importance of a factor, or of a combination of factors, the total variation in the execution time that is explained by this factor (factor combination) can be determined by computing the “sum of squares total” (SST) as follows [14]: 2 2 2 2 + qAB + qAC + qAD SST = 24 (qA2 + qB2 + qC2 + qD 2 2 2 2 2 +qBC + qBD + qCD + qABC + qABD 2 2 +qBCD + qABCD

= 8.79 + 42.23 + 41.91 + 40.06 + 8.23 + 8.15

Table 3 Observed execution times

+7.85 + 40.33 + 39.40 + 39.98 + 7.855

xA xB xC xD

Exec. time y (in s)

xA xB xC xD

Exec. time y (in s)

1111 1112 1121 1122 1211 1212 1221 1222

0.00167 0.00298 0.06555 0.00927 0.05678 0.02877 7.11214 0.13913

2111 2112 2121 2122 2211 2212 2221 2222

0.00336 0.04425 0.15692 0.06635 0.14552 0.11723 18.46458 0.28005

+7.86 + 38.79 + 7.75 = 339.21 The portion of variation explained by these effects is listed in Table 4. The analysis of these results shows that the portions of variation explained by the factors B (number of timing intervals per node), C (number of activation intervals per arc) and D (AND or OR relation

G. Kotsis, M. Braun / Future Generation Computer Systems 16 (2000) 597–607

603

Table 4 Portion of variation explained Parameter Variation explained (%)

qA 2.6

qB 12.5

qC 12.4

qD 11.8

qAB 2.4

qAC 2.4

qAD 2.3

Parameter Variation explained (%)

qBC 11.9

qBD 11.6

qCD 11.8

qABC 2.3

qABD 2.3

qBCD 11.4

qABCD 2.2

between the arcs) and their combinations all lie between 11.4% and 12.5%. The portions of variation explained by the factor A (number of nodes) and all its combinations with other factors only lies between 2.2% and 2.6%. Consequently the factors B, C, D have, under the chosen experimental conditions, approximately the same influence on the computational complexity. The number of nodes seems to be of less importance.

To simulate the execution behavior of a real application timing information for the nodes and arcs of the TSPG in Fig. 1 is obtained in the following way: v v ] ; ti,n [ti,n

of width 4σ 2 with µ being in the “center” of the interval, i.e., the following equations hold: µ=

5. Validation by simulation In addition to the analysis of the computational complexity also the accuracy of information obtained from the activation-interval approach is evaluated. Therefore its results are compared to the results produced by simulation. In the simulation, we have to make an assumption on the distribution of the execution time of nodes and of the time during which a component activates another components. In the first series of experiments, a uniform distribution was chosen were we expected a close match between the simulation results and the prediction of the interval approach, as the boundaries of the uniform distribution correspond directly to the interval boundaries. In the second series of simulations, a normal distribution was chosen. The corresponding parameters in the interval approach are obtained by truncating the tails of the normal distribution as follows.

v ti,n

+

v − tv ti,n i,n

2

,

σ = 2

v − tv ti,n i,2

4

.

The average degree of parallelism and the derived hypothetical speedup are used as first indicators for comparing the results obtained from simulation and obtained from analytical analysis. Simulation 1 in Table 5 gives the results for the simulation series based on a uniform distribution, Simulation 2 gives the results for a simulation based on a normal distribution of execution and activation times, AI gives the results of the activation interval approach. In the last two columns, the relation of the results from the AI approach and the simulation results is given. A value greater than one indicates, that the AI approach overestimates the ØDOP and the speedup as compared to simulation. From these experiments, we observe, that the AI approach tends to overestimate the potential degree of parallelism if the execution and activation times follow a uniform distribution, while it underestimates the DOP in case of a normal distribution. The results produced by the activation-interval approach deviate only slightly from the results of the

Table 5 Avg. deviation to activation-interval profiles

Simulation 1 Simulation 2 AI

ØDOP

Speedup

ØDOP AI/simulation

Speedup AI/simulation

2.378 2.443 2.403

1.3785 1.4426 1.4035

1.0105 0.9836 –

1.0181 0.9729 –

604

G. Kotsis, M. Braun / Future Generation Computer Systems 16 (2000) 597–607

Fig. 4. Parallelism profiles — uniform distribution vs. AI approach.

G. Kotsis, M. Braun / Future Generation Computer Systems 16 (2000) 597–607

Fig. 5. Parallelism profiles — normal distribution vs. AI approach.

605

606

G. Kotsis, M. Braun / Future Generation Computer Systems 16 (2000) 597–607

simulation. This can also be seen from comparing the shapes of the parallelism profiles obtained by the activation-interval approach against those obtained from various simulation runs. Figs. 4 and 5 show the parallelism and speedup profiles for 10 arbitrarily chosen simulation runs assuming normal distribution compared to the AI profiles generated for the corresponding paths through the TSPG.

similar shape. By defining some kind of profile pattern similar profiles may be combined into one profile with a larger probability. Also it may be helpful to omit profiles with a very small probability. To study the trade off between the gain in performance versus the loss in accuracy of these modifications is subject to future research. References

6. Conclusions and future work We have presented a graph model for characterizing distributed applications to be used in early stages of the program development life cycle. The model supports specification of the application’s structure at a high level of abstraction in terms of components communicating with each other resp. activating each other during execution. Timing parameters representing the estimated execution time of components and the occurrence of activation/communication can be specified as intervals. A technique has been presented which allows a fast evaluation of these graph models respect to the exploitable parallelism given by parallelism profiles. As the quality of such an approach depends on both, the accuracy of the results as well as the complexity in providing these results, a study of the factors influencing computational complexity (using k-factorial design) and a study on the accuracy of the estimates (validation against simulation results) have been presented. The studies have shown, that with respect to both criteria the proposed approach performs well, but the study also revealed potentials for further improvements. Two possible directions can be identified: • “Similar” timing intervals could be aggregated. When generating starting or terminating intervals, some intervals are rather close to each other or even overlapping. Others may only have a very small probability of occurrence. In these cases, it might be helpful to combine similar intervals into a single interval or to omit intervals, if the probability is very small. Ideally, the analyst should have the possibility to specify the degree up to which aggregation should be allowed. • “Similar” profiles could be aggregated. When generating parallelism profiles, many profiles have a

[1] C.U. Smith, Performance Engineering of Software Systems, Addison-Wesley, Reading, MA, 1990. [2] E. Luque, R. Suppi, J. Sorribes, Designing parallel systems: a performance prediction problem, Information and Software Technology 34 (12) (1992) 813–823. [3] H. Wabnig, G. Kotsis, G. Haring, Performance prediction of parallel programs, in: B. Walke, O. Spaniol (Eds.), Messung, Modellierung und Bewertung von Rechen- und Kommunikationssystemen, Springer, Berlin, 1993, pp. 64–76. [4] A. Ferscha, A petri net approach for performance oriented parallel program design, J. Parallel and Distributed Computing 15 (1992) 188–206. [5] U. Herzog, Performance evaluation as an integral part of system design, in: M. Becker et al. (Eds.), Proceedings of the Transputers’92 Conference, 1992, IOS Press. [6] A. Mitschele-Thiel, Automatic configuration and optimization of parallel transputer applications, in: Proceedings of the World Transputer Congress, Aachen, Germany, 1993, IOS Press. [7] P. Kacsuk, J. Cunha, G. Dozsa, J. Lourenco, T. Fadgyas, T. Antao, A graphical development and debugging environment for parallel programs, Parallel Computing 22 (13) (1997) 1747–1770. [8] M. Calzarossa, G. Haring, G. Kotsis, A. Merlo, D. Tessera, A hierarchical approach to workload characterization for parallel systems, in: B. Hertzberger, G. Serazzi (Eds.), High Performance Computing and Networking, LNCS, vol. 919, Springer, Berlin, 1995, pp. 102–109. [9] M. Braun, G. Haring, G. Kotsis, Deriving parallelism profiles from structural parallelism graphs, in: Proceedings of the TDP’96, 1996, pp. 455–468. [10] M. Braun, G. Kotsis, Interval based workload characterization for distributed systems, in: Proceedings of the International Conference on Modelling Techniques and Tools, Lecture Notes in Comp. Sci. Springer, Berlin, 1997. [11] T. Fahringer, Automatic performance prediction of parallel programs, in: Automatic Performance Prediction of Parallel Programs, Boston, USA, March 1996, Kluwer Academic Publishers, Dordrecht, ISBN 0-7923-9708-8. [12] T. Fahringer, Compile-time estimation of communication costs for data parallel programs, J. Parallel and Distributed Computing 39 (1) (1996) 46–65. [13] K.C. Sevcik, Characterization of parallelism in applications and their use in schduling, Performance Evaluation Rev., Special Issue, 1989 ACM SIGMETRICS 17 (1) (1989) 171–180.

G. Kotsis, M. Braun / Future Generation Computer Systems 16 (2000) 597–607 [14] R. Jain, The Art of Computer System Performance Analysis. Techniques for Experimental Design, Measurement, Simulation, and Modeling, Wiley, New York, 1991.

Gabriele Kotsis received her masters degree (1991, honored with the Award of the Austrian Computer Society) and her PhD (1995, honored with the Heinz-Zemanek Preis) from the University of Vienna. Since December 1991 she is working at the Institute of Applied Computer Science and Information Systems, University of Vienna as a researcher and teacher.

607

Her research interests include performance modeling of parallel and distributed systems, visualization of parallel program behavior and performance, workflow systems and simulation, and distributed cooperative work environments. Dr. Kotsis is author of several publications in international conferences and journals and is co-editor of four books. Markus Braun received his masters degree in 1995 from the University of Vienna. From 1995 to 1998 he was working as a research assistent at the Depratment of Applied Computer Science and Information Systems, University of Vienna and is expected to finish his PhD in 1999. Currently, he is working for an international business consulting company in Munich, Germany. His research interests focus on performance prediction and modeling of parallel and distributed applications, business information systems, and knowledge management and data mining.