Performance Evaluation of a Multithreaded Runtime System Using a Synchronous Reactive Model

Performance Evaluation of a Multithreaded Runtime System Using a Synchronous Reactive Model

Copyright ~ IFAC Real Time Programming, Shantou, Guangdong Province, P.R. China, 1998 PERFORMANCE EVALUATION OF A MULTITHREADED RUNTIME SYSTEM USING ...

4MB Sizes 1 Downloads 38 Views

Copyright ~ IFAC Real Time Programming, Shantou, Guangdong Province, P.R. China, 1998

PERFORMANCE EVALUATION OF A MULTITHREADED RUNTIME SYSTEM USING A SYNCHRONOUS REACTIVE MODEL A. VALDERRUTEN, V.M. GULIAS, J.S. JORGE and J. MOSQUERA Laboratorio de Fundamentos de la Computaci6n e Inteligencia Artificial, Departamento de Computaci6n, Universidad de La Corona, Campus de Elvina, 15071 La Corona, Spain. {valderruten,gulias,sjorge,mosky}~dc.fi.udc.es

Work partially supported by XUGA10504B96 and XUGA10505B96, Xunta de Galicia.

Abstract. Synchronous Reactive Modeling provides an optimal framework for the modular decomposition of programs that engage in complex patterns of deterministic interaction, such as many real-time and communication entities. This paper presents an approach which includes performance modeling techniques in the synchronous reactive modeling method supported by ESTEREL. It defines a methodology based on timing and probabilistic quantitative constructs which complete the synchronous reactive models. A monitoring mechanism allows the computation of performance results during the simulation. This methodology is applied to study a multithreaded runtime system for a distributed functional programming language. Performance metrics are computed and validated with experimental results. Copyright© 1998 IFAC Key Words. Performance Engineering; Synchronous Reactive Models; Multithreaded Runtime Systems; Modeling and Instrumentation; Simulation and Monitoring; Functional Programming.

1996; Valderruten et al., 1997) supply quantitative information needed to complete the design in order to carry out the performance modeling tasks.

1. INTRODUCTION

Developing systems with the use of declarative programming is motivated by the high degree of abstraction: the designer may describe what is being computed rather than how it should be computed. Declarative languages are characterized as having no implicit state and thus the emphasis is placed entirely on programming with expressions.

This paper presents a performance modeling approach that will support and guide the strategy used by the scheduler to balance the workload in the multithreaded functional framework. The use of a formal modeling method allows an unambiguous and precise description of the system and leads to easier formal verification and validation. Thus, a synchronous reactive model, well suited to describe computer-based systems which must react instantaneously to external events (Berry and Gonthier, 1992), is used to represent the behaviour of a task scheduler in a cluster of computers performing a functional computation.

In order to improve the execution time of declarative programs, a great deal of effort has to be done to implement languages that exploit automatically (in part, at least) the implicit parallelism in the side-effect-free expressions of declarative languages, as exposed by Backus (1978) or Hudak (1989). There are plenty of problems in implementing an actual distributed system using a functional language, that is, a declarative language based on the notion of (mathematical) function.

However, the synchronous reactive model was not intended to support quantitative analysis and the prediction of the system behaviour with respect to non-functional requirements. In order to fill this gap, the idea is to consider the performance requirements of a design as dedicated constructs for checking performance constraints by simulation. The semantics of these constructs must differ from those of the synchronous reactive System, and then ambiguities must be avoided.

In the explicit functional programming approach introduced by Jones and Hudak (1993), the programmer is often not concerned with operational details, but the appropriate constructs for expressing parallelism are used to improve execution with no (or little) change of meaning. In order to help the compiler and runtime system exploit properly the system resources, some hints or annotations must be introduced. This set of performance annotations (Valderruten and Gulias,

Hence, this work pay attention to the integration

37

of performance evaluation and synchronous reactive models. That will allow the specification, implementation and analysis of systems taking into account both functional and non-functional requirements (Valderruten, 1993). When formal models include inputs with regard to performance evaluation, models can be obtained in order to address the quality of service (speed, reliability...), very early in the system development life-cycle. The main goal of this approach is to define and validate rules allowing an easier definition of performance models and then to compute performance metrics. In this case study, performance evaluation techniques may advise the programmer about the estimated performance of a given scheduling strategy.

A future is a suggestion for the runtime system to spawn a new task. In this approach, the programmer uses sequential OBJECTIVE CAML enriched with future annotations for identifying potential tasks. Some of these expressions will be evaluated as remote tasks, while some will be carried out locally depending on system workload. The key problem is to distribute efficiently the workload among all the sites. In order to do it, a work-stealing scheduler, similar to multithreaded runtime system of CILK (Blumofe et al., 1996), has been adopted. With this approach, each site owns a pool of pending tasks (futures) to be executed by an evaluator thread. The pool of local tasks is managed by an scheduler thread, which is asked for work by the evaluator when it becomes idle and delivers results to their destination. When the workload at a given site falls under a threshold (for instance, no pending tasks are available), a work-stealing cycle begins: the scheduler thread (the "thief") polls other locations for new tasks and, if a site with enough workload is found (the "victim"), some tasks are "stolen" from it, being migrated to the idle site. In other similar systems, like GUM (Trinder et al., 1996), the stealing of tasks is also referred as "fishing" a task from the pool. This strategy seems to behave quite well in dynamic and highly asynchronous concurrent programs.

In order to support the proposed modeling methodology, the ESTEREL language (Berry and Gonthier, 1992; Berry, 1993; Berry, 1997) for synchronous reactive modeling has been used. Byobtaining a performance model from a synchronous reactive one, and solving it by simulation, similar results as those taken from the actual system, an earlier prototype described by Gulias (1998), have been found.

2. THE MULTITHREADED RUNTIME SYSTEM FOR DFL Functional languages are declarative languages whose underlying model of computation is the function (Hudak, 1989). As a declarative language, functional programming languages have no implicit state, and thus the emphasis is placed entirely on programming with expressions.

3. THE MODELING PROCESS In order to obtain performance estimations of the multithreaded runtime system, and hence to analyze different design options, a synchronous reactive model has been built using ESTEREL and its C interface. As summarized in figure 2, the modeling process involves two main models (Valderruten et al., 1995):

DFL (Gulias, 1998) is a distributed implementation of the functional language OBJECTIVE CAML (Leroy, 1997) designed to speed up the computation by spawning tasks to different nodes in a cluster of computers. At a first glance, OBJECTIVE CAML is extended with primitives to perform higher-order explicit communication among distributed threads in a distributed memory framework, typically a cluster of computers, using distributed channels. Higher-order communication means that closures (functions as well as their environments) can be delivered to a remote agent in the same way as any other value: integers, tuples, lists, or even other distributed channels. With this primitives, the core of DFL may be proved to be equivalent to the calculus of Milner et al. (1992) with asynchronous channels.

1. A synchronous reactive model corresponding

to the functional point of view is built using the deterministic mechanisms provided by ESTEREL; this first model is called the Functional Model. Starting from now, functional should be understood as a description of what a component does. 2. The Functional Model is completed using performance constructs, allowing the computation of performance measures during the simulation; the result is a Performance Model. The proposed functional model is quite detailed because the modeling features are close to the actual system. The goal is to gather low-level performance results in order to tune the system scheduler under some concrete workload configurations. Then, an appropriate starting point to support the performance analysis is needed.

On top of this explicit framework, MULTILISP'S futures (Halstead, 1985) have been implemented in order to perform side-effect free lenient evaluation (Schauser and Goldstein, 1995) (evaluation of function body and its arguments simultaneously).

38

Performance SpcclJlcatlon

.._----_ .. _-_ .. _..

---------------_._-~_

::

model allows to manage this type of behaviour The determinism is maintained too, because thh principle is only referred to the way of the syn· chronous reactive module is going to react when· ever an input event occurs. As timed actions arE considered external for the synchronous reactivE module, they can have variable or even randorr duration, without violating the determinism principle of the synchronous reactive model.

Perform.nc:c Mudcling

Functlorwl Design ~

:

.:',: , ,

. :'

:'

__

:C~,t>\,_-

Performance Annotations

The constructs that allow alternative behaviour~ maintain the instantaneous reaction principle. Nevertheless, the determinism seems to be violated, as long as different behaviours are wanted with same inputs. Again, the problem can be solved moving the source of the alternative behaviours out of the synchronous reactive model. As for timed actions, the alternative behaviours can be viewed as replies from external sources that determine the behaviour in the reactive module. This module is deterministic because always reacts in the same way when the reply of the external decision is the same. The non-deterministic behaviour is delegated by the reactive kernel and managed separately, so the determinism is maintained inside it.

produCts

-----.. ----- -- .. -.. - - -:- -------,,: - -.. --.... --- - - --" ~

-----'----

Design V&:V

Fig. 1. An Information view of the Modeling Process

On the other hand, taking into account performance issues during the system life-cycle is done by means of a performance modeling process, involving a set of system designers and modeling expert activities (Conquet et al., 1992). The aims of modeling include performance requirements, which may consider throughput, timing and utilization rate constraints. In order to add this kind of information to the ESTEREL functional specification, the necessary quantitative information below must be introduced:

Taking into account these considerations, the mentioned two types of performance parameters can be introduced using the following constructs: • Timed actions can be implemented as a twofold procedure: (i) request for a finalization signal to an external module and (ii) wait for the finalization signal. In ESTEREL, they can be implemented as follows: emit TIMED_ACTION_STARTS; await TIMED_ACTION_FINISHES;

• The estimated processing time relevant to each timed action. A timed action is an action which consumes time, e.g. the time for processing a frame, the transmission time... They are relevant to performance modeling, but have no sense in a synchronous reactive model. • The probability associated with each possible alternative behaviour. When a probability defines the system behaviour, a degree of non-determinism must be introduced in the synchronous reactive model, which is deterministic by nature.

However, by doing this, the time control is left entirely to the external module, and the time control feature supported by the synchronous reactive model is not used. In order to unify the time management, the following construct can be implemented: emit GET_TlMED_ACTION_DURATION; await immediate TIMED_ACTION_DURATION; (. it would be instantaneous .)

await (?TIMED_ACTION_DURATION) TIME_SIGNAL;

The modeling choices must avoid any semantical incongruence with the perfect synchrony hypothesis. It must be pointed out that these performance constructs do not violate the synchronous reactive model hypothesis of determinism and instantaneous reaction. The timed actions can be viewed as requests to external modules that must reply after a certain amount of time. Both requests and replies are signals emitted from and received to the reactive module, respectively. The

where the signal TIME-SIGNAL is the one that counts the time passing across the entire synchronous reactive model. It could be the predefined ESTEREL signal tick. The external module only must respond to a GET_TIMED-ACTION...DURATION request by sending the amount of time of the timed action, which is carried within the TIMED...ACTION...DURATION signal.

39

Performance

P~rfurmanc:e

Specification

Specification

Performance Measurement

Implementation

Mud~lling

nwdeJ.f

Standard Monitor

Scenario I~

]

rue..

1\

Measure

. . ; ; ;-~r

I--

-l

Design Component

1_

riwdeLo;

J-

I

for the

-{

produces

Modelling

I~ .:>.<: '

..

:})'

c:o~.flr{l;nt

()b.,~,.,·~s

:"<

-

./

:::-::.

De.'iig~

v&:v

Ptrjormam e Modtd V&: ~

-

·: . : ·:. :·::i:?:\···:·.

! " , . . - - - - - - - - - - - -...- - - - - - - - - - - - -....- - - - -....!~-

define

J\

~ produces

\1

Fig. 2. The performance information model

4. PERFORMANCE ANALYSIS OF THE CASE STUDY

• A general way to implement the alternative behaviours can be the following: emit GET_ALTERNATIVE_BEHAVIOUR; await case ALTERNATIVE_BEHAVIOUR_i do case ALTERNATIVE_BEHAVIOUR_2 do

The verification and validation process of the performance model has been carried out with the analysis of a set of workload scenarios. Measured results in the actual system allows the tuning of the quantitative parameters of the model, which can be used for forecasting future behaviours with various not yet reached workloads and different configurations. Valderruten and Gulias (1996) presents the Performance Information Model, that guides the performance evaluation process, whose main view is shown in figure 2.

case ALTERNATlVE_BEHAVIOUR_n do end await;

The external module that receives the signal GET..ALTERNATIVE...BEHAVIOUR takes a decision based on some rules, which are external to the synchronous reactive model.

The data provided by the monitor consists on the trace of a set of events, which must be processed in order to obtain the appropriate measures for result analysis. A result oriented analysis interface has been implemented to visualize this information.

The complementary work that must be done in order to obtain performance measures from the models is to instrument them with a suitable monitoring mechanism. A trace file must be generated with the results of the observation of a set of signals defined by the designer. The designer has to decide which events are necessary in order to compute the desired performance metrics. The result of this model instrumentation step is an operative performance model.

In this case study, three performance metrics were considered: the system speedup, the number of steal requests and the number of tasks on the system. In particular, the interest is centered on the number of work-stealing requests for the whole simulation and the number of tasks at each interval of time.

The proposed monitoring mechanism only offers the basic observation functionalities that are needed for performance measuring. A complete statistical tool must include at least the computation of the confidence intervals that are needed to control the simulation time.

A task can be ready to be evaluated, in evaluation, or waiting for the results of its children (Le., the evaluation of some of its subexpressions) to continue. Figure 3 shows the distribution of all

40

% of total steals

non-blocked tasks in a simulation with 4 identical processors. The non-blocked tasks are the ready tasks waiting in the queues in addition to the tasks in evaluation. This distribution follows an ideal divide-and-conquer scenario with three different stages: 1. an initialization stage, in which there are only

a few tasks on the system; 2. then, the system gets loaded, and new tasks are being generated while the system evaluates other tasks; 3. and the finalization stage, in which the system becomes empty again as long as tasks are finished and new tasks are not generated.

0+-o

----

----

--_---;

.7.6.n~

G.\2J70

t(ms)

ready tasks 71.,

,

Fig. 4. Accumulated steals in the simulation

..

an ideal workload. The actual speedup depends on a lot of factors such as the granularity of tasks, channel rates, inherent parallelism of the workload... Further tuning of the workload model should be performed if more accurate results are needed. 17.7~ 1

speedup ·

·.. ·

·· .. ·

:

1~11)ll.7~

·

·

·

,

,

· .. If

lI6m.~

t(ms)

Fig. 3. Tasks during the 4-processor simulation

When initializing, while tasks are distributed among the processors, a high volume of steal procedures is engaged due to the fact that many processors have an empty ready queue. In the termination phase, the degradation of the system is important because the number of disturbing steals is higher. In addition, the monitorization of the queues shows that the ready tasks are of the lowest levels of the spawning tree of the ideal divide-andconquer layout, Le. they will generate less additional tasks. The sites are continuously processing steal requests and the tasks currently evaluated are interrupted. If any request were satisfied, the thief processor would be busy for a short time and it would engage a new steal procedure quite soon. This situation increases the communication time, in particular to transmit the results of the lasts requests of the given scenario, because the network is loaded with all these steal requests. This behaviour can be observed in the simulation results of the model (figure 4).

0+--

- - - - _ - - - - _ - - -_ _ _

o

t(ms)

Fig. 5. Accumulated speedup with 4 processors

5. CONCLUSION

This study shows the possibility of obtaining performance results from instrumented synchronous reactive models. An instrumentation methodology that involves two steps has been proposed: 1. The use of performance constructs in order to take into account the quantitative information needed for performance evaluation. Two kinds of construct were used: timed action constructs based on a global time referential, and an alternative behaviour construct implementing probabilities. 2. The monitoring of the model using a monitor module that tests at any time the presence of predefined signals. Choosing the convenient signals, the generated simulation trace allows

Finally, figure 5 shows the resulting speedup of the system. In the actual system, it can be observed the same transient problems when processing a low rate of tasks, specially during the initialization and termination phases. Note that the speedup reached' by the model is based on

41

computation of performance measures.

Conquet, E., A. Valderruten, R. Tremoulet, Y. Raynaud and S. Ayache (1992). 'Un modele du processus de l'activite d'evaluation des performances'. Genie Logiciel et Systemes Experts (27), 27-31. Gulias, V. (1998). DFL: Computaci6n Funcional Distribuida. PhD thesis. University of La Coruna, Spain (to appear). Halstead, R. H. (1985). 'Multilisp: A language for concurrent symbolic computation'. ACM Transactions on Programming Languages and Systems 7(4),501-538. Hudak, P. (1989). 'Conception, evolution, and application of functional programming languages'. ACM Computing Surveys 21(3), 359411. Jones, M. and P. Hudak (1993). Implicit and explicit parallel programming in haskel!. Technical Report YALEU /DCSjRR-982. Department of Computer Science, Yale University. Leroy, X. (1997). The Objective Caml System, release 1.05. INRIA. Milner, R., J. Parrow and D. Walker (1992). 'A calculus of mobile processes'. Information and Computation 100(1), 1-77. Schauser, K. E. and S. C. Goldstein (1995). How much non-strictness do lenient programs Require? In 'Functional Programming and Computer Architecture'. San Diego, CA. Trinder, P. W., K. Hammond, S. J. Mattson, Jr., A. S. Partridge and S. L. P. Jones (1996). GUM: A portable parallel implementation of haskell. In 'Proceedingsof the ACM SIGPLAN Conference on Programming Language Design and Implemantation'. ACM Press. New York. pp. 79-88. Valderruten, A. (1993). Modelisation des Performances et Developpement de Systemes Informatiques: une etude d'integration. PhD thesis. Universite Paul Sabatier. Valderruten, A. and V. Gulias (1996). An information model for performance engineering. In 'Proceedings of the International Conference on Information Systems Analysis, ISAS'96'. Orlando. Valderruten, A., M. Vilares and J. Grana (1995). 'Instrumentation of synchronous reactive models for performance engineering'. Lecture Notes in Computer Science 989, 76-89. Valderruten, A., V. Gulias and J. Freire (1997). Instrumentation strategies in distributed functional computing. In 'Proceedings of the International Conference on Information Systems Analysis, ISAS'97'. Caracas.

This methodology has been applied in a case study, a multithreaded runtime system for a distributed functional programming language that uses a work-stealing algorithm for balancing the workload. The behaviour observed in the model was validated with experimental results. The gathered data has shown the same weaknesses exposed by the actual system at the initialization and the finalization phases, when the number of tasks is low and most of the work-stealing attempts fail. To address the identified performance problems, the designer should use the performance model in order to adjust the behaviour of the scheduling policy. For instance, the scheduling algorithm can be changed for the intervals when the number of tasks falls under a given threshold. There are some open issues that should be considered as future work: 1. Further measuring in the actual system must be performed in order to tune the different workload models. 2. Determine the proper mechanism to adjust the parameters of the actual scheduler using the model results to improve the efficiency of the distributed functional programs. 3. Consider the use of other simulation tools, in particular paying attention to the features related with the simulation time control, modeling capabilities, scalability...

6. REFERENCES Backus, J. (1978). 'Can programming be liberated from the von Neumann style? A functional style and its algebra of programs'. Communications of the ACM 21(8), 613-641. Reproduced in "Selected Reprints on Dataflow and Reduction Architectures" ed. S. S. Thakkar, IEEE, 1987, pp. 215-243. Berry, G. (1993). The semantics of pure esterel. In M. Broy (Ed.). 'Program Design Calculi'. Computer and System Sciences 118, NATO ASI Series. pp. 361-409. Berry, G. (1997). The Esterel vS Language Primer. Version 5.10, release 1.0. Berry, G. and G. Gonthier (1992). 'The ESTEREL synchronous programming language: design, semantics, implementation'. Science of Computer Programming 19(2), 87-152. Blumofe, R. D., C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall and Y. Zhou (1996). 'Cilk: An efficient multithreaded runtime system'. Journal of Parallel and Distributed Computing 37(1), 55-69.

42