A distributed event processing method for general purpose computation

A distributed event processing method for general purpose computation

& *H-- ;W~Ml;~ -Y !!B ELSEVIER OF ARCHITECTURE Journal of Systems Architecture 44 (1998) 547 -558 A distributed evemt processing method for gen...

864KB Sizes 3 Downloads 186 Views

& *H--

;W~Ml;~

-Y !!B ELSEVIER

OF

ARCHITECTURE Journal of Systems Architecture 44

(1998)

547 -558

A distributed evemt processing method for general purpose computation Nasser Kalantery



Abstract Previous proposals for the application of discrete event oriented methods to automatic parallelization have been based on the optimistic execution strategy. In this paper we present a new method which avoids optimistic execution. This is motivated by the observation that the control structure of a conventional program constitutes a temporal coordinate system which is exogenous to the program execution. The method employs a logical time mechanism and provides adaptive synchronisation for the distributed execution. Hence Data dependent and/or conditional parallelism is released without the risk of coherency violation. The paper begins with a brief introduction to Parallel Discrete Event Simulation (PDES) paradigm. Efficient coarse grain mapping of conventional programs onto this paradigm is then discussed. 0 1998 Elsevier Science B.V. All rights reserved. Automatic parallelization: Parallel discrete event simulation: Logical time; Structured shared memory; Data caching p’olicy

Kq.~t.ords;

1. Introduction Application of the Parallel Discrete Event Simulation (PDES) paradigm beyond its original domain is motivated by the localia,d memory-level synchronisation it offers [I]. Parallel execution of sequential applications that exhibit run-time de-

’ E-mail: kalantn@;wmin.ac.uk. 1383-7621/0165-6074198/$19.00 PIfS1383-7621(97)00062-3

0 1998 El:;e\ier

Science B.V. All rights

pendencies can be dynamically co-ordinated by a PDES mode of operation. Use of the PDES paradigm as a general purpose parallel execution method has been proposed by a number of researchers [2-51. These proposals have been based on the possible use of an optimistic strategy. Tn this paper we present a new method which is specifically developed for non-DES applications. This work is motivated by the observation that the control structure of a conventional program constitutes a reserved

temporal co-ordinate system which is exogenous to the program execution [6]. This is in marked contrast to programs in the DES domain where the temporal co-ordinate evolves as the result of the simulation and is not available independently. The exogenous nature of temporal relations in an application can greatly simplify the ordering algorithm. One only needs to de-:ermine the distribution pattern of events at rur time to arrive at a highly localised and efficient strategy. The PDES paradigm applied beyond its original domain thus finds the opportunity to avoid costly non-local operations, such as rollbacks and/or null messages, and compete successfully against conventional techniques such as barrier synchronisation [7].

possible communication channels are identified and null messagesare sent over channels to provide necessary timing updates at the destinations [9]. Once a process obtains messageson every input channel it can then select the smallest one safely. In the optimistic approach, static communication channels and null messagesare avoided. Instead necessary timing information is acquired through speculative forward execution and provision of error detection and recovery procedures [lo]. A process may move back and forth along the simulated time until it finds its correct next input.

3. Mapping conventional applications onto the PDES paradigm 2. PDES paradigm PDES assumes a process-oriented execution. Each process constitutes a spotid locality and encapsulates one or more state v,ariables. At each locality progress of simulation is marked by a private variable called a clock. Clocks are updated asynchronously. Global progress of the simulation is achieved through inter-process communication of timestamped messages.TImestamps are used to determine the logical order in which incoming messagesare processed. Incoming messagesmust be processed in increasing timestamp order. Thus a process must always select, as its next input. the messagewhich has the smallest possible timestamp. However, what the next timestamp is, and when it will arrive. depends on the execution of Iother processes.Restricted to local information a process cannot rule out the possibility that at any moment a new timestamp. smaller than the currently available one, may arrive. To enable processesto identify their next input, two basic strategies, conservative and optimistic, are available [8]. In the conservative approach,

In the application of PDES to conventional programs three preliminary issuesrelating to data and program decomposition and timestamp generation should be first clarified. Data and program decomposition leads to the creation of two distinct sets of processes, one perfortning memory functions and the other performing program functions. To distinguish between the two we will use the terms memory-node and task-node, respectively (Fig. 1). 3.1. Dutcc decomposition

Most conventional applications assumea global common memory. To allow mapping onto a PDES model, the global memory is simulated as a set of distributed processes (memory-nodes). This type of abstraction is known as “structured shared memory” [l 11.The memory is called structwwi becauseeach of its directly addressablelocations may be assigned a data item of any userdefined size and format. This means control over data granularity is given to the user (or a high level compiler). By integration of memory distribution

549

memory-nodes

0 %0 Ml

M3

task-nodes timestamped

.O T, 0 T2

messagesO 0 -Or

,j

C

0

l

0

0

cj

0

0 %A Fig. 1. Times tamped

message passing

and messagerouting in a single mapping function, the router is enabled to identify the appropriate destination from a given memory reference [12,131. ,4 memory-node receives Read/Write request messagesand carries out tile request ovel the specified location. 3.2. Program

3

decomposition

Loops contain the highest amount of parallelism and constitute the most time-consuming parts of the conventional applications. Therefore they are the natural candidates for parallelization. Coarse grain processescan be obtained from a hierarchy of nesting loops if the outermost loop is parallelized and inner loops are kept sequential. Consecutive iterations of a parallelized loop are distributed across a given number of task-nodes such that each process executes a subset of the iterations of the same loop. Each task may have its own private variables but variables subject to contention between two or more tasks are transferred o’nto the shared memory. A reference to a shared variable is represented by a corresponding call to an event message.

between

memory-

3.3. Timestump

T,.

and task-nodes.

generation

In a DES program the notion of simulated time is explicitly available. The simulation clock records the occurrence time (or timestamp) of the last input (caused) event and is used to calculate a timestamp for the occurrence of the next output (scheduled) event. In a conventional program the notion of simulated time does not exist. Instead a logical time system is derived from the specified sequential order of the programmed operations [14,15]. A program may contain a series of loops. Each loop would contain a series of iterations and each iteration may consist of a seriesof statements. Loops may be nested in a hierarchy. A logical time system can be derived such that the sequential position of a messagearising from anywhere in the running program is clearly identified by its timestamp. A method of deriving unique timestamps from a sequential program structure has been presented in 1141. At the parallelization stage, each processis allocated a logical clock. A logical clock is a data structure coupled with a function. The function is usedto update the clock when an event is executed. Logical

clocks are maintained such that the global sequential position of a Read or a Write operation is identified from the current value of the local clock.

4. Coherency condition: Causality general purpose computation

constraint

in a list in increasing timestamp order and a Read request, with timestamp TR is being processed (Fig. 2). Clearly if the values up to time TR were all present in the list, the requested value could be safely identified. However, unlessa global check is made, this cannot be ascertained. At any moment a new arrival may supersedethe locally latest value. Committing a global check for each Read request would be unacceptable because it would kill off all the available parallelism and in addition introduce huge overheads. On the other hand, local processing of a Read request, without making a global check. may lead to the violation of the coherency condition. Optimistic execution is one candidate solution for the above dilemma. At each Read request the state of the requesting task could be saved and whenever a coherency error is detected, a roll-back mechanism could be invoked to chase after and eliminate the spreading errors and eventually put the execution back on a corrected path. However, in general purpose computation we are dealing with programming constructs where the global order of events is exogenous to the main computation and hence resorting to optimistic execution often turns out to be unnecessary. In this class of programs, notably in FORTRAN based scientific and engineering applications [17]. loop control in conjunction with

in

Causality constraint in general purpose computation stems from data dependencies between tasknodes. A Read event must return the value which is given (caused) by the sequtmtiallJ* latest Write event for that memory location. This is known as the toheuencq’ condition [16]. W’hen a Read request is processed the value with the highest timestamp preceding that request - must be identified and returned to the reader. Returning any other value would constitute a coherency ‘error and invalidate the execution. At any given memory location values originating from differen.: task-nodes, arrive in an entirely non-deterministic order. Although timestamps provide a facility to determine the precedence relations between alrecldy received messagPS, the likelihood of new arrivals keeps this relation subject to constant alteration. Thus latest value with respect to a Read event cannot be safely determined. To illustrate the problem, consider a scenario where values arriving for a variable are maintained

P Virtual

List of written values

Location

A Read

request

R I

I

fhis Latest-write

at Tu

*a

will a new armal

supersede

it

Irkeasing timestamp Fig. 2. Processing

a Read

request

with timestamp

r,

at TR

551

for(i= E

LB;

ii:

UB; i += step)

var s; Read: AEx[ill,s; (* tfmestamp: --local work-); Do i WriterAfy [ili,s; (* tim~kaznp:

cO,i,l>

*I

cO,i,;l,

*)

I Fig. 3. An exam ale of a loop with dynamic

statement ordering constitutes a regular and predictable temporal co-ordinate, and event distribution pattern is given by a set of index expressions which are typically independent of’ the main loop body. There are numerous cases where index expressions include parameters which are not available at compile time, but they become available at run time prior to loop entry. Therefore, when required parameters are produc,ed, timestamp and distribution pattern of future events, pertaining to the given loop, can be generated independently. This constitutes a global knowledge which, once distributed, can be used locally, without imposing global lock-step operation or costly communications. Our proposed method is developed to ‘exploit this characteristic.

5. Global event skeleton of a loop A sequential loop consists of a control construct and a loop body. Statements comprising the loop body can be divided into two distinct types: 1. statements specifying a Read or a Write on a shared variable: 2. statements specifying local processing in private memory.

dependencies

across its iterations.

Consider the example of loop given in Fig. 3. In the given example at each iteration an element from array A is read and the results of local work is written back to another element of the same array. ’ Within the execution span of this loop elements of array A are referenced through index expressions .x[I] and JS[~]and present a run-time dependency relation amongst loop iterations. Hence at the parallelization stage, elements of array A are treated as shared variables and are encapsulated in memory-nodes. Distributed execution of the loop generates a set of Read/Write event messages. To enable memory-nodes to predict future messages, a skeleton of the loop is extracted at compile time such that only event statements remain in the loop body. By execution of this remaining skeleton the complete set of possible event messages pertaining to the original loop can be produced. For example, events pertaining to the loop of Fig. 3, are represented by the skeleton loop of Fig. 4. Such a skeleton, consists of 1. a loop control construct; 2. event statements within the loop body.

’ A timestamp expression such as (0, i. 2) indicates the loop no. ‘O’, the iteration no. ‘i’ and the statement no. ‘2’ (see [l]).

for(i=LB;

icUB;

i+=step)

{ A[x[i]l, A[y[i]], 1 Fig. 4. Global

, cO,i,2>,

event skeleton

of the

PR; PW;

loop of

Fig. 3.

An event statement contains three fields: (i) address expression (A[.Y[~] and A~v[I]] in Fig. 4) (ii) timestamp expression ((0. i, 1) and (0, i. 2) in Fig. 4) (iii) event type e.g. Read or Write (PR and PW in Fig. 4) Thus all three (spatial, temporal and procedural) co-ordinates of global events pertaining to the parallel execution of a loop are abbreviated in the loop skeleton.

6. Proxy events Eiy making a copy of the skeleton available to every memory-node. each node is enabled to iterate I;hrough the skeleton and evaluate each address expression. Now if a resulting address matches that of a locally encapsulaied location then a Proxy event is entered into the location’s event list. E:xecution of the skeleton up to logical time T schedules a complete set of 2.11Write events with timestamp smaller than T at the given virtual location. Now for any Read request with timestamp smaller than T, the “Latest-Write” can be safely determined. A Proxy-Write reveals the timestamp of an expected Write operation but dlles not hold a value. Its presence in an event list means that a corre-

sponding Read request (i.e. the Read request for which the Proxy-Write happens to be the “Latest-write”) must be blocked. The corresponding Read remains suspended until the actual Write event is received from a task-node. To satisfy the coherency condition, production of Proxy-Writes is entirely sufficient. It can be easily arranged that an arriving Read request, with timestamp T, is not processed until the skeleton executes past T. This will ensure that, when a Read request is processed. its respective Latest-Write is correctly identified. However. Proxy-Reads are generated to serve a different purpose: they offer a significant reduction in communication latency and overheads. A Proxy-Read event contains all the information which is expected from an actual Read request. This eliminates the need for transmission of actual Read request messages from task-nodes to memory-nodes. Unsolicited I’rrlue messages ure sent jronz memory-nodes to tusks, “automaticully”. in re.sponse to locally generuted Proxy- Reau’s. It is worth reminding that the timestamp of an event is a unique value that indicates the exact point in a loop iteration from which the event is derived. Thus the timestamp of a Proxy-Read identifies the program step to which the event belongs and offers the necessary information for the router to establish the physical destination to which the returned value must be delivered. Unsolicited instigation of Value messages by the memory demands appropriate provisions at the task management site which is described in Section 10.

7. Flow control Each proxy event occupies a certain storage space and therefore excessive proxy generation is not desirable. Indeed a memory-node only needs to be one step ahead of the task-nodes. Proxy event production should be interleaved with re-

ceiving and responding to actual events from tasks in “real time”. This is necessary to provide maximum parallelism between memory and processor actions. The execution of skeleton is therefore controlled by a simple flow control mechanism. Each memory-node keeps a tour t of generated Proxy-Writes against the number of actual write arrivals. Once the number of yet to ‘ae substantiated proxies exceeds a set upper threshold, execution of skeleton is stopped. Subsequently arrival of actual Write events reduces the number of unsubstantiated proxies. When this falls below a set lower threshold, execution of skeleton is resumed. This is analogous to the well-known Xon/Xoff flow control protocol. Hence physical space requirement is rninimised and kept under close control and production of Proxy events is interleaved with arrival of actual events from task-nodes.

8. Conditional

events and deadlock avoidance

A conditional event is specified as follows: If (corzdition-e.~~ressiolz) Then Event; If connation-expression can be evaluated independently of the main computati3sn, then it can be included within the skeleton. However there may be cases where evaluation of a condition-expression is dependent on values which are not available at the loop entry point. Such conditional events can be still adequately handled within the proposed strategy. To ensure that the skeleton of a loop containing non-deterministic conditions can safely predict future events, such conditions are excluded from the skeleton. As a result a conditional event is represented, within the skeleton, as an unconditional event. This keeps the skeleton simple and self-contained. But omission of a condition is equivalent to assuming that the condition evaluates to true. This leads to event management issues which require careful consideration.

Prediction of a Read event which eventually does not materialise means that the subsequent value message returned to a task will remain unused in the input queue of the task. Although this does not affect the semantics of the program it must be adequately managed to avoid possible space congestion. This is achieved by destroying input messages as and when their timestamp falls below (i.e. in the past of) the local time threshold. Adequate handling of a predicted Write event which does not actually materialise is more critical. When a Proxy-Write is scheduled, its presence may block one or more Read requests. Read requests are blocked until the corresponding Write arrives. If this arrival never takes place, then the Read request would remain forever blocked. A Task awaiting the requested value would then remain also blocked and this cycle may soon escalate until all processes within the system enter into a blocked state, that is a deadlock. To avoid deadlocks and to allow the execution to proceed to its intended termination, anti-Writes are used. When a Write meets its anti-Write the two are annihilated. Annihilation of a Proxy-Write event, results in the release of Read requests which were suspended on that proxy. The released Read requests now apply to the immediately preceding Write in the event list (Fig. 5). Thus a conditional *‘ if (condition) is implemented “ if (condition)

Write event such as W in then W ”

to the effect of then W else Anti-W

“.

Subsequently, in compiling skeleton of the same loop, that same conditional event W is represented as an unconditional event “W”.

554

j after

a>

j b)

an Anti-Write and finds

arrives

its matchmg

Proxy

released reads of annihilated proxy apply to the immediately preceding write in the list.

Fig. :j. Annihilation of a Proxy-Write Write event in the list.

results in the release of respective

This allows skeleton derivation to ignore nondeterministic conditions and keep event prediction a simple and local operation. Therefore, when a predicted conditional write does not happen, a corr,esponding “anti-Write” does happen. AntiWrites eventually find and annihilate surplus Write proxies and prevent deadlock:;. III the light of the above discussion concerning Proxy events, Anti-events and actual Read/Write events, the principal features of event management at memory-nodes and task management at processor nodes are outlined in Sections 9 and 10.

Read requests:

released

requests

now applying

to the preceding

W signifies an actual Write operation by a remote task process. PW is a Proxy-Write operation which is generated by the locally executing skeleton. R may be either an actual Read or a Proxy-Read operation. 3 Here the second option, i.e. ProxyRead is assumed. AW is an Anti-Write event which is sent by a processor node when the condition governing a Write operation evaluates to false. W and AW events arrive in a non-deterministic order and are scheduled into event lists at their appropriate addresses. Scheduling of W and AW events is interleaved with the scheduling of Proxy

9. Event managementprocedure The course of action in processing of an event is determined by the type of the event (see pseudocode in Fig. 6). Four possible types of event reaching the scheduler at a memory,-node are W, PW, R and AW.

3 Regarding a Read operation. a compiler can select one of the two options: (1) specify an actual Read request message to be sent from the task to the memory; or (2) include the Read event in the skeleton and hence let the memory generate a corresponding Proxy-Read. For a given Read operation, only one of the two options may be selected.

schedule

case

(event) type-of

w:

(event)

if

(P’iJ

found)

I

upciate PW to W commit suspended 1 else insert event. R:

PW:

find “latest-write”; if lilatest-writea else suspend if

AW:

a W dispatch under “latest

value write”.

(W found)

discard else if

is event

Reads

(AW

event found)

annihilate else insert event. if (PW f63u&? annihilate else ineert event.

Fig. 6. Pseudocode

describing

events which are generated by the &ally executing skeletons. Locally generated Proxy events are produced in increasing logical time order. Hence, when a Proxy-Read event is scheduled into an event structure, it is guaranteed that a respective “LatestWrite” event is already represented in that event list. 4 Write: and Proxy-Write events are inserted in the event list under the given address in increasing

’ An actual Read request message, with time stamp T, is not processed until skeleton executes past T. Hence it is ensured that when it is processed its respective Prory-Write is in the list.

memory

event

manager.

timestamp order. For a Read event Latest-Write is identified by in order traversal of the event list. If the Latest-Write, is a Proxy-Write, then the Read request is suspended, else the value provided by the Latest-Write is dispatched immediately. Suspended Read requests remain attached to their respective Latest-Write proxies (Fig. 7). A PW event (Proxy-Write) may be annihilated after it is scheduled (by arrival of an AW event). In this case the R events suspended under the cancelled PW are transferred to the preceding W event in the list. If an AW event arrives before its PW counterpart then it is inserted in the appropriate position in the list. Later scheduling of the corresponding

outstanding Write,

Proxy-Write

Anti-Write

and

suspended Read

events

Requests

KEY R w PW AW

Fig.

7. Example

snap

shot

Rear1 wntt! Pm{ write Ant!

Wile

of the state

of the event-ordermg

PW results in the annihilation of both events, as if they never existed. When a W event arrives and its PW is already in the list, then PW is replaced by W. Note that PW is a partial copy of its resFlective W, only lacking in value content. The replacement of PW with W leads also to the committi,lg of Read-requests suspendedon PW.

Fig.

8. State

transition

structure

at some

virtual

location

in a memory-node

10. Task management procedure A Task-node is in one of 3 states: (1) Ready-ToRun, (2) Running and (3) Suspended (Fig. 8). When a call is made to Read or Write a global location first the local clock of the calling task is updated. Then, if the request requires actual message transmission. a timestamped messageis composed

diagram

of a task-node.

and presented to the router. The r(Juter looks after the delivery of the message to the addressed location. When a task makes a Read request, two possible scenarios may occur: (1) if the required value has already arrived (because of the predictive action of memory-nodes) then the value is passed to the task and it can continue execution; (2) but if the requested value (identified by its address and timestamp) has not yet arrived. then the task is suspended until required value is received. Suspended tasks are queued in a suspended task list. Each tirne a task is suspended, the next task, in front of the Ready-To-Run queue. may execute. When the expected value for a task in the suspended queue arrives, then that task is moved to the Ready-To-Run queue.

11. Conclusion and future work Salient features of a distribute’1 event processing method for a class of non-DES applications were described. This method complies with the PDES mode of execution. In PDES a logical or virtual time domain is used to determine precedence relations between non-local operations. Timestamps are used to piggyback synchronisation information onto data communication messages and an on-line knowleldge acquisition strategy is employed to ensure that local event processing conforms with global ordering requireparallelism is ments. In this way maximum achieved and yet logical integrity of the execution is preserved. Our proposed method diverges from established optimistic or conservative strategies. This is due to the observation that unlike the DES domain, temporal relations in conventional programs are exogenous to the main corrputation and a great deal of the knowledge acquisition can be performed at compile time. We presented a mecha-

nism whereby this knowledge, in the form of event skeletons. can be used at distributed memory-nodes to process the relevant requests in a globally coherent manner. We also described how such skeletons can be used to provide an automatic prefetching of data values to relevant task locations. We discussed how conditional dependencies can be handled by the provision of appropriate antiWrite messages and how execution of skeletons can be controlled to optimise parallelism and minimise space usage. The proposed method combines compiler techniques with a dynamic memory co-ordination mechanism and therefore can be used to parallelize programs which contain run-time dependencies. The example loops which were discussed in this paper cannot be parallelized by conventional compiler techniques [18]. Future run-time compilation systems may be able to handle some dynamic dependencies [19] but ultimately they would require an efficient synchronisation and data caching policy. The beauty of the PDES paradigm is that it offers a unified solution for all these problems. The algorithms and methods described in this paper are implemented in the Spider programming system [7] which is developed at the Centre for Parallel Computing, University of Westminster. 5 Preliminary experiments have produced good performance on a network of workstations.

Acknowledgements The author is grateful to Steve Winter and Derek Wilson for comments on earlier drafts of this paper.

’ Public distribution of the system is scheduled for mid 1997. Intel-ested readers may contact the author for further details.

References Kalantery, SC. Winter, D. Wilson. From BSP to a virtual von Neumann computer, first presented at BCS open meeting on General Purpo:,e Parallel Computing, December 1993. also appeared in IEE Computer and Control Engineering Journal 6 (3) (1995) 131-136. [‘I R.M. Fujimoto, The virtual time machine, in: Proceedings of International Symposium on Parallel Algorithms and Architectures. June 1989. pp. 199-208. Task scheduling for general rollback comput131 P. Tinker, ing. in: Proceedings of the 1989 International Conference on Parallel Processing, August 19119. Using optimistic execution techI41 A. Back. S.J. Turner. niques as a paralleliz d t’ion tool for general purpose computing, in: Proceedings of HPCN Europe’95, Italy, May 1995. pp. 21 -26. [51 J.G. Cleary. M.W. Pearson, H. Kinawi, The architecture of an optimistic CPU: The warp engine. in: Proceedings of HlCSS’95, vol. I, 1995. pp. 163-172. Parallel discrete event simulation. Ph.D. PI N. Kalantery Thesis, University of Westminster, 1994. Electronic version available on request from ths author. Spider. A virtual van Neumann machine on [71 N. Kalantery. a network of workstations. at http:/lwww.cpc.wmin.ac.uk/ -spider. PI F!.M. Fujimoto. Parallel discrete event simulation, Comm. ACM 33 (10) (1990) 20-53. J. Misra, Asynch-onous distributed simu[91 K.M. Chandy. lation via a sequence of parallel computations, Comm. ACM 24 (11) (1981) 198-205. [lOI D.R. Jefferson. Virtual time, ACM Trans. Progrilmming Languages and Systems 7 (3) (1985) 404425. of a [I 11 D.J. Scales, M.S. Lam, The design and evaluation shared object svstem for distributed memory machines, in: Proceedings of First Symposiun on Operating Systems lDesign and Implementation, November 1994.

[ill

111 N.

[131 [I41

[151

[I61 u71

iI81

[I91

A.G. Ranade. How to emulate shared memory, in: Proceedings of the 28th IEEE Symposium on Foundations of Computer Science, 1987. pp. l85- 194. L.G. Valiant. A bridging model for parallel computation, Comm. ACM 33 (8) (1990) 103-l 1 I. N. Kalantery. S.C. Winter. D. Wilson, Deterministic parallel execution of sequential code. in: Proceedings of the Second Euromicro Workshop on Parallel and Distributed Processing. Spain, January 1994. pp. 4%55. .4. Back. S.J. Turner. Timestamp generation for optimistic parallel computing. in: Proceedings of 28th Annual Simulation Symposium, USA. April 1995. pp. 144-153. L.M. Censier, P. Feauttier, A new solution to the coherence problem in multicache systems, IEEE Trans. Comput. C? (12) (1978) 1112- 1118. Z. Shen. Z. Li, P.C. Yew, An empirical study of Fortran programs for paralleliring compilers. IEEE Trans. Parallel & Distributed Systems 1 (3) (1990) 356-364. W. Blume et al., Automatic detection of parallelism. A grand challenge for high performance computing. in: 1EEE Parallel & Distributed Technology. Fall 1994, pp. 31-47. J. Saltz, H. Berryman, J. Wu. Multiprocessors and runtime compilation, Concurrency: Practice & Experience 3 (6) (1991) 573-592. Nasser Kalantery is the Quintin Hogg fellow in parallel computing at the University of Westminster, London, UK. His research interests include models of parallel computation. distributed simulation and automatic parallelization. Kalantery graduated in control and computer engineering and received a Ph.D. in Parallel Discrete Event Simulation from University of Westminster.