A multi-thread approach reducing program execution time in a heterogeneous reconfigurable multi-processor architecture

A multi-thread approach reducing program execution time in a heterogeneous reconfigurable multi-processor architecture

~ JOURNAL OF SYSTEMS , ~ ~ ARCHITECTURE ELSEVIER Journal of Systems Architecture 43(1997) 143-153 A multi-thread approach reducing program exec...

995KB Sizes 2 Downloads 125 Views

~

JOURNAL OF SYSTEMS ,

~

~

ARCHITECTURE

ELSEVIER

Journal of Systems Architecture 43(1997) 143-153

A multi-thread approach reducing program execution time in a heterogeneous reconfigurable multi-processor architecture Miroslaw Thor • Institute ofComputer Science. Polish Academy of Sciences Ordona 2/, 01-237 War.l'aw, Poland

Abstract A new approach leading to reduced user program execution time in a multi-processor environment with dynamic reconfiguration capabilities is proposed in this paper. The general idea is that a user program partitioned intospecific parts called slices (based on a program graph representation) is executed on different heterogeneous, dynamically reconfigurable multi-processor units (MPUs). Reduction of the program execution time is achieved through finding the MPU with the shortest execution time for each program slice and appropriate control flow between executing units. Two different methods are proposed: ' • Run-time reducing a program execution time using multiple heterogeneous MPUs. All MPUs simultaneously start execution of each program slice. This method is used for real-time applications when analytical methods are impossible , to apply, program graphs activations are undeterministic, and/or external control signals influence the program execution paradigm. • Finding the optimal hardware topologies of the MPUs and allocating them program slices; next a reduced time program execution through look-ahead reconfiguration and switching of control between the MPUs. The proposed strategy and algorithms are described indetails. TIle methods arecompared and the scope of their applicability is discussed. Some indications for the future investigations are given in conclusions. .., Keywords: Multi-processor architecture; Reconfiguration; Optimization

• Email: [email protected] 1383-7621/0165·6074/97/$17.00 CO 1997 Published by Elsevier Science B.V. All rights reserved PII S 1383- 7621/0165-6074(96)00 I 13-0

144

M. rhor/ Jouma! o/System.\' Arc:hitel:ture 43(/997.' 143-/53

1. Introduction J. J• Different approaches concerning multi-processor

structures and algorithms for optimal or semi-optimal execution of user programs In the last few years two general trends can be observed in the area of multi-processor architectures and algorithms aimed to increase the efficiency of user programs execution: • development of reconfigurable, distributed memory multi-processor systems, • development ofdifferent (up to now mainly static) allocation strategies. Following the first trend a number of reconfigurable, distributed memory multi-processors, such as transputer networks have been proposed [1-4]. Simultaneously, the experiments show, that task allocation strategies are influenced by many factors such as for example the size and topology of the network and strongly depend on the application programs (frequency of communication between processes, work generation-locally or sent from other processors, etc.), In the investigations of the second trend a lot of attention has been paid to the static allocation in mapping problems. Some of the proposed mapping strategies were based on mathematical programming, graph theory or queuing theory [5-7]. Although these strategies give optimal solutions, they are time con'suming. Therefore, some heuristic methods have been used (iterative or constructive) to speed up the search [8-10]. In the iterative heuristic methods approach Lite hill-climbing, simulated annealing and genetic algorithms have been mainly considered [8]. On the other hand in the dynamic allocation area the majority of investigations concerned load balancing [11,12] and dynamic parallel computations but on specific architectures like hypercube multicomputers [13] or reconfigurable tree structures [14]. A common feature of all the above-mentioned approaches was that either already existing architec-

tures were used or new ones constructed using different assumptions concerning for example calculation of communication and execution costs, estimated load of particular inter processor connections (intensity of communication) and so on before user programs could be run. Only in some of them semidynamic reconfiguration was possible where application programs are executed section after section with static reconfiguration of link connections between section executions. The HP-FLEXAR architecture [15] was an attempt to remove the above-mentioned obstacles. Especially, to eliminate link connections reconfiguration overheads in the communication time between processors. For this purpose, two strategies have been proposed in this paper. J.2. The use of the HP'-FLEXAR as a test bed for

multi-processor investigations The HP-FLEXAR-a recently proposed high-performance reconfigurable multi-transputer heterogeneous topology architecture [15] is especially suitable for dynamic allocation and execution of parallel algorithms. One of the aims of the HP-FLEXAR was to support development of architectural concepts and communication strategies aimed to increase the efficiency of reconfigurable multi-transputer systems. But equally important was the use of the Hp· FLEXAR as a test bed to investigate the influence of different multi-transputer structures on a program execution time. In the HP-FLEXAR architecture a possibility of dynamic reconfiguration of the executional system hardware and dynamic loading of particular program parts combined with appropriate allocation algorithms give the possibility to determine minimal program execution time. In the HP-FLEXAR architecture a reconfiguration is easy and fast with minimal hardware and communication control overheads. A system expandability at the same communication level as well as hierar-

M. Thor / Journal o/Systems Architecture 43 (1997) 143-153

chical expendability (scaling) is also satisfied. Besides that, the possibility of direct connecting of processors (avoiding passing through intermediate nodes) is extremely important. The first evaluation of the HP-FLEXAR (a partial hardware implementation and simulation) shows that it is especially suitable (easy dynamic mapping of tasks and increased performance) for real-time execution of parallel allocation algorithms. The use of the HP-FLEXAR as a test bed to investigate the influence of different multi-transputer topologies on a program execution time enabled to develop a new strategy reducing the execution time of a user program, is presented in the following sections of this paper. J.3. A proposal of a multi-thread approach for opti-

mal program execution in multi-processor environment with dynamic reconfiguration capabilities The proposed strategy and algorithms are a completely new approach to a program execution paradigm leading to a reduced execution time of a user program in multi-processor environment with dynamic reconfiguration capabilities. We call this strategy a multi-thread approach because a program is executed in parallel threads. The general idea is that a user program is partitioned into specific parts called slices (based on its graph representation). In the first method the slices are simultaneously executed on different heterogeneous dynamically reconfigurable multi-processor units (MPUs) being parts of the HP-FLEXAR execution units, in such a way that the program execution time is reduced. It is achieved through finding, in a real time, the MPU with the shortest execution time, running each program slice in different MPUs topologies with appropriate control flow between these structures. In the second method the optimal hardware topologies of the MPUs are found for particular program slices in advance, followed by the execution of these slices through look-ahead recon-

145

figuration and switching of control between the MPUs. It should be stressed that both methods are developed to be used for different kind of applications. The first method is used for real-time applications when analytical methods are impossible to apply because of undeterministic activations of program graph nodes (loops inside slices), and/or external control signals influencing the program execution paradigm. The second one, in any other cases for non-time-critical applications and for deterministic program graphs. In the first method the price is hardware overhead, but the gain is that a program is executed in a real time faster than using any other methods in a multi-processor environment. In the second method a pre-processing, optimisation phase is needed, but much less hardware is required. In the following sections of this paper the proposed strategy and algorithms are described in details. The methods are compared and the scope of their applicability is discussed. Some indications for the future investigations are given in conclusions. 2. A multi-thread strategy

2.J. General description A multi-thread approach strategy proposed in this paper, as mentioned in the Introduction, depends on a partitioning of a user program into parts called slices and executing them in a specific way in a heterogeneous reconfigurable multi-processor architecture. Two points are fundamental for the proposed strategy: finding for each program slice such hardware structure which can lead to a reduced execution time and eliminating communication time overhead which was due to reconfiguration when topologies had to be changed. In practical implementations such a search is limited by accessible hardware resources and possibilities of dynamic reconfiguration of the , multi-processor system. The HP-FLEXAR transputer

146

M. rhor/ Journal oISys'em.~ Arc:lliteclure43 (1997) 143-153

architecture was found to be especially suitable for implementation of the proposed strategy. Two general methods can be distinguished in our approach: • one-phase method with run-time optimisation algorithm called RTO, in which a program execution time is reduced through a dynamic allocation and use of multiple heterogeneous processor units (MPUs)~ all MPUs simultaneously execute a program slice and after one of them has produced the results, the execution of the next program slice starts immediately in all of them; • two-phase method with a preprocessed optimization algorithm called PO, where the first phase is used for pre-processing (one or several runs of a program) and the second rne for a time reduced execution of the program with look-ahead reconfiguration of the execution units and switching of control between the MPUs after each program slice execution.. Both methods depend on partitioning of a user program into parts called slices (the rules for such partitions are given in Section 2.2.1.).

2.2.1. The rules for partitioning of a user program graph into slices Our multi-thread strategy is mainly thought for such programs in which some parts can be executed in parallel in a multi-processor system environment. The first step of this strategy is partitioning of a program graph into slices. The rules for such partitioning are specified below. The construction of a program graph is done in such a way that all fork and joint points, not being in any of the parallel branches of a control program stream, must be distinguished. So, for simplicity we can say that these are points in the main control program stream in which the semi-autonomous parts of a program are ended (branch points) or begin (joint points). Semi-autonomous means that except branch or joint points there is no communication with any other program parts. These semi-autonomous parts of a program can be treated as slices but it is assumed that a slice execution time should not exceed a predefined value which is a parameter for the partition into slices. Therefore, if a semi-autonoa

2.2. Preliminaries

b

There are two starting points for the development of both methods, The first one is a user program graph representation shown in Fig. Ia. The second is the HP-FLEXAR architecture shown in Fig. 2. The nodes in the graph are program modules (sequential threads or communicating instructions) and the edges between them represent activation sequence of the modules execution. In joint points of a graph all predecessor modules execution must be ended before the next part of a program execution can be started. These are program synchronization points which we call graph decomposition points. It can be seen in Fig. Ja, that a program between the decomposition points is not sequential. It can be a complicated subgraph with parallel branches and containing many communicating program modules.

c

d

e

Flg.1 a

o -

a program module

c:::::> -a semI-autonomous part

Flg.1 b

o-

a program slice

Fig. I. A user program graph with possible decomposition-points.

M. Thor/Journal oj'Sy.\·rem.~ ArclJilf!c:rure 43 (/997) /43-/53

mous program part execution time does not meet this requirement, it should be merged with a neighbour one. The partitioning of a program graph into slices is different for both methods. In the first method (we call it the on-line method) the decomposition-points are inserted only once during the execution of a user program. In the second method (we call it the off-line method) selection of the decomposition-points isdone during several program runs. In Fig. 1a a part of a program graph is shown with possible six decomposition-points indicated. The program slices between the decomposition-points (denoted a, b,... ,f) are shown symbolically in Fig. Ib. For program slice number 3 the scheme of internal connections between program modules is shown in more detail in a program graph (see Fig. Ia). The interconnection topology in other slices, represented as semi-autonomous program parts, can be more complex (with regular or irregular structure of interconnections between modules). Taking into account the above considerations, there are two formal rules for partitioning of a program into slices: • a program slice can have only one input point and one output point (decomposition points); no other communications between program modules from different slices are admissible (see Fig. 1b); • the execution time of any slice must be a substantial time compared to the whole program execution time. Its value is a parameter for the partition into slices and it is decided in the HP-FLEXARs DDD unit - described in Section 2.2.2 - where a program execution time is estimated. The second rule is important because of the time needed for reconfiguration in the second method. For the first method it is less significant since first we cannot properly estimate the slice execution time because of undeterminisrn mentioned in the previous section of the paper, and second we do not need a time for reconfiguration; only control flow between

147 to HOST DECISION & D1STftIIUTION UNIT

EXECUTION UIlIT2

EXECUTION UIIITI

Global Conlrollnd Daea lUI TI- al TRANSIUS conlroUlfr LH •• 10011 HOlt

Fig. 2. General structure of the HP-FLEXAR architecture.

the execution units is needed. A dynamic reconfiguration in the first method is need inside each slice during its execution. We can see from Fig. I that there are different possibilities of insertion decomposition points. For example all decomposition points (a, b,... , f) can be chosen, or only a, C, d and f. It is especially important for the second method in which in the first phase optimal hardware structures from the execution time point of view must be found. 2.2.2. The Hp-FLEXAR architecture for a multithread strategy implementation The general structure of the HP-FLEXAR architecture is shown in Fig. 2. Let us briefly recall the main features and functioning of the HP-FLEXAR architecture. A more detailed description can be found in rI5]. The general architectural features of the HPFLEXAR system are as follows: • four-unit structure: decision & distribution unit and 3 heterogeneous execution units; each execution unit is a multi-transputer dynamically recon-

148

M. Thor / Journal of S)',I'/l!rIIS Architecture 43 (1997) /43-/53

figurable architecture comprising different multilayer structures of worker transputers, • use of transputer-compatible crossbar switches within the multi-processor units (MPUs), • use of control ring buses and some additional hardware (TRANSBUS controllers), • scalability and expandability of the MPUs, • fully parallel, independent processing in the MPUs, • dynamic distribution and mapping of application program processes onto the MPUs, • ,.on demand", either system or user program driven MPUs reconfiguration, • use of the OCCAM as a user programming language The GCT is used for communication with a Host performing the allocation algorithm and supplying the DCx transputers with the information necessary to perform their tasks. TI,e DCI, DC2, DC3 transputers can dynamicalJy supply their corresponding execution units with the information necessary to configure multi-processor units (MPUl, MPU2, MPU3, respectively) and perform allocation and execution functions. The DCl, DC2 and DC3 perform two additional tasks: serve for communication between MPU1MPU3 processes and receive partial results from the execution units reporting them to the global control transputer GCT. We calJ it feedback control. The results are analyzed in the GCT transputer and decisions sent back to the DCx transputers. Accordingly, one or more of the MPUs can be partialJy reconfigured being better fit for further allocation and execution of the application tasks. Final results are sent from the GCT transputer to the Host computer. The DC I, DC2, DC3 transputers have direct very fast communication means. A special control and data bus using some additional hardware elements such as ring control and data buses, crossbar switches and TRANSBUS controllers are used. The transputers used in the HP-FLE)~AR architec-

ture are T800 and T9000 series and the crossbar switches are of both types C004 and CI04 (their detailed descriptions can be found elsewhere (16,17]. The control and data bus shown in Fig. 2 and some other internal control buses in the execution units are organized on the basis of the TRANSBUS controllers developed and produced by IRESTE in France [18]. They provide the control and data interfaces between transputers and the bus. The bus itself consists of the data line, the acknowledge line and the token line. The write access to the bus is arbitrated by a token circulating with the speed of 100 ns between TRANSBUS controllers. 2.3. A run-time optimization algorithm (RTO) A basic assumption in a run-time optimization algorithm (first method, for real-time applications) is that a user program, partitioned into slir es according to the partitioning rules (Section 2.2.1). is dynamically allocated and executed on different heterogeneous multi-processor units (MPUs). Therefore, a mechanism of control switching between the MPUs must be provided on the one hand, and appropriate aIJocation algorithms used during a program execution on the other. The method described in this section (the RTO algorithm) is especially useful for time-critical applications for which analytical methods are impossible to apply, there is no time for' preprocessing, and user programs must be run at once. Using the R '0 algorithm a close-to-optimal dynamic execution of a program can be achieved, but at the expense of some hardware redundancy compared to a preprocessed optimization algorithm described in the next section of this paper. The rules for partitioning of a user program graph into specific slices (inserting the decompositionpoints) are described in Section 2.2.1. Here, we will use these partitions. It was mentioned before that the slice estimated execution time is important for the second preprocessing method because of the time

M. Thor / Juuma! of Systems Architecture 43 (J997) J43-/53

The CPU Ll"itanalyses I user program graph & lnaerta decomposlti~polntl

2

The OPU Ll"itsupplies 11I1 3 EXUs \\ith ~

a program &lice to the 1stdecompl)slti'!'"~

The POU Ll"it supplies pJl3 EXUs 5 with partial restJta of t 'IO lastexecuted slice & loads theneld e1lce e-T a program

~

All3 EXUs s1mutBneously executo a program slice supplied bytheDDU

4

No RTO. Run-llmll Opfmlzation algorithm DDU • DlIciaion & Distribution Unit EXU. execution Un~

Fig. 3. Asimplified diagram of the RTO algorithm functioning.

needed for look-ahead reconfiguration. In the RTO algorithm such an estimation cannot be done and besides it is notcrucial. In this algorithm the decemposition-points are specified only once in the DDD unit. The reason is that optimization must be performed in a run-time unlike to a preprocessed optimization algorithm in which ihe final insertion of decomposition-points is specified after several runs of a program in the preprocessing phase (details in the next section of this paper). In the RTO method, much more important is using an optimal allocation algorithm for a given MPU. Some indications concerning choosing such an algorithm can be found in [I 2]. . A simplified diagram showing the idea of the RTO algorithm functioning is shown in Fig. 3. The execution of the RTO algorithm shown ill Fig. 3 can be illustrated using the execution time diagram shown in Fig. 4.

149

In general, any number of the EXl1s can be used and a program can be partit.ioned into any number of slices. However, choosing too many EXUf: can further increase the hardware overhead which will not be compensated by execution time gains in the first method. In this paper 3 execution units and 6 program slices are taken as an example. We can see in Fig. 4 that the EXU 3 is the first one that finished the 1st slice after time tl . therefore, all EXUs start to execute the 2nd slice from t1. The EXU 2 is the first one which finished the 2::ld slice in time t2. Starting from t2, the 3rd slice is executed in all EXUs and so on. The execution of all six slices is finished in time t6. The overall program execution time is shorter than when the same program was executed in anyone of these three EXUs. 2.4. A preprocessed optimization algorithm with run-time reconfiguration (PO) In a pre . .cessed optimization algorithm the basic pr.nciples at' a multi-thread strategy are the same as in the case of a run-time optimization aigorithm; however, there are some significant differences in their implementations. Now, a program does not have to be executed at once. Instead, some preprocessing runs are required to find the proper hardware topologies for optimal execution of particular program slices and hence minimize the total execution time. So, this time we have a two-phase algorithm. EY.~'llU01\1at alice

Units

2r,J IlIcl

EXU1

5th .nce 61b.Uce

14th ~Ilet

I

'

~-----:"_"'----'------7

1

o

3rd ,lice

t1

t2

t3

~

t5

>

t8 Execution time

Fig. 4. An execution time diagram of the RTO algorithm.

ISO

M. Thor I Journal 0/ Systelll.~ Architecture 43 (1997) 143-153

In general a two-phase algorithm such as the PO is needed in 1wo cases: • when more execution units exist than in a given architecture are expected to give a time-optimal solution (for example there are 3 EXUs in the HP-FLEXAR architecture) or • when hardware resources are limited. The 2, 3 or more execution unit solutions are possible for the PO algorithm implementation. In all of them a reconfiguration during a user program execution is needed. The solution with 2 structures saves some hardware but is more time constrained. In our paper we present a solution with 2 execution units. Such an approach is for simplification and possible comparisons with the RTO algorithm, because in general a distinction of 2 execution units is not needed. We can say instead that certain hardware resources are available and only during a program execution they are partitioned into 2 structures according to the algorithm demands. Another possibility is that the same hardware resources (worker processors) are used for executing consecutive application program slices. In this last case, more than one connection switch for dynamic rcconfiguration is needed. While one slice is executed, the connections for the next one are set in advance in another link connection switch. Such a possible implementation of the PO method is described in detail in [20]. The PO algorithm consists of 2 phases: • preprocessing (an optimization phase), • a program execution. 2.4./. First phase In the first phase the rules for partitioning of a user program graph into specific slices (insertir..., the decomposition-points) described in Section 2.2.1 are used. In contrast to the RTO algorithm, several possible partitions are used in consecutive program runs to find an optimal one. For each such a partition and for every program slice (a part of a program between two consecutive decomposition-points) dif-

ferent execution MPUs and different tusk migration algorithms are considered to find the shortest possible execution time. The final results of a preprocessing phase can be summarized as follows: • optimal slicing of a user program graph through inserting the decomposition-points. • finding for each slice the optimal execution mul ti-processor structures, • finding the optimal task migration methods for each structure specified above (a load balancing aspect).

2.4.2. Second phase In the second phasea user program is executed in two execution units (each one is a dynamically reconfigurable multi-processor structure) in the following way. The first execution unit structure is configured to be optimal for the first program slice to be executed. At the same time the second execution unit structure is configured to be optimal for the second program slice to be ready when the exer-tion of the first slice is finished. When the execution of the first program slice is finished the second execution unit takes over the control and the execution of the second slice begins. At this time the first execution unit is reconfigured to be ready for optimal performing of the third slice and so on. The sum of all minimal time executions of all slices gives the overalI minimal execution time of a program. A simplified diagram of the PO algorithm functioning is shown in Fig. 5 and Fig. 6, the first phase and the second phase, respectively. Obviously, there are different possibilities of implementation of the PO algorithm. Describing the second phase of the PO algorithm we have written about two execution units for simplicity of presentation only. In particular implementations u.e same hardware elements (mainly worker processors) can be used in a way described at the first part of this section, using multiple connection switches for rcconfiguration. The advantages

M. Thor / Journal (lfSy.\'tem.~ Architecture 43 (1997) 143-/53

and disadvantages of this algorithm implementation and comparisons with the RTO algorithm are given in the next section of this paper. ,

lSI

The DDU unit supplies the15t EXU WIth

a program slice to the 1st decomposition-point

2

The oou unit analysea auser pr0p,ram graph & Inserts decompollllon-po nt\

OOU unitsuppll.. the EXU 'with. program 2 The slice tothe 1,t decompolltion-polnt The DDU unit ,~el thoEXU with. program cefonowlng mOlt recently executed one

3

A new conlliurauon structure IntheEXU II eatllblJlhed Inlerted program IIlceexecuted

4

execution time of a program IIlce II IInl tothe DDU

6 The O[lU unitsuPPlies the EXi7'with partial reSUlts Ofthlliast executed slice & loads the next program sllrn

3

No PO • P,,~.." optlmlPtIon lQoIIhm DDU • O~_ & OlllrWIoIl Una EXU· ExtrutlollUnl

5

Tho conllguraUon parameters ofthe EXU Itructure Yt1th thelhortelt exec. time for a given decompoll1lon • point structure Ire .aved In OOU

Fig. 6. A simplified diagram of thesecond phase of PO algorithm functioning.

No

No

Yea The connguraaon parametera ofthe EXU Itructures 7 with the Ihortest eXllc. ame forIII program life.. for • dltCompolition.polnt Itructure with atotllihortlit execu!on time Ire lavedIn ODU. A program II read)' forthe execution. PO - Pr.proc"ltd Optimization .II/Or~hm

DOU- Decilion & Dlllrlb~lon Un~

EXU • execution Unl

Fig. 5. A simplified diagram of the first phase of PO algorithm functioning.

2.5. Implementation of the RTO and PO algorithms

Both algorithms can be implemented in different ways. However, the implementation of the RTO algorithm as being more specific is straightforward in the HP~FLEXAR architecture. More possibilities exist for the PO algorithm implementation. It can be for example implemented using only one of the HP-FLEXAR execution units: EXlTl or EXU2. In this way much less hardware resources is used but we have an off-line two-phase algorithm.

152

M. rhor/ Journal ofSy,~te",s Architecture 43 (1997) 143-153

3. Conclusions

However, the proposed algorithms ale simple. the complexity of problems to be considered in multithread methods is rather high. It concerns especially the partitioning of a program graph into slices. The problems to be considered are: granularity of partitioning. the execution time of different slices. reconfiguration time. regularity of structures, frequency of communication between tasks, allocation strategies and so on. Nevertheless, having specified programs and defined the rules for such partitioning we can easily decide which method should be used: the first one for time-critical applications (with hardware overhead) or the second one optimised in advance with look-ahead reconfiguration. In both cases the following implementation is straightforward. In the solution considered in [15] we have rather predicted that one of the three execution units will perform a certain part of a program faster than the other two. In a new approach presented in this paper such a prediction is not needed. In the graph representation of a program several decomposition-points are distinguished according to a given algorithm and slices are selected. Then the program runs from the beginning to a first decomposition-point in all 3 different structures of the execution units. The best (minimal) time is obtained automatically and signalled to the DDU unit. After that, the next part of a program is executed in all 3 structures from the first to the second decomposition-point and again the minimal time is observed in one structure and so on up to the end of the program. The sum of all minimal time executions of all slices gives the overall minimal execution time of a program. Both algorithms give optimal solutions due to the specific combination of software and hardware mechanisms. The advantage of the second algorithm is that it gives optimal execution time due to the finding of appropriate hardware structures for a program execution in the first phase, and dynamic loading, alloca-

tion and dynamic reconfiguration of the hardware used in the second one. The most important factor of this method is that no time is lost for reconfiguration of hardware, because it is done simultaneously with the execution of preceding program slice. So this method preserves all the advantages of the previously proposed solutions for running programs in multi-processor reconfigurable architectures and adds another one which is eliminating reconfiguration overhead. In the future investigations instead of a program activation graph, a data flow graph could be considered because data structures may have a big influence on the optimization methods, In the first phase of the PO algorithm some optimization runs of a program are done and the best structures for particular slices are found in a heuristic way. In the future research we could look for some analytical methods to evaluate the execution of the program slices. Besides the implementation presented in this paper using the HP-FLEXAR architecture some other implementations are also foreseen. They will be realized in mixed T800 and 1'9000 transputers environment. The use of new-generation transputers can stimulate development of new or modified methods. In other experiments the systems with multiple Alpha processors can ~e used.

References [I J P.C. Capon et al., ParSiFal - A Parallet Simulation Facility, IEEE Colloq. Dlges: 91 (J986). [lJ T. Muntean, SUPERNODE, Architecture Parallcle. Dynamiquement Reconflgurable de Transputers. fJemes }ournee.\' sur I'lnfomlUtique. Nancy (Jan. 1989). [3J P. Jones and A. Muna, The Implementation of a Run-Time Link-Switching Environment for Multi-Transputer Machines. Proc. NATUG 2 MeetinR, Durham (Oct. 1989). [4] M. TUdruj and T. Kalinowski, Multi-Transputer Systems with Dynamic Link Connection Switching Con-rolled through a Serial Bus, Proc, wr Congress - Aachen '93, Vol. 2 (1989) 803-818.

M. Tlror / Jouma; a/Systems Architecture 43 (1997) /41-153 [S) P.R. Mol et al., A task allocation model for distributed computing systems. JEEE Trails. Compo 31 (I) (1982) 41-47. (6) C.·C. Shen and W·H. Tsai, A graph matching approach to

optimal task assignment in distributed computing systems using a minmax criterion. IEEE Trans. Compo 34 (3)(1985) 197-203. (7) R.M. Bryant and J.R. Agre, A qlleuing net.....ork approach to the module allocation problem in distributed systems, Performam:e Evaluat. Rev. 10 (3) (1981) 191-204. [8] T. Muntean and E.G. Taibi. General Heuristics for the Mapping Problem. Tramp. Applic, alld SyJ'tem,\' '93. (IDS Press, Amsterdam, 1993). [9] D.·H. Du ar.d G. Vldal-Naquet, Mapping Communicating Task Graphs onto Reconfigurable Multiprocessor Architectures. Compo and hform. Sclence VI (Elsevier Science. Amsterdam. 1991). [10] lE. Boillat et aI., An analysis and reconfiguration tool for mapping parallel programs onto transputer networks, In: T. Muntean, Ed., Proc, 7t" OUG (Sept. 1987). [1 t] L. Schrettner and I.E. Jelly, A Test Environment for Jnvestlgation of Dynamic Load Balancing in Transputer Networks, Tramp. Appfic. and System,v '9J (IDS Press, Amsterdam, (993). [12] I, Phillips and P. Capon, Dynamic Distributed Load Balancing. Prac. World Transputer COnllr(~'S - Aachen '9J, Vol. 2 (1989) 757-771. [13J J. Ahmad et aI., Hierarchical Scheduling rf Dynamic Parallel Computations on Hypercube Multicomputers, J. Parallel and Di,vtr. Compo 20(994) 317-329. [J4J S. Srinivas and N.N. Biswas, A fast algorithm for dara exchange in reconfigurable tree structures, MPCS '94 Can/., Ischia, lraly (May 1994). liS) M. Thor, HP·FLEXAR: A reconfigurable multi-unit heterogeneous topology architecture for time critical applications, Mit:mproce,\'sing and Microprogramming 40 (1994) 777782.

153

(16) INMOS Ltd., Transputer Ref. Manual (Prentice·Hall. Englewood Cliffs. NJ, 1988), [17) 79000 Tran.lpuler Hardware Reference Manual (lNMOS Ltd•• 1993). [18) J.P. Calvez and O. Pasquier, A Transputer Interconnection Bus for Hald Real-Time Systems. Tran.vputer '92 Clm/' Proc., Are-et-Senans, France (May 1992). ['191 T. Kalinowski, M. Thor and M. Tudruj, Multi·Transputer Systems with Dynamic Rcconfiguratlon Control Based on the Sr-rial Bus, C()mputing Systems in Engineering 6 (4/5) (1995) 391-400. (20) Tudruj. M., Towards Time·Transparent Dynamic Link Conneetion Reconfiguration In Multl·Processor Architectures, ICSPAS Reports, No. 769 (1995).

Mlroslaw Thor received the M.Sc. degree in Electronics from the Warsaw Technical University. in 1967. and the Ph.D. degree in Computer Science from the Polish Academy of Sciences, in 1978. Dr. Thor is with the Instllute of Computer Science of the PoUsh Academy of Sciences. His main research Interests focus on multiprocessor architecture. communication and control, modelling and simulation and ap~ plled research In economy. He directed or panlclpated in numerous research pro)ects and designs within bllateral or mullllateral co-operation nallon·wlde and abroad. The results has been cresented at many European or US conferences and published n proceedings or international journals. In 1981-1982 he was a visiting professor of Computer Science at Baghdad University of Technology and College of Engineering, Systems and Control Depanmenl. In 1988-1989 he served as a visiting professor of Computer Science at Moldc State Collese In Norway. In 1985 In Brussels. he was elected a Euromlcro director. His current research IntereslS Include multiprocessor architecture, task schedUling and allocation. modelling and simulation. For tbe lasl few years he has been Involved in a project and implementation of a d)'llamlcally rearrangeable multl·transputer system.