PASE: A performance analysis simulation environment

PASE: A performance analysis simulation environment

SIMULATION PRACTICE s THEORY Simulation Practice and Theory 2 (1994) 43-59 PASE: A Performance Analysis Simulation A. Pombortsis a, E. Papaefstat...

1MB Sizes 0 Downloads 47 Views

SIMULATION

PRACTICE s THEORY

Simulation Practice and Theory 2 (1994) 43-59

PASE: A Performance

Analysis Simulation

A. Pombortsis a, E. Papaefstathiou

Environment

*,b, A.Veglis a, G.R. Nudd b

of

a Digital Systems and Computer Laboratory, Physics Department, University Thessaloniki,Greece b Parallel Systems Group, Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom

Received 1 June 1993; revised 4 March 1994

Abstract Performance simulation is one of the approaches used to estimate the quantitative behaviour of inter-connection networks of parallel and distributed systems. This paper presents both a discrete-event network simulation environment for multiprocessor systems, and a methodology for simulation. The simulator provides A Network Initiated Simulation Oriented Language (ANISOL) which allows the user to define environmental factors, network topologies and the load sharing strategies in both homogeneous and heterogeneous systems. The simulator is flexible enough to simulate various interconnection network topologies and processing environments. Features required to simulate massively parallel processors, such as fault tolerance, are also included. This simulation environment has been used to study a range of different architectures and preliminary results have shown that the accuracy of the outputs is comparable to other techniques. Key words: Simulation; processing

Performance

evaluation;

Interconnection

networks;

Parallel

1. Introduction The ever increasing demand for high speed processing in conjunction with falling hardware costs has made parallel and distributed multiprocessor computer systems both desirable and feasible. One way to ensure that an architecture under development will meet the performance specifications, is to use analytic and simulation models. Simulation has advantage over analytic models in that any required detail, that may effect the behaviour of the architecture, can be added to the simulation model. This paper presents a Performance Analysis Simulation Environment (PASE),

* Corresponding author. Email: [email protected]. 092%4869/94/$07.00 0 1994 ~ Elsevier Science B.V. All rights reserved SSDZ 0928-4869(94)00008-4

44

A. Pombortsis

et al. / Simulation

Practice and Theory 2 (1994) 43-59

which incorporates A Network Initiated Simulation Oriented Language (ANISOL) that has the ability to define the environment factors, the network topology and the load sharing strategy both in homogeneous and heterogeneous processor systems. The simulator is effective and flexible enough to simulate various multiprocessor and computer network topologies. It is also able to produce results that describe quantitatively the network operation. It has been noted that an effective simulation environment should include the following characteristics: wide availability, high portability, acceptable performance, features that help the modelling of the environment under investigation, high accuracy of predictions, expandability, reusability and maintainability [ 5,121. Typically existing simulation environments for interconnection networks of parallel/distributed system provide satisfactory characteristics in some areas while they are weak in others. For example special purpose simulation languages such as SimScript [20] which, while providing special model description features that speed up the development, are slow, especially when they run on personal computers [ 51. Typically there are three types of simulation environments for networks of parallel distributed systems: l Third generation, or object oriented languages such as C, C+ +, and Smalltalk. The advantages of traditional programming languages are the high performance, portability, and flexibility. The main disadvantage is the long time required to develop the model. l General purpose simulation environment such as SimScript [20]. ComNet [4], GPSS [9], and ModSim [ 71. These languages provide a range of simulation features for discrete-event, process interaction, and object-oriented simulation. Although these features minimise the development of the model, they introduce a number of drawbacks such as low performance, and high cost, l The third type of simulation tools are special purpose environments for networks [2, 211. These tools provide the features that are required for fast model development and high performance because are usually written in general purpose languages. However they are generally over-specialised and not publicly available. PASE is a special purpose simulation environment that has been developed using the C programming language and object oriented development techniques. The user interface to the simulator is a special purpose language (ANISOL). This combination provides the following characteristics: Easy model development: ANISOL is a special purpose language developed to describe networks of parallel/distributed systems. Due to this initial design strategy it is by far easier to develop a network model using ANISOL than other general purpose simulation languages. For example in [3] ComNet II.5 has been used to study a packet switching network. A special pre-processing stage was required in order to translate the workload and network characteristics in a suitable format readable by ComNet that delayed substantially the development of the model. In contrast, the examples in the case study section have been developed in less than an hour each. Portability: PASE has been developed using ANSI C because of the wide availability of compilers on virtually all hardware platforms. PASE has been ported by simply

A. Pombortsis et al. / Simulation

Practice and Theory 2 (1994) 43-59

45

re-compiling of its source code under PC (MSDOS). Sun 3 and 4 (UNIX), Parsytec SuperCluster single node (transputer based), and DEC VAX 11 (VMS). Other general purpose simulation environments are available only for a limited number of platforms [ 71. Availability: The majority of general purpose simulation environments are proprietary products. The cost for purchasing, and upgrading them is a limiting factor for academic institutions. A number of these environments (e.g. SimScript) are available on a range of platforms, but the cost of upgrading them from a workstation environment to a mini or mainframe computer is high. In other cases either general or special purpose simulation environments are not publicly available. PASE is public domain software and can be obtained by anyone that has access to Internet. Performance: C has been compared with other third generation, object oriented, and simulation languages in terms of performance in [S]. It has been proven to be up to 15 times faster than general purpose simulation languages (e.g. SimScript II). PASE is portable across a wide range of computers from personal computers to mainframes. From our experience the performance of the simulator when running on personal computer is reasonable for simulating models containing hundreds of processors and resource nodes. For example all of the examples in the case study chapter have been simulated on a PC 486 in a reasonable time scale of between 30 seconds and 7 minutes per simulated cycle. The PASE environment is now being ported to a parallel computer, a Parsytec SuperCluster with 128 transputers. This porting will allow the simulation of even larger networks. Accuracy: PASE has been used on a wide range of modelling case studies of interconnection networks including cluster, hierarchical multiple-bus networks, ethernet networks connecting workstations etc. In Section 5 these networks are analysed and the results obtained from the simulator are compared to analytic results and real measurements. Although it is clear from the results presented that the accuracy of the simulator is satisfactory, a number of extensions, described in the last section, are under development to increase the accuracy of the simulation (e.g. more representative workload characterisation). The paper is organised as follows. Section 2 discusses the general characteristics of the PASE. Section 3 describes the ANISOL language and gives some examples of its use. The simulation algorithm that has been used for PASE is presented in Section 4. In Section 5 three examples of use of the simulator are described including a clustered, a hierarchical homogeneous parallel system and a heterogeneous distributed system. Finally, in Section 6, future extensions of PASE are discussed.

2. General characteristics

PASE is a simulation environment for the evaluation of parallel and distributed architectures under various computational environments that does not require extensive computational resources [ 14, IS]. The simulator uses a language called ANISOL. With ANISOL, the network topology, the processing environment and other parameters that effect the simulation

46

A. Pombortsis

et al. / Simulation

Practice and Theory 2 (1994) 43-59

can be described. ANISOL is a simple language that translates the system description into coded data. The second part of PASE is a kernel that executes the simulation algorithms. The kernel uses, as input, a number of parameters along with the model of the network described by the ANISOL program. In order to simulate a network of a parallel or distributed computer system the first step is to write a program, using the ANISOL syntax, that describes the network’s topology, processing environment and desired form of results. Next, ANISOL must be called to analyse lexically and syntactically the program and then translate the source code to a database usable by the kernel. A simulation process is then performed by the PASE kernel, after which ANISOL takes over to print the results (Fig. 1). A series of network elements can be defined in PASE. More specifically, the network’s topology, the number of processors, the resources and the connection lines are described. Additionally the processing environment can be defined. This includes the request rate, inter-arrival time and processing time which might differ from one resource to another. An important role in the operation of the network is the switching method employed. PASE supports two switching methods: (1) Circuit switching: where every processor is continuously connected to a resource during the processing. (2) Fast circuit switching: where, after creating a job, it is serviced as noted below: (a) It waits in the processor until the processor installs a connection with the resource. (b) At the end of the transfer the connection is released (disconnection) and the job is processed by the resource until it is finished. (c) The result is routed to the processor that created the job. The fast circuit switching is efficient in computer-bound applications (for example radar, sonar, vision processing), where the number of arithmetic operations is much greater than the number of input and output elements. In such applications the circuit switching method leads to the under utilisation of the network. Segmentation of a job to packets, and the re-routing of packets when the paths are blocked, can impose large overheads [ 6, 171. When a request is issued in the network there is the possibility that it will be

ANISOL

ANISOL

I

4

I t - - - - - +a*ase) Fig. 1. Simulation

- - - - - I process.

A. Pombortsis et al. J Simulation Practice and Theory 2 (1994) 43-59

41

blocked, if a path is not available to the destination or the destination is busy servicing another request. PASE provides three possible techniques to handle requests which are blocked: (1) Ignore technique: In this case the requests that have not been accepted are abandoned by the network. This method is used by the majority of analytical models seen in the literature [ 1, 10, 111. However, this assumption is not realistic, since in practice requests rejected in a cycle will be submitted in the next cycle. (2) Repetition technique: In this second technique a request that has failed repeats itself with equal probability to the network’s resources in the next cycle. An interesting extension of this technique is the modified random splitting model [ 16, 171. In this case, a processor whose request has been abandoned, creates the next request, this time not randomly but according to a pre-defined rule. In the case of a network, it is possible for resources of the same type to be grouped accordingly so that if a processor cannot be serviced by a resource, it may immediately make a request to another resource of the same type in the next cycle. (3) Repetition to the same resource technique: This last technique involves the repetition of a failed request in the next cycle and to the same resource. This is the most common situation when the network is used for the interconnection of processors and memory units. PASE and ANISOL are written in C, and the code is about 10.000 lines long. C has been selected for its generality, speed, and portability. Currently PASE runs on personal computers (MSDOS), and Sun workstations (UNIX). Also a single transputer version of PASE runs on a Parsytec SuperCluster parallel computer. A parallel version of the simulator is under way. PASE can simulate systems that contain up to 215 request units, service units and connection lines. However, realistically the upper bound of the size of the simulated systems is restricted by the size of the memory of the computer that runs the simulator and the time required for each simulated cycle. From our experience, PASE running on PCs can simulate up to 512 request and service units and 1024 connection lines, connected to complex topologies, in reasonable time. The capacity of the simulator increases substantially when PASE runs on a SUN workstation. PASE is a public domain software, and can be obtained through Internet by contacting the authors.

3. Description and use of ANISOL ANISOL is a special purpose language used to describe systems under investigation and simulator configurations. The simulator assumes that the network contains four types of units. These are: (1) Request units (RU): The units that create requests (e.g. processors) for the resources in the system. (2) Service units (SU): These units represent the resources of the system (e.g. memory, special purpose VLSI devices). (3) Request-service units (RS): These units combine the features of both RUs and

48

A. Pombortsis

et al. / Simulation

Practice and Theory 2 (1994) 43-59

SUs. They are able to create requests but can also service requests generated by other units (Transputer, workstations on a LAN). (4) Connection lines (CL): The links that connect the above units with each other, forming the system’s network. Each entity in a parallel/distributed system can be represented from one of the above type of units. In the ANISOL source code the number and type of units, that the system under investigation contains are defined. Each unit in the system has a number of parameters that influence its behaviour during the simulation. The user can set the value of behaviour parameters such as the probability of request, interarrival time, number of requests generated etc. As well as the behaviour parameters each device contains result parameters. These parameters are modified by the simulator and can be printed by the user as results of the simulation. Depending on their type each unit have different result parameters. For example, the request units include parameters such as requests generated and requests served, while service units include the number of requests served, total number of cycles worked etc. A program written in ANISOL is organised into three parts called body modules. The simulator configuration and global characteristics of the system are defined in the module :conjig. An example of the use of this body modules follows. 1: 2: 3: 4: 5: 6: 7: 8:

# This is a comment line # Begin of configuration body module :config # Name of network sinname multiple-bus; failrec repeat; # Failure recover method # Simulation cycles to be run cycles 2000; # Set random generator rndset on; :end

Some of the commands that can be used within xonjig are seen in the above example. The name of the network to be simulated, the data transfer within the network, the number of simulation cycles and the fact that the random generator will be used during the simulation, are defined in the example above. The second body module is the xetwork. In this part of the program the network topology and the behaviour of the system devices are described. Well known topologies can be employed namely, crossbar, multiple-bus, and also hierarchical topologies that include clusters (processors and memory modules) and global memory modules. As an example, a multiple bus connecting eight processors to eight special purpose VLSI resources of different types (i.e. with different service times) is shown in Fig. 2. 1: # Network body module 2: :network # Multiple buses 3: cl hrb[4]; # Vertical buses connecting cl vrb[ 161; 4: # VLSI devices 5: su v[S]; # Processors 6: ru pi?4 7: # Define device behaviour parameters

devices to the network

A. Pombortsis et al. / Simulation Practice and Theory 2 (1994) 43-59

Fig. 2. Multibus

8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: :end

set p[1:8] = { iat = 2, pro = 80

49

system under investigation.

# For all processors # Set inter-arrival time # Set request issue probability

1; # Set VLSI service times set v[ 1:2] = {srv = 10); set v[ 3:4] = {srv = 20); set v [ 5:6] = {srv = 30); set v[7:8] = {srv = 40); # Group difference resource types group {v[ 1:2]}; group {v[3:4]); group {v[5:6]}; group {v[7:8]}; # Define network topology con { (p[ 1:8] > vrb[ 1:8]), (v[l:S] > vrb[9:16]), (hrb[ 1:4],vrb[ 1:16]) i End of this module

In lines 3 through 6 the units that are included in the network are declared. In lines 7 through 16 the behaviour of the units during the simulation are defined. This is performed by setting the behaviour parameters of the units (iat, pro, srv) by the set command. In lines 17 through 21 VLSI devices of the same type are teamed together and in lines 24 through 27 the connections of the units creating the network are declared. In the last body module :result, the form of the results that will be printed after the simulation, is declared. Each unit of the network, according to its type (RU, SU, RS, CL) contains various information about its behaviour during the simulation, the result parameters mentioned above. The user can print result parameters of selected or all the devices of the system. An important innovation of PASE is its ability to simulate hierarchical architectures. In other words, not all the resources that generate requests can communicate with all the resources that serve the requests. ANISOL includes a special command

50

A. Pombortsis

et al. / Simulation

Practice and Theory 2 (1994) 43-59

local that enables grouping of the resources architecture is given in Section 5.

into clusters.

An example

of such an

4. Simulation algorithm PASE uses an object oriented approach to the simulation of networks. Each device is an object of one of the types that are supported by ANISOL (RU, SU, RS, CL). Each object has a communication protocol with the basic simulation procedures. This approach has the following advantages: (1) All the internal operations performed by the devices are hidden from the simulation procedures. A disadvantage of a simulator over an analytical model is the difficulty to manage the development and maintenance of a large piece of software [8]. In particular, the modification and enhancements of a simulator is a task that might cause side-effects. With the object oriented approach a major change in the program can be focused into a limited section of the program scope. For example, an enhancement such as the support of multitasking processors requires a request queue mechanism for the service and request units. Since the communication protocol between the modules of the simulator does not require modification the changes will be done only in the internal devices. Other complicated parts of the program such as the simulator procedures, will not be effected. (2) The object oriented paradigm used by the simulator is suitable for coarse grained parallelisation. This will allow the implementation of PASE on an MIMD computer. The main body of the simulation executes a set of procedures, see Fig. 3. Special care has been taken to allow the pipelining of the simulation procedures in order to allow the future parallelisation of the simulation kernel. These procedures are: (1) The processing of the waiting list: Initially the simulator collects all the units (RUs or RSs) that create a new request in a waiting list. This list also includes the units that have created requests in previous cycles, without being satisfied. A request

Device Database

Fig. 3. Simulation

procedure.

A. Pombortsis et al. / Simulation Practice and Theory 2 (1994) 43-59

51

may not be satisfied either because the required resource has been occupied by another processor, or the selected connection path was busy. The waiting list creation procedure is dependent on the workload characterisation. The units that create requests include behaviour parameters, which can be modified from the user in the ANISOL code, which define the request issue pattern (request issue probability, inter-arrival time, etc.). This simulation phase mainly examines the workload description of each request unit in order to evaluate if the unit is ready to issue requests. The waiting list will be used in the following stages to actually issue the requests, see Fig. 3. (2) Resource matching: The matching procedure assigns to each member of the waiting list a resource. The request units in the waiting list will request specific service units based on the matching procedure of this simulation stage. The selection is made either randomly or by using the modified random splitting method. This procedure is complex because it depends both on the workload characterisation and the hardware configuration. For an example a processor (RU) might request during the simulation a special device unit (e.g. sorting VLSI). The resource matching procedure will identify that the processor is in the appropriate stage of its workload cycle to request a sorting VLSI and then search through the hardware configuration to identify the location of these units. (3) Connection: Each unit in the waiting list is connected to the resource that has been selected in the previous step. This phase contains a number of steps. The first is the selection stage where the request units compete for the available service units. For example two or more processors might request the same memory unit of the system. In order to resolve this race condition the selection procedure use a predefined rule to pick one of the competing processors. Currently a random mechanism that ensures uniform distribution of resources to the processors exists but other methods are under development, including First Come First Served (FCFS), First Come Last Served (FCLS). The second step of the connection stage is the establishment of a connection path from the selected request unit to the matched service unit. The final stage is the update of the result parameters of the request and service units. For example the number of requests satisfied from a request unit will increase by one while for the service unit the total cycles worked will increase accordingly. These result parameters are the output that can be obtained by the user to access the performance of the system. (4) Disconnect: This is the final stage of each simulated cycle. The units that are ready to be disconnected are selected. A path is established in order to return results from the service unit to the request unit. For example after a sorting VLSI finishes, the sorted array is returned to the processor that requested this service.

5. Using the PASE environment A number of simulation examples that show the use of the PASE environment and ANISOL are presented. The case studies include the network performance prediction of a clustered and a hierarchical multiprocessor systems, and a cluster of

52

A. Pombortsis et al. / Simulation Practice and Theory 2 (1994) 43-59

workstations. Initially, the systems are described and results obtained by the simulator are presented along with results obtained from analytical methods and real measurements. The purpose of these examples is not to present an in depth analysis of the systems examined, but to demonstrate the range of applications of the simulator, the accuracy of the results compared to other methods, and the use of PASE. PASE, in realistic case studies, handles more complex systems in terms of the system’s configuration and workload representation. The first system is a cluster based multiprocessor system [22] that consists of L clusters and a set of M global memory modules. Each cluster contains N processors and N VLSI modules. These resources are connected with a multiple-bus or a crossbar interconnection system. The communication between the clusters and the global memory molecules is made via a global interconnection system, which can be viewed as a multiple-bus or a crossbar. In our study, multiple-bus interconnection networks are used for both global and inter-cluster communications, see Fig. 4. The ANISOL program that describes the architecture is included in the Appendix. The system consists of two clusters containing multiple bus networks. Each cluster contains five processors and five special purpose VLSI devices (e.g. sorting, FFT). The two clusters are connected to a global multiple-bus network. Also, connected to the global network are five memory modules accessible by the processors of both clusters. The ratio of requests that are issued to the local VLSI devices is 90% and the remaining 10% are issued for the global memory modules. The effective bandwidth of the system for probability of request varying from 10% to 100% are obtained. The results are compared with results obtained from the analytical performance study presented in [22] (Table 1, Fig. 5). From this comparison it is clear that the results obtained by the analytical methods have the same trends as results obtained from simulation. Simulation results are optimistic by approximately 7.5%. The second system under investigation is an L-level hierarchical cluster based multiprocessor system presented in [23, 241. The system consists of L levels of hierarchy as shown in Fig. 6. The first level (root) includes n1 memory modules and n, clusters. The same configuration, with the addition of one output that leads to the upper level, can be found in each cluster of the 1 level (1 < 1
Fig. 4. Cluster

based multiprocessor

system.

53

A. Pombortsis et al. 1 Simulation Practice and Theory 2 (1994) 43-59 Table 1 Cluster based multiprocessor

system results

r (%)

Simulation

10 20 30 40 50 60 70 80 90 100

1254 1811 2092 2330 2450 2587 2658 2714 2773 2805

Analytic

results

results

Error

1365 2058 2265 2488 2643 2756 2841 2906 2956 2992

(%)

8.8 13.6 8.2 6.7 7.8 6.5 6.8 7 6.5 6.6

2600 :

2400

2

2200

i

2000

;

1800

g

1600 1400

4

:

:

:

:

:

:

:

:

10

20

30

40

50

60

70

80

90

I 100

Request Probablhty (%)

Fig. 5. Performance

prediction

Fig. 6. L-level hierarchical

for cluster multiprocessor

cluster based multiprocessor

system.

system.

includes n1 * n2 - . . . *nL_ 1 base clusters, which contain nL processors and nL memory modules. Intra-cluster communication is established via a multiple-bus interconnection network. The communication between processors situated in different base clusters (inter-cluster) is serviced by shared memory modules in the suitable level. The interconnection networks of the other levels are multiple-buses. For Z_.= 1 the architecture is a multiple-bus multiprocessor system and for L = 2 the previous

54

A. Pombortsis

et al. / Simulation

Practice and Theory 2 (1994) 43-59

clustered system. In this study a three-level system is configured with four processors and memory modules in each base cluster and four memories in each cluster in the upper levels. The multiple bus interconnection networks in all levels include two buses. 80% of requests are issued towards the local memory modules and the remaining 20% equally divided between the memory modules situated in the upper levels. The performance of the system has been investigated for a probability of request varying from 10% to 100%. The ANISOL script used to simulate the architecture is an extension of the script developed for the clustered system. The results obtained from the simulation along with results obtained from the analytical performance study presented in [22, 231 are shown in Table 2 and Fig. 7. Again, the results obtained from the two methods are similar. PASE has also been used to predict the performance of workstation clusters connected through ethernet. In [25] PASE has been used to identify bottlenecks of parallel applications that run on more than one workstation. The case study involved the analysis of the performance of applications that have been parallelised with PVM library on a network of SUN workstations. The master/slave paradigm has been Table 2 Hierarchical

based multiprocessor

r (%)

Simulation

10 20 30 40 50 60 70 80 90 100

3968 5856 7085 7939 8535 8969 9319 9607 9726 10113

system results

results

Analytic

results

Error

4188 7287 8481 8909 9317 9845 10081 10140 10254 10287

3000

5.5 24.4 19.7 12.2 9.1 9.7 8.1 5.5 5.4 1.5

4

:

:

:

:

:

:

:

:

I

10

20

30

40

50

60

70

80

90

100

Request Probabiltty

Fig. 7. Performance

prediction

(%)

for hierarchical

(%)

cluster multiprocessor

system.

A. Pombortsis et al. 1 Simulation Practice and Theory 2 (1994) 43-59



HPredlcted

(Config

6

0 Measured

(Config

1)

5

q Predicted

(Config

2)

55

1)

a 24 B $3

3

4

5 Number

Fig. 8. Real measurements

vs simulation

6

7

8

9

of Workstations

predictions

for cluster of workstations.

evaluated as parallelisation strategy for highly asynchronous applications, such as image synthesis, molecular modelling etc. PASE has been used to simulate the behaviour of the cluster of workstations. The results obtained from the simulation have been compared with real measurements for an image synthesis application running on a cluster of 2 to 9 workstations. In Fig. 8, the difference between the measured and simulated speed-up is shown for two system configurations. In the first configuration each processing request packet, send by the master processor to slave processors, contains 256 pixels while in the second configuration each packet contains 64 pixels. In the previous examples the use of the PASE environment has been demonstrated. The simulator is being used for homogeneous and heterogeneous parallel/distributed systems to predict its performance under different traffic loads. The results obtained from the simulation have been compared to real measurements and predictions from analytical techniques. From this comparison it can be seen that the accuracy of the predictions is satisfactory.

6. Conclusions and future extensions A novel simulation environment, PASE, has been presented and studied. Its capabilities makes it a useful tool for the analysis of computer architectures and distributed systems. The ability to simulate Request-Service units makes it suitable for the study of computer networks and architectures that include tranputers and workstations. Also the capability to simulate a system that during operation might present defective devices gives the flexibility for a realistic study of a system. Nevertheless, extensions that will provide new features, are under development. An extension to the simulator is the ability to create more than one request per each RU or RS at a given time. Each RU and RS will create multiple requests per device. This approach will allow the simulation of multitasking computer systems. Another extension for the more representative characterisation of workload is the

A. Pombortsis et al. / Simulation Practice and Theory 2 (1994) 43-59

56

integration of the simulator as part of a parallel software characterisation tool. This tool is based on the layered approach for characterising parallel systems and is under development [ 13, 151. The methodology separates the characterisation of the system to four independent layers, namely: application, sub-task, parallel template, and hardware layer. The simulator will be part of the hardware layer for the evaluation of the network contention of the system. The layered approach uses advanced techniques for workload characterisation, by analysing the software in terms of execution flow, resource usage, and data dependency parameters. Finally the simulator will be modified to use the message passing system, PVM [ 191. The aim of this effort is for a network of workstations or a parallel computer system to run PASE. The improvement in terms of execution time can be used in order to simulate larger systems for a wider range of architectural configurations.

Appendix. Anisol source code of a hierarchical multiprocessor 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:

# # # # #

File Type Purpose Description

: : : :

system

hiersys.net ANISOL Source Simulate a multiprocessor system with 2 clusters Each cluster of the system has a multibus network The clusters are connected with a global multibus network

:config # Simulation and network simname clusters; dtrans 1; path 0; failrec retry; :end # End of config module

configuration module # Name of the network # Time required for data transfer # Use shortest path to connect units # Use retry method when a request fails

:network # Network description module # Define devices of cluster A rocessors ru p451; i :pecial purpose VLSI su vaC51; # Buses connecting devices to the network cl bva [ lo]; # Buses that form the local multiple-bus cl bha[2]; # network # Interface to the global network i Deiin:adevices of cluster B # Processors ru pbC51; # Special purpose VLSI su vbCS1; # Buses connecting devices to the network cl bvb[ lo]; # Buses that form the local multiple-bus cl bhb[2]; # network # Interface to the global network bib; cl

A. Pombortsis et al. / Simulation Practice and Theory 2 (1994) 43-59

27: 28: 29:

# Global memory mem[5]; SU cl bvgC51;

30:

cl

31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47: 48: 49: 50: 51: 52: 53: 54: 55: 56: 57: 58: 59: 60: 61: 62: 63: 64: 65: 66: 67: 68: 69:

51

system # Global memories # Buses connecting memories with # network # Buses that form global multiple-bus # network

bM21;

# Set configuration of devices of cluster A set pa[ 1:5] { # For processors pa[ l] to pa[ 51 # Set request generation probability pro = 50, # Set time between generation of requests iat = 2, # Set processing time lim = 10, # Set probability for local request prl = 50 I.

to va[5] iet va[1:5] { # For VLSI va[l] # Set service time srv = 5 1; local {

# Group

devices of cluster A

paCW, va[ 1:5] 1.

# Set configuration of devices of cluster B set pb[ 1:5] { # For processors pb[ l] to pb[5] # Set request generation probability pro = 50, # Set time between generation of requests iat = 2, # Set processing time lim = 10, # Set probability for local request prl = 50 1.

to vb[5] get vb[1:5] { # For VSLI vb[l] # Set service time srv = 5 1; local {

# Group

devices of cluster

B

pbCW, vb[ 1:5] 1; # Define the network topology con { # Connect command (pa[ 1:5] > bva[ 1:5]), (va[ 1:5] > bva[6:10]), (bva [ 1: lo] ,bha [ 1:2]), (bia,bha 1;

[ 1:2])

for cluster A # # # # #

Connect each processor with one bus Connect each VLSI with one bus Connect device buses with multiple bus network Global network interface bus

A. Pombortsis

58

70: 71: 72: 73: 74: 75: 76: 77: 78: 79: 80: 81: 82: 83: 84: 85: 86: 87: 88: 89: 90: 91: 92: 93: 94: 95: 96: 97: 98: 99: 100: 101: 102: 103: 104: 105: 106: 107: 108: 109: 110: 111: 112: 113: 114:

et al. / Simulation

# Define the network topology con { # Connect command (pb[ 1:5] > bvb[ 1:5]), (vb[ 1:5] > bvb[6:10]), (bvb [ l:lO],bhb[ 1:2]), (bib,bhb[

1:2])

Practice and Theory 2 (1994) 43-59

for cluster # # # # #

B

Connect each processor with one bus Connect each VLSI with one bus Connect device buses with multiple bus network Global network interface bus

1; # Define global interconnection network con ( # Connection command (mem [ 1:5] > bvg [ 1:5]), # Connect each global memory # with one bus (bvg[1:5],bhg[1:2]), # Connect memory buses # with global network # Connect cl. A with global network (bia,bhg [ 1:2]), (bib,bhg[ 1:2]) # Connect cl. B with global network >. $2

:end

# End of network

module

:result # Result definition module # Print result tables for VLSIs in cluster A @{ “\n\nVLSI CLUSTER A”,“\n”,“\n% I”,” 1I’, # Print result parameters for cluster A VLSI va[ 1:5], # Print requests satisfied and total cycles worked strq,tlcr ;P. t-m t result tables for VLSIs in cluster

B

@{ “\n\nVLSI CLUSTER B”,“\n”,“\n% I”,” 1”) # Print result parameters for cluster B VLSI vb[ 1:5], # Print requests satisfied and total cycles worked strq,tlcr # Print

result table for global

memories

@{ “\n\nGlobal memories”,“\n”,“\n% I”,” I “, # Print result parameters for global memories mem[ 1:5], # Print requests satisfied and total cycles worked strq,tlcr I>

:end

# End of result module

A. Pombortsis et al. 1 Simulation Practice and Theory 2 (1994) 43-59

59

Acknowledgement The authors wish to thank the reviewers for the helpful comments during the revision of this paper.

and suggestions

References Cl1 W.T. Chen and J.P. Shew, Performance

analysis of multiple bus interconnection networks with hierarchical requesting model, IEEE Trans. Comput. 40 (1991) 834-842. c21 I. Chlamtac and R. Jain, A methodology for building a simulation model for efficient design and performance analysis of local area networks, Simulation 42 (2) 57-66. network simulation on a personal computer, in: c31 R.A. Colton, A case study in packet-switching Modelling and Simulation on Microcomputers (1990) 99-103. c41 COMNET 11.5 User’s Manual with Release 2.9 Supplement (CACI Products Company, La Jolla, CA, 1989). simulation programming, in: Object Oriented Simulation (1990) 2-6. c51 R.J. Doyle, Object-oriented Voice/data integration using circuit switched networks, IEEE Trans. Comm. 28 C61E. Harrington, (1980) 781-793. simulation language, in: Object Oriented Simulation c71 C. Herring, ModSim: A new object-oriented (1990) 55-60. CSI R. Jain, The Art of Computer Systems Performance Analysis (John Wiley & Sons, New York, 1991). A software review of GPSS/PC, J. Math. Management Sci. c91 Z.A. Karian, GPSS for microcomputers: 5,93-101. Cl01 C. Lin and T.Y. Feug, Tutorial on Networks for Parallel & Distributed Processing (IEEE Computer Society Press, Silver Spring, MD, 1984). Performance Analysis of a Generalized Class of M-Level II1111.0. Mahgonb and A.K. Elmagarmid, Hierarchical Multiprocessor System, IEEE Trans. Parallel Distributed Systems 3 (1992) 129-138. Cl21 F. Neelamkavil, Computer Simulation and Modelling (John Wiley & Sons, Inc. 1987). Y. Papay et al., A layered approach to the characterisation of Cl31 G.R. Nudd, E. Papaefstathiou, parallel systems for performance prediction, in: Proceedings of the Performance Evaluation of Parallel Systems Workshop (1993) 26-34. A simulation environment for distributed/parallel systems, Senior Dissertation, Cl41 E. Papaefstathiou, Digital Systems & Computers Lab., Physics Department, University of Thessaloniki, 1988. D.J. Kerbyson and G.R. Nudd, A layered approach to parallel software Cl51 E. Papaefstathiou, performance prediction: A case study, in: Proceedings of the 1994 EUROSIM Conference on Massively Parallel Processing (1994). environment, Inform. Process. Cl61 A. Pombortsis, Sharing special purpose resources in a multiprocessor Lett. 34 (1990) 255-260. network Cl71 A. Pombortsis, P. Linardis and C. Halatsis, Study of resource arbitration interconnection for multiprocessor systems, in: Proceedings of IEEE CompEuro 87 (1986) 797-800. and A. Veglis, ANISOL V. 2.10 User Manual, Digital Systems Cl81 A. Pombortsis, E. Papaefstathiou & Computers Lab., Physics Department, University of Thessaloniki, 1992. Oak Ridge, Cl91 PVM 3.0 User’s Guide and Reference Manual Draft (Oak Ridge National Laboratory, TN 37831, 1993). (CACI, Inc., 1983). cm E.C. Russell, Building Simulation Models with SlMSCRIPTII.5 c211S.A. Sorensen, Performance Studies of Computer Networks, in: Proceedings of the UKSC Conference on Computer Simulation (1984) 477-487. system with clusters and global memory cm A. Veglis and A. Pombortsis, A hierarchical multiprocessor modules, in: Proceedings of the First General Conference of the Balkan Physical Union (1991) 628-630. ~231 A. Veglis and A. Pombortsis, Performance related analysis of L-level hierarchical shared-memory multiprocessors, in: Proceedings of ParCo93. ~241A. Veglis and A. Pombortsis, A comparative performance analysis of hierarchical shared-memory multiprocessors, Technical Report DI-3, Aristotelian University of Thessaloniki, 1994. ~251P.H. Von, Parallel processors workload simulation, MSc Thesis, University of Warwick, 1993.