Parallel Computing 19 (1993) 633-649 North-Holland
633
PARCO 764
Modeling and evaluation of a new message-passing system for parallel multiprocessor systems Helnye Azaria and Yuval Elovici Department of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel Received 6 January 1992 Revised 1 September 1992
Abstract Azaria, H. and Y. Elovici, Modeling and evaluation of a new message-passing system for parallel multiprocessor systems, Parallel Computing 19 (1993) 633-649. As parallel implementation of complex applications is becoming popular, the need for a high performance interprocessor communication system becomes imminent, especially in loosely coupled distributed-memory multiprocessor networks. An important factor in the efficiency of these networks is the effectiveness of the message-passing system which manages the data exchanges among the processors of the network. This paper presents the modeling and performance evaluation of a new Message-Passing System (MPS) for distributed multiprocessor networks without shared-memory and where the processors or Processing Elements (PEs) are connected to each other by point-to-point communication links. For maximum performance, the MPS manages the communication and the synchronization between the different tasks of an application by means of three approaches. One is an asynchronous send/receive approach which handles efficiently server like tasks, the second is a synchronous send/receive approach which handles efficiently streaming communication mode and the third is a virtual channel approach which minimizes the overhead of the synchronization mechanism, efficiently handling the burst mode of heavy communication between tasks. The developed models of the MPS approaches enable the determination of analytical expressions for different performances and a comparison between analytical and experimental performances reveals that the models predict the MPS performance with high accuracy. The MPS written in Parallel ANSI C, is studied on a mesh topology network of 16 transputers T800. The MPS performances for each approach are studied and presented in terms of communication latency, throughput, computation efficiency and memory consumption.
Keywords. Message-passing system; distributed-memory multiprocessor; parallel implementation; transputer network; modeling; performance evaluation.
1. Introduction The performance of a complex application implemented on a parallel multiprocessor system is determined by the effectiveness of the message-passing system between the different tasks of the application running in distinct PEs of the multiprocessor system [1,9,12-16]. For a given application, communication demands on the message-passing mechanism can be different from one task to another, depending on the function of the tasks and therefore the use of Correspondence to: H. Azaria, Department of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel, Email: bitnet%"helnye@bguee". 016%8191/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved
634
1t. Azaria, Y Elovici
a message-passing system based on different approaches is preferable for complex applications [3,6]. The application designer can choose the appropriate approach to handle the communication depending on the application task functions, enabling an efficient implementation especially when the application is composed of different types of tasks. Parallel processing applications using message-passing involve the overheads of communication and synchronization arising from the message-passing system. The performance of a message-passing system is determined by the amount of time spent in the process, the communication and the synchronization of the message-passing tasks [5,11]. Those times which significantly affect the overall execution time of an application are functions of both the hardware and the used message-passing. A new Message-Passing System (MPS) for parallel multiprocessor systems, based on different approaches has been proposed by the authors [3,6]. In order to propose a general model for evaluating the performance of any implementation of the new MPS, an analytical study of the different approaches of the MPS was performed. The study consists of the development of general models, independent of the technical specifications of the PEs of the network, which have been validated by measurements on the MPS implementation using a 16 transputers (T800) network [7]. The experiments are carried out to assess the accuracy of the assumptions and approximations involved in the modeling. The results reveal that the models can predict the MPS performance with high accuracy. In the next sections, a general description of the MPS and its different approaches are given, followed by the proposed modeling of the MPS approaches. An analysis of the analytical and experimental performance is presented in Section 4, followed by the conclusions.
2. General description of the MPS [3] The MPS is built of similar Message-Passing Cores (MPS) running in parallel with the Application Tasks (ATs) when only the MPC is connected to the physical links, so the communication over the links can not be blocked by the ATs. One MPC is placed in each PE of the system (Fig. 1).
2.i, Application task connections In order to be implemented on a distributed system a n application is divided into concurrent tasks. The tasks are mapped on the different PEs of the distributed system. These tasks are called Application Task (ATs). The ATs can communicate only through the MPC's of the MPS. The ATs are connected to the MPC through pairs of software channels. Each pair of channels enables the AT to access another AT in the distributed system. One pair of channels is created for each unidirectional communication (halfduplex) between two ATs and can serve only for this connection. The ATs use the output channel to send the three different commands: Request To Send a message (RTS), Request To Receive a message (RTR) or Request For Acknowledge (RFA) while the input channel serves to receive the Acknowledge (A). Each output channel has a corresponding input channel as an acknowledge channel. In order to achieve high:speed communication between the PEs, the receiving and the sending are implemented by transferring messages directly between the AT's memory workspace and the physical links. Messages are not duplicated by the MPC inside the same PE since the request to send or receive messages contains only pointers.
A new message-passing system for parallel multiprocessor systems
635
Message Passing!Core(MPC)
KppUcation Tasks
._~
®1 '®t -@
(
\,~
~
®i Link 3
~ _ /
i
[J
PE: One TransputerT800 + 1Mbyte memt~ry Fig. 1. MPC structure and its connection to ATs.
2.2. MPS core and tasks The MPC consists of a several tasks which are executed at the same high priority while the ATs connected to it are executed at t h e same low priority; thereby a fast implementation of the MPC routing functions is obtained. The PEs of the network considered have to'support at least two task priorities in a multitasking environment' as for the transputer T800 used in the network [7]. The MPC consists of one Message Passing Manager ( M P M ) t a s k and several tasks of 3 different types: I, O, and PO (Fig. 2). All the MPC tasks are connected to each other through Software channels. To each of the input links an Input task I is connected, to each of the output links an Output task O is connected and to each O task there is a Pre-Outputer task (PO). Therefore the number of tasks of the MPC is: 3 * 4 + 1 = 13 tasks (4 = number of PEs physical links).
2.3. MPS approaches The MPCs manage the communication and synchronization between the different ATs in different PEs of an: application by means of three approaches. The application designer can
PE1
I O)RTS
(2) M
I
PE2
(3) M
PE3
[
(1) TI sends a Request To Send (RTS) to MPC (PE1). (2) MPC (PEt) scnclg~.h~message (M)to PE2~flu'oughthe buffers (13). , (3) MI~ (PE2) routsthe message(M) to PE3. (4) MPC (PE3) receives the M and load it in the memory workspaee of T2. Fig. 2. Communication between two application tasks - A p p r o a c h 1.
H. Azaria, Y. Elovici
636
PE1
PE2 []
(1) RTS ~ (2) RFA---~. (10) ~ A
[] '91--(6) CMRTR ~--~
PE3 []
[]
~ ~ (5) ClVlRTR ~ (8)~
RTR (3) RFA (4) "--~(9) A
(1) T1 requests to send a message to I"2 (RTS).
(2) TI requests for an acknowledge (RFA). (3) I"2 requests to receive a message from T1 (RTR). (4) T2 requests for an acknowledge (RFA). (5) MPC (PE3) sends to PE2 the control message (CMRTR) through the buffers and the physical link when step 3 occurs. (6) MFC (PF.2) routs to PEl~the control message CMRTR. (7) M I ~ (PEI) sends the message (M) through the buffers and the physical link. (8) MPC (PE2) routs the message (M) to PE3. (9) MPC (PE3) sends to T2 an acknowledge (A) the RTR is fullfilled. (10) MPC (PEI) sends to T1 an ,,,-knowledge (A) the RTS is fullfilled. * Step 10 can occur in any steps after step 7 . Fig. 3. Communication between two application tasks - Approach 2.
choose the most appropriate approach to handle the communication between the ATs running in different PEs, depending on the function of each task.
2.3.1. MPS approach 1 The first MPS approach is a straightforward asynchronous send/receive approach: when an AT sends a message, the MPC assumes that the receiver AT is ready to receive the message. The MPS first approach is based on the concept of two buffers per each physical link. This straightforward approach, which was designed to allow asynchronous AT communications, is efficient for communication between ATs that are always ready to receive data but do not know the exact memory location of the received data. The operation of this MPS approach, handling the communication between two tasks T1, T2 in PE1 and PE3 respectively, with one intermediate PE (PE2), has four steps which are illustrated in Fig. 3. The disadvantage of this approach is the lack of synchronization between the sender and the receiver. To overcome this problem when necessary, the second MPS approach is proposed. 2.3.2. MPS approach 2 The MPS approach 2 performs a synchronized communication between two ATs. In this approach, the MPS sends messages only if the messages were requested by the receiver. This mechanism is implemented using a control message: Control Message Request To Receive (CMRTR). The main disadvantage of this approach is that for any request, two messages are sent in sequence per communication (control message and data message). The MPS approach 2 is based on the MPS approach 1 but in order to avoid congestion, due to the overflow of messages that are not ready to be received, a 'request to receive' scheme is employed.
2.3.3. MPS approach 3 The MPS approach 3 is proposed to minimize the overhead of the synchronization mechanism of the approach 2 when the mode of communication is a burst mode. This approach is based on the concept of virtual channels. A virtual channel is opened by the MPCs when at AT (T1) sends a message to another AT (T2) for the first time. The virtual
A new message-passing system forparallel multiprocessor systems
PEI
PE2
637
PE3
(b) _~ A ~[]DD (c) ---~L-a~ B D O (e)
...... M
~- Igi [] [] ~4--~ ........... M
[] [ ] [] Three buffers R: Request to open a virtual chan.
A: Acknowledge M: Message
Fig. 4. Communication between two application tasks - Approach 3.
channel is implemented by allocating three buffers in all the PEs that are in the path between the two ATs which need to communicate. Opening a virtual channel When an AT requests to send a message to another AT, each MPC in the path between the two tasks checks if a virtual channel is already opened. If a virtual channel is opened and the next PE in the path has a free buffer on this virtual channel, the message is sent to the next PE. If the virtual channel is not opened, the MPC requests the next MPC in the path to open a virtual channel (allocate three buffers for the new virtual channel). After allocating the three buffers, the MPC sends an acknowledge to the previous MPC indicating that three buffers are free (enabling the sending of the first message to the next PE in the path on this virtual channel). The opening procedure of a virtual channel repeats itself in the next PEs in the path until the first message reaches the PE where the receiver task is placed. Once a virtual channel is opened others messages can pass between the two ATs without the need of opening another virtual channel. The MPS assumes that there is always enough memory to open the virtual channels in a specific application: if no failure Occurs.
Communicating through virtual channels Each MPC in the network keeps track of the number of empty buffers in the next PEs of all the virtual channels passing through it. In this way messages are not sent between PEs unless there is a free buffer on the virtual channel in the next PE of the path. The operation of this MPS approach handling the communication between two tasks T1, T2 in the case of one intermediate PE is illustrated in Fig. 4. It is important to notice that in Approach 3 a full pipelined communication between T1 and T2 is obtained at this step (see step (h) in Fig. 4).
3. M o d e l i n g o f the M P S a p p r o a c h e s
Parallel processing applications using an MPS involve the overheads of communication and synchronization. The performance of a message-passing system is determined by the amount of time spent in the process, the communication and the synchronization of the MPS tasks [5,11,12,161.
638
H. Azaria, Y. Elovici
The MPS approaches are modelled as repetitive cycles where a cycle consists of a combination of start-up time (context switch time denoted by Ts), followed by a computation/process time (Tp) and by a communication time between tasks (Tc), where overlapping may occur between those times and the communication times of the messages, inside a PE and over the physical links of the PEs, and also between those times in different PEs. The development of the models, which 'consist of repetitive and overlapped cycles, is based on the simulation of the different modes of operation of the PEs in the network for each approach, according to the descriptions in the previous section. Three modes of communication can occur in a PE [3]: (1) Receive mode When two application tasks located in different PEs need to communicate, the MPC of the PE, where the destination AT (the task which requests to receive a message) is located, handles the receiving of the message from the input link i. Two MPC tasks are involved: The Inputer task (I i) and the MPM. (2) Send mode When two application tasks located in different PEs need to communicate, the MPC of the PE, where the source AT (the task T which requests to send a message) is located, handles the sending of the message through one of the physical output links. Three MPC tasks are involved: The Pre-Outputer task (POi) , the Outputer task (O i) and the MPM. (3) Route mode When a message is routed through a PE, three MPS tasks are involved: Ii, Oj, PO i.
Assumptions (1) It was Convenient during the modeling to consider each mode of operation separately and assume that, for an In-processor (In-PE) communication between two tasks (tasks of the same MPC), when the tasks are ready the CPU first handles the communication and then processes the task which was first ready to communicate. This assumption, which simulates the scheduler behavior of the PE, was tested and verified on the transputer network Used. (2) According to the observed fact that for the same approach, the process times of the MPC tasks are almost equal, it is assumed that for each approach, one value of Tp may be used for all the process times of the MPC tasks. This value is determined by the average value o f the eprocess times of all the MPC tasks, according to the approach. The three average values (three approaches) are obtained by measurement on an MPS implementation and used for checking the accuracy of the results of the modeling. The modeling of the MPS consists of ten separate models: for each one of the three approaches, three models for the sendl route and receive modes. Most of the models are too large and too complex to be presented in this paper. However, as an example, the modeling of Approach 1 is presented. The study of all the models enabled the determination of analytical expressions for different performance functions presented in the following. The resulting expressions of several MPS performances calculated from the models, such as the communication' time, the latency, the throughput rate and the computation efficiency, are presented. Let us use the following notations for the different parameters of the models: Ts for the Context switch time. Tp for the average process time of the MPS tasks. Tc for the MPC task communication time. M for the Message size in Kb (data message), ~ Mh for the Message header size in Kb (command message). L s for the Link speed in Kb/sec. Iccs for the Internal channel communication speed in Kb/sec. P for the Number of intermediate PEs between the sender At and the destination AT.
639
A new message-passing system for parallel multiprocessor systems
O
PO
O
T MPM
PO
O
MPM PO
T MPM
PO
. . . . .
Send latency (first message)
MPM
Continue
S(Ts)
T
.
MPM
.
.
Tp PO MPM . ~
TLs m (cont.)
T
. Tc Tp
.
.
O
.
MPM
O
O
TTe~
~ Tp i
T
PO
. . Tc Tp
Tp~Tp
Tp
- a.,-
Tp
Tc
Tp
Continue
S(Ts)
PO
.
T
T
.
MPM
.
Tp
MPM
.
Tc Tp
PO
.
MPM
.
T
T~p
MPM
T
.Tc Tp
Tc
Po ,o ~TT-;~ T ........ ,.1-, ............ ..l-, ......:~ .............. ~. ........,l,.- ............. ~.
o
Tc
MPM
O
~
I
::......
:::'::.......................................... Send latency (one message, stream mode) ,
hi
Fig. 5. Send- Approach 1: Communicationmodeling. The models describe the timing of the scheduler (S) of the PE and the interaction between the application tasks (t) and the different MPC tasks (MPM, I, O, PO), involved in each mode of operation. The models are composed of segments which simulate the different times Ts, Tp, Tc, Tlccs (Internal message communication time) and TLs (Message communication time over the physical links). In the models Ts defines the context switch time from one task to another, Tp defines the process of the appropriate task and Tc defines the communication time between the two involved tasks, as indicated in the modeling (Fig. 5). The times Tlcss and TLs depend on the message size and on the Internal channel communication speed (Icss) and the Link speed (Ls), respectively.
3.1. Approach 1 The send, receive and route models of the MPS Approach 1 are developed according to the above, and the resulting expressions resumed in Table 1. As an example: the send-Approach 1 model, based on the send mode, is detailed in Fig. 5. This second model simulates, in the same way, the Scheduler state, the functioning of the MPC tasks involved (MPM, PO,
640
H. Azaria, Y. Elovici
Table 1 Latency and communication time modeling results for Approach 1 Send
Router
Receive
First message M
latency
15Ts+ 16Tp+8Tc+ - Ls
2M
M
5Ts + 7 T p + 3Tc+ - Ls
3Tc+3Tp+ ~s +
M
Iccs
stream M
latency
5Ts + 6Tp + 3Tc + --Ls
2M
M
5Ts + 7Tp + 3Tc + --Ls
M
3T+ 3Tp + ~ s + Iccs
Stream communication M
time per message
5Ts+6Tp+3Tc+ ~
M
M
4Ts+6Tp+3Tc+ ~
M
3 T s + 3 T p + ~ + Ices
O) and the requests of the application task T. The time TLs represents the communication time of a Message over the physical link and is equal to M/Ls. Two cases are considered and indicated in Fig. 5, the first message communication and the stream communication. The resulting expressions, for the latency per message in the two cases and the communication time are given in Table 1, where M/Iccs and M / L s are respectively the internal and external communication speeds per message. The total latency, when P intermediate PEs are involved, is obtained from Table 1 by: M Lt = (send latency) + P(route latency) + (receive latency) - (P + 1 ) ~ .
(1)
Where the term (P + 1) M / L s has to be subtracted, based on the fact that (P + 1) is the number of links with P intermediate PEs and for each communication between two PEs (for each mode), we introduce twice M / L s (which is the duration of the communication of the data message over the physical link). The total latency per first message communication is given in seconds by: Lt=18Ts+19Tp+7Tc+~+~+P
5Ts+7Tp+3Tc+~
,
(2)
The total latency per message in stream communication is given in seconds by: Lt = 8Ts + 9Tp + 3Tc + ~
+ Icc + P 4Ts + 3Tc + 7Tp +
.
(3)
The communication times in Table 1 (last row) are related to the throughput calculated in the case of stream communication. For the send and receive modes they are identical to the relevant latencies when for the route mode the communication time is different by definition. The higher communication time between the three modes determines the maximum throughput rate. Therefore, in Approach 1 the throughput rate is given in Kb/sec by:
Cs = min
{
M ; M ; 5Ts + 6Tp + 3Tc + Ls 4Ts + 6Tp + 3Tc + Ls M M
3Ts + 3Tp + ~
+
M
(4)
A new message-passing system for parallel multiprocessor systems
641
W h e n a stream of messages is routed through a PE where a A T is processed, the computation overhead is determined by the amount of process time needed by the MPC, performing the routing• In order to evaluate the computation overhead of the router PE, we choose the computation efficiency function which is calculated from the developed models according to the following equation: max. communication .time-process time of router CE =
(5)
max. communication time
The obtained expression for the computation efficiency in percents is: M M -- + -- 6Tp - 3Tc - 3Ts Ls Iccs Ceff = 100 M M 3Ts + 3Tp + - + -Iccs Ls
(6)
3.2. Approach 2 An identical analytical study was m a d e for Approach 2, assuming that the control messages C M R T R and the data messages are passing through the same PEs in the network. In Approach 2, the throughput rate is determined only by the stream communication times of the sending and routing PEs and is given in K b y t e / s e c by:
Cs=min
M ; M 6Ts + 6Tp + 4Tc + ~ss 5Ts + 3Tc + 7Tp + - ~
"
(7)
The computation efficiency, calculated as for approach 1, is given as a percentage by: M Ceff = 100
Ls
Mh 9Ts - 9Tp - 3Tc M 5Ts + 3Tc + 7Tp + - Ls Ls
(8)
3.3. Approach 3 The obtained throughput rate in Approach 3 is given in K b / s e c by: M Cs = min
M M ;
7Ts + 7Tp + 3Tc + I_s
M' 3Ts + 3Tp + Tc + - Ls M M
(9)
6Ts + 6Tp + 2Tc + The expression obtained obtained for the computation efficiency as a percentage is: M M -- + -- 6Tp - 3Tc - 3Ts Ls Iccs C e f f = 100 M M 3Ts + 3Tp + - + -Iccs Ls
(10)
642
H. Azaria, Y. Elovici
4. A n a l y t i c a l a n d e x p e r i m e n t a l p e r f o r m a n c e s
The MPS is implemented and studied on a 16 PEs Mesh topology network of 16 PEs [4]. Each PE isbuilt of a transputer T800, 20MHz with 1 Mb of local memory (two wait states) and four communication links, 10 Mb/sec. The MPs is written in the Inmos Parallel ANSI C language [7]. The three MPS approaches are implemented o n each MPC so that the application designer can use one of the three approaches. The size of the MPC code is 18 Kb and the MPC needs additional memory space according to the routing strategy and the MPS approach used as detailed later (memory consumption). From the user point of view the message size is unlimited. The MPS has a parameter for maximum message size. Bigger messages are reduced to shorter messages by the MPS. The ATs connection to the MPS is declared in the configuration language of the Inmos parallel C software package.
4.1. Analytical and experimental performance comparison Experimental measurements were carried out to study the MPS and also to assess the accuracy of the assumptions involved in the experimental modeling of the MPS approaches. The calculations for one intermediate PE (P = 1) are based on the following values of the parameters of the model, for the transputer (T800) network: Ts = 1 /xsec from the Inmos transputer databook Ls = 880 Kb/sec measured on the transputer network Iccs = 6140 Kb/sec measured on the transputer network Tc = 2 /zsec equal to 12 byte (internal message size) divided by Iccs Tp (approach 1) = 100/zsec average measured value Tp (approach 1) = 100/zsec averag e measured value Tp (approach 2) = 100/zsec average measured value Tp (approach 3) = 200/xsec average measured value The results show the high accuracy of the models presented, in particular for message sizes higher than 8K (Figs. 6 and 7). The differences in the throughput for low message sizes are due to the assumptions made (less overlapping between the repetitive cycles). The differences in the computation efficiency (about 5%) can be explained by the fact that there is no full parallelism of the communication over the physical link and the CPU process, as assumed.
4.2. Experimental performances More features are measured and discussed with regard to the three MPS approaches: - Deadlock prevention - Congestion behavior - Memory consumption - Communication speed/throughput rate - Computation overhead due to communication and communication latency.
4.2.1. Deadlock prevention Deadlock prevention prevents deadlocks by constraining how requests can be made. This is the main role of the MPS Approaches 2, 3. The Approaches 2, 3 implement synchronized broadcasting. The synchronized communication between ATs gives the application designer a tool to design the application and prevent deadlocks [2,4,8,14,15].
A new message-passing system for parallel multiprocessor systems
Approach 1
100 95
85
......... ~..i .................. i.................. i .................. ,:.................
~ - 1 ~
....... l -- c ' ~ - ' !
! ......
,. ~o ~ .............i........................ I - - ° - ' "
J
l
l
20
30
40
65
0
10
Message
. . . . . .
50
size [ K b y ~ ]
Approach 2 100
....
95
°
...... Fii="~"''-i
~ 85
,.
.... ;/i
7~ -i W 70-t ~9
................. .................. i..................................
............ i ................ !........ 1 i
...................................... t
65 0
..................i................
10
°
c~°~--I
.....
- ~- Measured
:
:
i 20 30 M e s s ~ e size [Kbyte]
! .....
i 40
50
Approach 3
.,o
iiiiiiiiiiiiiiiiiiii iiii iiiiiiiiiiiiiiiiii I
70
.
.
.
.
.
.
!~- 60
.....
5o
i 0
10
i
20 30 40 50 Message size [Kbyte] Fig. 6. Throughput - analytical and experimental results.
6,
H..4zaria, Y. Elovici
Approachl
(P-I)
800 750
I ~-~-~-~-~
~
700
~):iiiiiiiiiiiilj .................. :
~
650
~
550
....... ~:'i .................. i.................. i.................. i................
. . . . . . . . . . . . . . . .
~ Calculated I | -'J:................................................
- - o- - Measured
500
0
10
20 30 Message size [Kbyte]
Approach 2
I
900
,-
40
50
(P- I ) I
8110 ....
n-
~ ,oo -J:.~:: ............. J.................. J.................. ,
. . . . . . . . . . . . . . . .
~ 6oo
,' .... ~o ...................................................................................
:
I
'
:. ,oo ...............................-..-..:..-...~ 400
I0
20
I ......
i
i
30
40
50
Message slze [Kbyte]
Approach 3 ( P - 1 )
600
0
10
20 30 40 50 Message size [Kbyte] Fig. 7. Computation efficiency - analytical and experimental.
A new message-passingsystemfor parallel multiprocessor systems
645
Katseff said about the 'Incomplete Hypercube': "It is easy to see that some buffering is necessary to prevent deadlocks. It is sufficient that each link is able to independently buffer a single message in each direction to show that deadlock does not occur" [10]. The MPS is implemented on a mesh topology for which the same remark can be made regarding the synchronized MPS Approaches 2 and 3, which use buffers in each direction of the links. Each MPC has a deadlock free routing table which serves the three MPS approaches. To study the MPS, a routing table protocol based on a deterministic technique of deadlock free routing is placed in every PE of the network. The fixed routing table, in each PE, stores the intermediate destination for each final destination, therefore a priori knowledge of the exact route to be followed is needed. This technique is suitable for regular topologies such as mesh and hypercube for which different routing tables providing deadlock-free behavior exist [4,10]. 4.2.2. Congestion behavior In the MPS Approach 1, messages are sent without verifying if the receiving AT is ready to receive. Improper usage of this approach means that when the receiving AT is incapable of receiving all the messages sent to it a flood of messages will cause the MPS to be blocked. The Approaches 2 and 3 perform synchronized communication between the different ATs. Messages are never sent by an AT, in these two approaches, unless the receiving AT is ready to receive the message. Hence for congested communication, the flooding of messages in the MPS is prevented. Moreover, in cases of congested communication, the synchronization mechanism is not influenced, due to the PO tasks of the MPC, which give high priority to the broadcast of control messages between the different PEs..
4.2.3. Memory consumption The memory consumption of the MPS consists of a fixed memory space for code and workspace and a variable memory space for data buffers allocated by the different MPS approaches. Each MPC uses 18 Kb of code and an additional 10 Kb of workspace is needed by each MPC for variables and for storage of the routing tables. A total of 28 Kb of fixed memory space is used by each MPC. The maximum amount of memory space needed for the data buffers depends on the approach used. The MPS Approaches 1, 2 use two buffers for each of the physical links of the PE. The buffers are allocated and reallocated dynamically from the heap of the MPC when the messages is passing through the PE. The memory consumption of Approach 3 depends on how many virtual channels are passing through a PE. To summarize, when M is the message size in Kb, L is the number of physical links of the PE (L = 4 for the transputer) and V is the number of virtual channels passing through the PE, the memory consumption in each MPC is (2ML + 28) Kb for Approaches 1 and 2 and (3MV + 28) Kb for Approach 3.
4.2.4. Throughput The communication speed of the MPS is influenced by three major factors: Communication speed of the physical links Communication speed of the internal PE channels - Computation overhead Message size. -
-
-
800
_7,0
.
.
.
.
.
i
.
'~ 650 .
.
.
i
. . . . . . . . .
.
o
.
[] P=I
600 ~550
.....................................................................
J~
450
.......................................................................
i
0
10
20
* P=2
-
~ P=3
._
A P=5
--
• P--4
c
[" 500
P=O
i
30
40
50
MessageSize[Kbyte] 900
i
.....~
8oo
~7oo
,
~
.............f ~ - - ! f
~_
~
~ f
i
i
......
o ~,~ ....
i ......
u P=I o P=2
i
* P--4
~5oo -] ...................................... i.......................i...... ,P=5 400 !
i
0
10
900
i
20 30 Message Size [Kbyte]
40
i
~
60o
; i !~ .....~ ............i...................... ! ...................... / -
p
a~ c
..........~
-
~
h. . . . !.....
............ !.....
~
:
~_~
o
P=I
•
~=~
3 .... ....
~. P=3 •
e---4
.-
~. P=5
4OO 0
50
i
700
500
....
10
20 30 Message Size [Kbyte]
I 40
Fig. 8. Experimental throughput of the MPS approaches.
50
A new message-passing system for parallel multiprocessor systems
647
The communication speed (throughput) was studied for each approach, for different message sizes and for different numbers of intermediate PEs P, between the sending and the receiving ATs (Fig. 8). For the three approaches, for small message size the communication speed is lower. This result points to a computation overhead proportional to the number of messages passing through the PE. This result is derived from the architecture of the PE used (the transputer T800) where the computation overhead caused by the communication is trivial and is caused by a fixed overhead to construct a message and set up the appropriate communication mechanism and by DMA transfer [1]. In Approach 1 (Fig. 8) it is possible to notice that the communication speed became relatively constant for messages bigger than 10 K. When the number P of intermediate PEs grows, the communication speed decreases. The results point slightly to a low additive computation overhead of the intermediate PEs for large messages. In Approach 2 (Fig. 8) the communication speed measured for stream communication decrease slightly for each additional intermediate PE. The communication speed becomes relatively constant for messages bigger than 20 K. This result is explained by the fact that, although the synchronization control message has to pass through intermediate PEs from the receiving to the sending application tasks before a message is sent, there is a good exploitation of the pipeline and the additive computation overhead which is proportional to the number P of intermediate PEs is low and relatively lower for a large message. In the worst case of burst communication (all steps of Fig. 4 in series), measurements show that the communications speed dramatically decreases for each additional PE and therefore Approach 2 is not useful in this case. In Approach 3 (Fig. 8), the number P of intermediate PEs has no influence on the communication speed. This result points to an efficient exploitation of the pipeline in the communication (see Fig. 4) and to the fact that the synchronization control message (A) flow and the data message (M) flow is performed in parallel in opposite directions as illustrated in the example in Fig. 4 step h. The communication speed is not affected by the type of communication, stream or burst, once the channel is open and there is a full pipeline and parallel flows.
Computation overhead due to communication latency - Computation efficiency The computation overhead due to communication, which causes the communication latency, is a feature which points to the amount of time spent by the PE to handle the communication when messages are passing through the PE. The latency is a parameter difficult to measure reliably on the system.
5. Conclusions
To achieve high performance in job executions the proposed MPS facilitates efficient routing and broadcasting in a parallel multiprocessor network. The MPS written in Parallel ANSI C is studied on a Mesh topology network of 16 transputers T800 with 1 Mb memory. Each of the MPS approaches enables efficient implementation for different types of application tasks. As Parallel Implementation of complex applications is becoming popular, the need for a high performance interprocessors communication system becomes imminent, especially in loosely coupled distributed-memory multiprocessor networks. An important factor in the efficiency of these networks is the effectiveness of the message-passing system which manages the data exchanges amont the processors of the network. The modeling and performance evaluation of a new Message-Passing System (MPS) for distributed multiprocessor networks
648
H. Azaria, Y. Elovici
without s h a r e d - m e m o r y and where the processors or Processing Elements (PEs) are connected to each other by point-to-point c o m m u n i c a t i o n links has b e e n presented. T h e performance of a message-passing system is determined by the a m o u n t of time spent in the process, the communication and the synchronization of the MPS tasks. For maximum performance, the MPS m a n a g e s the communication and the synchronization between the different tasks of an application by means of three approaches. O n e is an asynchronous s e n d / r e c e i v e a p p r o a c h which efficiently handles server-like tasks, the second is a synchronous s e n d / r e c e i v e a p p r o a c h which efficiently handles streaming communication m o d e and the third is a virtual channel a p p r o a c h which minimizes the overhead of the synchronization mechanism, efficiently handling the burst m o d e of heavy communication between tasks. T h e developed model of the MPS approaches enables the determination of analytical expressions for different performances and a comparison between analytical and experimental p e r f o r m a n c e s reveals that the models predict the MPS p e r f o r m a n c e with high accuracy. T h e MPS analytical and experimental performances for each a p p r o a c h are studied and presented in terms of communication latency, throughput, c o m p u t a t i o n efficiency and m e m o r y consumption. T h e p r o p o s e d MPS was developed as a part of a larger project: the implementation of a Multiple Target Tracking system application on a parallel transputer network, some of the modeling results were used to predict few p e r f o r m a n c e of the implementation.
Acknowledgement W e are grateful to Prof. R.D. Hersch from E P F L - L S P , Lausanne, for his helpful c o m m e n t s and for lending us their transputer network which enabled this research work to be carried out.
References [1] M. Annaratone, C. Pommerell and R. Riihl, Interprocessor communication speed and performance in distributed-memory parallel processors, 1989 A C M (1989) 315-323. [2] J.K. Annot, A deadlock free and starvation free network of packet switching communication processors, Parallel Comput. 9 (2) (1989) 147-162. [3] H. Azaria and Y. Elovici, Multiple interfaces message passing system for transputer network, Microprocessing and Microprogramming 34 (1-5) (Feb. 1992) 237-242. [4] H.G. Badr and S. Podar, An optimal shortest-path routing policy for network computers with regular mesh-connected topologies, IEEE Trans. Comput. (10) (Oct. 1989) 1362-1371.a [5] A. Basu et al., A model for performance prediction of a message passing multiprocessors achieving concurrency by domain decompositions, Proc. Internat. Conf. on Vector and Parallel Processing, Zurich (Sep. 1990) (Springer, Berlin, 1990) 75-85. [6] Y. Elovivi and H. Azaria, Message passing system for transputer network Mq"r implementation, in T.S. Durrani et al., eds., Applications of Transputer 3, (IOS Press, Amsterdam, 1991) 650-655. [7] M. Homewood et al., The IMS T800 transputer, IEEE Micro 7 (5) (Oct 1987) 10-86. [8] S.S. Isloor and T.A. Marsland, The deadlock problem: An overview, IEEE Comput. 13 (9) (Sep. 1980) 58-78. [9] C.R. Jesshope, P.R. Milleer and J.T. Yamtchev, High performance communications in processor networks, The 16th Annual Internat. Syrup. on Computer Architecture, Jerusalem (May 28-June 1, 1989) 150-157. [10] H.P. Katseff, Incomplete hypercubes, IEEE Trans. Comput. 37 (5) (May 1988) 604-607. [11] C.-T. King, W.-H. Chan and L.M. Ni, Pipelined data-parallel algorithms: Part I - Concept and modeling, IEEE Trans. Parallel Distributed Syst. 1 (4) (Oct. 1990) 470-485. [12] D.A. Reed and R.M. Fujimoto, Multiprocessor Networks Message Passing Based Parallel Processing (MIT Press, Cambridge, MA, 1987). [13] K.G. Shin, HARTS: A distributed real-time architecture, IEEE Comput. 24 (5) (May 1991) 25-35.
A new message-passing system for parallel multiprocessor systems
649
[14] S. Srinivas, A. Basu, G.K. Gopinath and A. Paulraj, Shared memory vs. message passing in parallel computers: Emerging perspectives, Proe. Indo-US Workshop on Spectral Analysis in One and Two Dimensions, New Delhi (Nov. 1989). [15] M.-Y. Wu and D.D. Gajski, Hypertool: A programming aid for message passing systems, IEEE Trans. Parallel Distributed Syst. 1 (3) (July 1990) 330-343. [16] X. Zhang, System effects of interprocessor communication latency in multicomputers, IEEE Micro 11 (2) (Apr. 1991).