Distributed shared-memoryimplementation for multitransputersystems P Tsanakas, G Papakonstantinouand G Efthivoulidis
A new method Jor the implementation o f distributed shared memory on multitransputer systems is presented. The method is based on an extension o f the common paralM programming language occam2, allowing the use o f constructs such as virtual channels, global semaphores, and shared variables (with strong coherence). The allocation o f the shared variables is done at compile-time on an)' of the available transputer nodes. Semaphores allow explicit process synchronization, while virtual channels facilitate the programmer's task, by providing an abstract view o f process communication, regardless o f the particular network topology. parallel programming, distributed shared-memory systems, transputers, multitransputer architectures, parallel process synchronization
There is a general consensus that real scalability in computer systems can only be achieved by using distributed memory systems ~. This is because of the bottleneck that arises in shared-memory systems, when the processing nodes are increased beyond a certain number. The programming paradigm of shared-memory systems, however, is attractive, as it facilitates the programming task, because of the inherent uniform address space of those systems. Distributed memory systems, on the other hand, require more programming effort, because of the individual address spaces of each node and the need for explicit programming of each data exchange (message passing). Distributed shared memory (DSM) is a new approach, which tries to combine the advantages of the distributed memory architectures (scalability) and the advantages of the shared-memory paradigm (easy' and secure programming). DSM is considered as a practical solution to make a distributed system look to the application software as if a shared global memory existed. Portability is another advantage offered by DSM systems, as opposed to distributed memory systems that are programmed with reference to their particular network topology. Any change of the topology normally requires certain source program modifications; this can be avoided by a DSM-based system. Moreover, the source code of parallel applications written for DSM is
National TechnicalUniversityof Athens, Department of Electricaland Computer Engineering, Zographou Campus, GR-15773 Zographou, Greece Vol 34 No 8 August 1992
normally shorter and easier to understand than the equivalent programs written for ordinary message-passing systems. The communication overhead incurred by a DSM mechanism is the main drawback of these systems. If the proper DSM algorithm is carefully selected for each kind of applications, however, this overhead may be kept low. Moreover, there have been reported cases in which DSM systems outperform the equivalent implementations that are based on message-passing mechanisms 2. The DSM abstraction is normally achieved by a software layer in the form of an operating system kernel, run-time library routines, or source program transformations. The last approach is chosen in the present work; it is the simplest to implement and can be easily adapted to virtually any distributed memory system topology, without any significant overhead. The original idea of DSM came from Li 3,4, who presented a system in which each block of shared memory has a server node (owner). The owner performs read/ write requests and maintains a list of servers (nodes), which have a read-only copy of this block. This system permits run-time changes in the block ownership, when a write fault occurs on a node that has a read-only copy of the block. If write faults are rare, the communication overhead is not excessive, and the DSM system performance is satisfactory. Other DSM systems allow full replication of the shared blocks, therefore implementing the multiplereaders/multiple-writers protocol s. The data-consistency problem is tackled by controlling (synchronizing) the data access sequences. Thus the read requests are performed locally, but all the write requests need to be broadcast (in sequence) over the network. Another interesting category of DSM systems is based on the migration principle ~. In this case a single copy of the shared block is kept, which is always migrating to the node where it is requested. This method avoids the need for broadcasting after a write operation, but the frequent movements of the shared block may cause a considerable overhead. This overhead is kept low only when the accesses of a shared block are local. If there is no locality of reference, the DSM system will force the shared blocks to thrash among the nodes of the system. The migrationbased DSM systems resemble virtual-memory systems and are mostly implemented using the existing virtualmemory support of the host operating system.
0950-5849/92/080499~38 © 1992 Butterworth-Heinemann Ltd
499
Distributed shared-memory implementation for multitransputer systems Some systems, such as Munin 6.7 and Galaxy 8, provide the means to allow the user to select the particular D S M protocol to be employed during program execution. This approach intends to minimize the communication overhead, by examining the behaviour of each parallel application and matching it to the appropriate DSM protocol. The EEC-sponsored EDS project incorporates a significant effort for the implementation of efficient DSM mechanisms, on top of extensive loosely coupled architectures 9. Other groups are focusing their effort on the development of special architectural support to facilitate the implementation of low-overhead DSM systems~°,~L Research work has also been reported on the implementation of D S M mechanisms on heterogeneous distributed systems-'. The basic concepts remain the same, but there are bigger technical difficulties that arise from nonuniform data representation formats and different compilers. The method chosen in the authors' system, OPP (Occam PreProcessor), is based on the concept of source program transformations. The source language that is used is a variant of occam2 ~2(called o c c a m + + ), and the target architecture may be any multitransputer network ~3. The authors' approach defines o c c a m + + , which is an extension of occam2, allowing the use of virtual channels, global semaphores, and shared variables. The OPP transforms the o c c a m + + programs into equivalent occam2 programs, which can be compiled and executed on a transputer network. The transformation introduces some extra code and extra processes, to support the additional features of o c c a m + + . The overhead (extra code and communication/execution time) can be kept low, provided that the p r o g r a m m e r assigns the shared variables and semaphores to the appropriate nodes• Test runs may be used for the fast and easy estimation of these assignments. The other D S M systems may cause a heavier message traffic, especially when the shared data references are not localized. Furthermore, note that the OPP system ensures a great degree of portability with regard to the operating system and the network topology, while its simplicity allows easy and fast debugging. In the following, first an outline of the OPP approach is given and then an illustrative example is provided, explaining how program transformations are performed. Then, the essential implementation details are described, proving that the system can maintain strong data coherency and that the support of virtual channels and semaphores is secure and efficient.
SYSTEM OVERVIEW The OPP translator consists of two parts (OPP1 and OPP2), as shown in Figure 1. OPP2 accepts o c c a m + + programs that allow the use o f virtual channels, shared memory, and semaphores and produces the equivalent occam + programs, which support only virtual channels. OPPI transforms the occam + programs into ordinary 500
++ l Occam Vi r tual channels Shared memory Semaphores
OPP2
I OPP1
I Occam +
Virtualchannels
I Occam2
Figure 1. OPP translator occam2 programs ~2, which can be compiled and executed on any multitransputer system. The virtual channels of occam + are the communication means between processes that are executed on separate transputers and are implemented using an appropriate set of physical intertransputer links. In an occam + program, any number of virtual channels can be established between any pair of transputers, regardless of the particular network topology. The messages are properly multiplexed and demultiplexed, and they are transmitted using message switching, so that they can be transferred to any transputer, without excessive communication delay. An occam + program has the following format: protocol definitions procedure and function definitions virtual channel definitions PLACED PAR PROCESSOR 0 type links Host Transputer process PROCESSOR 1 type links process
where type is the processor type (e.g., T8) and links are the IDs of the interconnected transputers. The syntax of the occam + language is described in the Appendix, in a modified Backus-Naur Form. The (virtual) shared memory of an o c c a m + + program consists of variables that are accessible (shared) by all of the processes of the system. For each such variable, a special process is executed on a transputer, to service the read/write requests (messages) that arrive from processes running on other transputers, when they want to access that shared variable. The allocation of shared variables to the transputer nodes is performed a priori, using special program annotations. The semaphore implementation in o c c a m + + is supported by a special type called SEMA. Semaphores are general (count semaphores) with initial value 0. The basic operations P and V are represented by the processes: semaphore ! wait Information and Software Technology
P TSANAKAS, G P A P A K O N S T A N T I N O U AND G EFTHIVOUL1DIS
and: semaphore ! signal respectively. The assignment o f an initial value to a semaphore can be performed by the process: SEQ i = 1 FOR initial.value semaphore ! signal F o r each semaphore, a special process is running on a transputer. The allocation o f semaphores over the network nodes is performed a priori, as it happens with the shared variables. This semaphore process accepts the wait and signal messages from the various network processes and modifies the semaphore value, accordingly. An o c c a m + + p r o g r a m has the following format: protocol definition procedure and function definitions shared variable definitions virtual channels definitions PLACED PAR PROCESSOR 0 type links Host Transputer PLACE var. l : PLACE sema. I : process PROCESSOR I type links PLACE var.2 : process
It is clear that shared variable var. 1 and semaphore sema. 1 are allocated to processor 0, while shared variable var.2 is allocated to processor 1. The syntax o f o c c a m + + is described in the Appendix.
ILLUSTRATIVE EXAMPLE A simple o c c a m + + p r o g r a m is shown in Figure 2. The shared variables are o f [ ] I N T type, with initial value 0, as it happens with semaphores. N o t e that all the shared variables must be declared as arrays. This option has been selected because it both simplifies the implementation o f the preprocessor and provides m o r e flexibility to users, by allowing them to use shared array variables. Therefore, many shared variables can be declared as a single m e m o r y unit (in the form o f an array). The prog r a m m e r can, thereafter, rename the array elements as scalar variables (via abbreviation). This solution also helps to reduce the incurred overhead, as only a single extra process is created for each m e m o r y unit (array). Semaphores have not been implemented as arrays, however, because o f their usually limited number. The processes o f this example use the shared variable a and are synchronized by semaphores semal and sema2. Transputer 0 (Host Transputer) is o f type T4 and is connected with transputer 1 via link 2, while the three other links remain unused. Transputer 1 is of type T8 and is connected with transputer 0 via link 0. The shared variable a together with the two semaphores are allocated to transputer 0. The process running on the host transputer repetitiVol 34 No 8 August 1992
# U S E userio [I]INT a : S E M A semal, sema2 : PLACED PAR P R O C E S S O R 0 T4 - 1 , - 1 , 1 , - 1 PLACE a : P L A C E semal, sema2 : SEQ semal ! signal WHILE TRUE SEQ sema2 ! wait a[0]: = a[0] + 2 write.int (screen, a[0], 4) semal ! signal PROCESSOR 1 T80,WHILE TRUE SEQ semal ! wait a[O] : =
a[O] -
1,- 1,- 1
1
sema2 ! signal Figure 2. occam+ + program with two simple processes
vely increases the value o f the shared variable a[0] by 2 and prints its value. The other process is executed on transputer 1 and repetitively decrements the value o f a[0] by 1. The two semaphores ensure that these processes will be executed synchronously (step-by-step), starting with the process o f transputer 1. Therefore, the whole o c c a m + + p r o g r a m will produce the following printout: 1234... The o c c a m + p r o g r a m that is produced by the O P P 2 translator (given the o c c a m + + p r o g r a m of Figure 2), is shown in Figure 3. Protocols S H A R E D and S E M A , and procedures shared and semaphore, are automatically incorporated by OPP. The two following o c c a m + + declarations: [I]INT a : PLACE a : are translated into: • the declaration o f virtual channels a.req.r and a.ack.r • the declaration o f (local) channels a.req. I and a.ack. 1 • the process shared, having as parameters the above channels (Note that ,r stands for ' r e m o t e ' and .1 for 'local'). The value o f the shared variable a[0] is read by using the instructions: a.req ! read: 0 a.ack ? CASE ack; a[0] 501
Distributed shared-memory implementation for multitransputer systems # USE userio VAL page.size IS 100: PROTOCOL SHARED CASE read; INT write; INT; INT ack; INT PROTOCOL SEMA CASE signal
a.req ? write; 0; a[0] In the a b o v e instructions 0 is the index o f a[0] in the a r r a y a, while a is a local c o p y o f the shared variable. The local c o p y is d e c l a r e d at the b e g i n n i n g o f the process that uses the shared variable. The following o c c a m + + declarations: SEMA semal : PLACE semal : are t r a n s l a t e d into:
PROC shared ([ ]CHAN OF SHARED req. 1, ack. 1, req.r, ack.r) PROC semaphore ([ ]CHAN OF SEMA wait. 1, signal. 1, wait.r, signal.r) [1] CHAN OF SHARED a,req.r, a.ack.r : [1] CHAN OF SEMA semal .wait.r, semal .signal.r : [1] CHAN OF SEMA sema2.wait.r, sema2.signal.r : PLACED PAR PROCESSOR 0 T4 - 1,- 1,1,- 1 [1] CHAN OF SHARED a.req.l,a.ack.1 : [1] CHAN OF SEMA semal.wait.1, semal.signal.1 : [1] CHAN OF SEMA sema2.wait.l, sema2.signal.1 : PAR shared (a.req.1, a.ack. 1, a.req.r, a.ack.r) semaphore (semal.wait.1, semal .signal.l, semal.wait.r, semal.signal.r) semaphore (sema2.wait. 1, sema2.signal.l, sema2.wait.r, sema2.signal.r) [1]INT a : CHAN OF SHARED a.req IS a.req.l[0] : CHAN OF SHARED a.ack IS a.ack.l[0] : CHAN OF SEMA semal .signal IS semal .signal.l[0] : CHAN OF SEMA sema2.wait IS sema2.wait.l[0] : SEQ sema l.signal ? signal WHILE TRUE SEQ sema2.wait ? signal SEQ a.req ! read; 0 a.ack ? CASE ack; al0] a[0]: = a[0] + 2 a.req ? write; 0; a[0] SEQ a.req ! read; 0 a.ack ? CASE ack; a[0] write.int (screen, a[0], 4) semal.signal ? signal PROCESSOR 1 T8 0 , - 1,- 1,- 1 [I]INT a : CHAN OF SHARED a.req IS a.req.r[0] : CHAN OF SHARED a.ack IS a.ack.r[0] : CHAN OF SEMA semal .wait IS semal .wait.r[0] : CHAN OF SEMA sema2.signal IS sema2.signal.r[0] : WHILE TRUE SEQ semal.wait ? signal SEQ a.req ? read; 0 a.ack ? CASE ack; a[0] a[0]:= a[0] - 1 a.req ! write; 0; a[0] sema2.signal ! signal
• the d e c l a r a t i o n o f virtual channels s e m a l . w a i t . r a n d sema 1.signal. r • the d e c l a r a t i o n o f (local) channels s e m a l . w a i t . 1 a n d sema 1.signal. 1 • the process s e m a p h o r e , having as p a r a m e t e r s the a b o v e channels The s e m a p h o r e o p e r a t i o n s P a n d V for s e m a l are translated into: semal.wait ? signal and: semal.signal ? signal respectively. T h e s a m e h a p p e n s with s e m a p h o r e sema2. The general structure o f the final p r o g r a m that is p r o d u c e d by OPP1 is shown in F i g u r e 4. The l i b r a r y h e a d e r contains, in a d d i t i o n to the o c c a m + p r o t o c o l definitions, the p r o t o c o l M E S S A G E definition. This p r o t o c o l i n c o r p o r a t e s the tags channel, back, a n d sync (explained in the next section) a n d one or m o r e tags for each virtual channel p r o t o c o l . In l i b r a r y proc, the o c c a m + p r o cedures are defined. The p a r a m e t e r s that are virtual channels have been identified, their p r o t o c o l is c h a n g e d to M E S S A G E , a n d their r e a d / w r i t e instructions have been p r o p e r l y modified. In l i b r a r y node, the p r o c e d u r e n o d e (described in the next section) is defined. A d d i t i o n ally, an E X E fold has been c r e a t e d for the process o f the host t r a n s p u t e r , as well as an SC fold for t r a n s p u t e r i, a n d a special fold for the configuration. All the virtual channels are r e n a m e d ( a b b r e v i a t e d ) into node.in a n d n o d e . o u t , which are vectors o f local channels c o n n e c t i n g the local processes with node. Vectors p a t h , fcid, a n d bcid are required by the node process. T h e r e a d / w r i t e instructions o f the virtual channels have been t r a n s f o r m e d , so they use the M E S S A G E p r o t o c o l . Each write o p e r a t i o n to a virtual channel is followed by the o u t p u t o f a signal sync. Finally, the c o n f i g u r a t i o n section is derived by the n e t w o r k t o p o l o g y , as it is defined in the occam + program.
Figure 3. Equivalent occam + program for occam+ + program of Figure 2
IMPLEMENTATION
T h e write o p e r a t i o n on this v a r i a b l e is p e r f o r m e d b y the instruction:
The OPP1 t r a n s l a t o r puts on each t r a n s p u t e r the process node, which consists o f a set o f c o n c u r r e n t processes, as
502
ISSUES
OPP1
Information and Software Technology
P TSANAKAS, G PAPAKONSTANTINOU '~ l A B h e a d e r .., abbreviations and protocols of occam + program
PROTOCOL MESSAGE CASE channel; INT back; 1NT sync SHARED.read: INT I N T : 1NT
SHARED.write;
SHARED.ack; INT SEMA.signal I'H "',~, LIB p r o c - - o u t p u t to virtual ( M E S S A G E ) . . . SC P R O C shared
c h a n n e l is f o l l o w c d by a sync o u t p u t
. . . SC P R O C semaphore '~' ,ll LIB n o d e
• . . SC P R O C node ([ ] C H A N O F M E S S A G E in, out, C H A N O F M E S S A G E link0in, linklin, link2in, link3in, C H A N O F M E S S A G E link0out, linklout, link2out, link3out, V A L [ ] I N T path, fcid, bcid) tH ~,',I EXE Host . . use uscrio, h e a d e r , p r o c , n o d e C H A N O F A N Y link0in, linklin, link2in, link3in: C H A N O F A N Y link0out, linklout, link2out, link3out : .. P L A C E link2in, Imk2out d e c l a r a t i o n s for node.in, node.out
P}~R
. a b b r e v i a t i o n s (virtual c h a n n e l s are m a p p e d to n o d e . i n , n o d e . o u t ) . p r o c e s s r u n n i n g at p r o c e s s o r 0 (as in o c c u m ~- ) • . a b b r e v i a t i o n s for p a t h , reid, bcid aode ( n o d e . i n . n o d e . o u t , link0in, l i n k l i m link2in, link3in, link0out, l i n k l o u t , l i n k 2 o u t , l i n k 3 o u t , p a t h , reid, held)
',',; P R O G R A M Netv, ork I',', SC application.l • . use userio, header, proc, node P R O C a p p l i c a t i o n . 1 I C H A N O F A N Y link0in, linklin, link2in, link3in. C H A N O F A N Y link0out, link lout, link2out, l i n k 3 o u t ) . d e c l a r a t i o n s lbr n o d c . i n , node.out PAR . . . a b b r e v i a t i o n s ('drtuul c h a n n e l s arc m a p p e d to n o d e . i n / n o d e . o u t ) . . . local copy of shared m c m o r y (as in occam + ) • . . abbreviations for virtual c h a n n e l s (as in o c c a m + )
WHILE TRUE SEQ SEQ ! SEMA.signal s e m a l . w a i t ! sync SEQ SEQ scmal.wait
a.req ! S H A R E D . r e a d : a.rcq ! sync
0
a.uck ? C A S E S H A R E D . a c k : a[0] a[0]:=
a[0] -
1
SEQ a.rcq ! S H A R E D . w r i t e : 0: a[(I]
a.req ! sync SEQ ! SEMA.signal scma2.signal ! sync scma2.signal
• . . a b b r e v i a t i o n s for p a t h , fcid, bcid n o d e ( n o d e . i n , n o d e . o u t , link0in, l i n k l i n , link2in, link3in, l i n k 0 o u t , l i n k l o u t , l i n k 2 o u t , l i n k 3 o u t , p a t h , fcid, bcid)
,,'~ c o n f i g u r a t i o n CHAN OF ANY from0tol, fromlto0 : PLACED PAR P R O C E S S O R I T8 . .. P L A C E f r o m 0 t o l , f r o m l t o 0 [4] C H A N O F A N Y d u m m y . i n , d u m m y . o u t : a p p l i c a t i o n . 1 (from0tol. dummy.in[I], dummy.in[2], dummy.in[3], from I toO. dummy.out[ 1], dummy.out[2], dummy.out[3]) 'H ,,I
bTgure 4. Equivalent occam2 program Jot occam + program q["Figure 3
depicted in Figure 5. Each message of a virtual channel is Vol
34 No
8 August
1992
AND G EFTHIVOULIDIS
transmitted as follows. The sender process sends the message to a channel of the vector Local.Input and then tries to send a synchronization signal to the same channel. This signal will be accepted only after the final receiver (target process) takes the message. In this way, virtual channels are equivalent to the local ones, as one process cannot proceed after sending a message, until the receiver acquires it. Each message from a sender process passes through the Input.Buffers, the Multiplexer, and the Crossbar.Input.Buffers modules, where the required headers are added to the messages. The next step of scheduling is performed by the Crossbar.Switch and the Crossbar. Output.Buffers, from where a message is directed to the Output.Links. After passing the Crossbar.Input.Buffers, the Crossbar.Switch and the Crossbar.Output.Buffers of some other (intermediate) transputers, the message reaches the Demultiplexer of the final (target) transputer. Then, the message is transferred to the receiver process, through the Output.Buffers. At that time, the process Output.Buffers also sends an acknowledge signal to the channel back.out. This signal follows the inverse path and reaches the Crossbar.Output.Buffers of the sender transputer. Then, the signal is transferred through channel back.in to the Input.Buffers process, which then reads the synchronization signal from the sender process. Therefore, the initial sender process can resume its execution, as it is ensured that the message has reached its destination. The following gives a detailed description of the particular modules of node, as well as the structure of the messages that circulate within the network. The Input.Buffers process accepts the message from the sender process and blocks the corresponding input buffer channel (with the synchronization signal), until a proper signal arrives in the channel back.in. The message coming from the sender process is transferred to the Multiplexer, after a special header is added (regarding the sender channel ID, i.e., the channel index in the Local.Input vector). The Multiplexer forwards its input messages to an input channel of the Crossbar.Input. Buffers. The Crossbar.Input.Buffers process changes the header of each message that is coming from the Multiplexer output and puts the fcid (forward channel ID) that corresponds to the sender. An fcid consists of the target transputer ID (tid) and the receiver channel ID (cid), i.e., the channel index in the Local.Output vector. The Input.Links and Output.Links directly correspond to the physical links of each transputer. The incoming messages (from Input.Links) are not modified as they have the proper headers. The back.out channel is used for the receipt of the acknowledge signal with the cid of a receiver process that has taken a message from its input. This signal is transformed (within Crossbar.Input. Buffers) into a message with header containing all the information that is necessary for the signal to reach the Input.Buffers of the initial sender transputer, and with an acknowledge signal, containing the cid of the sender process. The Crossbar.Input.Buffers uses vectors fcid and bcid to modify the headers. 503
Distributed shared-memory implementationfor multitransputer systems
Input Output links links Crossbar CrossbarF-~ Crossbarl ~_~ Locally] Input ~ Multiplexer :~ ~'~ -~1 input output Demul t i p l e xer inputYl buffers swi t ch ~ buffers buffers
T
kLocal
[,.ou,.o, back.out
back.in Figure 5. Process node, running on each transputer The Crossbar.Switch transfers a message (without modification) from an input to an output channel, based on the message header, and according to the resident vector path that gives the output channel for each destination index (tid). This array is automatically constructed by the preprocessor, and corresponds to the respective minimal paths of the network• The Crossbar.Output.Buffers process forwards a message without modification to one of its output links, if this message is destined to some other transputer. Any message whose destination is a local process is transferred either to channel back.in (if it is an acknowledge signal) or to the Demultiplexer input (if it is a regular. message). The tid part of the header is removed, as it is not necessary any more. The Demultiplexer process sends a message without any header to the appropriate output channel, while the Output.Buffers process forwards this message to the final (target) process. For each message received by a target process, an acknowledge signal is produced (with the ID of the receiver) and sent through the back.out channel. This signal will eventually be received by the Crossbar. Output.Buffers process of the sender transputer and sent to the back.in channel•
OPP2 For each shared variable, OPP2 introduces a process shared, while for each semaphore it introduces a process semaphore. The code for these processes is shown in Figure 6. The process shared accepts the read and write requests and performs them by sending a reply or by changing an element of the shared variable. In each semaphore process there is a counter (count) that contains the current value of the semaphore, initialized to 0. Each signal from channel signal.1 or signal.r increments the semaphore value. Channels wait.r and wait.l are blocked when the value of the semaphore (count) is 0. Each signal from these channels decrements the semaphore value. The OPP2 preprocessor recognises the shared variables and the semaphore, by keeping their identifiers in a symbol table when they are declared and b y analysing (syntactically) the o c c a m + + program, according to the g r a m m a r given in the Appendix. Both preprocessors (OPPI and OPP2) are L A R L 504
PROC shared ([]CHAN OF SHARED req.l, ack. l, req.r, ack.r) [page.size]lNT page : SEQ SEQ i = 0 FOR 100 page [i] := 0 INT x : WHILE TRUE ALT ALT i = 0 FOR SIZE req.l req.l[l] ? CASE read; x ack.l[i] ! ack; page Ix] write; x; page [x] SKIP •.. the same with req.r and ack.r PROC semaphore ([ ]CHAN OF SEMA wait.l, signal.l, wait.r, signal.r) INT count: SEQ count : = 0 WHILE TRUE ALT ALT i = 0 FOR SIZE wait.l (count > 0) & wait.lit] ? CASE signal count := count - 1 the same with wait.r ALT i = 0 FOR SIZE signal.1 signal.l[t] ? CASE signal count := count + 1 the same with signal.r .
.
.
•
.
.
Figure 6. Structure of shared and semaphore procedures parsers (similar to the parsers produced by YACC) written in the occam2 language. Their structure is shown in Figure 7. The lexical analyser transforms the source program into tokens, while the syntax analyser uses the parsing tables (for the syntactic analysis) and the symbol table (for the semantic analysis). The parser stack contains the current configuration of the parser.
SYSTEM PERFORMANCE AND APPLICATIONS It has already been shown that OPPI creates some necessary processes. These extra processes do not necessarily constitute a system overhead. This happens in many large-scale applications, when excessive virtual Information and Software Technology
P TSANAKAS, G P A P A K O N S T A N T I N O U AND G EFTHIVOULIDIS
,n u,
program--
'e,,ca'
analyser
Syntax l_~Output analyser ] - program
I
Parser stack ] Parsing tables [ ISymb°l table J Figure 7. General structure o f O P P 1 and O P P 2 preprocesSOI'S
links exist a m o n g the application processes and message exchanges occur randomly. In such applications special routing processes are always needed, even if OPP1 is not used. If the structure of the application happens to match the structure of the physical network, however, direct message passing a m o n g the application processes may be achieved, without having to create any routing processes. In this case the routing processes that are created by OPPI may be considered as overhead. OPP has been realised in two layers for simplicity of the implementation and for provision for future porting of OPP into more advanced systems that support virtual channels in hardware (e.g., T9000), thus eliminating OPPI. However, this layered implementation incurs some overhead in the case of reading a shared variable, which requires four messages, although it could be done using only two. On the other hand, the write operation is not associated with any overhead, because two messages are transferred, the second of which is required for guaranteeing strong coherence ~. The performance of the programs produced by OPP can be greatly improved by proper allocation of the shared variables and semaphores a m o n g the available processor nodes. Each shared variable and semaphore must be allocated in or near the nodes that demand them most frequently. This problem is complex in general to be solved automatically at compile-time. In OPP, the user is responsible for an efficient allocation scheme, according to the nature of each particular application. The potential application areas of OPP are extensive as OPP provides means for supporting most of the existing parallel algorithms that are based on the sharedmemory computation model, with minimal programming effort. OPP also allows the last and reliable testing of parallel algorithms on inexpensive distributed memory architectures (e.g., for prototyping). So far the authors have successfully implemented most of the classical synchronization problems (e.g., readerswriters and dining philosophers), using OPP. These problems are important as they are abstractions of many practical applications. The authors' major effort is currently focused on using OPP for the development of a distributed multi-user database management system. In this system (which is planned to be portable to any network structure), shared semaphores are systematically used for the synchronizaVol 34 No 8 August 1992
tion of the access to the system resources, while shared variables are used for the concurrent access to the system's data.
CONCLUSIONS The presented approach for achieving DSM is based on source program transformations, for the introduction of virtual channels, shared variables (with strong coherency), and global semaphores. The p r o g r a m m e r ' s view of the system is consistent with the shared-memory paradigm, which exhibits considerable advantages over the distributed message-passing paradigm. Applications written in o c c a m + + can be executed on any multitransputer architecture, without any significant source modifications. Shared memory is allocated at compile-time to one or more transputer nodes. If this allocation is reasonable, low overhead can be expected, and many applications will be executed faster than in the case of using other replication- or migration-based DSM methods. Finally, it must be emphasized that the principles of OPP may be used for any other parallel programming language; occam2 was chosen simply because it is a convenient parallel language for multitransputer architectures.
REFERENCES 1 Stumm, M and Zhou, S 'Algorithms implementing distributed shared memory' Computer (1990) pp 54-64 2 Zhou, S e t al. 'Extending distributed shared memory to heterogeneous environments' in Proc. lOth Int. ConJl Distributed Computing Systems Computer Science Press (1990) 3 Li, K 'Shared virtual memory on loosely coupled multiprocessors" PhD thesis" Department of Computer Science, Yale University, USA (1986) 4 Li, K and Hudak, P "Memory coherence in shared virtual memory systems' A C M Trans. Comput. Svst. Vol 7 No 4 (November 1989) pp 321-359 5 Bisiani, R and Forin, A 'Multilanguage parallel programming of heterogeneous machines" IEEE Trans. Computers Vol 37 No 8 (August 1988) pp 93(~945 6 Bennet, 3 K et aL ~Adaptive software cache management for distributed shared memory architectures" in Proc. 17th Annual Int, Symp. Computer Architecture (1990)pp 125 134 7 Carter, J B e t al. 'Implementation and performance of Munin' Technical report C O M P TR90-I50 Rice University, USA (1991) 8 Sinha, P K et aL 'Flexible user definable memory coherency scheme in distributed shared memory of GALAXY' in Proc. 2nd European Conf. Distributed Memory Computing
(Lecture Notes in Computer Science Vol 487) SpringerVerlag (1991) pp 52-61 9 Borrmann, L and Istavrinos, P 'Store coherency in a parallel distributed-memory machine' in Proc. 2nd European Conf. Distributed Memory Computing (Lecture Notes in Computer Science Vol 487) Springer-Verlag (1991) pp 3241 10 Giloi, W K et al. 'A distributed implementation of shared virtual memory with strong and weak coherence' in Proc. 2nd European
Conf.
Distributed Memory
Computing
(Lecture Notes in Computer Science Vol 487) SpringerVerlag (1991) pp 23-31 11 Dally, W J e t al. 'The J-Machine: a fine-grain concurrent 505
Distributed shared-memory implementation for multitransputer systems computer' in Ritter, G X (ed) Information Processing 89 Elsevier Science Publishers (1989) 12 Inmos Ltd occam2 reference manual Prentice Hall (1988) 13 Inmos Ltd Transputer reference manual Prentice Hall (1988)
A P P E N D I X : S Y N T A X O F occam + A N D occam+ + The syntax of occam + and occam+ + is described below in a modified Backus-Naur Form (BNF). The same formalism has been used elsewhere ~2. With the notation: assignment = v a r i a b l e : = expression it is meant that assignment is a variable, followed by a : = , followed by an expression. A vertical bar (J) means 'or'. For example: action = assignment I input J output means that action is an assignment, an input, or an output. With the notation: sequence = SEQ { process } it is meant that a sequence is the keyword SEQ followed by zero or more processes, each on a separate line and indented two spaces beyond SEQ. Similarly, {0, expression } denotes a list of zero or more expressions, separated by commas, and {~, expression } denotes a list of one or more expressions, separated by commas. The following gives only the differences between the languages o c c a m + and occam2 and between o c c a m + + and occam + .
S y n t a x o f occam + occam.plus.program = { library.usage } { abbreviation } { definition } { virtual.channel.declaration I placedpar
{ placedpar } I P L A C E D PAR replicator placedpar I processor processor = PROCESSOR expression name {i, expression } process The other syntactic objects are defined as by lnmos ~2,except the following: parallel = PAR { process } I PAR replicator process I PR1 PAR { process } I P R | PAR replicator process
S y n t a x of o c c a m + + occam.plus.plus.program = { library.usage } { abbreviation } { definition } { shared.memory.or.semaphore.declaration } placedpar shared.memory.or.semaphore.declaration = shared.memory. declaration J semaphore. declaration shared.memory.declaration = [expression] INT {i , name } : semaphore.declaration = SEMA {l , name }: processor = PROCESSOR expression name {1, expression } { place } process place = PLACE {t , name } : The other syntactic objects are defined as in occam + , except the following:
library.usage = # U S E library.logical.name
channel.type = C H A N O F protocol I [ expression ] channel.type
output = channel ! expression channel ! output.item channel ! {~ ; output.item } channel ! tag channel ! tag ; {1 ; output.item } port ! expression semaphore ! name
placedpar = P L A C E D P A R
semaphore = element
library.logical.name = name virtual.channel.declaration = channel.type {i, name } :
506
Information and Software Technology