The RHODOS DSM system1

The RHODOS DSM system1

MICPRO 1226 Microprocessors and Microsystems 22 (1998) 183–196 The RHODOS DSM system1 J. Silcock*, A. Goscinski School of Computing and Mathematics,...

2MB Sizes 0 Downloads 65 Views

MICPRO 1226

Microprocessors and Microsystems 22 (1998) 183–196

The RHODOS DSM system1 J. Silcock*, A. Goscinski School of Computing and Mathematics, Deakin University, Geelong, Victoria 3217, Australia

Abstract The RHODOS Distributed Shared Memory (DSM) system forms an easy to program (using sequential programming skills without a need to learn DSM concepts) and transparent environment, and provides high performance computational services. This system also allows programmers to use either the sequential or release consistency model for the shared memory. These attributes have been achieved by integrating DSM into the RHODOS distributed operating system rather than putting it on top of an existing operating system, as have other researchers. In this paper we report on the development of a DSM system integrated into RHODOS and how it supports programmers; the programming of three applications to demonstrate ease of programming; and the results of running these three applications using the two different consistency protocols. q 1998 Published by Elsevier Science B.V. Keywords: DSM; Consistency; Write-update; Write-invalidate; Distributed Operating System; Performance

1. Introduction Distributed Shared Memory (DSM) is one of the three basic approaches of expressing parallelism (parallel programming languages, parallel programming packages and just DSM), which is most promising because it allows the programmer to employ the familiar and easy-to-use shared memory programming model and the direct use of existing shared memory software. Another very important feature of DSM, which is not addressed directly, is that it allows the management of existing parallelism. Clusters of workstations (COWs) are well suited to support parallel processing because they scale to larger numbers of machines than tightly coupled systems and demonstrate an excellent ratio of performance to cost. However, all of the excellent architectural and computational features of COWs suffer from the lack of good parallel software. As the available parallelism of computer systems increases, the problem is how to exploit this parallelism. An analysis of the DSM systems developed by other researchers has highlighted that ease of use for application programmers is often sacrificed for the sake of performance. Thus, in order to extract the best performance, Midway [1] requires the programmer to label DSM variables and associate them with specific locks, while Munin [2] requires the * Corresponding author. Tel.: 00 61 03 5227 1243; fax: 00 61 03 5227 1208; e-mail: jackie, [email protected] 1 This work was partly supported by the Deakin University Research Grant 0504222151.

programmer to use different consistency protocols for variables according to their access patterns. These approaches clearly require the programmer to gain additional skills and to have an insight into the implementation of the DSM system in order to use it successfully. This feature of DSM could discourage application programmers even more than parallel programming languages, which require learning a new language, and parallel programming tools, which require learning some library routines; programmers could be forced to learn both DSM and an operating system supporting DSM, and specify data to address the DSM requirements. The DSM system proposed in this paper places strong emphasis on ease of programming and use, and transparency, while still not sacrificing performance. We are convinced that it is not enough to develop an efficient, but sometimes difficult to use, DSM system which the average application programmer simply will not adopt as their execution environment. We claim in this paper that in order to provide application programmers with a convenient environment in which to develop and execute parallel applications, and which does this in a transparent and efficient manner, a DSM should be an integral part of a distributed operating system. (The existing DSM systems supporting parallel execution on COWs have been built on top of operating systems.) The problem is whether the operating system based approach is fully feasible and whether it is able to provide results comparable to or even better than the DSM systems developed on top of an operating system.

0141-9331/98/$ - see back matter q 1998 Published by Elsevier Science B.V. All rights reserved PII S 01 41 - 93 3 1( 9 8) 0 00 7 8- 7

184

J. Silcock, A. Goscinski/Microprocessors and Microsystems 22 (1998) 183–196

In order to demonstrate feasibility of our approach the DSM system has been completely developed within the RHODOS Operating System (ResearcH Oriented Distributed Operating System) [3]. RHODOS’ DSM supports application programmers by giving an environment in which they can easily write shared memory code or adapt existing shared memory code. The system requires only basic initialisation code for the creation of the shared memory region [4]. A single library call from the parent process creates the shared memory and initialises the child processes on remote workstations based on the current system load, then initialises the synchronisation variables for the DSM system. The synchronisation primitives are in the form of barriers which ensure all DSM parallel processes co-ordinate at a certain point before continuing execution and semaphore-type wait() and signal() for mutual exclusion. The goal of this paper is (i) to report on the development of a DSM system which allows programmers to achieve the above goals; (ii) to show how RHODOS’ DSM allows programmers to write shared memory code in an easy to use and familiar environment and that this task can be performed by an inexperienced programmer who does not know the DSM system; and (iii) to present the results of tests carried out to show the difference in the performance between the sequential and release consistency protocols employed by RHODOS’ DSM.

2. DSM in RHODOS In this section we describe the semantics of memory coherence and synchronisation in RHODOS’ DSM in order to show that the primitives used by programmers perform operations which are completely invisible to the programmers at user level. We use a simple section of Producer–Consumer code to explain the actions taken by the DSM system when the code is executed. We have employed two consistency models in RHODOS’ DSM, sequential and release consistency. We have used a writeinvalidate approach to serialising parallel write operations when implementing sequential consistency and a writeupdate approach to serialising parallel write operations when implementing release consistency. In addition, we use a skeleton of code in which barriers are used to describe the semantics of these barriers and how they are employed in RHODOS’ DSM. In the write-invalidate approach, each workstation has a local shared memory region which may contain writable, readable and invalid (offsite) pages. An attempt by a process to access a page for which it does not have a copy results in a page fault which is handled by the DSM software. The DSM software then retrieves the missing page. If the access that caused the page fault was a write access then all other copies of the page must be invalidated. In the write-update approach each process has a local shared memory region

which contains both writable and readable pages. The processes on different workstations may read from and write to the pages; however, it is the responsibility of the DSM software to ensure that all the shared memory regions in the system are consistent. Write-update models using relaxed consistency protocols, such as release consistency, delay the distribution of updates until they are required by another workstation. 2.1. Design requirements Three design requirements have been identified as being desirable for a DSM system: •





Ease of programming. The DSM should provide programmers with an easy to use environment in which they are comfortable writing their shared memory code. The programmer should not be forced to go beyond the concepts of sequential shared memorybased programming, supported by such constructs as semaphores, locks or barriers, with which they are familiar. This will allow easy development of new and porting of existing sequential programs to the proposed execution environment. Transparency. The users should be unaware that the memory they are using is not physically shared and they should not be expected to have any DSM-related input to the program other than barriers which are easy to use and understand. Efficiency. The access time of non-local memory should be as close to the access time of local memory as possible. This is also related to transparency in that the programmer should see no discernible difference between local and non-local memory accesses.

2.2. A simple example of Producer–Consumer code In Fig. 1 we show a code segment which we use to illustrate the semantics of the memory consistency model and semaphore-type synchronisation in the RHODOS DSM system. The code shows a simple example of the classic Producer–Consumer problem in which it is clear that the synchronisation primitives are no different from those normally used by programmers in shared memory code. The Producer produces three numbers and places them into the variables x, y and z, the Consumer then consumes the numbers from x, y and z. Since both processes access the three variables the access must take place within a critical section to prevent two processes from accessing it

Fig. 1. Producer–consumer code segment.

J. Silcock, A. Goscinski/Microprocessors and Microsystems 22 (1998) 183–196

185

simultaneously. The semaphores take the form of an array of their names in the user process. 2.3. Memory management based DSM system in RHODOS We have decided to embody DSM within an operating system in order to create a transparent and easy to program environment and achieve high execution performance of parallel applications. Operating systems which allow easy extensions and modifications are those which are based on the client–server model. In a microkernel and client–server based operating system the system resources are managed by a set of servers such as a process server, memory server, and interprocess communication server. Shared memory can itself be viewed as a resource which requires management. RHODOS is a microkernel and client–server based distributed operating system [3]. In RHODOS, the system resources are managed by a set of servers such as the Process Manager, Space (Memory) Manager, and Interprocess Communication (IPC) Manager. The options for placing the DSM system within the operating system are either to build it as a separate server or incorporate it within one of the existing servers. The first option was rejected because of a possible conflict between two servers (Space Manager and the DSM system) both managing the same object type, i.e. memory. Synchronised access to the memory in order to maintain its consistency would become a serious issue. Since DSM is essentially a memory management function, the Space Manager is the server into which the DSM system must be integrated. In order to support memory sharing in a COW, which employs message passing to allow processes to communicate, the DSM system must be supported by the Interprocess Communication Manager. However, the support provided to the DSM system by this server is invisible to application programmers. Furthermore, because DSM parallel processes must be properly managed, including their creation, synchronisation when sharing a memory object and co-ordination of their execution, the Process Manager must support DSM system activities. The placement of the DSM system in RHODOS and its interaction with the basic servers are shown in Fig. 2. The placement of the DSM system in the memory manager of a client–server and microkernel based distributed operating system helps the DSM system to achieve several design requirements. First, because the DSM system is integrated into the memory management the programmer is able to use the shared memory as though it were physically shared, hence transparency is achieved. Second, because the DSM system is in the operating system itself and is able to use the low level operating system, functions efficiency is achieved. 2.4. Granularity of the shared memory object The granularity of the shared memory object is an important issue in the design of a DSM system. The proposed

Fig. 2. DSM system integrated into the Space Manager in RHODOS.

DSM system is placed within the Space Manager of RHODOS. The memory unit of the RHODOS Space is a page. Thus, it follows that the most appropriate object of sharing for the DSM system is a page. A single page rather than multiple pages are to be used as the unit of granularity in the proposed system. The reason is that although multiple memory units would reduce the effect of the communication overhead, they would increase the incidence of false sharing and thus reduce the efficiency of the DSM system. False sharing occurs in write-invalidate systems when processes on separate workstations access different variables in the same shared memory unit or page. The consequence is that the memory unit thrashes between the workstations. 2.5. Semaphores and memory coherence in invalidationbased DSM in RHODOS The events that occur when the producer code segment in Fig. 1 is executed by the User Process on the Source Workstation, using invalidation-based DSM [5], are as follows (Fig. 3). Before performing the write access on the shared variables the wait(sem[0]) call must be executed; the Producer sends a message (Message 1) to the Space Manager on the Source Workstation and blocks. The Space Manager sends a message (Message 2) to the probable owner of the semaphore. The probable owner is the process which, according to this Space Manager’s records, was the last owner of the semaphore. If ownership of the semaphore has been passed on, the message will be forwarded to the new probable owner until the actual owner is located. The semaphore can be in one of three states: REMOTE, LOCKED or RELEASED. If it is REMOTE the local Space Manager is no longer the owner and the request must be forwarded to the probable owner of the semaphore as shown in the local

186

J. Silcock, A. Goscinski/Microprocessors and Microsystems 22 (1998) 183–196

Fig. 3. Design of semaphores and memory consistency in invalidation-based DSM on RHODOS.

Space Manager’s records. If the semaphore is LOCKED the request is queued to be handled later when the semaphore is released. If the semaphore is RELEASED the state is changed to REMOTE and a reply (Message 3) is sent to the Space Manager on the Source Workstation. The latter changes the state of its semaphore to LOCKED and restarts the user process by sending a message (Message 4) to the Process Manager. The page in the shared memory region, on which x, y and z are found, can be in one of three states: read-only — the only operation that may be performed on the page is a read operation, a write operation will cause an exception; read– write — both read and write operations can be carried out on the page; or offsite — a valid copy of the page does not exist in the memory of the workstation, a read or write operation on this page will result in an exception. When the User Process tries to access memory which is not resident on that workstation (offsite) or attempts to write to a page which has read-only protections, a trap occurs and a message (Message 5) is sent to the Space Manager which uses an algorithm similar to that used to locate the semaphore owner to find the owner of the page. When the page owner has been located, the Space Manager on the Source Workstation sends a message (Message 6) to the Space Manager on the Page Owner’s Workstation. If write access is required, an invalidation message (Message 7) is sent by the Space Manager on the Page Owner’s Workstation to all of the workstations in the copyset. The copyset is a list of all workstations which have a copy of the page. The Space Managers on all workstations in the copyset invalidate their copies of the page and send a reply (Message 8) to the Space Manager on the Page Owner’s

Workstation. When all the replies have been received a copy of the page is sent to the Source Workstation (Message 9) and the copy on the Owner’s Workstation is invalidated. If read access is requested, the Source Workstation is added to the owner’s copyset and a copy of the requested page is sent to the Source Workstation (Message 9). The Space Manager on the Source Workstation receives the message from the Page Owner’s Workstation. When the page is received at the Source Workstation the Space Manager maps the page into the local space. If write access was requested it changes the ownership of the page to local and the protections on the page to read_write. The Space Manager then sends a message (Message 10) to the Process Manager which unblocks the User Process. To exit the critical region, the User Process executes a signal(sem[0]) user library call and a message (Message 11) is sent to the Space Manager which changes the status of the semaphore to RELEASED. 2.6. Semaphores and memory coherence in update-based DSM in RHODOS The events that occur when the producer code segment in Fig. 1 is executed by the User Process on the Source Workstation, using update-based DSM [6], are as follows (Fig. 4). The protections on the shared memory region, in which x, y and z are found, are read-only. Before performing the write access on the shared variables the wait(sem[0]) call must be executed, the Producer sends a message (Message 1) to the Space Manager on the Source Workstation and blocks. The Space Manager sends a message (Message 2) to the probable owner of the

J. Silcock, A. Goscinski/Microprocessors and Microsystems 22 (1998) 183–196

187

Fig. 4. Logical design of semaphores and memory consistency in update-based DSM on RHODOS.

semaphore. If the semaphore is RELEASED the status is changed to REMOTE and a reply (Message 3) is sent to the Space Manager on the Source Workstation. The latter changes the state of its semaphore to LOCKED and restarts the user process by sending a message (Message 4) to the Process Manager. The attempt to perform a write access on variable x in the shared memory region on the Source Workstation results in a write fault (Message 5) because the page has read-only protection. The write fault is handled by the Space Manager. The Space Manager twins the page and changes the

protection for that page to read–write so that further accesses will not trigger a write fault. Twinning involves making an identical copy of the page. In the page record for the page containing x in the DSM table the status field is changed to TWINNED and a pointer to the copy of the page is placed in the Page_ptr field. The Space Manager contacts (Message 6) the Process Manager to restart the User Process. The write operation can then proceed. If variables y and z are on the same page as x, the operations to write to them will also proceed without a page fault as the page now has read–write protections. Thus the page within the shared

Fig. 5. Overview of Deff message generation.

188

J. Silcock, A. Goscinski/Microprocessors and Microsystems 22 (1998) 183–196

memory region is altered while the copy of the page pointed to by the DSM table remains unchanged. These two pages will be compared later to identify the changes. When the User Process on the Source Workstation executes the signal(sem[0]) call to exit the critical section, a message (Message 7) is sent to the Space Manager on the Source Workstation. In the Diff Message (Fig. 5) the shared memory region identifier is placed in ID and the number of pages containing differences (num_pages) is initialised to zero. The Space Manager searches sequentially through the page records of the shared memory region in the DSM table for page records with TWINNED status. When one is located, num_pages in the message is incremented and the original page, which is pointed to by the Page_ptr, is compared word by word with the corresponding page in the shared memory region. When a difference is found the offset of the beginning of the difference from the beginning of the page is stored in the Diff Message (Offset) and the number of differences found on the page (num_diffs) is incremented. The comparison continues until no difference is found when the length of the ‘‘chunk’’ of different data and the ‘‘chunk’’ itself are copied to the message. The search through the page for changes is continued to the end of the page and any further differences added to the message. The status field for the page in the DSM Table is changed to SINGLE. This process is repeated for each twinned page. When this is completed the Space Manager on the Source Workstation sends the Diff message (Message 8) to all Remote Workstations. When a Diff Message is received at a Remote Workstation the shared memory region is updated to contain all the differences. When the message is received, the shared memory region with the same name as ID in the message is located. The page corresponding to Page Number in the message is then found and the address corresponding to Offset is located. The difference is then placed in the page at that position. This process is continued until the whole message has been consumed. The Space Managers on the Remote Workstations send acknowledgement messages (Message 9) to the Source Workstation. When the Space Manager on the Source Workstation has received all acknowledgements from the Remote Workstations it releases the semaphore and sends a message (Message 10) to the Process Manager on the Source Workstation to restart the User Process. At this point all shared memory should be

consistent. If, however, not all Remote Workstations acknowledge receipt of the Diff Message, the Source Workstation will send the message again and await acknowledgements. 2.7. Semantics of barriers in invalidation-based and update-based DSM in RHODOS Barriers are used in RHODOS to co-ordinate the parallel DSM processes. Processes block until all processes have reached the same barrier; the processes then all continue execution. A piece of skeleton code in Fig. 6 shows the use of barriers. The barriers take the form of an array of their names. In code written for DSM, as in code written for execution on any shared memory system, a barrier is required at the start and end of execution and there may be any number of barriers during the execution of the application. The barrier at the start of execution is required because, as in the case of a tightly coupled system where the memory is physically shared, none of the processes should begin computation until the shared memory initialisation has been completed. In the same way, a barrier should precede the execution phase of the application to ensure that all processes have completed their initialisation phase. Likewise the final barrier ensures that the processes exit only after all processes have completed the execution of the application. In addition to this, barriers can be used throughout the execution whenever it is necessary that all processes have completed a particular phase of the computation before the start of the next phase. In RHODOS barriers are managed by a Centralised Barrier Manager. The Barrier Manager receives messages from all processes when they have reached the barrier. Once all these messages have been received the Manager sends a message to all the processes allowing them to unblock and continue execution. The main barrier data structure, the dsm_barrier_table, holds information about the barrier in five fields. The barrier_name field contains a unique identifier for the barrier. The psn field contains the unique identifier for the parent process from which the barriers were initialised. The barrier_count is used by the Barrier Manager as a count of the number barrier messages received and barrier_max is the number of processes using the barrier. The barrier_queue field is also only used by the Barrier Manager; it contains a pointer

Fig. 6. Code skeleton using barriers.

J. Silcock, A. Goscinski/Microprocessors and Microsystems 22 (1998) 183–196

189

Fig. 7. Logical design of barriers in invalidation-based DSM on RHODOS.

to a queue which contains the locations of all processes using the barrier. The events that occur when a dsm_barrier(barrier[0]) primitive is executed by the User Process on the Source Workstation are as follows (Figs. 7 and 8). In invalidation-based DSM, when a dsm_barrier (barrier[0]) call is executed (Fig. 7) by the User Process

on the Source Workstation, the User Process sends a message (Message 1) to the Space Manager on the Source Workstation and blocks. The Space Manager sends a message (Message 2) to the Barrier Manager. This manager is part of the Space Manager on the workstation on which the barriers were initialised. The Barrier Manager increments the barrier_count. If barrier_count equals

Fig. 8. Logical design of barriers in update-based DSM on RHODOS.

190

J. Silcock, A. Goscinski/Microprocessors and Microsystems 22 (1998) 183–196

barrier_max the Barrier Manager has received barrier messages from all workstations listed in its barrier queue and it sends a message (Message 3) to all their Space Managers. The barrier_queue is a list of all workstations from which the Barrier Manager is expecting barrier messages. If the Barrier Manager times out waiting for barrier messages from all workstations in the barrier queue, it can use the barrier queue to identify the workstation which has not sent a message. When all barrier messages are received, the Space Managers send a message (Message 4) to the Process Manager to restart the User Processes. In RHODOS’ update-based DSM, barriers (Fig. 8) not only synchronise the processes but serve as points for memory update operations. Entry to a barrier has the same effect on the shared memory as a signal(sem[0]) because the first operation carried out by the barrier primitive is to make the shared memory consistent. Exiting from a barrier has the same effect as a wait(sem[0]) because the protection on the shared memory is changed to read-only. Because the shared memory is made read-only before a barrier is exited, any subsequent attempt to perform a write access on any part of the shared memory after exiting a barrier results in a write fault (Message 1) which is handled by the Space Manager. The Space Manager twins the page and changes the protection for that page to read–write so that further accesses will not trigger a write fault. Twinning involves making an identical copy of the page. The status field in the page record of the DSM table is changed to TWINNED and a pointer to the copy of the page is placed in the Page_ptr field. The Space Manager contacts (Message 2) the Process Manager to restart the User Process. The write operation can then proceed. When the User Process on the Source Workstation executes another dsm_barrier(barrier[0]) call, a message (Message 3) is sent to the Space Manager. The TWINNED pages are compared with their copies and a Diff message is generated using the same mechanism as that used to generate Diffs when a signal(sem[0]) is executed. The Diff

Message (Message 4) is then sent to the Space Managers on all Remote Workstations. The Space Managers incorporate the changes in the Diff Message into the shared memory. The Space Managers on the Remote Workstations send acknowledgement messages (Message 5) to the Space Manager on the Source Workstation. When the Space Manager has received all these acknowledgements it sends a barrier message (Message 6) to the Barrier Manager. When the Barrier Manager has received barrier messages from all workstations in its barrier queue it sends a message (Message 7) to all their Space Managers. The Space Managers on all workstations then change the protection for the shared memory to read-only and send messages (Message 8) to their respective Process Managers to restart the User Processes.

3. Programming and performance in the RHODOS DSM environment In this section we show how we support application programmers by allowing them to program using the same shared memory code that they would use when programming for shared memory systems whether they be tightly coupled multiprocessors or uniprocessors with concurrent programming. In addition, we present the results for our performance tests which measure the speedup of three applications using RHODOS’ update and invalidation-based DSM. The complete set of performance studies is presented in [7]. The applications used are: • •



Jacobi which uses the SPMD computational model to solve partial differential equations. Quicksort (QS) which has computational phases in which it uses the SPMD model and phases in which it uses the MPMD model. The algorithm sorts an array of elements. Travelling Salesman Problem (TSP) which uses the

Fig. 9. Basic pseudocode for the Jacobi algorithm.

J. Silcock, A. Goscinski/Microprocessors and Microsystems 22 (1998) 183–196

Fig. 10. Data decomposition algorithm.

Multiple Program Multiple Data (MPMD) computational model. TSP is based on a branch and bound algorithm. In TSP all processes share the same queue of partial tours; there is no data decomposition. The system is currently implemented within the RHODOS operating system running on Sun 3/50 workstations, connected by a 10-Mbps Ethernet. The granularity of the shared memory is an 8K page. The experiments were carried out using one to eight workstations for each application for both the update- and the invalidation-based protocols with a single DSM process being placed on each workstation. 3.1. Jacobi The Jacobi algorithm uses two matrices, ‘‘grid’’ and ‘‘scratch’’, placing the average of the surrounding (above, left, right, below) elements of the grid array into the scratch array, and then placing the scratch array into the grid array. This is performed an arbitrary number of times. The algorithm was written based on an algorithm provided in [8]. The pseudocode for the Jacobi algorithm is shown in Fig. 9. To achieve parallelism, the arrays are split up into

191

contiguous blocks of rows, and each process is given one block on which to operate. Therefore, for N processes, each process will get approximately 1/N of the arrays. A unique identification number is given to each process to determine its piece of the arrays. The algorithm (Fig. 10) shows the decomposition algorithm, where M is the unique identifier for the process. One shared two-dimensional array of floats is used for the ‘‘grid’’ array, and a local two-dimensional array of floats is given to each process for the ‘‘scratch’’ array. In the experiments a matrix size of 60 3 1024 was used. Parallelism was achieved by allocating equal sections of the matrix to each process to work on. The boundary values of these sections of the array are shared with other processes. The programmer is required to insert barriers to synchronise the Jacobi processes. As in all cases pre- and post-execution barriers ensure the processes all commence execution only after the completion of initialisation and exit only after all processes have completed execution. The barriers during execution ensure that all processes have completed the computation of their scratch arrays before they are written to the grid arrays. Discussion. The results for the Jacobi tests are shown in Fig. 11. The speedup for eight processors using invalidation-based DSM and update-based DSM are 3.6 and 5.2, respectively, for the problem of size 60 3 1024. The experiment was repeated for the invalidation-based version so that the matrix size (64 3 2048) was deliberately chosen such that rows were page aligned. This eliminates the effects of false sharing also reported by other researchers when using page-based implementations [2,9]. Realistically, such implementation details should not concern the programmer and in a normal situation, there may be more false sharing than in this experiment. The results of this experiment are

Fig. 11. Speedup for Jacobi on matrices of 64 3 2048 (INVALIDATE) and 60 3 1024 (SMALL-INVALIDATE and UPDATE) elements.

192

J. Silcock, A. Goscinski/Microprocessors and Microsystems 22 (1998) 183–196

Fig. 12. Pseudocode for Quicksort.

also shown in Fig. 11; with an eight-process speedup of 5.4 this demonstrates the negative influence of false sharing on performance. 3.2. Quicksort Quicksort is a recursive sorting algorithm which repeatedly divides lists into sublists so that the contents of one list are less than the contents of another (Fig. 12). When the size of the sublists reaches some default size (1 kB in our case) they are sorted sequentially using a bubblesort algorithm. The Quicksort code used here was provided by Keleher [10]; we made some adjustments to the code for use on RHODOS DSM.

Quicksort was implemented using three shared data structures: the array being sorted; the task queue containing the indices of the subarrays to be sorted; and a count of the number of processes waiting for work. The task queue serves to parallelise Quicksort because it holds the details of the unsorted sublists. The processes continually remove these details and either partition the sublists or, if they are of the default size, sort them sequentially. The programmer must insert a semaphore to protect accesses to the task queue. A condition variable is used to signal when there is work to be done. As in all applications, barriers are used to synchronise the start and finish of execution. The sequential bubblesort performed on the sublists with the default size are carried out without synchronisation as

Fig. 13. Speedup for Quicksort using invalidation and update-based DSM on an array of 256K elements.

J. Silcock, A. Goscinski/Microprocessors and Microsystems 22 (1998) 183–196

193

Fig. 14. Pseudocode for TSP.

each process is sorting a section of the array which will only be accessed by that process. The default size in this experiment was set at 1024 elements. Discussion. Quicksort displays false sharing when more than one process attempts to sort different subarrays which happen to be on the same page. This means that the pages thrash between workstations. The speedups by invalidationand update-based DSM are shown in Fig. 13 and for eight workstations are 2.8 and 3.8, respectively. The speedup

drops away from the linear because of the large number of synchronisation accesses performed. The task queue is always accessed in a critical section and whenever a task is removed from the task queue the critical section is exited. However, if the task removed requires further subdivision, the critical section must be re-entered after the subdivision is performed so that the new tasks can be added to the queue and the critical section exited. Each time a critical section is entered, messages must be sent to gain access to

Fig. 15. Speedup for TSP using invalidation and update-based DSM for an 18-city tour.

194

J. Silcock, A. Goscinski/Microprocessors and Microsystems 22 (1998) 183–196

the synchronisation variable. Exiting the critical section causes the operating system to go through the updating process. 3.3. Travelling Salesman Problem (TSP) The Travelling Salesman has a group of cities to visit. The problem is to find the shortest path which the salesman must take in order to visit each city only once. Each path is known as a tour. This algorithm maintains a priority queue which contains partially evaluated tours. The head element of the queue is the tour with the smallest lower bound for the remaining portion of the tour. If the remaining number of nodes required to complete the tour is below the threshold, the remainder of the tour is computed sequentially, otherwise the partial tour is expanded by a single node and the resulting partial tours are put into the tour queue. The current best tour is maintained. As each tour is removed from the tour queue a lower bound on the remaining part of the tour is computed. If the sum of this lower bound and the current length exceeds the current best tour, the tour is rejected. The code for the TSP problem was originally from [10]. We have adapted it for RHODOS DSM (Fig. 14). The input for TSP is an array which represents the distances between the cities which the salesman must visit. The shared data structures are: the length of the global minimum tour and the tour itself; an array containing partially evaluated tours and unused tours; a priority queue of pointers to partially evaluated tours; and the tour stack the elements of which point to unused tour structures. Effectively, the larger the threshold, the larger the portion of the work to be completed sequentially. These sequential portions are then carried out in parallel. In this case, we use a threshold of 13 which in [2] is referred to as a coarsegrained solution and results in better speedups than the finer grained solutions, which use a smaller threshold. Discussion: The major bottleneck in the TSP code is that to access the priority queue. Processes must wait for the semaphore before accessing the queue itself. This semaphore is not released until new tours have been put back onto the queue. We used an 18-city tour. The speedup shown by RHODOS’ invalidation- and update-based DSM for TSP are shown in Fig. 15 and for eight workstations are 4.2 and 6.9, respectively.

4. Outcome analysis and related work In this section, we discuss programming and performance aspects of RHODOS’ DSM. We confine our references to related work to these two areas. The comparison of performance is made difficult by the use of different hardware. Initially, TSP and Quicksort were converted for RHODOS DSM by J. Silcock. The conversion involved the removal of all TreadMarks, Munin or IVY related code

followed by the addition of the initialisation and synchronisation code for RHODOS DSM. Performance tests were then carried out on these two applications measuring their speedup. Subsequently, the task of implementing (by either converting existing DSM code or writing the DSM code using a supplied algorithm) the remainder of the applications was given to an undergraduate student who had completed the first two years of a computer science degree, and as such he was a relatively inexperienced programmer. The task was part of the duties given to this student after being awarded a vacation scholarship. He had completed a number of units which involved some programming, including Basic Programming Concepts, Data Structures and Algorithms and Operating Systems [11]. The latter unit involved some concurrency issues, while all three units used C as the programming language. This student was given a basic of background in general DSM issues and shown the Quicksort and TSP implementations before being asked to write the Jacobi code using an algorithm given in [8] for execution in the RHODOS DSM environment. With his knowledge of the C language and the introduction to concurrency provided by his course work, he was easily able to complete the task of converting and writing the required code. The notable thing was that he had little knowledge of the implementation details of the RHODOS DSM system, yet he was able to perform his programming task easily without a need for this knowledge, thereby demonstrating the transparency of the DSM system. While it is difficult to quantify ease of programming and use, this experience indicates that the implementation of RHODOS’ DSM system has met the goals of easy to use, easy to program and transparency. Much of the research effort in DSM research effort has concentrated on improving the performance of the DSM system, particularly in the area of more relaxed consistency models rather than researching a more integrated DSM design [12]. In order to extract the best performance from the DSM system, many implementations require programmers to go beyond their normal shared memory based practices. In particular, Midway [1] is one of such systems which requires programmers to learn it thoroughly. Midway provides a DSM system which uses entry consistency. Memory that uses the entry consistency model is made consistent at the entry point to a critical section. The programmer must provide the system some mechanism to identify the variables which are shared and need to be updated before the critical section may be entered. In addition to code development, programmers need to annotate variables and associate them with particular locks. Treadmarks [13] uses a new consistency model, lazy release consistency, which delays updating the shared memory until entry to a critical region in which those data will be used. Lazy release consistency appears to improve the performance of applications. However, the implementation needs specialised hardware since a large amount of memory is required on each workstation to store unapplied updates.

J. Silcock, A. Goscinski/Microprocessors and Microsystems 22 (1998) 183–196

In addition, a presumably time-consuming garbage collection function must run at intervals in order to clear the memory of a backlog of updates. The frequency of the garbage collection runs is related to the size of the local memory, the smaller the memory the more often this function must run. Our update-based DSM uses the release consistency model. Treadmarks uses locks rather than the semaphores used in RHODOS DSM. However, it is difficult to compare our speedups with those reported in the Keleher thesis and subsequent papers as the network hardware used in the Treadmarks tests has a total throughput of 1.2 Gbps compared with our 10 Mbps Ethernet. In [2] the results of tests performed comparing seven applications using message passing, ‘‘traditional DSM’’ and Munin on Sun 3/60s on a 10 Mbps Ethernet are reported. The ‘‘traditional DSM’’ used for this comparison is a page-based DSM which uses an invalidation-based protocol similar to RHODOS’ invalidation-based DSM. Thus, the results generated in this part of Carter’s tests can be compared with our performance results for TSP and Quicksort. The speedup shown by Carter for TSP using eight processors is approximately 6.5, our speedup is 4.22 and that for Quicksort using eight processors is approximately 2.4 while our speedup is 2.8. The major difference between RHODOS’ invalidation-based DSM and Carter’s traditional DSM is that the latter’s synchronisation is implemented through locks which are attached to particular shared variables instead of semaphores which define an area of code in which shared variables are accessed. Our solution provides the latter, semaphore-like, synchronisation. The different synchronisation methods could account for the variation in the speedup of TSP. Our implementation of DSM separates the synchronisation mechanism from the paging mechanism while Carter’s implementation integrates the two. In the latter the locks are implemented as variables within the shared memory. In order to gain access to the lock a writable copy of the page containing the lock variable must be moved to the requesting the access. This page also contains the data to be protected if the lock variable declaration has been placed adjacent to the data being protected. Thus, when the requesting workstation obtains the lock it also gets a writable copy of the page it will require in the critical section. This is, in effect, prefetching the page. While this implementation has improved performance, this improvement can only be seen if application programmers place the lock variable in the correct position. Carter’s implementation has sacrificed transparency for the sake of efficiency. We have aimed to provide a DSM which required as little additional input from the application programmer as possible, while maximising the efficiency of the system. Our solution provides transparency, one of the requirements we have identified as desirable for DSM design. Any application programmer who is familiar with shared memory programming using semaphores will be able to use RHODOS’ DSM with no additional instructions or understanding of the application.

195

Munin, also described in Ref. [2], is an object-based DSM which has multiple consistency protocols. Programmers must select the protocol best suited to each variable in the application. Munin shows a speedup of 6.25 for eight processors for Quicksort, while our update-based DSM shows a speedup of 3.8, and a speedup of approximately 6.8 for TSP, while our update-based DSM shows an eight-processor speedup of 6.85. The multiple consistency protocol approach of Munin places a considerable load on application programmers. In order to extract optimal performance from the DSM, programmers must select the correct consistency protocol for each variable. This approach clearly requires the programmer to gain additional skills and to have a thorough insight into the implementation of the DSM system they are using in order to make successful use of the DSM system. In fact, programmers require a deeper than usual insight into the application’s data sharing patterns than would normally be expected in shared memory programming.

5. Conclusions Distributed shared memory provides a convenient abstraction for shared memory on a COW in which the memory is physically distributed. If developed properly, a DSM system can improve the performance of application execution. The issue is whether the programming could be made easy and transparent and the execution efficient. In this paper we have reported on the programming and performance aspects of RHODOS’ DSM. RHODOS’ DSM exhibits good transparency and programmability as it allows application programmers to write code in a familiar environment without the need to gain additional skills or to understand the underlying DSM mechanisms. Programmers are able to choose to use sequential consistency through the use of invalidation-based DSM or release consistency through the use of update-based DSM. Both DSM implementations are integrated into the RHODOS operating system. The programmer’s input is in the form of initialisation code and synchronisation primitives. Our results demonstrate that even for a system made up of workstations with small memory, the speedup for both invalidation and update-based DSM is very good. The performance of applications using invalidation-based DSM is worse than that for applications using update-based DSM despite the fact that so many messages are needed to carry out the memory operations for update-based DSM. Operating system-based DSM makes operations transparent and nearly completely reduces the involvement of the programmer beyond classical activities needed to deal with shared memory; only barrier-based synchronisation of parallel processes is needed. From all these results we can conclude that DSM, when implemented within a distributed operating system, is one of the most promising approaches to parallel processing and guarantees huge performance

196

J. Silcock, A. Goscinski/Microprocessors and Microsystems 22 (1998) 183–196

improvements with the minimum of involvement of the programmer.

Technical Report, TR-552-97, Department of Computer Science, Princeton University, October 1997. [13] P. Keleher, Lazy release consistency for distributed shared memory, PhD Thesis, Rice University, January 1995.

References [1] B. Bershad, M. Zekauskas, W. Sawdon, The Midway distributed shared memory system, Proceedings of the IEEE COMPCON Conference, IEEE, 1993. [2] J. Carter, J. Bennett, W. Zwaenepoel. Techniques for reducing consistency-related communication in distributed shared-memory systems, ACM Transactions on Computer Systems, 13 (3) 1995. [3] D. De Paoli, M. Hobbs, A. Goscinski, Microkernel and kernel server support for parallel execution and global scheduling on a distributed system, in: Proceedings IEEE First International Conference on Algorithms and Architectures for Parallel Processing, April 1995. [4] J. Silcock, A. Goscinski, The influence of the ratio of processes to workstations on the performance of DSM, in: Proceedings of the 21st Australasian Computer Science Conference (ACSC ’98), February 1998. [5] J. Silcock, A. Goscinski, Invalidation-based distributed shared memory integrated into a distributed operating system, in: Proceedings of IASTED International Conference Parallel and Distributed Systems (Euro-PDS ’97), June 1997. [6] J. Silcock, A. Goscinski, Update-based distributed shared memory integrated into RHODOS’ memory management, in: Proceedings of the Third International Conference on Algorithms and Architecture for Parallel Processing (ICA3PP ’97), December 1997. [7] J. Silcock, A. Goscinski, Performance studies of distributed shared memory embedded in the RHODOS’ operating system, in: Proceedings of the Fourth Australasian Conference on Parallel and Real-Time Systems (PART ’97), September, 1997. [8] C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, W. Zwaenepoel, Treadmarks: shared memory computing on networks of workstations, IEEE Computer 29 (2) (1996) 18–28. [9] H. Lu, Message passing versus distributed shared memory on networks of workstations, PhD Thesis, Rice University, May 1995. [10] P. Keleher, Personal communication, April 1996. [11] A. Goscinski, P. Horan, S. Kutti, D. Newlands, J. Teague, G. Webb, W. Zhou, Computer science/software development and information systems curricula within the School of Computing and Mathematics, School of Computing and Mathematics, Deakin University, August 1995. [12] L. Iftode, J.P. Singh, Shared virtual memory: progress and challenges,

Jackie Silcock received a B.Sc. degree in Chemistry from the University of Cape Town, South Africa. She completed a Graduate Diploma of Computing at Deakin University. Her Ph.D. has been approved by the examiners and will be awarded in October, 1998. The topic of her thesis was the design and implementation of a user friendly and efficient Distributed Shared Memory system integrated into a distributed operating system. Her current research areas are distributed processing on clusters of workstations and Distributed Shared Memory. She teaches undergraduate courses in basic programming, data structures and information systems in organisations.

Andrzej M. Goscinski received the M.Sc. degree in Automatic Control, the Ph.D. degree in Control Engineering and Computer Science and D.Sc. degree in Computer Science from the Staszic University of Mining and Metallurgy, Krakow, Poland. In January 1992, he took up a Chair in Computing at Deakin University. At the beginning of 1993 he became Head of School. Profesor Goscinski is actively involved in research and teaching. He is a member of ACM and IEEE Computer Society. Professor Goscinski’s current research activities are in parallel and distributed processing on clusters of workstations, distributed operating systems, applications of distributed systems, networks, and communication protocols. He has been leading research projects (i) to develop a theory of distributed operating systems; (ii) to study, design and implement a high performance microkernel-based distributed operating system (RHODOS) to study design issues of distributed operating systems, (iii) to study parallel execution environment and parallelism management in applications executing on clusters of workstations. He has published extensively in refereed journals and conference proceedings, and presented papers at conferences. He teaches undergraduate courses in data networks, operating systems, distributed systems. He supervises postgraduate students who carry out research leading toward PhD and MSc degrees.