An efficient logging and recovery scheme for lazy release consistent distributed shared memory systems

An efficient logging and recovery scheme for lazy release consistent distributed shared memory systems

Future Generation Computer Systems 17 (2000) 265–278 An efficient logging and recovery scheme for lazy release consistent distributed shared memory s...

410KB Sizes 0 Downloads 28 Views

Future Generation Computer Systems 17 (2000) 265–278

An efficient logging and recovery scheme for lazy release consistent distributed shared memory systems夽 Taesoon Park a , Heon Y. Yeom b,∗ b

a Department of Computer Engineering, Sejong University, Seoul 143-747, South Korea Department of Computer Science, School of Computer Science and Engineering, Seoul National University, San 56-1, Sinlim-dong, Gwanak-gu, Seoul 151-742, South Korea

Received 12 May 2000; received in revised form 29 May 2000; accepted 20 June 2000

Abstract Checkpointing and logging are widely used techniques to provide fault-tolerance for the distributed systems. However, logging imposes too much overhead on the processing to be a practical solution. In this paper, we propose a low-overhead logging scheme for the distributed shared memory system based on the lazy release consistency memory model. Unlike the previous schemes in which the logging is performed when a new data item is accessed by a process, the stable logging in the proposed scheme is performed only when a lock grant causes an actual dependency relation between the processes, which significantly reduces the logging frequency. Also, instead of making a stable log of the accessed data items, a process logs stably only some access information, and the accessed data items are saved in the volatile log. For the recovery from a failure, the correct version of the accessed data items can be effectively traced by using the logged access information. As a result, the amount of logged information can also be reduced. © 2000 Elsevier Science B.V. All rights reserved. Keywords: Distributed shared memory system; Fault-tolerance; Message logging; Lazy release consistency; Rollback-recovery

1. Introduction Distributed shared memory (DSM) systems [16] provide a simple means of programming for the networks of workstations, which are gaining popularity due to their cost-effective high computing power. However, as the number of workstations participating in a DSM system increases, the probability of failure also increases, which could render the system useless for long-running applications. Hence, for DSM sys夽 An earlier version of this work appeared in the Proceedings of the Ninth Symposium on Parallel and Distributed Processing, 1998. ∗ Corresponding author. Tel.: +82-2-880-5583; fax: +82-2-871-4912. E-mail address: [email protected] (H.Y. Yeom).

tems to be of any practical use, it is important for the system to be recoverable so that the processes do not have to restart from the beginning when there is a failure [27]. An approach to provide fault-tolerance for DSM systems is to use checkpointing and rollback recovery. Checkpointing is an operation to save intermediate system states into the stable storage which is not affected by system failures. With the periodic checkpointing, the system can recover to one of the saved states, called a checkpoint, after a failure, instead of restarting the computation from the initial state. The activity to resume the computation from one of the previous checkpoints is called the rollback. In the DSM system, the computational state of a process becomes dependent on the state of another process by accessing a common data item. An important

0167-739X/00/$ – see front matter © 2000 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 7 3 9 X ( 0 0 ) 0 0 0 8 5 - 6

266

T. Park, H.Y. Yeom / Future Generation Computer Systems 17 (2000) 265–278

factor to characterize the inter-process dependency is the memory consistency model. Though the DSM system has its own merits, the remote memory access still has performance impact and hence many relaxed memory consistency models have been developed for DSM systems [2,7,12]. The lazy release consistency (LRC) memory model is one of the relaxed models. This model relaxes the traditional sequential consistency by allowing copies of the same data on different nodes to be temporarily inconsistent, and hence the inter-process dependency happens when the data accesses are explicitly synchronized by lock release and acquire operations. Such computational dependency in DSM systems makes a set of related processes roll back together to reach a consistent recovery line, when a process fails. Also, the processes may have to roll back recursively to reach the recovery line, which is called a domino-effect [19], unless the checkpointing activities of those processes have been carefully coordinated. In the worst case, the consistent recovery line consists of a set of initial points, i.e., the total loss of the computation in spite of the checkpointing effort. One solution to cope with the domino-effect is the coordinated checkpointing, in which each time when a process takes a checkpoint, it coordinates the related processes to take consistent checkpoints together [1,3,4,8,10,14]. Since each checkpointing coordination under this approach produces a consistent recovery line, processes are not involved in a domino-effect. One possible drawback of this approach is that the processes have to be blocked from the normal computation during the coordination. Also, there can be limited rollback propagation. Another solution is the communication-induced checkpointing, in which a process takes a checkpoint whenever there occurs a new dependency on another process [9,24,26,27]. This checkpointing scheme ensures no domino-effect since there is a checkpoint for each communication point, and it also ensures no rollback propagation to the other processes. However, the overhead caused by frequent checkpointing may severely degrade the system performance. Another solution is to use message logging with the independent checkpointing [20]. Processes in this scheme logs all the accessed data items on a stable storage with the periodic checkpoints. After a failure occurs, the process rolls back to its latest checkpoint

and regenerates the same sequence of computation by reprocessing the logged data. Since the process can reproduce the exactly same computation even after a failure, the rollback of one process does not affect the other processes. However, since the logging itself causes non-negligible overhead, many schemes have been suggested to reduce the amount of logged data and the frequency of logging. For the sequentially consistent DSM system, instead of logging the accessed data items repetitively, it is suggested that the process logs each accessed data item once when it is first accessed and logs the valid duration of the data when the item is invalidated [25]. To further reduce the logging overhead, in [17], logging of the accessed data item and its valid duration is delayed until the item is invalidated in the system, and in [11], the data item is logged when it is written instead of being logged each time when it is accessed. In [18], the logging overhead is substantially reduced by logging the data item at the site where it is produced and at the time when the item is invalidated. For the lazy release consistent DSM system, three logging schemes have been proposed in the literature: shared-read logging (SRL) [20], shared-access-tracking (SAT) [25], and lightweight logging (LWL) [6]. In SRL, all the data items accessed by a process are logged, while SAT tries to reduce the logging overhead by logging the accessed data items and their valid duration when a new dependency relation may be formed, such as the lock release time and the time when a process transfers a data item to another process. LWL, on the other hand, logs the data items and their valid duration in the volatile memory of the writer process to remove the overhead of stable logging. However, this scheme requires more recovery cost since the log is distributed in the system and also it cannot handle the concurrent failures which may occur at the reader’s process site and the writer’s process site. In this paper, we propose a new logging scheme for DSM systems with the LRC memory model. The proposed scheme incurs much low logging overhead compared to the other schemes like SRL and SAT, and tolerates multiple site failures unlike LWL. In the LRC memory model, the inter-process dependency can occur when the data accesses are explicitly synchronized by lock operations. Hence, we suggest that logging

T. Park, H.Y. Yeom / Future Generation Computer Systems 17 (2000) 265–278

should be performed only when a lock release operation following any write is performed by a process, which reduces the frequency of stable logging. Also, the volatile logging is performed for data items by the writer process. Only the access information such as the valid duration of the data and the vector time used in the LRC model is logged into the stable storage. The correct version of the data item during the recomputation can be traced using the vector time. As a result, the amount of the stable log can be drastically reduced. In the following section, the system model and the definition of consistent recovery line are introduced and in Section 3, the proposed logging and recovery scheme is presented. The performance of the proposed scheme is discussed in Section 4 using the simulation results as well as the results from implementation, Section 5 concludes the paper.

2. Background 2.1. System model We consider a DSM system consisting of a number of nodes connected through a communication network. Each node consists of a processor, a volatile main memory and a non-volatile secondary memory. The processors in the system do not share any physical memory and communicate by message passing.

267

The communication subsystem is assumed to be reliable, i.e., the message delivery is error-free and virtually lossless. However, no assumption is made on the message delivery order. Fig. 1 shows a typical DSM configuration. The nodes in the system are fail-stop [21], i.e., when a node fails, it simply stops and does not perform any malicious actions. The failures considered in the system are transient, i.e., when a node recovers from a failure and re-executes the computation, the same failure is not likely to occur again. We also assume an independent failure mode in which the failure of one node does not affect other nodes. We do not make any assumption on the number of simultaneous node failures. When a node fails, the register contents and the main memory contents are lost. However, the contents of the secondary memory are preserved and the secondary memory is used as a stable storage. Logically, the system can be viewed as a set of processes running on the nodes and communicating by accessing a logically shared memory. Each process can be considered as a sequence of state transitions from the initial state to the final state. An event is an atomic action that causes a state transition within a process, and a sequence of events is called the computation. In a DSM system, the computation of a process can be characterized as a sequence of read/write operations to access the shared memory. The computation performed at each process is assumed to be piecewise deterministic, i.e., the computational states generated by

Fig. 1. A distributed shared memory system.

268

T. Park, H.Y. Yeom / Future Generation Computer Systems 17 (2000) 265–278

a process are fully determined by a sequence of data values provided for a sequence of read operations. The shared memory space the system provides is composed of a set of fixed-size pages. For the memory consistency, we assume the invalidation-based lazy release consistency model [12]. A number of different memory semantics for the DSM system have been proposed including sequential [16], processor, weak, and release consistency [2] as well as causal coherence [7]. However, in this paper, we focus on the LRC memory model for its enhanced performance. In the LRC model, the concurrent and overlapping executions of the read/write operations are allowed and hence, the actual order of the operations on the shared data may not be the same as expected and such data races may result in unexpected program outputs. To guarantee the correct execution order between the conflicting operations, the synchronization operations, such as the acquire, release, barrier operations, are used in the program. By appropriately inserting the lock acquire and the lock release operations, the execution order among the conflicting operations can be made sequential. To synchronize the execution timing of all the processes, a barrier operation is used. Also, in the LRC model, the invalidation of a data page by any write operation is more lazily observed by other processes. If a process writes on a data page and releases a lock, another process can invalidate the old copy of the page and read the new data page after it acquires the lock. The LRC model also employs a multiple-reader, multiple-writer protocol to avoid the pingpong effect, in which two processes competing for writing on one data page keep transferring the page back and forth. Hence, in this model, each of two processes can have a writable copy in its local memory if the writing portions of the page are not overlapped. 2.2. Consistent recovery line We now define a state interval, denoted by I (i, a), as the computation sequence between the (a − 1)th and the ath synchronization operations of a process pi , where a > 1 and the 0th synchronization means the initial state of pi . In the DSM systems based on LRC model, the effect of a write operation in I (i, a) can be viewed by another process pj , either after a release operation following I (i, a) and an acquire operation following the release are performed by pi and

pj , respectively, or after pi and pj perform the same barrier operations following I (i, a). Hence, the computational dependency between the state intervals can be defined as follows: Definition 1. A state interval I (i, a) is dependent on another state interval I (j, b) if one of the following conditions is satisfied: 1. i = j and a = b + 1. 2. I (j, b) ends with a release(x) and I (i, a) begins with an acquire(x), and pi in I (i, a) accesses a data page written by pj in I (j, b). 3. I (j, b) ends with a barrier(x) and I (i, a) begins with a barrier(x), and pi in I (i, a) accesses a data page written by pj in I (j, b). 4. I (i, a) is dependent on I (k, c) and I (k, c) is dependent on I (j, b). The dependency relation between the state intervals may cause possible inconsistency problems when a process rolls back and performs the recomputation. Fig. 2 shows the typical example of inconsistent rollback recovery [5]. The notations R(X1 ) and W (X1 ) in the figure represent the read and the write operations on a data page X1 , respectively, and the notations U (A) and L(A) represent the release and the acquire operations of a lock A, respectively. Now, suppose the process pi in Fig. 2 rolls back to its latest checkpoint Ci due to its own failure, however, it cannot regenerate the same computation for W (X1 ) which have been performed before the failure. Then, the consistency between pi and pj becomes violated since pj ’s computation depends on pi ’s invalidated computation. Such a case is called an orphan message case and a process is said to recover to a consistent recovery line, if any process in the system is not involved in the orphan message case after the rollback recovery.

Fig. 2. An inconsistent recovery line.

T. Park, H.Y. Yeom / Future Generation Computer Systems 17 (2000) 265–278

3. Protocol description 3.1. Logging protocol Processes performing the piecewise deterministic computation can regenerate the same computation sequence, if the sequence of data values used for the shared memory accesses can be logged and replayed at the same access points [20]. Hence, for the consistent recovery, each process must log two types of information; one is the contents of the data page it has accessed and the other is the access information, such as the computational point at which the page has been accessed. The data page can be logged either at the process which accessed it (a reader process) or at the process which produced it (a writer process). If the reader process logs the page, the log must be stable since it has to tolerate the reader’s own failure. However, if the writer process logs the data page, the volatile log can be used since the page is to be retrieved after the reader’s failure not for its own failure. In case that the writer process fails, it can regenerate the same contents of the page if it performs the correct recovery. Moreover, since a data value written by a process is usually read by many processes, it is more efficient for one writer to log the value, instead of many readers logging it. For the volatile logging of data pages, the diff structure which the LRC memory model provides can be used. In the LRC model, when a process writes on a data page, it first creates a copy of the page, called a twin, and then performs the write operation. Later, when another process requests the data page, the modified page is compared with its twin, and the modified portion of the page, called a diff, is created and sent to the requesting process. The requesting process collects the diffs from every process which has written on that page, and applies them in the chronological order to create the up-to-date version of the page. Such a diff structure is maintained at the volatile storage of the writer process until the system periodically discards the diffs which are no longer required, which is called a garbage collection. Hence, in the proposed scheme, the diffs maintained in the volatile storage is used as the volatile log of data pages and only the diffs discarded by the garbage collection are saved into the stable storage as a part of checkpoints.

269

Another information to be logged is the association of each data access point with the correct version of the data. For the efficient implementation, the vector time [22] employed in the LRC memory model is directly used to represent the information. A vector time Ti of a process pi is an array of integers, Ti = (ti1 , . . . , tii , . . . , tin ), where n is the number of processes in the system and each entry of Ti is initialized as zero. The value of tii in Ti is incremented by one when pi releases a lock following any write operation and each tik is updated as maximum of tik and tjk in Tj when pi acquires a lock from the last releaser of the lock, say pj . As a result, the vector time associated with the synchronization operation reflects the causal order between those operations. In the LRC based DSM system, inter-process dependency specified in Definition 1 can be reflected in the vector time associated with each state interval. For example, if each entry of Ti associated with an interval I (i, a) is less than or equal to the corresponding entry of Tj for the interval I (j, b), then I (j, b) must be dependent on I (i, a), and vice versa. That is, if a process accesses a data page when its vector time is T , the page must reflect all and only the updates (diffs) made before the time T . Hence, if the values of vector time associated with the diffs and the data access points are logged, the correct version of the shared data can be retrieved during the recovery. Moreover, since the vector time can be updated only when the synchronization operation is performed, the logging needs to be performed only for the synchronization operation. To log the vector time associated with each synchronization operation, each process pi maintains a synchronization operation counter, denoted by Syni , which indicates the number of synchronization operations happened at pi . When pi performs a synchronization operation, the new vector value is logged with the value of Syni , if the vector time of pi has been updated. Fig. 3 shows an example of the logging for the system consisting of three processes, pi , pj and pk . Ti:a = (i, j, k) in Fig. 3 denotes the vector time of pi with Syni value, a. Since the vector time needs to be logged only when the value is updated, there are four logging operations performed in the system. From the figure, we can easily see how the processes can recover from a failure. For example, suppose that process pk fails and it has to recover up to the release operation of lock A (denoted by U (A)). Then, from

270

T. Park, H.Y. Yeom / Future Generation Computer Systems 17 (2000) 265–278

Fig. 3. An example of logging.

the log, pk can retrieve the information that its first lock acquire operation (denoted by L(A)) is associated with the vector time (1,1,0), and hence it can know that for R(X) and R(Y ), it has to fetch the diffs made before the vector time (1,1,0). Since each diff in the system carries the vector time of its creation, pk can safely include the diffs made by pi and pj before vector time (1,1,0) for its read operations. However, unlike the data page contents (diffs), the access information must not be logged on the volatile storage, since such information cannot be regenerated after the writer’s failure. In case that the reader process makes a stable log of such information, the frequency of logging must be an important performance factor, even though the amount of information to be logged is very small. To reduce the logging frequency, in the proposed scheme, the vector time associated with Sync value is temporarily logged on the volatile storage of the reader process and the stable logging is performed only when the process has a new dependent. If a process has lost some state intervals due to a failure, the processes dependent on the lost intervals have to roll back together to reach a consistent recovery line [15]. However, if a process lost some state intervals but there is no process dependent on them, then arbitrary recomputation of the process does not cause any inconsistency problem with other processes; that is, there is no need for the logging of those state intervals. Hence, a process can delay the logging of the access information for some state intervals until having a process dependent on those intervals. From the definition, a state interval I (i, a) can have any dependent interval I (j, b) only if the following sequence of operations are performed: W (X) in I (i, a), U (A) (or barrier(A)) following I (i, a), A(A) (or barrier(A)) at pj following U (A) of pi , I (j, b) following A(A) at pj and then R(X) in I (j, b).

To efficiently perform stable logging, in the proposed scheme, each process pi maintains the write-since-last-logging flag, denoted by WSLi , which indicates whether there has been a write operation since the last stable logging of pi . The WSLi is set to one when pi performs a write operation and reset to zero when pi performs stable logging. Process pi performs stable logging only when pi grants a lock to another process and WSLi is equal to one. Hence, in Fig. 3, volatile logging is performed at four computational points, however, stable logging is performed only at two synchronization points, such as U (A) of pi and U (A) of pj . The proposed logging protocol is summarized in Fig. 4.

Fig. 4. The proposed logging protocol.

T. Park, H.Y. Yeom / Future Generation Computer Systems 17 (2000) 265–278

3.2. Checkpointing Each process in the system periodically takes a checkpoint to reduce the amount of recomputation in case of a system failure. A checkpoint includes the intermediate state (context) of the process, its current vector time, the values of Syn and WSL, the contents of the data values (diffs) the process currently holds and any other information required for the memory consistency protocol. The checkpointing among the related processes need not be performed in a coordinated way. However, if the checkpointing is incorporated into the barrier operation or the garbage collection, the overhead of the checkpointing can be reduced.

try with the sequence number matching with its current Syn value. It then resets its current vector time as the one saved in the log entry. Since for each synchronization operation, the process can retrieve the vector time same as the one created before the failure, it can retrieve the same diffs from the other processes and creates the

3.3. Rollback-recovery When a process recovers from a failure, it first restores its state using the latest checkpoint and begins the recomputation as follows: At the lock acquire time: The process increments its Syn value by one and searches its log entry with the sequence number matching with the current Syn value. It then resets its current vector time as the one saved in the log entry. At the data page access miss: The process broadcasts the data page request to the other processes with its current vector time. Each of the other processes then replies with the diffs of the data page it has produced before the given vector time. The recovering process arranges the received diffs in the timing order and applies them on the data page to create an up-to-date version, as performed during the normal execution. The created version can be used until the vector time of the process changes. To prevent the unnecessary invalidation of the created data pages, the write notices obtained at the lock acquire time can be used. The write notices include the identifiers of the data pages which have been invalidated since the last acquire operation of the process. Hence, by logging the write notices with the vector time at the acquire point, the process can invalidate only the necessary pages at the acquire time during the recovery. Logging the write notices may slightly increase the amount of the data to be logged, however, it does not affect the frequency of logging. At the lock release time or at the barrier: The process increments its Syn value and searches the log en-

271

Fig. 5. The rollback-recovery protocol.

272

T. Park, H.Y. Yeom / Future Generation Computer Systems 17 (2000) 265–278

same data page for its read and write operations. The rollback-recovery protocol is summarized in Fig. 5. 3.4. Correctness Now, we prove the correctness of the proposed logging and rollback-recovery scheme. Theorem. The recovery line produced under the proposed scheme is consistent. Proof. For a recovery line to be consistent, there must not be any orphan message case in the system. In the proposed scheme, the vector time of a state interval is used for the identifier for the data values created during that interval, and it is also used for the identifier for the data access point. Since the vector time is stably logged with the unique sequence number for each state interval, the data values and their identifiers can be retrieved after the failures. Hence, it is enough to prove that any access information indicating a dependency relation is stably logged before the dependency is actually formed. In the proposed scheme, a process performs stable logging when it receives an acquire request from another process. Every access information before the acquire request can be stably logged before the acquire of another process, at which point the actual dependency is formed. Therefore, for any dependency relation, the data value and the other access information can be retrieved after a failure and the recovery line produced under the proposed scheme is consistent. 

4. Performance study To verify the claim that the proposed logging scheme reduces the logging overhead, two sets of experiments have been performed. A simple trace-driven simulator based on MINT has been built to examine the logging behavior of various parallel programs running on the DSM system, and then the logging protocols have been implemented on top of the CVM system to measure the effects of logging under the actual system environments. We have compared the behavior of the proposed scheme with the schemes proposed in [20,25].

4.1. Simulation environments We have built a trace-driven simulator using MINT for the simulation and compared the performance of the new logging scheme with SRL and SAT schemes. There is one more scheme, LWL [6], for the lazy release consistent model, however, we did not include LWL because it cannot handle multiple failures. The logging schemes compared in the analysis using simulation are as follows: Shared-read logging (SRL) [20]: Each process logs the data value accessed to the volatile memory, whenever it performs the access operation. The log in the volatile memory is flushed into the stable storage, when the process transfers a data value it has written to another process. Shared-access tracking (SAT) [25]: Each process logs the data value accessed once to the volatile memory when it is transferred from another process and logs the write notices at the lock acquire time. The log in the volatile memory is flushed into the stable storage, when the process transfers a data value it has written to another process, or when it receives a lock acquire request after releasing the lock. Reduced-stable logging (RSL): It is what we propose in this paper and each process logs only the vector time and write notices into the volatile log when it performs a synchronization operation. The stable logging is performed only when a new dependency relation is actually formed; that is, when the process transfers its write notices for the lock acquire of another process following its lock release. We have run the simulation with two different sets of traces. The first one is the trace we have synthetically generated using random numbers and the second one is obtained from the real parallel program execution. Two performance measures are used; one is the amount of diffs which have to be logged and the other is the frequency of stable logging. From the simulation results, it is evident that the new logging scheme requires much smaller overhead for both measures. 4.2. Simulation using synthetic traces To generate the synthetic traces, a system consisting of 16 processes and 5000 data pages has been simulated. Each process consists of a sequence of read/write operations on the data pages. To generate

T. Park, H.Y. Yeom / Future Generation Computer Systems 17 (2000) 265–278

273

Fig. 6. The effect of R/W ratio and locality on the amount of logged.

the sequence, the read/write ratio and the locality are used. Read/write ratio 0.9 means that 90% of the operations are reads and 10% are writes. Locality 0.9 means that 90% of the operations are for the local data pages. The synchronization operation is generated for every ten read/write operations, and eight different locks are used in the simulation. One simulation run consists of 10 000 workload records and the simulation was repeated using various read/write ratio and locality values. Fig. 6 first shows the effect of read/write ratio and the locality on the amount of data values to be logged. In SRL scheme, data pages used for read operations and transferred from another site for write operations are logged. Hence, the figure shows 100% of logging for the 100% of read ratio, and the amount of log is decreased as the write ratio increases. In SAT scheme, only the pages transferred after the access miss are logged. Hence, as the write ratio increases, which increases the data page invalidation, the amount of log also increases. In both cases, the figure shows the performance degradation when the locality decreases. Since the low locality means more access on the data pages at the remote sites, there must be more data transfer which causes the increase in logging. For the

various read/write ratio and locality values, the proposed scheme shows no stable logging for the data pages. Fig. 7 shows the frequency of stable logging, which indicates the number of accesses to the stable storage not only for the data values, but also for the other access information. Comparing the three schemes, SAT scheme shows the most drastic increases, since in the scheme, not only the data value transfer but also the write notice transfer causes the logging. In SRL scheme, only the data value transfer causes the logging and in the proposed scheme, only the write notice transfer causes the logging. Comparing the data value transfer, the frequency of the write notice transfer is much low. Moreover, in the proposed scheme, the stable logging is not performed for every write notice transfer, but performed only when there has been a write operation. 4.3. Simulation using parallel program traces To further validate our claim, we have also run the simulation using traces from multiprocessor execution. The traces contain references produced by a 32-processor MP, running the following four

274

T. Park, H.Y. Yeom / Future Generation Computer Systems 17 (2000) 265–278

Fig. 7. The effect of R/W ratio on the frequency of logging.

programs: FFT, barnes-hut, mp3d and radix. Using these traces, we have run the simulation for the three schemes. Fig. 8 shows the logging frequency of three schemes. From Fig. 8, it is shown that the logging frequency of the proposed scheme is much lower than the other schemes. Compared to the other schemes, about 50–90% of the logging frequency can

be reduced in the proposed scheme. Looking at the performance of SAT and SRL, we can see that the performance is affected by the characteristics of the application program. With FFT, where the data locality is low, SAT outperforms SRL while the reverse is true for the other applications since they tend to show high data locality.

Fig. 8. The frequency of logging for MP traces.

T. Park, H.Y. Yeom / Future Generation Computer Systems 17 (2000) 265–278

4.4. Experimental results from implementation To examine the performance of the proposed logging protocol under the actual system environments, the proposed logging protocol (RSL) and the protocol proposed in [25] (SAT) have been implemented on top of a DSM system. In order to implement a LRC based DSM system, we use the CVM (coherent virtual machine) package [13], which supports the lazy release consistency memory models as well as the sequential consistency memory model. CVM is written using C++ and well modularized and it was pretty straightforward to add the logging scheme. The basic high level classes are the CommManager class and the Msg class which handle the network operation, the MemoryManager class which handles the memory management, and the Page class and the DiffDesc class handling the page management. The protocol classes such as LMW, LSW and SEQ inherit the high level classes and support operations according to each consistency protocol. For our experiment, we have selected LMW (lazy release consistency+multiple writer) protocol. The classes related to the LMW protocol are LmwInterval, LmwNotice, DiffHeap, DiffManager, IntervalManager, LmwProtocol and LmwPage. We have modified LmwInterval, IntervalManager and LmwProtocol classes so that remote accesses are logged. Our experimental environment is a 40-processor IBM SP/2 running AIX 4.1.2 with 5.8 GB of total memory and 134 GB of disk space. Eight of the processors are wide nodes with 512 MB memory and the rest of them are thin nodes with 128 MB of main memory. All the processors are connected through the IBM SP/2 high-performance switch (HPS). The HPS is a two-level cross-bar switch and provides a point-to-point bandwidth of 40 MB/s [23]. We have used eight thin nodes residing on the same frame for our experiment. We have run three application programs, barnes-hut, FFT, and TSP, with and without logging, compared the execution time of the programs and measured the amount of logged data for each application. To measure the effects of the system size on the performance, we have run each application with the different number of nodes, two, four, and eight.

275

Table 1 The logging overhead (SAT vs RSL)

Barnes FFT TSP

Logging frequency

Logging amount

SAT

RSL

SAT

RSL

266 182 717

84 72 637

57 284 456 50 740 800 43 208 332

4480 2940 13920

Table 1 shows the overhead of the logging schemes in terms of the logging frequency and the amount of logged data. The amount of the logged data is closely related with the number of acquire/release operations and barrier operations performed during the execution of the applications. Since RSL does not require the data value logging, we can see that there is a huge difference in the logging amount even though there is relatively small difference with the logging frequency compared with SAT. Figs. 9–11 show the execution times of the application programs: barnes, FFT, and TSP, respectively, with and without logging. From the figures, we could observe that in all cases, the execution time under the proposed logging scheme is only slightly longer than the one under no logging. Analyzing the overhead of the proposed logging, each process spends some CPU time to insert its current vector time into the volatile log space when a synchronization operation is performed, and to copy the write notices into the volatile log space at the acquire time. Since such CPU time spent for each synchronization operation is usually considered to be very small, the overhead caused by those operations are not that high. Another overhead to be incurred is related to the stable logging. In the proposed scheme, the stable logging is performed only when the actual data dependency is formed by the sequence of write–release–acquire–read operations, while in the SAT scheme, the stable logging is performed more frequently when the lock or data transfer time. As a result, the increase in the execution time is much larger in the SAT scheme. For our scheme, the overhead ranges from 2.4 to 9.6% and the average was 5.4%. On the other hand, the overhead of SAT ranges from 14.0 to 49.9% and the average was 31.3%.

276

T. Park, H.Y. Yeom / Future Generation Computer Systems 17 (2000) 265–278

Fig. 9. Experimental results (barnes).

Fig. 10. Experimental results (FFT).

Fig. 11. Experimental results (TSP).

T. Park, H.Y. Yeom / Future Generation Computer Systems 17 (2000) 265–278

5. Conclusions In this paper, we have proposed a new logging scheme for a lazy release consistent DSM system, which tolerates multiple failures with minimal logging overhead. In the proposed scheme, the data value produced by a write operation is logged at the volatile storage of the writer process. By associating a recoverable identifier for each data value and stably logging such information, not only the data value but also the identifier can be retrieved after the failure of the writer. Hence, there is no need for stable logging of the data values. To efficiently trace the correct version of the data value during the recovery, the vector time provided by the LRC model is used for the access information. Also, the stable logging of the access information is performed when there is an actual dependency relation between the processes. Hence, the frequency of stable logging can be reduced. From the experimental results, it has been shown that the proposed scheme always enforces much lower logging overhead compared to the other schemes.

Acknowledgements The authors wish to acknowledge the financial support of the Korea Research Foundation given in the program year of 1997 (1997-001-E00409).

References [1] R.E. Ahmed, R.C. Frazier, P.N. Marinos, Cache-aided rollback error recovery (carer) algorithms for shared-memory multiprocessor systems, in: Proceedings of the 20th Symposium on Fault-Tolerant Computing, June 1990, pp. 82–88. [2] S.V. Adve, M.D. Hill, Weak ordering — a new definition, in: Proceedings of the 17th Annual International Symposium on Computer Architecture, May 1990, pp. 2–14. [3] G. Cabillic, T. Priol, I. Puaut, The performance of consistent checkpointing in distributed shared memory systems, in: Proceedings of the 14th Symposium on Reliable Distributed Systems, September 1995, pp. 95–105. [4] J.B. Carter, A.L. Cox, S. Dwarkadas, E.N. Elnozahy, D.B. Johnson, P. Keleher, S. Rodrigues, W. Yu, W. Zwaenepoel, Network multicomputing using recoverable distributed shared memory, in: Proceedings of the IEEE International Conference CompCon’93, February 1993.

277

[5] M. Chandy, L. Lamport, Distributed Snapshot: Determining Global States of Distributed Systems, ACM Trans. Comput. Syst. 3 (1) (1985) 63–75. [6] M. Costa, P. Guedes, M. Sequeira, N. Neves, M. Castro, Lightweight logging for lazy release consistent distributed shared memory, in: Proceedings of the USENIX Second Symposium on Operating Systems Design and Implementation, October 1996. [7] K. Gharachorloo, D.E. Lenoski, J. Laudon, P. Gibbons, A. Gupta, J.L. Hennessy, Memory consistency and event ordering in scalable shared-memory multiprocessors, in: Proceedings of the 17th Annual International Symposium on Computer Architecture, May 1990, pp. 15–26. [8] G. Janakiraman, Y. Tamir, Coordinated checkpointingrollback error recovery for distributed shared memory multicomputers, in: Proceedings of the 13th Symposium on Reliable Distributed Systems, October 1994, pp. 42–51. [9] B. Janssens, W.K. Fuchs, Relaxing Consistency in Recoverable Distributed Shared Memory, in: Proceedings of the 23rd Annual International Symposium on Fault-Tolerant Computing, June 1993, pp. 155–163. [10] B. Janassens, W.K. Fuchs, Reducing interprocessor dependence in recoverable shared memory, in: Proceedings of the 13th Symposium on Reliable Distributed Systems, October 1994, pp. 34–41. [11] S. Kanthadai, J.L. Welch, Implementation of recoverable distributed shared memory by logging writes, in: Proceedings of the 16th Internationmal Conference on Distributed Computing Systems, May 1996, pp. 116–123. [12] P. Keleher, A.L. Cox, W. Zwaenepoel, Lazy Release consistency for software distributed shared memory, in: Proceedings of the 18th Annual International Symposium on Computer Architecture, May 1992, pp. 13–21. [13] P. Keleher, CVM: The Coherent Virtual Machine. http://www.cs.umd.edu/projects/cvm. [14] A. Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, I. Puaut, A recoverable distributed shared memory integrating coherence and recoverability, in: Proceedings of the 25th International Symposium on Fault-Tolerant Computing Systems, June 1995, pp. 289–298. [15] J.L. Kim, T. Park, An efficient algorithm for checkpointing recovery in distributed systems, IEEE Trans. Parallel Dis. Syst. 4 (8) (1993) 955–960. [16] K. Li, Shared virtual memory on loosely coupled multiprocessors, Ph.D. Thesis, Department of Computer Science, Yale University, September 1986. [17] T. Park, S.B. Cho, H.Y. Yeom, An improved logging and checkpointing scheme for recoverable distributed shared memory, in: Proceedings of the Second Asian Computing Science Conference, December 1996, pp. 74–83. [18] T. Park, S.B. Cho, H.Y. Yeom, An efficient logging scheme for recoverable distributed shared memory systems, in: Proceedings of the 17th International Conference Distributed Computing Systems, May 1997. [19] B. Randell, P.A. Lee, P.C. Treleaven, Reliability issues in computing system design, ACM Comput. Surv. 10 (2) (1978) 123–165.

278

T. Park, H.Y. Yeom / Future Generation Computer Systems 17 (2000) 265–278

[20] G.G. Richard III, M. Singhal, Using logging and asynchronous checkpointing to implement recoverable distributed shared memory, in: Proceedings of the 12th Symposium on Reliable Distributed Systems, October 1993, pp. 58–67. [21] R.D. Schlichting, F.B. Schneider, Fail-stop processors: an approach to designing fault-tolerant computing systems, ACM Trans. Comput. Syst. 1 (3) (1983) 222–238. [22] R. Schwarz, F. Mattern, Detecting causal relationships in distributed computations: in search of the holy grail, Technical Report (TR) #SFB124-1592, Department of Computer Science, University of Kaiserslautern, 1992. [23] C.B. Stunkel, et. al., The SP2 high performance switch, IBM Syst. J. 24 (2) (1995) 185–204. [24] M. Stumm, S. Zhou, Fault tolerant distributed shared memory, in: Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing, December 1990, pp. 719–724. [25] G. Suri, B. Janssens, W.K. Fuchs, Reduced overhead logging for rollback recovery in distributed shared memory, in: Proceedings of the 25th Annual International Symposium on Fault-Tolerant Computing, June 1995. [26] V.O. Tam, M. Hsu, Fast recovery in distributed shared virtual memory systems, in: Proceedings of the 10th International Conference on Distributed Computing Systems, May 1990, pp. 38–45. [27] K.L. Wu, W.K. Fuchs, Recoverable distributed shared memory, IEEE Trans. Comput. 39 (4) (1990) 460–469.

Heon Y. Yeom Heon Y. Yeom is an Associate Professor with the Department of Computer Science and Engineering, Seoul National University. He received his BS degree in computer science from Seoul National University in 1984 and received the MS and PhD degree in computer science from Texas A&M University in 1986 and 1992, respectively. From 1986 to 1990, he worked with Texas Transportation Institute as a Systems Analyst and from 1992 to 1993, he was with Samsung Data Systems as a Research Scientist. He joined the Department of Computer Science, Seoul National University in 1993, where he currently teaches and researches on distributed systems, multimedia systems, and transaction processing.

Taesoon Park Taesoon Park received the BS degree in computer engineering from Seoul National University, Seoul, Korea, in 1987, and the MS and the PhD degree degree in computer science from Texas A&M University, College Station, TX in 1989 and 1994, respectively. She is currently an Assistant Professor in Sejong University, Seoul, South Korea. Her research interests are in the areas of fault-tolerant and distributed computing systems, and distributed database systems.