A Comparison of Application-Level Fault Tolerance Schemes for Task Pools

A Comparison of Application-Level Fault Tolerance Schemes for Task Pools

Journal Pre-proof A comparison of application-level fault tolerance schemes for task pools Jonas Posner, Lukas Reitz, Claudia Fohry PII: DOI: Referen...

962KB Sizes 0 Downloads 21 Views

Journal Pre-proof A comparison of application-level fault tolerance schemes for task pools Jonas Posner, Lukas Reitz, Claudia Fohry

PII: DOI: Reference:

S0167-739X(19)30829-5 https://doi.org/10.1016/j.future.2019.11.031 FUTURE 5303

To appear in:

Future Generation Computer Systems

Received date : 31 March 2019 Revised date : 20 September 2019 Accepted date : 25 November 2019 Please cite this article as: J. Posner, L. Reitz and C. Fohry, A comparison of application-level fault tolerance schemes for task pools, Future Generation Computer Systems (2019), doi: https://doi.org/10.1016/j.future.2019.11.031. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Elsevier B.V. All rights reserved.

Journal Pre-proof

A Comparison of Application-Level Fault Tolerance Schemes for Task Pools

I

Jonas Posner, Lukas Reitz, Claudia Fohry

pro of

Research Group Programming Languages / Methodologies University of Kassel, Germany

Abstract

lP

re-

Fault tolerance is an important requirement for successful program execution on exascale systems. The common approach, checkpointing, regularly saves a program’s state, such that the execution can be restarted after permanent node failures. Checkpointing is often performed on system level, but its deployment on application level can reduce the running time overhead. The drawback of application-level checkpointing is a higher programming expense. It pays off if the checkpointing is applied to reusable patterns. We consider task pools, which exist in many variants. The paper supposes that tasks are generated dynamically and are free of side effects. Further, the final result must be computed from individual task results by reduction. Moreover, the pools must be distributed with private queues, and adopt work stealing. The paper describes and evaluates three application-level fault tolerance schemes for task pools. All use uncoordinated checkpointing and regularly save information in a resilient store. The first scheme (called AllFT ) saves descriptors of all open tasks; the second scheme (called IncFT ) selectively and incrementally saves only part of them; and the third scheme (called LogFT ) logs stealing events and writes checkpoints in parallel to task processing. All schemes have been implemented by extending the Global Load Balancing (GLB) library of the “APGAS for Java” programming system. In experiments with the UTS, NQueens, and BC benchmarks with up to 672 workers, the running time overhead during failure-free execution, compared to a non-resilient version of GLB, was typically below 6%. The recovery cost was negligible, and there was no clear winner among the three schemes. A more detailed performance analysis with synthetic benchmarks revealed that IncFT and LogFT are superior in scenarios with large task descriptors. Keywords: HPC Programming Languages, Libraries and Tools 1. Introduction

Jo

urn a

In the future, large programs will run on exascale machines, which have a higher failure probability due to their higher component count. We consider permanent hardware failures of one or several cluster nodes, which are expected to be increasingly frequent. For instance, Herault and Robert [1] estimate that a machine with 1,000,000 nodes will experience a node crash every 53 minutes on average, even if a single node crashes only once in a century. The current standard technique for coping with permanent node failures is system-level checkpoint/restart. It is provided by libraries such as DMTCP [2] and BLCR [3], and does not require any changes of the program. Checkpointing libraries periodically save the full state of a running program on disc such that, after failures, it can be I This paper is an extended version of: C. Fohry, J. Posner, L. Reitz: A Selective and Incremental Backup Scheme for Task Pools. Int. Conf. on High Performance Computing & Simulation (HPCS), 2018 Email addresses: [email protected] (Jonas Posner), [email protected] (Lukas Reitz), [email protected] (Claudia Fohry)

restarted from the last checkpoint [1, 4]. The main drawback of system-level checkpointing is a high overhead. Depending on various parameters, the time for writing a checkpoint to disc may, for instance, be 30 minutes [5]. Application-level checkpointing provides an alternative. If the user is able to identify a subset of the state that is sufficient for recovery [1], the running time overhead may be reduced significantly. Moreover, application-level checkpointing allows to continue the program execution after failures, without need for a restart. The obvious drawback of application-level checkpointing is an increased programming expense. The effort pays off if the technique is applied to reusable software such as a library, which may, e.g., implement a programming pattern. We consider the task pool pattern, which has long been used for load balancing of irregular applications. Nowadays, a second important usage of task pools is in runtime systems of task-based parallel programming languages, which are gaining popularity on both shared-memory and distributed-memory machines [6, 7]. Briefly stated, the task pool pattern uses a fixed set of workers to process a large number of tasks. A task is a subcomputation that can be executed in parallel to

Preprint submitted to On The Road to Exascale II: Advances in High Performance Computing and Simulations

September 20, 2019

Journal Pre-proof final result

other tasks. Workers typically correspond to processes or threads. Task pools come in many variants, which differ in their concept of tasks, and in the mechanism for load balancing. This paper considers fine-grained tasks in the order of nanoseconds to seconds. We assume that tasks may spawn other tasks, but there are no other task dependencies. All tasks must be free of side effects, and the final result must be calculated from individual task results by reduction with an associative and commutative binary operator. Furthermore, all tasks must carry out the same code, parameterized by a task descriptor. We assume that tasks are internally sequential, although this requirement can be relaxed as long as the task behavior appears deterministic from the outside. Load balancing is illustrated in Figure 1: We assume that each worker operates on a local pool, from which it takes and into which it inserts open tasks. The local pools are private, i.e., a worker may access its own pool only. A worker processes at most one task at a time, and runs this task to completion. If the local pool is empty, it asks a co-worker for tasks. The co-worker actively responds by sending tasks (called loot) or a reject message. This scheme is commonly denoted as cooperative work stealing, and the two workers are called thief and victim, respectively. Initially, there is at least one task in one of the local pools. During the computation, each worker accumulates task results into a worker result. At the end, the worker results are combined into the final result. The paper presents and compares three alternative application-level fault tolerance schemes for this type of task pools. All schemes cope with permanent node failures. Network communication is supposed to be reliable. We assume that failures are detected at system level and reported to the application. The basic approach of the three schemes has been adopted from own previous work [8]. It is a form of application-level checkpointing, in which the local pool contents and the current values of the worker results are saved. Backups are written to a resilient store. They are written independently by each worker, periodically and in the event of work stealing. After a failure, all tasks that have been processed after the worker’s last backup, including the currently executed task, are re-executed. We call the schemes AllFT, IncFT and LogFT, respectively: AllFT corresponds to JFT GLB from [8], but the algorithm has been re-implemented in a more efficient and scalable way. AllFT backups contain the complete task pool contents at the time of their writing. IncFT, in contrast, saves only those task descriptors that were contained in the pool for a while, and that have not been saved before. For that, IncFT constantly monitors the pool size. IncFT imposes some additional constraints (see Section 2.2). LogFT, unlike the other schemes, does not update backups in the event of stealing. Instead, it records time stamps of stealing events and saves them in the resilient

⊕ worker results

···

···

⊕ W ··· (victim)

⊕ steal request 2

loot

W (thief)

local pools

···

···

⊕ steal request 1

pro of

worker

···

···

reject

W

···

···

···

Figure 1: Cooperative work stealing

urn a

lP

re-

store. Moreover, LogFT writes backups in parallel to task processing. LogFT imposes some additional constraints, as well, but different ones. All fault tolerance schemes include recovery procedures. They are integrated into the program execution such that, after failures, the program continues with a smaller number of workers (often called shrinking recovery). The recovery procedures were adapted from previous work [8]. They guarantee that programs either crash, or compute the correct result despite any number of failures. Although our schemes can handle any failure scenarios, crashes may still occur, due to crashes of the environment such as the resilient store. This paper extends its original version [9] by several novel contributions: ˆ task pool variant-independent descriptions of the fault tolerance schemes, ˆ inclusion of a third scheme, LogFT, ˆ re-implementations of AllFT and IncFT, ˆ inclusion of a third benchmark, NQueens, and

Jo

ˆ updated and more comprehensive experiments.

Although our fault tolerance schemes are general, we had to choose a concrete task pool variant for implementation. In accordance with reference [8], we selected lifeline-based global load balancing, also called the lifeline scheme [10]. It is an advanced task pool variant that couples random victim selection with an overlay graph for termination detection. When a worker has no tasks left, it contacts up to w random victims, followed by up to z graph neighbors. If the worker does not discover tasks this way, it becomes inactive. When all workers are inactive, the 2

Journal Pre-proof

overall computation ends and the final result is computed. Otherwise, inactive workers may be reactivated by others. The lifeline scheme was originally implemented in the Global Load Balancing (GLB) library of the parallel programming language X10 [11]. Later, it was ported to the related “APGAS for Java” programming system [12]. We refer to the APGAS version, where APGAS stands for Asynchronous Partitioned Global Address Space, a wellknown parallel programming model that represents each cluster node by one or several places. The APGAS version of GLB has about the same performance as the X10 version [8] and, like that, runs one worker per place. If not otherwise noted, GLB stands for the APGAS version of GLB in this paper. Our GLB implementations of the three fault tolerance schemes are denoted by AllFTGLB, IncFTGLB, and LogFTGLB, respectively. These programs have been carefully engineered for high performance. They implement the resilient store by the IMap data structure of the Hazelcast framework, which is a distributed in-memory store. Experiments were conducted with the same benchmarks as in reference [8]: Unbalanced Tree Search UTS [13], NQueens [14], and Betweenness Centrality BC [15]. The experiments concentrated on the fault tolerance overhead during failure-free operation, compared to a non-resilient version of GLB. Considering a particular UTS instance and a particular environment with 144 places as an example, this overhead was reported as 12.87% in reference [8]. Later, it was reduced to 7.09% [9]. In the current experiments, we measured an overhead of 3.46% with AllFTGLB, 3.94% with IncFTGLB, and 2.89% with LogFTGLB. Beyond the example, we carried out experiments on both the cluster from references [8, 9], and on a larger cluster where we could start up to 672 places. While previous work had used strong scaling, we used weak scaling, which allowed the consideration of larger inputs. Overall, we observed overheads of at most 7%. This maximum value belonged to LogFTGLB, but in general there was no clear ranking between the three schemes. To better understand the relative merits of the three schemes, we performed a second group of experiments with synthetic benchmarks that allow to control the task descriptor sizes and thus the backup volume. For large sizes, both IncFTGLB and LogFTGLB significantly improved on AllFTGLB, without a clear ranking between the two. This outcome holds for both running time and memory consumption. The latter is of interest since it permits the execution of larger program instances. For all benchmarks and fault tolerance schemes, the restore overhead after failures was negligible. We conclude that all three schemes perform better than the general-purpose alternative of system-level checkpointing. Among the schemes, differences in efficiency are low. Thus, in scenarios with small task descriptors, AllFT is preferable, since it is easiest to implement and does not

pro of

impose additional constraints. In scenarios with large task descriptors, either IncFT or LogFT can be used. The choice depends on the compliance with the additional constraints. The paper is organized as follows. Section 2 describes AllFT, IncFT, and LogFT in a novel, task pool variant-independent way. Afterwards, Section 3 adapts the schemes to lifeline-based global load balancing and provides details about their implementation in GLB and “APGAS for Java”. This section also includes some background. Section 4 describes our experimental setting, and presents and discusses results. Finally, Section 5 surveys related work, and Section 6 concludes the paper. 2. Fault Tolerance Schemes

In addition to the assumptions from Section 1, we impose the following technical requirements. As explained below, they can be established for any task pool variant, possibly at the price of a loss in efficiency:

re-

(R1) While a worker’s local pool is not empty, the worker must perform a sequence of worker steps, or briefly steps. A step typically compromises multiple of the task processing operations from Figure 1, and consists of following worker actions:

Jo

urn a

lP

ˆ take out one or several tasks from the pool, ˆ process all tasks taken, in any order, ˆ combine the results of these tasks with the worker result, and ˆ insert all child task that were generated during task processing into the local pool.

When all tasks taken have been handled by the worker this way, the step ends. As illustrated in Figure 2, between the end of a worker step and the beginning of the next one, there is a gap, during which the worker is allowed to communicate. Within a step, however, communication by the worker is forbidden. In particular, the worker must neither deliver nor accept loot. (R2) Only one steal from the same thief to the same victim may be in progress at a time. (R3) A steal should leave at least one task in the local pool.

worker

relevant times

···

(worker) step

gap

task processing

steals, backups, …

step

t1 Figure 2: Steps, gaps, and relevant times

3

gap

t2

···

Journal Pre-proof Victim

Thief

To establish (R1), recall from Section 1 that we restrict our consideration to cooperative work stealing. Since victim and thief actively participate in the stealing, they can postpone their activities until the end of a worker step. For (R2), thieves can remember open steal requests and remove duplicates. For (R3), the victim can reject steal requests that would leave the pool empty. Requirement (R3) is not strictly necessary, but appears sensible and occasionally simplifies the bookkeeping of our fault tolerance schemes.

processing steal request

record thief

● ● ●

pro of

deliver loot



insert loot into local pool write steal backup loot received



if inactive → restart



2.1. AllFT This scheme was already introduced in reference [8], where it was described specifically for the lifeline scheme. The following description generalizes it to any task pool variant that fulfills our requirements. AllFT is composed of checkpointing and a recovery procedure. The checkpointing is uncoordinated, i.e., each worker autonomously decides when to write a next local backup. The term backup refers to both the saved data and the event of their writing. Any new backup replaces the previous one. Backups contain copies of the local pool contents and the current worker result at the time of their writing, as well as some status information explained later. They are written in the gaps between two worker steps. As illustrated in Figure 2, we occasionally refer to these gaps as (relevant) times. Note, that backups always capture a consistent worker state that includes the complete outcome of all previous tasks (result, child tasks). Backups are written on the following occasions:

waiting

extract loot save loot write steal backup

remove loot

loot

resilient store

T

V

re-

backups

Figure 3: AllFT steal protocol

lP

type of backup are scheduled for the same gap, the regular backup is omitted. Backups are saved in the resilient store by a synchronous write operation. No particular type of store is required, but the store must support

ˆ right after initialization of the worker (called initial backups),

urn a

ˆ at regular time intervals (called regular backups),

ˆ failure-safe storage and retrieval of data, ˆ transactions to access multiple pieces of data in concert, and ˆ concurrent accesses by multiple workers.

ˆ in the event of stealing on both the victim and thief sides (called steal backups),

Steal backups are part of a steal protocol, which is illustrated in Figure 3 and works as follows:

ˆ during restore (called restore backups), and

1. The thief contacts the victim, asking for tasks. 2. The victim answers at its earliest convenience. It either sends a reject message (not shown in the figure), or decides to share tasks. 3. In the second case, the victim extracts the loot from its local pool, and saves it in the resilient store (independent from backups). 4. The victim writes a steal backup. 5. The victim delivers the loot to the thief. 6. The thief inserts the loot into its local pool. 7. The thief writes a steal backup. 8. The thief notifies the victim about task adoption. 9. The victim removes the loot from the resilient store.

ˆ right before the worker becomes inactive (called final backups).

Jo

Initial backups contain the initial tasks assigned to the worker and an empty worker result. Final backups do not contain any tasks but the final worker result. The length of the time period between successive regular backups is denoted by r. It is measured in seconds, which is a minor technical difference from [8], where it was measured in steps. We think that seconds make it easier for a user to choose an appropriate value. In each gap between worker steps, the worker checks whether the current time period is over. If so, it writes a regular backup. If a steal or restore backup was performed during the time period, the regular backup is postponed accordingly. Similarly, if both a regular and another

While a piece of loot is kept in the resilient store, it is called open. 4

Journal Pre-proof

All resilient store entries have a unique owner. For backups, it is the worker whose data are saved. For open loot, it is the respective victim. During failure-free operation, all accesses are performed by the owner. Thus, there is no need for synchronization. After failure of a worker x, other workers take care for x’s entries. To avoid interference with accesses by x, which may arise late because of network delays, each accessing non-owner marks x’s entries as done. When this flag is set, owner accesses are discarded. This behavior can be programmed with a transaction. The recovery procedure has been introduced in reference [8] for a particular task pool variant, and is quite complex. In the following, we concentrate on generalization issues, whereas the original paper [8] provides further details and discusses correctness. The recovery procedure assumes that all workers are notified when a worker x failed, although not necessarily at the same time. If a system lacks support for global notification, a worker who observes the failure should notify the others. After notification, a worker p records the failure, to avoid future communication with x. In some task pool variants, further actions may be required to adjust future victim selection. If p has already sent a steal request to x, it considers the request as rejected. If p is currently the victim in a steal protocol with x, p inspects x’s backup and, if needed, takes back the stolen tasks by inserting them into its own pool. As explained before, p marks x’s backup as done before it performs any actions, and performs them within a transaction. The transaction also includes a restore backup that saves the local pool contents after the task adoption. In addition to loot sent to x, recovery must deal with loot sent from x, as well as with the tasks in x’s backup. The result in x’s backup need not be dealt with, but it is simply kept in the resilient store until the final reduction. The loot from x and the tasks in x’s backup are handled by a designated backup partner. This role can be taken by any worker, and possibly at a later time, since the data are held in the resilient store. From an efficiency point of view, timely recovery may pay off, though. A definition of backup partners must meet the following requirements:

pro of

No matter how a backup partner p is selected, it inserts x’s tasks into its own pool and writes a restore backup. Moreover, it re-sends all open loot from x to its respective thieves, since such loot may or may not have been sent. Any receiver makes sure that it does not incorporate the same piece of loot twice, by inspecting a loot identifier (lid). The lids are consecutive numbers sent along loot deliveries. Each worker records the most recent lid of loot sent, and the most recent lids of loot received. From requirement (R2), only the most recent lids must be covered. They are held locally and are included in backups. 2.2. IncFT IncFT resembles AllFT, but reduces the backup volume by saving less tasks. The scheme imposes some additional constraints on the task pool variant:

re-

(I1) The owner must operate on one end of the local pool, and stealings and task deliveries must operate on the other. For simplicity, we denote the owner end as top, and the other as bottom. (I2) Each worker step must process a single task. (I3) The reduction operator should be approximately size-preserving, i.e., the result should have about the same number of bits as each operand.

lP

Constraints (I1) and (I2) are needed for correctness, whereas constraint (I3) impacts efficiency. IncFT has been introduced in a previous version of the present paper [9]. It combines two ideas: ˆ Backups cover the worker state at some suitable time in the recent past, and ˆ backups are written incrementally.

urn a

The following section explains IncFT for regular backups, and the next section extends the scheme to stealing and restore.

2.2.1. Regular Backups Let us consider any particular local pool and its worker. From constraints (R1) and (I2), each worker step removes the topmost task from the pool, possibly adds one or several tasks at the top, and keeps the rest of tasks in the pool untouched. Figure 4 depicts an example for the evolution of pool contents over time. Only gaps between worker steps, i.e., the so-called (relevant) times, are shown. At times t˜ and t, successive regular backups are written. The figure depicts pool examples P0 . . . P5 , where P1 denotes the pool right after t˜, and P5 denotes the pool right before t. For each depicted pool, R denotes the current worker result. The topmost task A is drawn as a brown grid, and the other tasks are represented by a green striped area. The number of “striped” tasks is denoted by s. These tasks remain in the pool during a worker step, and thus we call them stable. While s denotes their number, S denotes the actual tasks.

ˆ Each place x must have a unique backup partner. If the backup partner fails before or during its business, succession must be clear.

Jo

ˆ A successor of a failed backup partner must not reexecute any actions. ˆ The backup partner must be able to process the adopted tasks.

Reference [8] uses a ring structure to choose backup partners, and substitutes failed backup partners by their nearest neighbor. It uses transactions to prevent re-execution of a failed backup partner’s actions. A simpler deployment may designate worker 0 as the backup partner of all others, and crash the program when worker 0 fails. 5

Journal Pre-proof t̃

t‘

R R Rt̃‘

A

A

R

R Rt‘

A

At‘

s A

t̃‘

st̃‘ P0 min

s

...

... ... s...s

s

t‘

t̃‘

P1 snap t̃

t

P2

P3 snap

P4 snap min

...

there may have been a previous minstate of same size. The backup at t updates the previous backup in the resilient store by inserting or deleting the respective tasks. A slight drawback of IncFT over AllFT strikes after failures, when the failed computation must be repeated from t0 instead of from t. However, the additional time period is limited by |t − t0 | ≤ |t − t˜| ≈ r.

A

s

2.2.2. Extension to Stealing and Recovery Since stealing and recovery are only performed in gaps, the worker’s state is clearly defined then. IncFT deploys the same steal protocol as AllFT, except that the backups contain less tasks. A new snapshot is taken after all types of backup writings. In the following, we modify the notation from Figure 4 as follows:

P5

pro of

t̃‘

backup

t

time

Figure 4: Selective and incremental backup scheme

At any (relevant) time, the state of a computation can be represented by the triple (A, S, R). Note that such a triple includes the outcome of all previous tasks and thus captures a valid state. Our selective backup scheme is based on the following idea: Whenever a regular backup is due at a time t, the state from a recent time t0 ≤ t is written, where t0 minimizes backup size. To determine t0 , each active worker monitors s. A and R need not be monitored since their sizes are approximately constant (from (I3)). At t˜, and whenever s reaches a minimum, the worker takes a snapshot, i.e., it locally saves the tupel (A, s, R). Snapshot times are marked by “snap” in the figure. They include times when the same minimum is encountered again, since then R and A are more recent. Each snapshot replaces its predecessor in the local store. Note that the second parameter of a snapshot is a number, whereas states contain tasks. From a snapshot, the corresponding state can be reconstructed by taking s tasks from the bottom of the pool. At backup time t, a minstate (denoted min t ) is defined as the state that belongs to the current snapshot. Thus, in Figure 4, min t belongs to snapshot (At0 , st0 , Rt0 ). At t, min t can be reconstructed by taking the bottommost st0 tasks, since the pool contents did not fall below st0 between t0 and t. IncFT combines the above idea with incremental backups, i.e., the scheme does not re-send tasks that are already contained in the current backup. That backup was written at t˜ and contains the state at the last snapshot time t˜0 of the preceding time interval (or was an initial backup). We distinguish two cases:

ˆ t denotes the time at which the current (steal) backup backt is written. ˆ t0 denotes the time at which the current snapshot was taken.

re-

ˆ t˜0 denotes the time at which the pool was in the state that is represented by the previous backup backt˜ = (At˜0 , St˜0 , Rt˜0 ). ˆ sloot denotes the loot size.

Jo

urn a

lP

We distinguish several cases: a) Victim side, sloot ≤ st˜0 and sloot ≤ st0 : Since the loot tasks stayed in the pool from t˜0 to t, backt equals (At0 , sloot , Rt0 ), plus administrative information such as a hint to case a). After receipt, the bottommost sloot tasks are removed from the saved backup. b) Victim side, sloot > st˜0 and sloot ≤ st0 : After t, the worker’s computation can no longer be reconstructed from backt˜. So the backup is based on the minstate, i.e., ˆ Rt0 ), where Sˆ denotes the st0 − sloot tasks backt = (At0 , S, above the loot in the pool at t. This backup replaces backt˜. c) Victim side, sloot > st0 : The tasks that remain in the pool after stealing have been generated after t0 . Thus, neither backt˜ nor the minstate are suitable to reconstruct ˆ Rt ), where Sˆ denotes the state at t. Thus, backt = (At , S, the st − sloot tasks above the loot in the pool at t. Again, this backup replaces backt˜. d) Thief side, empty local pool : Backup backt consists of the loot, including its topmost task, and the current worker result. It replaces backt˜. e) Thief side, non-empty local pool : This case may occur in some task pool variants such as ahead-of-time stealing [16]. The backup sent contains the loot only, and backt˜ is updated by including these tasks. Only backups according to cases b), c) and d) postpone the next regular backup. Otherwise, if backups of different types are due at the same time, the regular backup is written first, followed by victim-side steal backups, and thief-side steal backups (different from AllFT).

1. st˜0 ≤ st0 : The bottommost st˜0 tasks (highlighted in Figure 4) stayed in the pool from P0 to P4 , because of the minstate property. Therefore, they are not resent. Instead, the backup at t consists of the data marked by a circle in the figure: At0 , the st0 − st˜0 upper striped tasks, and Rt0 . 2. st˜0 > st0 (not shown in the figure): Since the bottommost st0 tasks stayed in the pool from P0 to P4 , they are not re-sent. So the backup at t includes: At0 , st0 (just a number!), and Rt0 . Task At0 is included, since 6

Journal Pre-proof Victim

Thief

The same recovery procedure as in AllFT can be applied since, like there, we always have a valid backup from which a failed worker’s computations can be reproduced. The fact that the backup is possibly older does not matter for recovery. We apply one modification: Restore backups, which are written after task adoptions, save less tasks than in AllFT. They are technically the same as steal backups at the thief side, and are handled by cases d) and e).

processing ●

wait for completion of previous backup

steal request

waiting record thief ●

2.3. LogFT Like IncFT, LogFT reduces the backup volume. However, it deploys a different technique: Logging time stamps of steals. Moreover, LogFT writes backups in parallel to task processing. The scheme imposes two additional constraints, which differ from those of IncFT:



deliver loot



extract loot transaction (asynchronous): ● write log ● write backup

pro of



transaction complete

insert loot into local pool

remove loot

(L1) All tasks that are in a local pool at a time must originate from the same task delivery, or from the initial task assignment, respectively. We call such a set of tasks a task bag. (L2) Computations inside worker steps and local pool accesses must be deterministic. In particular, a take operation must always yield the same task(s), including task order, when applied to the same pool. Moreover, the tasks must be processed in the same order, and child tasks must be inserted after their generation immediately into the pool.

timestamp

logs

re-

T

Constraint (L1) can be established by incorporating some additional handshaking between victim and thief. For instance, the thief may reject any task deliveries if its pool is non-empty. LogFT reduces the backup volume of steal backups at the victim side, whereas initial, final, regular, and thiefside steal backups are identical to their AllFT counterparts. We occasionally denote the these (identical) backups as standard. Restore backups are not needed, as will be explained later. In both previous schemes, victim-side steal backups contained tasks: more tasks in AllFT, and less tasks in IncFT. In LogFT, victim-side steal backups never contain any tasks. Instead, their main contents is a time stamp for the stealing event. Time stamps specify the number of worker steps that have been executed by the respective worker thus far. Thus, they uniquely correspond to times. If multiple steals are answered at the same time, a separate backup is written for each of them, and the time stamps are supplemented by a sequence number to clarify ordering (see below). From now on, we denote victim-side steal backups as logs. In addition to time stamps, logs may contain the loot size, if it is not clear otherwise. We will see later that a sequence of logs, together with the last standard backup, allows to reproduce the victim pool contents after the steals. Beside changing the contents of steal backups, LogFT differs from AllFT by asynchronous backup writing. For that, a local copy of the tasks and the worker result is created, and then task processing continues while these data

V

backups loot

Figure 5: LogFT steal protocol

urn a

lP

are sent. Consequently, backup writing is separated into starting the backup writing, and waiting for its completion, respectively. Due to the asynchrony, successive write operations to the resilient store may overtake each other. To avoid race conditions among standard backups, a worker always waits for the completion of a previous backup before starting the next one. Concurrency between standard backups and logs will be discussed later. The LogFT steal protocol is depicted in Figure 5. It is asynchronous and combines the formerly independent victim and thief side steal backups into a single transaction:

Jo

1. The thief waits for the completion of the previous backup and then sends a steal request. 2. The victim answers with a reject message (not shown), or decides to share tasks. 3. In the second case, the victim extracts the loot from the pool. Then, it invokes a transaction on the resilient store, which is composed of: 3a) a log, which writes the time stamp (and loot size) to the victim’s store entry, and 3b) a thief-side steal backup, which writes the loot to the thief ’s store entry The transaction is performed asynchronously. While it is in progress, the victim continues task processing, but it is not allowed to invoke another transaction. 4. When the transaction is completed, the victim asynchronously delivers the loot to the thief. 7

Journal Pre-proof

5. The thief inserts the loot into its local pool and starts processing it.

Beyond that, task adoptions can violate constraint (L1). To account for that, the recovery procedure omits all task insertions into a non-empty pool. Instead, the corresponding worker creates a description record, which contains sufficient information to carry out the adoption later. For instance, the description record may contain a link to the failed worker’s entries in the resilient store. The description record is inserted into a replay list, which is saved in the resilient store. This list must support concurrent accesses. Entries in the replay list are processed at a more suitable time later. Any worker can do this processing. To avoid increasing the running time of failure-free runs, one may, e.g., adopt the following scheme: The creator of a description record locally saves a link to this record. When it later runs out of tasks or receives a steal request, it resorts to the linked tasks in place of a normal loot. At the very end, a designated worker makes sure that no records are left in the list.

pro of

Step 1 ensures that the thief-side steal backup (Step 3b) is written after the previous standard backup. The use of transactions in Step 3 ensures that logs can not overtake each other. Consequently, sequence numbers are clearly defined. To untangle concurrently written backups and logs, backups are extended by a time stamp, as well, which reflects their startup time. As in AllFT, backups are written before logs if due at the same time. The recovery procedure resembles that of AllFT. Like there, a backup partner is responsible for handling the failed worker’s backup and logs in the resilient store. It first marks these entries as done, and then collects the backup and all logs into a replay unit. The following proposition describes how the failed worker’s state can be reproduced from the replay unit. Afterwards, we discuss the overall recovery procedure, including the question who performs this recovery. Proposition 1. From a replay unit, the victim pool contents after the contained steals can be reproduced.

3. Implementation

re-

We implemented our fault tolerance schemes in the GLB library of the “APGAS for Java” programming system. APGAS uses the Hazelcast library underneath, and so we deployed Hazelcast’s IMap as a resilient store. This section starts with background on the three systems, putting emphasis on lifeline-based global load balancing, the task pool variant underlying GLB. Afterwards, we discuss the adaptation of the fault tolerance schemes to this variant. Finally, we sketch some implementation details. Our source codes will be published in a git repository upon acceptance of the paper [17].

lP

Proof. First, all late logs, i.e., logs that have been overtaken by a backup, are removed, since the backup already contains the effects of their steals. Then, without loss of generality, let us consider the first time ts at which one or several steals of the replay unit took place. The standard backup was written at tb ≤ ts . If tb = ts , then it was written first. Otherwise, tasks have been processed during (tb , ts ). Their calculations can be repeated by re-starting the task pool computation from the pool in the standard backup. From (L2), this yields the same pool contents as in the original execution. From (L1) and (R3), no loot was received during time interval (tb , ts ]. Next, the steals are re-applied to the pool, ordered by sequence numbers. For each steal, as many tasks as indicated by the loot size are extracted from the pool and thrown away. From (L2), this yields the same pool contents as in the original execution. The process is repeated for all times at which steals took place (in chronological order).

urn a

3.1. Background APGAS Library. The term Asynchronous Partitioned Global Address Space (APGAS) denotes both a programming model [18] and a library that implements the model [19]. The model describes a parallel machine as a set of places, each of which consists of a memory partition and associated computing resources. All places can access all memory partitions, but local accesses are faster. For running an APGAS program on a real machine, the user must map one or several places to each cluster node. In the APGAS model, computations are performed by light-weight activities. A program starts with a single activity at place 0. Other activities are spawned dynamically, giving rise to a tree. Both activities and data are assigned to places by the programmer, either explicitly or implicitly, and can not migrate. The APGAS library was developed in a branch of the X10 language project. Java and Scala versions exist, we refer to the Java version [19]. APGAS implements each place by a Java Virtual Machine (JVM). Place-internally, activities are mapped to Java threads with the help of Java’s Fork-Join pool.

Jo

Note that a replay unit may contain early logs, i.e., logs that overtook their standard backup. Proposition 1 reapplies them as any others. This is justified as the correctness of the method from Proposition 1 described in the proof) does not depend on the frequency of regular backups. Moreover, from Step 4 of the steal protocol, all logs refer to the current task bag. Finally, early logs can be safely consumed, since their “right” backup will not arrive anymore. The AllFT recovery procedure includes occasional task adoptions: 1) A victim may need to re-adopt the loot sent, and 2) The backup partner may need to adopt the failed worker’s saved tasks. Obviously, case 2) is different in LogFT, insofar as the tasks must first be reproduced from a replay unit. 8

Journal Pre-proof

Parallelization and distribution constructs are almost identical to those of X10, but APGAS realizes them with Java lambdas. Direct accesses to remote data are not allowed, but all communication is based on active messages. They are realized with constructs asyncAt and at, which transfer a piece of code and parameters to a remote place and invoke a corresponding activity there. Communication with at is synchronous, i.e., the parent activity waits for the return of the child, and communication with asyncAt is asynchronous. For global synchronization, APGAS provides a finish construct, which waits for the termination of all activities spawned inside a block and their descendants. Placeinternal synchronization relies on Java functionalities such as synchronized and wait() / notify(). Remote data access is facilitated by global references. Moreover, a PlaceLocalObject construct supports the programming of distributed data structures such as arrays. APGAS supports a resilience mode, in which user applications are notified of failures. For instance, programs can register a placeFailureHandler on each place, and APGAS invokes these handlers automatically. In rare cases, the APGAS runtime crashes after failures: 1) when the origin place of a finish fails, and 2) when internal bookkeeping information is lost.

GLB maintains one local pool for each worker, from which the worker takes and into which it inserts tasks. The pool data structure must be coded by the GLB user, who must implement a set of functions such as function split() for extracting loot from the pool. GLB allows only one worker per place. Nevertheless, in parallel to its activity, concurrent activities can be performed on the same place. Therefore, steal requests can be received during task processing [8].

re-

pro of

Lifeline Scheme. The task pool variant that is implemented by GLB is called lifeline-based global load balancing, or briefly the lifeline scheme. It combines victim selection with termination detection in a quite sophisticated way [10]. The following description, like the original publication of the lifeline scheme, refers to the APGAS model and to cooperative work stealing: A steal request starts an activity on the victim place, which checks an empty flag and either rejects the request straight away, or records it in a local queue. Loot deliveries likewise are recorded in a queue. In parallel to this communication, the worker activity runs a loop, in which it alternatingly processes up to n tasks (corresponds to worker steps from Section 2), and handles the requests that were recorded in the meantime (corresponds to gap from Section 2). Typically, a GLB user sets n > 1, to avoid frequent synchronization, which is required to protect the accesses to pool and queues [8]. After having processed its last task, a worker starts stealing. For that, it contacts up to w random victims, followed by up to z so-called lifeline buddies. If all w + z steal attempts failed, the worker becomes inactive. It may be reactivated later by a lifeline buddy that delivers tasks. Lifeline buddies are preselected and form some suitable lifeline graph [10]. The lifeline buddies store steal requests that they had to reject and answer them later, if possible. A global finish construct recognizes the inactivity of all workers. Afterwards, the final reduction is performed and the program ends.

urn a

lP

Hazelcast. The Hazelcast library is internally used by APGAS for interconnecting places [20]. The library includes a resilient store, called IMap, in which APGAS saves internal bookkeeping information. We also use the IMap for our resilient store. IMap entries are key-value pairs. For each entry, a configurable number of replica is maintained, called backup count. The entries and their replica are distributed to different nodes, so as to balance load and provide resilience. If an entry can not be restored after failures, a user-defined PartitionLostListener is triggered. IMap accesses automatically keep the entry’s replica consistent. They can be performed concurrently by different threads. Function executeOnKey() groups multiple accesses to the same entry into a critical section, to protect them from concurrent access. Similarly, transactions group multiple accesses to different entries, for the same purpose.

3.2. Adaption of Fault Tolerance Schemes Whether or not the lifeline scheme fulfills the requirements from Section 2 depends on details of its own implementation, as well as that of the local pool data structure that is provided by the user alongside its application. In our benchmarks, we implemented the local pools by dequeues. In all cases, access functions for stealing and deliveries operate on one end of the pool, and access functions for the worker’s own task processing operate on the other. Steal requests extract at most 50% of the tasks in the pool (benchmark-dependent). We adopt a help-first strategy, i.e., child tasks are inserted into the pool and the processing continues with the parent [21]. GLB leaves open whether the n tasks for a worker step are taken from the pool as a block or individually. Our implementations take a single task at a time, and completely process it before taking the next one.

Jo

GLB. The Global Load Balancing (GLB) library was originally proposed for X10 [11], but we refer to the APGAS version of the library here [8, 12]. GLB provides inter-place load balancing, and thus complements the intra-place load balancing of the APGAS Fork/Join pools. GLB tasks are different from APGAS activities. A GLB user must program these tasks by extending some interface. The GLB tasks adopt the task model from Section 1, i.e., they may generate other tasks, are free of side effects, and contribute to a final result that is computed by reduction. 9

Journal Pre-proof

In the following, we comment on the validity of our fault tolerance schemes’ requirements. Thereafter, we fill gaps in the algorithms with specific detail, where Section 2 was confined to generic arguments. Our GLB implementations are denoted by AllFTGLB, IncFTGLB, and LogFTGLB, respectively. Requirement (R1) is naturally met by the lifeline scheme’s computation structure. Each step corresponds to the processing of n tasks, and workers correspond to worker activities. GLB synchronization ensures that the worker activities do not communicate during a step. Each step enters all child tasks into the pool and combines the task results with the worker result. Requirement (R2) may only be violated when a random steal request is sent to a lifeline buddy that has already recorded a lifeline request before. Like reference [8], we enforce (R2) by discarding the random request on the victim side and treating the lifeline request as if it would have arrived just now. Requirement (R3) is always fulfilled as we extract at most 50% of the pool contents. Likewise, our local pool implementation meets Constraint (I1). Constraint (I2) could in principle be established by setting GLB parameter n = 1. However, that would increase the synchronization costs. Therefore, we introduce two levels of steps: “Small’ steps” are the steps according to requirement (R1), i.e., they process one task and incorporate its children/result. “Large steps”, in contrast, are the units after which communication operations are allowed, i.e., the length n of a large step expresses how many tasks must have been processed before communication is allowed. Thus, we typically have n > 1, as in GLB. Constraint (I2) is fulfilled for the small steps. The fact that communication is more rare does not compromise (R1)’s correctness. Note that the two-level step structure can only be imposed if the user program takes one task from the pool at a time, as we do. This is a restriction to IncFTGLB usage, though. Constraint (I3) is application-dependent. Our benchmarks use the sum operator, which is size-preserving. The lifeline scheme may violate constraint (L1), since task deliveries from lifeline buddies may arrive at any time after the steal request. Therefore, we included some additional handshaking between victim and thief, but only into LogFTGLB. In particular, before loot delivery a lifeline buddy first asks its partner whether it is still in need of tasks by spawning an activity on the partner’s place. This activity may have to wait for ongoing communication before the answer is clear. Constraint (L2) is naturally fulfilled by our benchmark implementations.

lP

re-

pro of

Owner accesses to the done flag are performed with function executeOnKey(). The recovery procedure receives global failure notifications with placeFailureHandlers, and uses IMap transactions wherever needed. The choice of backup partners and their successors corresponds to reference [8]. The backup partners are always able to process adopted tasks; for that, inactive workers are reactivated when needed. In rare cases, multiple failures may dissect the lifeline graph, such that workers from some subgraph can not steal anymore from the others [22]. In this case, load balancing malfunctions, but still all tasks are processed by the rest of the workers. Thus, the efficiency drops but the correctness is not compromised. It may pay off to occasionally reconstruct the graph, as suggested in reference [23]. We did not implement that. IncFT does not require any adaptations. As explained above, a snapshot is taken after each “small step”. The LogFT description in Section 2.3 did not prescribe the design and handling of description records. In our implementation, a backup partner with a non-empty pool creates a description record that just contains the failed worker’s number. It further maintains a local counter for the number of description records that were created but not yet processed. Whenever the backup partner’s pool runs empty or it receives a steal request, it consults and possibly decreases this number, and resorts to the corresponding tasks. At the very end, worker 0 checks the resilient store for any left entries. 3.3. Technical Issues Our implementations were carefully engineered for high performance. Here is some examples of implementation details:

urn a

Distribution of IMap entries. Internally, Hazelcast assigns IMap entries to a fixed number of partitions, and evenly distributes these partitions across places. Our IMaps contain as many entries as places, and therefore the standard distribution is imbalanced. Our implementations equate the number of partitions with the number of places, and deploy a user-defined distribution that saves the backup of place i on place i + 1. As usual, replicas are distributed randomly.

Jo

ExactlyOnceExecutor. Our fault-tolerant GLB versions use Hazelcast function executeOnKey() for backup writing. Occasionally, Hazelcast runs this function multiple times, which would render IncFTGLB incorrect. We implemented an exactly-once guarantee with the help of system-wide unique IDs. InMemoryFormat. Hazelcast supports two storage formats for entries: Binary format is more efficient for accesses to whole entries, and Object format is more efficient for entry processing. To facilitate our use of executeOnKey(), we changed the default setting to Object.

The application of AllFT checkpointing to the lifeline scheme is for the most part obvious. Hazelcast’s IMap nicely fits the stated requirements for a resilient store. 10

Journal Pre-proof

RestartDaemon. The APGAS runtime uses asyncAt to invoke placeFailureHandlers. Thus, they run outside any finish of the user program. In our programs, the placeFailureHandlers may have to restart inactive workers which, for termination detection, must run inside a global finish. As a workaround, we implemented a restart daemon, which is an additional activity inside finish. If needed, a placeFailureHandler sends a message to this daemon, which thereupon restarts the corresponding worker. Furthermore, the restart daemon delays termination until all placeFailureHandlers have been executed and no open steals are left [8]. We more efficiently re-implemented these checks with Hazelcast’s ICompletableFuture construct.

pro of

number of nodes. NQueens calculates the number of nonthreatening placements of N queens on an N × N chessboard. BC calculates a centrality score for each node of a given graph. The two synthetic benchmarks compute π with a Monte Carlo algorithm and user-defined precision. They were developed with the aim of simulating large pools, and therefore their task descriptor size is adjustable with a dummy ballast. DynamicSyn, like UTS, starts with a single task at place 0. Each task spawns multiple child tasks dynamically, until a given depth is reached. StaticSyn, like BC, generates all tasks at program start and evenly distributes them among workers. All benchmarks reduce task results with the sum operator. For UTS, NQueens and the synthetic benchmarks, the result is a single long value. For BC, the result is a long array with one entry per graph node. The UTS, NQueens and BC benchmarks were already available for GLB and AllFTGLB [8], we slightly adapted them for IncFTGLB and LogFTGLB. The synthetic benchmarks are own developments. In the following, we list our parameter settings, including GLB parameter n for the number of tasks per step. Values for n were determined experimentally, so as to minimize the running time for different steal rates and task granularities. Similarly, we experimentally determined the percentages of tasks that are extracted in steals as 10% for UTS and the synthetic benchmarks, and 50% for NQueens and BC. The experiments deployed weak scaling, i.e., we increased the problem size with the number of places by adjusting benchmark-specific parameters, such that all running times are in the 100 . . . 1000 seconds range. The ellipsis below indicate the corresponding parameter range, and Tables 1 and 2 contain examples of concrete running times. The synthetic benchmarks were always run with 144 places. Instead, we increased the dummy ballast. Here are the benchmark parameters:

4. Experiments

re-

We compared the new fault-tolerant GLB versions to the original non-resilient one to evaluate (a) running times of failure-free runs, (b) memory footprints, (c) restore overheads, and (d) correctness. Experiments were run on two clusters:

lP

ˆ Kassel [24]: Our department’s partition of the Kassel University Cluster comprises 12 homogeneous, Infiniband-connected nodes. Each node contains two 6-core Intel Xeon E5-2643 v4 CPUs (a total of 12 cores per node), and 256 GB of main memory. We deployed up to 144 places, which we mapped cyclically onto the lowest possible number of nodes (at most 12 places per node). For instance, we mapped 24 places onto two nodes: place 0 onto node 0, place 1 onto node 1, place 2 onto node 0, and so on.

urn a

ˆ SuperMUC Phase 2 [25]: Each node of this petascale machine contains two 14-core Intel Xeon E5-2697 v3 CPUs (a total of 28 cores per node), and 64 GB of main memory. We allocated up to 24 nodes within a single island, and started up to 672 places. The mapping was analogous to that on Kassel, here with up to 28 places per node.

ˆ UTS: geometric tree shape, branching factor b = 4, random seed s = 19, tree depth d = 15 . . . 19, n = 511 ˆ NQueens: n = 511

We deployed Java in version 11.0.2 and Hazelcast in version 3.10.6. For APGAS, we used the latest revision (March 29, 2019) from an own repository [26]. This repository is a fork of the official one with some additional features and bug fixes. The resilience mode of APGAS was switched on for runs with the fault-tolerant GLB versions only. Thus, the numbers reported below include both the overheads of our own schemes, and those of resilient APGAS. As benchmarks, we used Unbalanced Tree Search UTS [13], NQueens [14], Betweenness Centrality BC [15], and two own synthetic benchmarks called DynamicSyn and StaticSyn, respectively. UTS dynamically generates a highly irregular tree from SHA1 values and counts the

N = 16 . . . 18, threshold t = 10 . . . 12,

Jo

ˆ BC: random seed s = 2, number of graph nodes N = 216...19 , n = 127 ˆ DynamicSyn: number of child tasks randomly selected from [1, ..., 27], tree depth d = 7, precision g = 1, n = 127, ballast b = 0 . . . 10 MB ˆ StaticSyn: number of tasks t = 100000, precision g = 300, n = 12, ballast b = 0 . . . 0.4 MB

In all runs, GLB parameter r, which specifies the time period between successive regular backups, was set to 10 seconds. It was estimated with the Daly formula [27], 11

Journal Pre-proof

8

AllFTGLB IncFTGLB LogFTGLB

2 1, 5 1 0, 5 d=17

0 2

d=18 24

48

6 5 4 3 2 1 0

pro of

2, 5

−1

d=19 72

96

120

AllFTGLB IncFTGLB LogFTGLB

7

Overhead over non-resilient GLB in %

Overhead over non-resilient GLB in %

3

−2

144

Places

d=18

28

112

224

336

d=19 448

560

672

Places

(a) Kassel

(b) SuperMUC

Figure 6: UTS weak scaling: Running time overhead of AllFTGLB, IncFTGLB, and LogFTGLB over non-resilient GLB

AllFTGLB IncFTGLB LogFTGLB

re-

Backups per place and second

0, 45

0, 15

0, 1

0, 05

0, 4

0, 35

0, 3

0, 25

0, 2

lP

Steals per place and second

0, 2

AllFTGLB IncFTGLB LogFTGLB

0, 5

0, 15

d=17

0 2

d=18 24

48

d=19 72

96

Places

120

2

144

urn a

(a) Steals

d=17

0, 1

d=18 24

48

d=19 72

96

120

144

560

672

Places

(b) Backups

Figure 7: UTS weak scaling: Number of (a) backups and (b) steals per place and second on Kassel

1, 5

1

0, 5

0

−0, 5

q=16 2

q=17

24

48

72

3 Overhead over non-resilient GLB in %

AllFTGLB IncFTGLB LogFTGLB

Jo

Overhead over non-resilient GLB in %

2

q=18 96

AllFTGLB IncFTGLB LogFTGLB

2, 5 2 1, 5 1 0, 5

q=17

0 120

144

28

Places

112

q=18 224

336

448

Places

(a) Kassel

(b) SuperMUC

Figure 8: NQueens weak scaling: Running time overhead of AllFTGLB, IncFTGLB, and LogFTGLB over non-resilient GLB

12

Journal Pre-proof

8

AllFTGLB IncFTGLB LogFTGLB

5 4 3 2 1 0 −1 −2 −3 −4

n=18 2

24

48

4 2 0 −2 −4

n=19 72

96

120

AllFTGLB IncFTGLB LogFTGLB

6

pro of

6

Overhead over non-resilient GLB in %

Overhead over non-resilient GLB in %

7

−6

144

Places

n=18

28

112

224

336

n=19 448

560

672

Places

(a) Kassel

(b) SuperMUC

Figure 9: BC weak scaling: Running time overhead of AllFTGLB, IncFTGLB, and LogFTGLB over non-resilient GLB 120

300

Overhead over non-resilient GLB in %

AllFTGLB IncFTGLB LogFTGLB

re-

250

AllFTGLB IncFTGLB LogFTGLB

100

200 150 100 50 0 0

2

4

6

Ballast per Task in Megabyte

8

60 40 20

0

10

urn a

(a) DynamicSyn

80

lP

Overhead over non-resilient GLB in %

350

0

0, 05

0, 1

0, 15

0, 2

0, 25

0, 3

0, 35

0, 4

Ballast per Task in Megabyte

(b) StaticSyn

AllFTGLB IncFTGLB LogFTGLB

Overhead over GLB in %

140 120 100 80 60 40 20 0 0

2

4

6

450

AllFTGLB IncFTGLB LogFTGLB

400 350 Overhead over GLB in %

160

Jo

Figure 10: DynamicSyn and StaticSyn: Running time overhead of AllFTGLB, IncFTGLB, and LogFTGLB over non-resilient GLB with 144 places on Kassel

300 250 200 150 100 50 0

8

−50

10

Ballast per Task in Megabyte

0

0, 05

0, 1

0, 15

0, 2

0, 25

0, 3

0, 35

0, 4

Ballast per Task in Megabyte

(a) DynamicSyn

(b) StaticSyn

Figure 11: DynamicSyn and StaticSyn: Memory footprint of AllFTGLB, IncFTGLB, and LogFTGLB over non-resilient GLB with 144 places on Kassel

13

Journal Pre-proof

which, depending on inputs such as system MTBF, gave us r = 10 . . . 1000 seconds. We conservatively used the minimum from this range, to avoid reporting too optimistic results. The impact of r on the running time will be further discussed in Section 4.1. Previous experiments were carried out with IMap backup counts 1 and 6, but the distinction did not lead to interesting insights [8, 9]. Therefore, we carried out the current experiments with backup count 1 only (except in Section 4.3).

pro of

The BC overheads (Figure 9) on Kassel are at most 6.06% (for LogFTGLB on 132 places). Above 24 places, they are quite stable, except for IncFTGLB, with a relatively clear ranking: AllFTGLB performs best with overheads up to to 0.86%, followed by IncFTGLB with overheads up to 4.13%, and LogFTGLB with overheads up to 5.57% (all with 144 places). On SuperMUC, the overheads are at most 7.11% (for AllFTGLB with 672 places). Unlike on Kassel, there is no clear ranking. In general, the BC overheads are higher and more stable than those of UTS and NQueens. This can be attributed to different benchmark characteristics: First, all tasks are known from the beginning, and thus the steal rate is significantly less. In consequence, the number of steal backups is less than the number of regular backups (see Table 3). Correspondingly, backups are written about every r seconds only. Second, with an increasing number of graph nodes (parameter N ), the task granularity increases considerably (see Table 3). For instance, with 144 places, processing n tasks takes 25 seconds. Therefore, backups are only written about every 25 seconds. Finally, in contrast to UTS and NQueens, the result is not a single long value, but an array. Since the result is contained in backups, they are larger than for UTS and NQueens (Table 3). For instance, with 144 places, a single backup has 4 MB, in contrast to UTS and NQueens backups with 2 KB and 30 KB, respectively. For the synthetic benchmarks (Figure 10), as expected, the overheads tend to increase with ballast. Table 3 shows that the backups sizes increase up to 235 MB (DynamicSyn) and 53 MB (StaticSyn), respectively. Both values are significantly higher than those of the other benchmarks. DynamicSyn shows a clear ranking: LogFTGLB performs best with overheads of up to 32.95%, IncFTGLB ranks second with overheads up to 59.27%, and AllFTGLB looses with large overheads up to 317.12% (all with 10 MB ballast per task). Obviously, the reduced backup volume of IncFTGLB and LogFTGLB pays off. Results for StaticSyn show a clear ranking, as well, although a different one. Here, IncFTGLB performs best with overheads up to 7.47%, and LogFTGLB ranks second with overheads up to 16.05%. AllFTGLB again looses clearly with overheads up to 103.04% (with 0.35MB or 0.4 MB ballast per task, respectively). Overall, there is no clear ranking between our three fault tolerance schemes, but the results are benchmarkand machine-dependent. This outcome can be explained by different pros and cons of the three algorithms: IncFTGLB and LogFTGLB have a lower backup volume than AllFTGLB, and thus lower communication costs. The difference is, however, only noticeable for large pools, since otherwise communication costs are dominated by latency. On the backside, IncFTGLB causes additional overhead for monitoring, and LogFTGLB causes additional overhead for local task copying and handshaking. The latter is required to avoid task deliveries to non-empty pools.

4.1. Running Times of Failure-Free Runs

Jo

urn a

lP

re-

This section refers to overheads instead of absolute running times, to make the presentation more clear. The overheads are calculated with the formula timexFTGLB /timeGLB − 1, and expressed as a percentage. Figures 6, 8 and 9 depict our results for UTS, NQueens, and BC on Kassel (left) and SuperMUC (right), respectively. In the figures, grey colored areas mark place ranges that we ran with the same weak scaling parameters. For UTS on Kassel, the overheads are at most 2.96% (for IncFTGLB with 2 places). The overheads are lowest for LogFTGLB, which is closely followed by AllFTGLB, while IncFTGLB has the largest overheads. To better explain the results, Figure 7 shows the number of (a) steals, and (b) backups per place and second. The curves share some similarities with those in Figure 6a, such as the peak at 72 places. As Figure 7a shows, the steal rate increases within each grey colored area. The reason for this is that with the same total work and an increasing number of places, each worker gets fewer tasks, and thus it has to steal more often. Figure 7b shows that the number of backups per place and second grows in the same way as the steal rate, which is obviously due to a growing number of steal backups. Additionally, one can see that regular backups, which are written every 10 seconds, play a minor role in the running time overhead. For example, with 144 places, each place writes a backup about every 3 seconds. Correspondingly, Table 3 shows that about 88% of the backups are steal backups. On SuperMUC (Figure 6b), the overheads are at most 7.03% (for LogFTGLB with 224 places), and fluctuate in a similar way as on Kassel. Since Kassel and SuperMUC have completely different hardware (CPU speed, number of cores per node, network speed, etc.), the exact form of the curves is different. Occasionally we have measured negative overheads, but only in single node runs. They can be explained by the non-deterministic nature of GLB’s work stealing algorithm, due to which a certain variance in processing time is normal. For NQueens (Figure 8), the overheads are at most 1.39% on Kassel (for AllFTGLB with 72 places), and at most 2.89% on SuperMUC (for LogFTGLB with 448 places). The overheads fluctuate without a clear winner.

14

Journal Pre-proof

UTS BC BC, realistic r NQueens DynamicSyn StaticSyn

non-resilient GLB 886.78 718.56 718.56 541.82 503.78 56.73

AllFTGLB 890.53 724.73 719.71 543.21 2101.37 115.18

IncFTGLB 898.85 748.23 727.39 543.99 802.37 57.78

LogFTGLB 889.00 758.57 725.57 543.62 669.78 65.83

UTS NQueens BC

non-resilient GLB 430.31 237.42 357.76

pro of

Table 1: Running times in seconds with 144 places on Kassel

AllFTGLB 445.73 241.97 383.20

IncFTGLB 451.48 240.27 369.13

LogFTGLB 453.65 243.62 368.43

Table 2: Running times in seconds with 672 places on SuperMUC

Single Task 0.10 – 0.11 microseconds 70 – 120 microseconds 5 – 200 milliseconds 0.70 – 15 milliseconds 83 – 221 milliseconds

n Tasks 55 – 60 microseconds 36 – 60 milliseconds 0.65 – 25 seconds 0.08 – 2 seconds 1 – 2.60 seconds

re-

UTS NQueens BC DynamicSyn StaticSyn

Backup Size 1 – 2 KB 23 – 30 KB 0.60 – 4 MB 2 KB – 235 MB 5 KB – 53 MB

Steal Backups 12 – 88 % 83 – 96 % 3 – 16 % 32 – 37 % 60 – 96 %

lP

Table 3: Average task processing time (task granularity), average backup size per place, and percentage of steal backups in relation to all backups on Kassel

runtime. As in the previous paragraph, we report overheads of the fault-tolerant GLB versions over GLB, and express them as a percentage. For clearer results, experiments for this section were performed with the synthetic benchmarks. The results are depicted in Figure 11, for the same parameters as before. DynamicSyn results vary for small task descriptor sizes, but above 4.5 MB ballast per task the picture is clear: IncFTGLB needs the least memory with overheads of at most 38.48%, followed by LogFTGLB with overheads of at most 53.89%, and AllFTGLB with overheads of at most 147.74% (all with 10 MB ballast per task). Absolute values are given in Table 4. StaticSyn yielded similar results. Here, a clear picture arises above 0.1 MB ballast: IncFTGLB has the lowest memory footprint overheads with at most 69.17%, followed by LogFTGLB with at most 161.14%, and AllFTGLB with up to even 441.92%. Again, absolute values can be found in Table 4.

4.2. Memory Footprint Beside running time, memory footprint is relevant, since it determines the maximum problem size, respectively the number of places that can be assigned to each node. We calculated memory footprints by taking the average over the peak memory consumptions of all places during a job’s

4.3. Restore Overhead In our third group of experiments, we determined time for handling place failures. It includes failure tection, execution of the recovery procedure, and processing of the lost tasks. For that, we compared running times of three program executions:

Jo

urn a

In general, our overheads are remarkably low compared to the results of previous experiments [8, 9]. To put this into concrete terms, let us consider the execution of a particular UTS instance (d = 17) with 144 places on Kassel. For this instance, reference [8] reported an overhead of 12.87% for a precursor of AllFTGLB, and reference [9] reported overheads of 7.09% for a precursor of AllFTGLB and 8.88% for a precursor of IncFTGLB. In the current experiments, we measured overheads of 3.46% for AllFTGLB, 3.94% for IncFTGLB, and 2.89% for LogFTGLB. The improvements can be explained by more carefully engineered re-implementations. Let us finally consider the impact of GLB factor r. As stated above, r should be calculated with the Daly formula, which gave us a [10, 1000] seconds range. We performed most experiments with r = 10 seconds, but also tried r = 500 seconds. With this setting, the BC overheads of our fault tolerance schemes is only 0.16% for AllFTGLB, 1.23% for IncFTGLB and 0.98% for LogFTGLB (all with 144 places on Kassel).

15

the derethe

Journal Pre-proof

non-resilient GLB 6826.50 MB 3278.70 MB

DynamicSyn (10 MB) StaticSyn (0.4 MB)

AllFTGLB 16 911.80 MB 17 768.01 MB

IncFTGLB 9453.64 MB 5546.52 MB

LogFTGLB 10 505.59 MB 8561.91 MB

Table 4: Memory footprint of one place

AllFTGLB 904.54 984.79 1017.79

IncFTGLB 909.80 989.87 1025.93

LogFTGLB 902.12 983.84 1012.96

pro of

A B C

Places 144 132 144 − 24

Table 5: Running times in seconds for UTS with d=19 and backup count=6 on Kassel

A: 144 places without crashes, B: 132 places without crashes, and C: 144 places with 24 place crashes after half of the remaining time of A’s running time.

an AllFT step may process multiple tasks, and thus the scheme is suitable for compact task representations [11]. For large pools, in contrast, IncFTGLB and LogFTGLB significantly outperform AllFTGLB. The choice between the two schemes depends on a task pool scheme’s compliance with the additional constraints, respectively the costs for establishing them.

re-

Since executions B and C use the same total amount of computing resources, the restore overhead can be estimated by tC −tB . This approximation is somewhat rough, since the second half of A’s execution has more idle workers than the first one, and so the reduction in resources hurts less then. The value of 24 for the number of place crashes is unrealistically large. We used it to obtain measurable results. To avoid a program crash despite the large value, we set the backup count to 6. Two places per physical node were crashed by calling System.exit(). Table 5 depicts the measured running times in cases A, B and C. With the above formula, we obtain an estimated restore overhead of 3.35% for AllFTGLB, 3.64% for IncFTGLB, and 2.96% for LogFTGLB (all for 24 failures). Thus, the restore overheads for a single failure are negligible in all cases.

5. Related Work

4.4. Correctness Tests

urn a

lP

Fault tolerance research has attracted increasing attention in recent years [1, 5, 28]. Its scope is broader than ours and includes, e.g., silent error handling and failure prediction. As noted in Section 1, system-level checkpoint/restart is the current de-facto standard for coping with permanent node failures. It is provided by libraries such as BLCR [3], DMTCP [2], and FTI [29]. Many papers suggest performance improvements for system-level checkpointing [4, 30, 31]. In particular, multi-level checkpointing combines different storage media for saving backups. Incremental checkpointing avoids re-sending data. Uncoordinated checkpointing does not perform a global synchronization, but complements local checkpoints with message logging. Our fault tolerance schemes share some similarities with these approaches. For instance, our steal backups serve a similar purpose as message logs. Application-level fault tolerance is supported by more programming systems than APGAS, e.g.: ULFM for MPI [32], Resilient X10 [33], and Charm++ [34]. Failure notification allows a user program to implement application-specific fault tolerance techniques. An example is the exploitation of redundancy in matrix computations, which is often denoted as algorithm-based fault tolerance (ABFT) [35]. Load balancing and non fault-tolerant task pools have been studied intensively, e.g. [36, 16], but there is still only few work on fault-tolerant task pools. Some of it is targeted at silent errors [37, 38, 39]. Other work considers permanent failures for different task models: For tasks with side effects, re-execution after failures must not update the same data twice [40]. Independent tasks that

4.5. Summary

Jo

Correctness tests were performed with 144 places on Kassel. We tested the same scenarios as listed in reference [8], plus some additional one-place crashes after steal backup writing. Again, place crashes were provoked with System.exit(). After the program runs, we checked log files to make sure that the protocols were obeyed.

All three fault tolerance schemes have low overheads in failure-free runs, typically below 6%. Furthermore, the restore overheads after failures are negligible. Thus, all three schemes constitute an efficient alternative to systemlevel checkpointing for task pool-based applications. The difference between the three schemes is less clear, however. In the practically relevant case of small pools, they have a similar performance. Here, AllFT seems to be the best choice, since it is easiest to implement, and does not impose any additional constraints. In particular, 16

Journal Pre-proof

are a-priori known and controlled by a single master, can be re-run under the master’s supervision, e.g. in MapReduce [41]. This approach extends to hierarchical master/worker patterns [42]. Vishnu et al. store tasks and data in a distributed array with automatic replication [43]. Their experiments with a chemical application showed an overhead of less than 15% with up to 4096 processes and 2 process failures. Kabir and Goswami suggest a fault tolerant algorithm for the A* algorithm, which has a similar computation structure as task pools [44]. In fork/join programs, tasks return their result to a parent, which consequently can manage re-execution after failures [45, 46]. This approach was recently improved by Kestor, Krishnamoorthy and Ma [47]. Their proposal avoids re-execution of stolen children of failed tasks, which instead return their result to the grandparent (or another ancestor). Ancestors are discovered with the help of history information that is piggypacked onto loot deliveries. After failures, this information is globally collected in a so-called steal tree, and the proper ancestor is determined. The approach can be applied to reduction-based task pools, where steal trees reflect the steal relation instead of ancestry [48, 49]. The LogFT design was inspired by that scheme, but we save the history information in a resilient store instead of sending it around. This requires more messages, but avoids unbounded information loss and the need for global reduction. AllFT has been introduced previously by two of the authors [8]. For the present paper, it was re-implemented in a more efficient way. Formerly, our group developed faulttolerant task pool variants for the X10 version of GLB [22]. They did not use a resilient store, but manually distributed backups to other workers. In this context, a precursor of IncFT was sketched, alongside a rudimentary implementation [50]. In reference [23], a fault-tolerant X10 task pool was extended by malleability, i.e., the ability to add and remove resources at runtime [23]. LogFT was originally proposed in one of the authors’ bachelor thesis [51]. There are several alternatives to the Hazelcast IMap, e.g. Infinispan [52]. X10 provides two resilient data stores, one based on Hazelcast and one implemented in pure Resilient X10 [53]. Other fault-tolerant data structures include arrays [54].

lP

re-

pro of

Our three schemes differ in the concrete information that is included in backups, as well as in some details of their recovery procedures. Two of the schemes impose additional constraints. The paper has formulated the schemes in a generic way, such that they can be applied to a class of task pool variants. Moreover, we implemented the schemes for a particular variant, namely lifeline-based global load balancing. For that, we extended the GLB library of the “APGAS for Java” programming system into program versions AllFTGLB, IncFTGLB and LogFTGLB. Experiments with the three programs showed that all have a negligible restore overhead after failures. Thus, our experiments concentrated on running times in failure-free runs. Based on five benchmarks, we found that the performance of the schemes is comparable for small local pools, whereas for large pools, IncFTGLB and LogFTGLB outperform AllFTGLB. Thus, the choice of a scheme should be based on its compliance with the constraints, which often favors AllFT. Future work may experimentally compare our schemes to conceptually different fault tolerance approaches such as system-level checkpointing and the approach of reference [47], based on a common set of benchmarks. Another open issue is the application of the three schemes to other task pool variants including hybrid task pools [55]. Finally, further benchmarks and real applications, such as the combinatorial search problems considered in [56], may be investigated.

urn a

Jo

6. Conclusions

[1] T. Herault, Y. Robert, Fault-Tolerance Techniques for HighPerformance Computing, Springer, 2015. doi:10.1007/ 978-3-319-20943-2. [2] J. Ansel, K. Arya, G. Cooperman, DMTCP: Transparent checkpointing for cluster computations and the desktop, in: Int. Symp. on Parallel & Distributed Processing, IEEE, 2009, pp. 1–12. doi:10.1109/ipdps.2009.5161063. [3] P. H. Hargrove, J. C. Duell, Berkeley lab checkpoint/restart (BLCR) for linux clusters, Journal of Physics: Conference Series 46 (2006) 494–499. doi:10.1088/1742-6596/46/1/067. [4] F. Shahzad, M. Wittmann, M. Kreutzer, T. Zeise, G. Hager, G. Wellein, A survey of checkpoint/restart techniques on distributed memory systems, Parallel Processing Letters 23 (04) (2013) 1340011–1340030. doi:10.1142/s0129626413400112. [5] F. Cappell, A. Geist, W. Gropp, S. Kale, B. Kramer, M. Snir, Toward exascale resilience: 2014 update, Supercomputing Frontiers and Innovations 1 (1) (2014) 5–28. doi:10.14529/ jsfi140101. [6] P. Thoman, K. Dichev, T. Heller, et al., A taxonomy of taskbased parallel programming technologies for high-performance computing, The Journal of Supercomputing 74 (4) (2018) 1422– 1434. doi:10.1007/s11227-018-2238-4. [7] C. Fohry, An overview of task-based parallel programming models, tutorial at HiPEAC 2019. [8] J. Posner, C. Fohry, A Java task pool framework providing faulttolerant global load balancing, Int. Journal of Networking and Computing (IJNC) 8 (1) (2018) 2–31. doi:10.15803/ijnc.8. 1_2. [9] C. Fohry, J. Posner, L. Reitz, A selective and incremental backup scheme for task pools, in: Int. Conf. on High Performance Computing & Simulation (HPCS), 2018, pp. 621–628. doi:10.1109/HPCS.2018.00103. [10] V. A. Saraswat, P. Kambadur, S. Kodali, D. Grove, S. Krishnamoorthy, Lifeline-based global load balancing, in: Proc. ACM

This paper has described and compared three application-level fault tolerance schemes for reductionbased distributed task pools: AllFT, IncFT and LogFT. All schemes periodically write backups to a resilient store. The backups include the task pool contents (possibly in a condensed form) and the current worker result. They are updated in the event of stealing. Upon failures, the failed worker’s tasks are adopted by co-workers, such that the program computes the same result as in non-failure executions. 17

Journal Pre-proof

[17] [18]

[19]

[20] [21]

[22]

[23]

[24]

[25] [26] [27]

[28]

[29]

[30]

[31]

pro of

[16]

tations for high performance computing systems, The Journal of Supercomputing 65 (3) (2013) 1302–1326. doi:10.1007/ s11227-013-0884-0. W. Bland, A. Bouteiller, T. Herault, G. Bosilca, J. Dongarra, Post-failure recovery of MPI communication capability, The Int. Journal of High Performance Computing Applications 27 (3) (2013) 244–254. doi:10.1177/1094342013488238. D. Cunningham, D. Grove, B. Herta, A. Iyengar, K. Kawachiya, H. Murata, V. Saraswat, M. Takeuchi, O. Tardieu, Resilient X10, ACM SIGPLAN 49 (8) (2014) 67–80. doi:10.1145/ 2692916.2555248. L. V. Kale, S. Krishnan, CHARM++: A portable concurrent object oriented system based on C++, in: SIGPLAN, Vol. 28, ACM, 1993, pp. 91–108. doi:10.1145/165854.165874. G. Bosilca, R. Delmas, J. Dongarra, J. Langou, Algorithmbased fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing 69 (4) (2009) 410–416. doi:10.1016/j.jpdc.2008.12.002. M. Korch, T. Rauber, A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice and Experience 16 (1) (2003) 1–47. doi:10.1002/cpe.745. C. Cao, T. Herault, G. Bosilca, J. Dongarra, Design for a soft error resilient dynamic task-based runtime, in: 2015 IEEE Int. Parallel and Distributed Processing Symp., IEEE, 2015, pp. 765–774. doi:10.1109/ipdps.2015.81. Y. Wang, W. Ji, F. Shi, Q. Zuo, A work-stealing scheduling framework supporting fault tolerance, in: Proc. Design, Automation Test in Europe Conf. Exhibition (DATE), 2013, pp. 695–700. doi:10.7873/date.2013.150. M. C. Kurt, S. Krishnamoorthy, K. Agrawal, G. Agrawal, Faulttolerant dynamic task graph scheduling, in: Proc. Int. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2014, pp. 719–730. doi:10.1109/SC.2014.64. W. Ma, S. Krishnamoorthy, Data-driven fault tolerance for work stealing computations, in: Proc. ACM Int. Conf. on Supercomputing, 2012, pp. 79–90. doi:10.1145/2304576.2304589. B. Memishi, S. Ibrahim, M. S. P´ erez, G. Antoniu, Fault tolerance in MapReduce: A survey, in: Computer Communications and Networks, Springer, 2016, pp. 205–240. doi:10.1007/ 978-3-319-44881-7_11. A. Bendjoudi, N. Melab, E.-G. Talbi, FTH-B&B: A faulttolerant hierarchical branch and bound for large scale unreliable environments, IEEE Transactions on Computers 63 (9) (2014) 2302–2315. doi:10.1109/tc.2013.40. A. Vishnu, H. J. J. van Dam, W. A. de Jong, Scalable fault tolerance using over-decomposition and PGAS models, https://pdfs.semanticscholar.org/816e/ 2068eadb17995b34f7d51d7dc4497d706b67.pdf. U. Kabir, D. Goswami, Identifying patterns towards algorithm based fault tolerance, in: Proc. Int. Conf. on High Performance Computing & Simulation, IEEE, 2015, pp. 508–516. doi:10. 1109/hpcsim.2015.7237083. R. D. Blumofe, P. A. Lisiecki, Adaptive and reliable parallel computing on networks of workstations, in: Proc. of the Annual Conf. on USENIX, 1997. R. V. V. Nieuwpoort, G. Wrzesi´ nska, C. J. H. Jacobs, H. E. Bal, Satin: a high-level and efficient grid programming model, ACM Transactions on Programming Languages and Systems 32 (3). doi:10.1145/1709093.1709096. G. Kestor, S. Krishnamoorthy, W. Ma, Localized fault recovery for nested fork-join programs, in: IEEE Int. Parallel and Distributed Processing Symp. (IPDPS), 2017, pp. 397–408. doi:10.1109/ipdps.2017.75. M. Dratwa, Uebertragung eines fehlertoleranten algorithmus fuer fork/join-programme auf reduktionsbasierte taskpools, Mastersthesis, University of Kassel, Germany (2017). Lukas Reitz, Design and evaluation of work-stealing-based fault tolerance for task pools, Mastersthesis, University of Kassel, Germany (In Preparation). C. Fohry, M. Bungart, J. Posner, Towards an efficient fault-

[35]

[36]

[37]

[38]

re-

[14] [15]

[34]

[39]

[40]

lP

[13]

[33]

[41]

[42]

urn a

[12]

[32]

[43]

[44]

[45]

[46]

Jo

[11]

Symp. on Principles and Practice of Parallel Programming (PPoPP), 2011, pp. 201–212. doi:10.1145/1941553.1941582. W. Zhang, O. Tardieu, D. Grove, B. Herta, T. Kamada, V. Saraswat, M. Takeuchi, GLB: Lifeline-based global load balancing library in X10, in: Proc. ACM Workshop on Parallel Programming for Analytics Applications (PPAA), 2014, pp. 31–40. doi:10.1145/2567634.2567639. J. Posner, C. Fohry, Cooperation vs. coordination for lifelinebased global load balancing in APGAS, in: Proc. ACM SIGPLAN Workshop on X10, 2016, pp. 13–17. doi:10.1145/ 2931028.2931029. S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, C.-W. Tseng, UTS: An unbalanced tree search benchmark, in: Languages and Compilers for Parallel Computing, Springer LNCS 4382, 2006, pp. 235–250. doi:10.1007/ 978-3-540-72521-3_18. E. J. Gik, Schach und Mathematik, 1st Edition, Thun, 1987. L. C. Freeman, A set of measures of centrality based on betweenness, Sociometry 40 (1) (1977) 35. doi:10.2307/3033543. A. Prell, Embracing explicit communication in work-stealing runtime systems, Ph.D. thesis, University of Bayreuth (2016). J. Posner, Extended APGAS applications repository, https: //github.com/posnerj/PLM-APGAS-Applications (2019). V. Saraswat, G. Almasi, G. Bikshandi, et al., The asynchronous partitioned global address space model, in: Proc. ACM SIGPLAN Workshop on Advances in Message Passing, 2010. O. Tardieu, The APGAS library: resilient parallel and distributed programming in Java 8, in: Proc. ACM SIGPLAN Workshop on X10, 2015, pp. 25–26. doi:10.1145/2771774. 2771780. Hazelcast, The leading open source in-memory data grid, http: //hazelcast.org (2019). Y. Guo, R. Barik, R. Raman, V. Sarkar, Work-first and help-first scheduling policies for async-finish task parallelism, in: Proc. IEEE Int. Parallel & Distributed Processing Symp. (IPDPS), 2009, pp. 1–12. doi:10.1109/IPDPS.2009.5161079. C. Fohry, M. Bungart, P. Plock, Fault tolerance for lifelinebased global load balancing, Journal of Software Engineering and Applications 10 (13) (2017) 925–958. doi:10.4236/jsea. 2017.1013053. M. Bungart, C. Fohry, A malleable and fault-tolerant task pool framework for X10, in: Proc. IEEE Int. Conf. on Cluster Computing, Workshop on Fault Tolerant Systems, 2017, pp. 749– 757. doi:10.1109/CLUSTER.2017.27. University of Kassel, Scientific data processing, https: //www.uni-kassel.de/its-handbuch/en/daten-dienste/ wissenschaftliche-datenverarbeitung.html (2019). Leibniz-Rechenzentrum (LRZ), Supermuc petascale system, https://www.lrz.de/services/compute/supermuc/ (2019). J. Posner, Extended APGAS library repository, https:// github.com/posnerj/PLM-APGAS (2019). J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems 22 (3) (2006) 303–312. doi:10.1016/j.future.2004.11. 016. S. Hukerikar, C. Engelmann, Resilience design patterns: A structured approach to resilience at extreme scale, Supercomputing Frontiers and Innovations 4 (3) (2017) 4–42. doi: 10.14529/jsfi170301. L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, S. Matsuoka, FTI: High performance fault tolerance interface for hybrid systems, in: Proc. of Int. Conf. for High Performance Computing, Networking, Storage and Analysis, ACM Press, 2011, pp. 1–32. doi:10.1145/2063384.2063427. T. Herault, Y. Robert, A. Bouteiller, A. Arnold, K. B. Ferreira, G. George, J. Dongarra, Checkpointing strategies for shared high-performance computing platforms, International Journal of Networking and Computing 9 (1) (2019) 28–52. doi:10. 15803/ijnc.9.1_28. I. P. Egwutuoha, D. Levy, B. Selic, S. Chen, A survey of fault tolerance mechanisms and checkpoint/restart implemen-

[47]

[48]

[49]

[50]

18

Journal Pre-proof

pro of

[56]

re-

[55]

lP

[54]

urn a

[52] [53]

Jo

[51]

tolerance scheme for GLB, in: Proc. ACM SIGPLAN Workshop on X10, 2015, pp. 13–17. doi:10.1145/2771774.2771779. L. Reitz, An asynchronous backup scheme tracking workstealing for reduction-based task pools, Bachelorsthesis, University of Kassel, Germany (2018). Red Hat, Infinispan, http://infinispan.org (2019). D. Grove, S. S. Hamouda, B. Herta, A. Iyengar, K. Kawachiya, J. Milthorpe, V. Saraswat, A. Shinnar, M. Takeuchi, O. Tardieu, Failure recovery in resilient X10, Tech. rep., IBM (2017). A. Chien, P. Balaji, P. Beckman, et al., Versioned distributed arrays for resilience in scientific applications: Global view resilience, Procedia Computer Science 51 (2015) 29–38. doi: 10.1016/j.procs.2015.05.187. J. Posner, C. Fohry, Hybrid work stealing of locality-flexible and cancelable task for the APGAS library, The Journal of Supercomputing 74 (4) (2018) 1435–1448. doi:10.1007/ s11227-018-2234-8. Blair Archibald and Patrick Maier and Robert Stewart and Phil Trinder, Implementing YewPar: A framework for parallel tree search, in: Proc. Euro-Par Parallel Processing, Springer LNCS 11725, 2019, pp. 184–196. doi:https://doi.org/10. 1007/978-3-030-29400-7_14.

19

Journal Pre-proof Jonas Posner

pro of

Jonas Posner received his Bachelor’s and Master's degree in Computer Science from the University of Kassel, Germany, in 2014 and 2016, respectively. Currently, he is a full time Ph.D. student and works for the Programming Languages/Methodologies research group at the University of Kassel. Jonas' research interests include programming languages and high performance computing, with focus on PGAS, task-based systems, load balancing, fault tolerance and malleability.

Claudia Fohry

Lukas Reitz

urn a

lP

re-

Claudia Fohry is an Associate Professor of Computer Science at University of Kassel, Germany, where she leads the research group on Programming Languages/Methodologies. She received a Diploma in Mathematics in 1990 and a Ph.D. (Dr. rer. nat.) in Computer Science in 1992, both from Humboldt-Universität zu Berlin, Germany. In 2003, she habilitated in Computer Science at Friedrich-Schiller-Universität Jena, Germany. Her research interests include parallel programming and parallel algorithms. Recently, she concentrated on PGAS and task-based parallel programming systems, and studied issues such as resilience and load balancing.

Jo

Lukas Reitz received his Bachelor’s degree in Computer Science in 2018 from the University of Kassel, Germany, and is currently a master student of Computer Science at the University of Kassel. His research interests are in the field of high performance computing and fault tolerance. Since 2017, he is working with the Programming Languages/-Methodology research group at the University of Kassel and is involved in research projects on fault tolerance. During his work he has copublished several papers in the field of fault tolerance and high performance computing.

Journal Pre-proof

Jo

urn a

lP

re-

pro of

No changes.