CLPKM: A checkpoint-based preemptive multitasking framework for OpenCL kernels

CLPKM: A checkpoint-based preemptive multitasking framework for OpenCL kernels

Journal of Systems Architecture 98 (2019) 53–62 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.else...

2MB Sizes 0 Downloads 65 Views

Journal of Systems Architecture 98 (2019) 53–62

Contents lists available at ScienceDirect

Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc

CLPKM: A checkpoint-based preemptive multitasking framework for OpenCL kernels☆ Ming-Tsung Chiu, Yi-Ping You∗ Department of Computer Science, College of Computer Science, National Chiao Tung University, Hsinchu, Taiwan

a r t i c l e Keywords: GPGPU OpenCL Preemption Software checkpointing

i n f o

a b s t r a c t Heterogeneous computing has become popular in the past decade. Many frameworks have been proposed to provide a uniform way to program for accelerators, such as GPUs, DSPs, and FPGAs. Among them, an open and royalty-free standard, OpenCL, is widely adopted by the industry. However, many OpenCL-enabled accelerators and the standard itself do not support preemptive multitasking. To the best of our knowledge, previously proposed techniques are not portable or cannot handle ill-designed kernels (the codes that are executed on the accelerators), which will never ever finish. This paper presents a framework (called CLPKM) that provides an abstraction layer between OpenCL applications and the underlying OpenCL runtime to enable preemption of a kernel execution instance based on a software checkpointing mechanism. CLPKM includes (1) an OpenCL runtime library that intercepts OpenCL API calls, (2) a source-to-source compiler that performs the preemption-enabling transformation, and (3) a daemon that schedules OpenCL tasks using priority-based preemptive scheduling techniques. Experiments demonstrated that CLPKM reduced the slowdown of high-priority processes from 4.66x to 1.52–2.23x under up to 16 low-priority, heavy-workload processes running in the background and caused an average of 3.02–6.08x slowdown for low-priority processes.

1. Introduction With the rise of high-performance computing, techniques such as machine learning that were considered not practically feasible before are growing as a game changer these days. With its high availability and power efficiency, graphics processing units (GPUs) have a competitive market share over central processing units (CPUs) and applicationspecific integrated circuits. General-purpose computing on GPUs (GPGPUs) dominate heterogeneous computing over the past decade due to its power efficiency and throughput. To GPGPU programmers, tasks running on GPUs can be categorized into data-transfer and computation tasks. The former is to synchronize data between the main memory and the specific memory on GPUs, whereas the later is to run a programmer-specified number of workitems (or threads), each of which executes a specified entry function, called kernel, on GPUs. A typical task sequence is transferring data into a GPU, computing, and retrieving the result back to the main memory. Modern GPUs are able to transfer data and compute concurrently. A carefully written program can overlap data-transfer and computation tasks to improve the overall throughput. Nevertheless, the execution time of the tasks is not equally predictable. If the source, destination, ☆

and length of a data-transfer task are valid, we can expect the task will finish at some time point in the future. However, things are complicated for computation tasks. If a kernel contains some form of infinite controlflow behaviors, it may not finish until the end of time. Unfortunately, GPUs do not receive the capability to evict a running kernel until lately. In spite of the preemption support brought by NVIDIA to their latest GPUs [2], most other GPUs are not shipped with the feature. Although GPUs have high computing power, it is relatively difficult to program for a GPU than a CPU. Newcomers are not likely getting things right on the first shot. The popular programming models for GPGPUs, CUDA [3] and OpenCL [4], also provide no way to stop a running task. To make things worse, mainstream GPUs are only capable to run a single computing task at the same time. As a result, an erroneous GPGPU task could block other GPGPU tasks that rely on the same piece of the GPUs, which could be a common problem on platforms for training and development of GPGPU programming. Some previous studies have brought priority management to the popular programming frameworks (e.g., leveraging the capability of multiple GPUs [5,6], proposing hardware extensions to support preemption [7–9], using customized drivers to ensure high-priority tasks be scheduled at any time [10], and spatially dividing workload to reduce response time [11]) so as to address the blocking issue. However, the aforementioned approaches either need

An Extension of an ICPP-EMS’18 Workshop Paper [1]. The implementation of the proposed framework (CLPKM) is available at https://github.com/ypyou/clpkm.

git. ∗

Corresponding author. E-mail address: [email protected] (Y.-P. You).

https://doi.org/10.1016/j.sysarc.2019.06.008 Received 28 November 2018; Received in revised form 31 May 2019; Accepted 28 June 2019 Available online 29 June 2019 1383-7621/© 2019 Elsevier B.V. All rights reserved.

M.-T. Chiu and Y.-P. You

Journal of Systems Architecture 98 (2019) 53–62

special hardware extensions or fail to handle inherently endless GPGPU tasks. In other words, these approaches do not consider both portability and effectiveness. In this paper, we propose a pure software framework, called CLPKM, based on a software checkpointing mechanism to enable preemption of OpenCL tasks on any OpenCL-enabled platform and manage ill-designed, endless OpenCL tasks. The primary concept of the framework is to instrument codes that save live values of a low-priority task at a checkpoint so as to allow the task to be preempted at the checkpoint and later (e.g., after a high-priority task being executed) restored from the checkpoint by a priority-aware task scheduler.

2. Related work Considering the shortcomings of GPGPU platforms and programming models, many previous studies and efforts have been made to provide a solution that enables preemption of GPGPU tasks. We summarize some representative studies in this section. Chimera [7] introduces a hardware extension, called streaming multiprocessor (SM) flushing, that enables an SM to drop the progress of a running thread block so that a new task could be dispatched to the SM right after the flushing. Chimera automatically decides which SM to preempt so as to minimize the runtime cost on a thread-block by threadblock basis. GPU-SAM [5] leverages split-and-merge execution on multiGPGPUs in order to reduce response time. It splits a kernel into disjoint work-spaces, as well as replacing some of the OpenCL built-in functions to maintain the correctness. GPU-SAM also uses a runtime manager to assign appropriate execution mode automatically based on offline profile data. RGEM [10] is a user-space runtime engine that splits memory copy operations into chunks so the operations can be preempted at chunk boundaries since DMA copies are usually non-preemptive, and buggy programs can easily result in large memory copy transaction. A runtime kernel scheduler is developed to take priority into account. High-priory kernels are still subject to influenced by previously launched low-priority kernels. PKM [11] follows the similar approach of RGEM. Unlike RGEM, PKM allows data-transfer and computation tasks to run concurrently. PKM divides kernels into smaller sub-kernels, in a spatial partitioning manner, to facilitate finer-grained preemption. The system administrator is responsible for setting up an appropriate sub-kernel size for a reasonable trade-off between finer-grained preemption and overhead. EffiSha [9] uses a source-to-source compilation strategy that transforms a kernel into a form that uses persistent threads such that the transformed kernel voluntarily returns between so-called block-tasks (the set of work done by a thread block) if the runtime decided to evict it. Since the evictions take place at the end of block-tasks, no intermediate values need to save; hence avoiding the cost. EffiSha also avoids the overhead of re-launching a kernel if the kernel does not require to be evicted. Our proposed approach is similar to PKM and EffiSha, but a temporal partition of a kernel execution instance is proposed in our study. The aforementioned studies are promising, but they either fail to preempt kernels that are endless by nature or require special hardware support. In the other words, they fail to meet effectiveness and portability at the same time. To achieve both of effectiveness and portability, despite possible runtime overhead being expected to low-priority processes, we propose to use a good old technique, source-to-source compilation, to provide a solution on top of vendor’s OpenCL runtime library. Provided that a preemptive multitasking of OpenCL kernel execution is supported, many enhancement techniques regarding process scheduling in the literature can be applied and of more practical use on OpenCL devices. For example, when a preempted kernel get resumes, the kernel might be migrated to some other device to achieve better load-balancing state or better performance, using the scheduling techniques proposed by Du et al. [12] and Pham et al. [13]. When setting priority and execution order of OpenCL programs, scheduling with precedence graphs could have potential performance enhancement effects [14–17].

Fig. 1. High-level workflow of CLPKM.

3. Design and implementation 3.1. Overview CLPKM is a framework that provides preemptive scheduling for OpenCL kernels with process-based priority levels. In order to preempt kernels of low-priority processes, it transforms the kernels to offer finer preemption granularity such that the kernels can yield for kernels of high-priority processes. In other words, a kernel of a low-priority process is instrumented with checkpoint manipulation codes that save the execution states of the kernel and force the kernel to break its execution at a checkpoint when the kernel is preempted, and restore the execution states of the kernel and restart the kernel from the checkpoint when the kernel is resumed. CLPKM consists of three components: (1) a transcompiler (a source-to-source compiler) that performs the preemption-enabling transformation, (2) a daemon, which is responsible for scheduling, and (3) a runtime library, which intercepts OpenCL API calls, invokes the transcompiler on-demand, and coordinates with the daemon. Fig. 1 shows the high-level workflow of CLPKM.

3.2. The transcompiler Every time the API clBuildProgram() is called, the CLPKM runtime library delegates the CLPKM transcompiler to perform the preemption-enabling transformation. Such transformation makes a kernel voluntarily relinquishes the OpenCL device that it runs on (i.e., suspending the kernel execution) at a checkpoint if the elapsed execution time of the kernel exceeds a threshold. The transcompiler also instruments code that saves the execution states of the kernel before the kernel is preempted at a checkpoint and restores the execution states for restarting from the checkpoint after the kernel is resumed. In this study, we place checkpoints only within loops, which usually take a dominant fraction of program execution time and are thus considered candidates to be transformed into a preemption-enabled form.

3.2.1. Clang extension The transcompiler is implemented based on Clang [18], which is the C front end for the LLVM compiler infrastructure. With the source range information provided by Clang and the help of the Rewriter class, we can modify or instrument code constructs within OpenCL kernel functions. However, Clang does not always consider semicolons as part of a source range, which leads to problems when making instrumentation. Therefore, we extended Clang with new APIs, such as getStmtLocEnd(), which is similar to getLocEnd(), to address this issue. 54

M.-T. Chiu and Y.-P. You

Journal of Systems Architecture 98 (2019) 53–62

3.2.2. Preprocessing Before the transformation is performed, some preliminary work, including macro expansion, shadowed variable renaming, and function inlining, is performed in order to make the transformation easy and effective. 3.2.2.1. Macro expansions. A macro use may be expanded into one or multiple loops, which typically account for the majority of the execution time for most applications and thus are considered candidates for placing checkpoints within each loop iteration. However, it could be difficult to place checkpoints into a macro function and make it general enough to fit any circumstances. Enforcing macro expansions aforehand is a straightforward solution to increase the coverage of the codes to be preemption enabled. 3.2.2.2. Shadowed variable renaming. The key function of the transcompiler is to instrument codes that save and restore the execution states of a kernel execution instance at a checkpoint. Basically, a main source of the execution states is live variables. However, it can be a problem if an outer-scope variable is shadowed by an inner-scope variable (i.e., a variable declared within a certain scope has the same name as a variable declared in an outer scope) because the value of the outer variable is neither preserved nor restored at a checkpoint. Therefore, we propose to deploy a simple renamer, which renames shadowed variables, to solve the issue. 3.2.2.3. Function inlining. Similar to macro functions, it may be necessary to place checkpoints into functions that involve time-consuming loops. However, if checkpoints are put directly into a function, the caller of the function must distinguish the function returns due to its time slice expiring or terminating normally. Inlining a function to expose the control flow of the function to its caller is a simpler solution to the problem. The source-level inliner replaces a function call with a statement expression that contains the body of the callee function with an appropriate prologue and epilogue to set up the arguments and the return value associated with the call. Fig. 2 shows a simple example of a source-level function inlining, in which the statement expression is in red text.

Fig. 2. An example of source-level inline expansion.

3.2.3. Preemption-enabling transformation The transcompiler enables the preemption capability of a kernel execution instance by transforming the kernel into a form in which the kernel execution can be suspended at a checkpoint and later resumed from the checkpoint. The transformation involves four main changes: (1) extending the parameter list of kernel functions, (2) injecting the preliminary skeleton codes of the checkpointing mechanism, (3) inserting codes for checkpoint manipulation (e.g., saving and restoring execution states) within loops, and (4) resolving potential problems caused by barriers. The details of the transformation are given as follows.

pass the sizes to the kernel so that the kernel knows how much space is needed for saving values in the buffers at a checkpoint. progress is a flag vector that indicates the progress of each work-item within the grid defined by the NDRange. Each flag in the vector is a signed integer and initialized to FIRST_RUN, which indicates a corresponding workitem is about to execute the kernel for the first time, and it is changed to DONE right before the work-item finishes its task. At each checkpoint position, the work-item updates its flag with a unique number if the checkpoint is taken so as to keep track of the execution progress. How the flags are updated are discussed in the following two subsections.

3.2.3.1. Parameter extension. The parameter list of the kernel function to be transformed is extended with five additional parameters so as to pass execution states of the kernel to/from the CLPKM runtime library. The five parameters include the minimum checkpoint interval (threshold), two buffers (loc_live_val_buf and prv_live_val_buf), a size table of __local arguments (loc_arg_ size_tbl), and a progress vector (progress), as shown in red text in Fig. 3. The minimum checkpoint interval determines the minimum time slice allotted to the kernel. loc_live_val_buf and prv_live_val_buf are buffers in the global memory for storing live values of variables that are allocated in the local and private memory, respectively, when a checkpoint is taken—the local and private memory are not guaranteed to be persistent across kernel invocations. loc_arg_size_tbl is a read-only table that stores the sizes (in bytes) of all __local arguments of the kernel, where the sizes of these __local buffers are decided by programmers by calling an OpenCL API call (clSetKernelArg()). The CLPKM runtime library must

3.2.3.2. Injection of the preliminary skeleton for checkpointing. Before a checkpoint is placed into the kernel to be transformed, the kernel is enclosed with a switch-case construct along with a preceding prologue and a concluding epilogue. Fig. 3 gives the codes after injecting the preliminary skeleton to the vector_add kernel shown in Fig. 2a. The prologue (shown in green background color in Fig. 3) includes the preparation of CLPKM-related information, such as initializing the checkpoint timer (which is used to log the time taken by the current invocation of the kernel), calculation of the linear IDs of each thread, and locating where live values should be stored. Following the prologue, a switch-case construct (with the execution progress of the running work-item, progress[global_id], as its control variable and shown in cyan background color in Fig. 3) encloses the original kernel body and determines where the execution flow should continue. If a work-item executes the kernel for the first time—the progress flag of the work-item is initially FIRST_RUN as 55

M.-T. Chiu and Y.-P. You

Journal of Systems Architecture 98 (2019) 53–62

Fig. 3. The result after injecting the preliminary skeleton of the checkpointing mechanism to the vector_add kernel in Fig. 2a.

mentioned in Section 3.2.3.1, the work-item will simply execute the body. In contrast, if it is not the first time that a work-item execute the kernel, the work-item will take a case label which indicates the program point where the work-item last met a checkpoint and re-execute the body from the checkpoint. When a work-item finishes its task (i.e., the original function body), the progress flag of the work-item is set to DONE. The epilogue (shown in red background color in Fig. 3) is a sequence (with a leading label) that stores the values of all variables within the kernel that are allocated in the local memory to the global memory. All return statements in the original body (and also the return statements added for checkpointing) are replaced with a goto statement that jumps to the leading label in order to ensure that the epilogue is executed at the end of each invocation of this kernel, even if a work-item has finished its task. This requirement is necessary because other work-items may need the updated values in the local memory to continue their work in the next run(s).

3.2.3.3. Checkpoint insertion. As mentioned in the beginning of Section 3.2, checkpoints are placed only within loops in this study. More specifically, a checkpoint is inserted at the end of a loop body and consists of a checkpointing sequence and a resuming sequence. Fig. 4 shows an example of inserting a checkpoint (shown in green background color) at the end of a loop body. The loop is assumed to be a part of the body of a kernel function. The checkpointing sequence (shown in red text in Fig. 4) first updates the checkpoint timer by increasing the value of the timer by a compile-time estimated cost of an iteration of the loop or by leveraging hardware registers, and then jumps to the injected epilogue (described in Section 3.2.3.2) after changing its progress flag to the case label produced in the resuming sequence and storing the live values in the private memory to the global memory if the checkpoint timer reaches the minimum checkpoint interval (i.e., the time slice of the work-item expires).

Fig. 4. An example of placing a checkpoint into a loop body.

The resuming sequence (shown in blue text in Fig. 4) restores the live values saved in the checkpointing sequence of the previous run and resumes the execution of the remaining loop iterations. The resuming sequence is contained within an “if-zero” statement followed by a case statement with an integer number (a unique number for each check56

M.-T. Chiu and Y.-P. You

Journal of Systems Architecture 98 (2019) 53–62

as utilization of compute units and memory buses of OpenCL devices, with a global bitmap, where each bit of the bitmap indicates whether a resource is allocated for high-priority processes. The daemon also provides methods for a process to request for a resource demand (e.g., SetHighPriorityTaskBitmap() for high-priority processes to notify their work). When a high-priority process is about to start a new OpenCL task, it makes a request with a bitmap to the daemon. The daemon collects all the requests made by high-priority processes and summarizes them into the global bitmap, which is broadcasted to all lowpriority processes if it has been changed. Low-priority processes then examine the global bitmap and start their OpenCL task only when the bitmap of the task has no conflict with the global bitmap. Introducing a daemon to manage priorities and maintain the global state of the system that OpenCL applications run on is simpler than doing the same thing using only the CLPKM runtime library since certain cleanup operations can be automatically performed by the daemon, such as descheduling accidentally terminated high-priority processes, to achieve global stability of the system. 3.4. The runtime library Fig. 5. An example of transforming a barrier.

The CLPKM runtime library (or CLPKM runtime for short) is a wrapper library on top of the vendor’s OpenCL runtime library. It interacts with the OpenCL applications, the CLPKM transcompiler, the CLPKM daemon, and the vendor’s OpenCL runtime library, as illustrated in Fig. 1. We discuss how the CLPKM runtime transparently interacts with OpenCL applications in this section. Basically, most issues are for interacting with low-priority processes, which are supposed to be preempted when a high-priority process arrives.

point) as its label, so the resuming sequence will be executed only when jumping from the enclosing switch statement. 3.2.3.4. Resolving issues for barriers. The OpenCL specification requires that when a barrier (a call to barrier()) is placed within a kernel function, all the work-items belonging to the same work-group execute the barrier function before continuing execution beyond the barrier. Unfortunately, this requirement will not be satisfied if some work-items of the work-group have been checkpointed before they reach the barrier. If we leave it as it is, the barrier synchronization would be collapsed. Therefore, the transcompiler must ensure that all the work-items in the same work-group reach a barrier all together if a checkpoint occurs before the barrier. This can be done by forcing a work-item that is about to reach the barrier to stop entering the barrier and to exit the kernel as if it is checkpointed if some other work-item(s) in the same work-group have been checkpointed and not yet resumed. Fig. 5 shows how a barrier is transformed according to the aforementioned idea. We introduce a barrier stop, which is a counter shared across the work-group and is initialized to the size of the work-group, for each single call to the barrier function. The barrier stop for a specific barrier is accessed prior to the barrier with the three operations (called a barrier-stop sequence): decrement by one, fetching the value, and increment by one. If all the workitems belonging to the same work-group have executed the barrier-stop sequence (i.e., they all have reached the barrier), the value fetched in the second operation should be zero, and all the work-items can proceed; otherwise, the work-items that have reached the barrier-stop sequence are “checkpointed”—we name this type of checkpoints “barricaded checkpoints”. Unlike a regular checkpoint stated earlier, when a barricaded checkpoint occurs, the progress flag is set to a negative integer, the absolute value of which is the case label preceding the resuming sequence. The injected switch-case construct disallows a work-item with a negative value of the progress flag to make any progress until the CLPKM runtime library finds all the work-items in the work-group that the work-item belongs to have the same negative number of the progress flag and flips their sign. Since then the work-items can continue from the resuming sequence of the checkpoint and go beyond the barrier.

3.4.1. Building programs When a low-priority OpenCL application calls clBuildProgram() to build a program executable, the CLPKM runtime intercepts the call, invokes the CLPKM transcompiler to perform the preemption-enabling transformation of the source program, and finally passes the transformed source program as a shadow program, on behalf of the original program, to the vendor’s OpenCL runtime library to build the executable. We name the kernel objects that created from the shadow program shadow kernels. 3.4.2. Enqueuing kernels When an

OpenCL

application

calls

clEnqueueNDRangeKernel() to enqueue a command to execute a kernel, which is transformed into a shadow kernel, the CLPKM runtime must intercept the call and enqueue a series of commands to execute the shadow kernel in a way that works with the checkpoint manipulation within the shadow kernel and conforms to the OpenCL specification in terms of asynchronism. Fig. 6 illustrates a four-phase sequence diagram of enqueuing such a kernel with the CLPKM runtime. The four phase includes an initialization, first-run, later-runs, and finalization phase. When such a kernel is enqueued to execute by an OpenCL application, the CLPKM runtime starts the initialization phase, in which it allocates spaces in the global memory for the extended parameters of the corresponding shadow kernel (including progress, loc_arg_size_tbl, loc_live_val_buf, and prv_live_val_buf) and then enqueues two write-buffer commands, which initialize the progress vector with all entries set to FIRST_RUN and the buffer pointed by loc_arg_size_tbl to the total size of the arguments that are declared with the __local qualifier, respectively. Later in the first-run phase, the CLPKM runtime enqueues two dependent commands (denoted as a shadow-kernel-launch operation), which wait on the completion of the two aforementioned write-buffer commands: (1) a kernel-launch command of the shadow kernel, which performs the actual computation, with the same kernel configuration

3.3. The daemon The CLPKM daemon provides a simple priority-aware scheduling service for OpenCL kernel execution instances over the sd-bus API, which is a lightweight D-Bus IPC client library as part of systemd [19]. The daemon maintains the resource utilization of high-priority processes, such 57

M.-T. Chiu and Y.-P. You

Journal of Systems Architecture 98 (2019) 53–62

of the user event to CL_COMPLETE, which announces the termination of the kernel execution instance indirectly created by of the intercepted clEnqueueNDRangeKernel(). 3.4.3. Kernel object pool As described in Section

3.4.2,

the

intercepted

call

to

clEnqueueNDRangeKernel() returns after the first kernellaunch command (for the first run) is enqueued, rather than after the kernel execution instance is complete, in order to preserve the asynchronism of clEnqueueNDRangeKernel(). With the CLPKM framework, the kernel execution instance that is indirectly created by the call to clEnqueueNDRangeKernel() from an OpenCL application might be divided into several shadow kernel execution sub-instances (or sub-instances for short). However, it is possible that the kernel object that is one of the arguments of clEnqueueNDRangeKernel() is set with new arguments for other purpose after the call to clEnqueueNDRangeKernel(); this means that the shadow kernel object created in the first place cannot be reused as an argument for later sub-instances since the shadow kernel object might be changed under the aforementioned scenario and leads to incorrect behavior of the sub-instances. Hence we deploy a kernel object pool that contains only one type of kernel objects, which are exactly the same as the shadow kernel object created in the first place regarding the intercepted call to clEnqueueNDRangeKernel(). Every time the CLPKM runtime (on behalf of the application) requests to enqueue a shadow kernel, the CLPKM runtime draws out one kernel object from the pool for the request, passes the kernel object as user data of the callback function, and returns the object to the pool on termination. If there is no spare kernel object in the pool, the CLPKM runtime creates one for an request. 3.4.4. Shadow queue As discussed in Section 3.4.2, a kernel-launch command (denoted as K) enqueued by an OpenCL application is interpreted by the CLPKM runtime as a set of operations and commands (denoted as commands SK1 , SK2 , ..., SKn ), which basically launch the kernel execution sub-instances that comprise the kernel execution instance launched by K. Theoretically, these commands (SK1 –SKn ) may be interleaved temporally with some other command (denoted as C) that is enqueued after K by the OpenCL application. If SK1 –SKn are enqueued to the very same command queue that contains K and C, the application might experience a serious problem that leads to incorrect results. If the command queue is an in-order queue, C (which is supposed to be executed after the complete of K) could be enqueued and executed before SKn (which launches the last kernel execution sub-instance) is enqueued and executed; this breaks the requirement of C starts before K (or SKn ) finishes. Another problem occurs when enqueuing a barrier command (by calling clEnqueueBarrier()) into an out-of-order command queue after a kernel-launch command—now C refers to the barrier command. Under such a circumstance, the barrier command is likely enqueued between the marker command (which is enqueued in the first-run phase) and SKn (which launches the last kernel execution substance in the later-runs phase). However, according to the OpenCL specification, the barrier command requires the marker command to finish before SKn can start while the CLPKM runtime defines that the marker command waits on the completion of SKn ; this circular dependency results in a deadlock. To solve the aforementioned issues, the CLPKM runtime creates an out-of-order shadow queue for each command queue, and enqueues the SK1 –SKn commands to the shadow queue and other commands to the original queue. Since the event object of the marker command identifies the kernel execution instance that is created by K (as described in Section 3.4.2), in either case that the command queue is in-order or out-of-order, C is guaranteed to wait on the completion of K before it starts.

Fig. 6. A sequence diagram of enqueuing a kernel, which is transformed into a shadow kernel, with the CLPKM runtime library.

parameters as well as the extended parameters, and (2) a read-buffer command, which reads the progress vector and registers a callback function for the completion of the read-buffer command. The callback function determines whether enqueuing another shadow-kernel-launch operation is necessary. It is worth noting that the buffers that are allocated by the CLPKM runtime are delivered as user data of the callback function such that the buffers can be released as long as the computation of the kernel has fully accomplished. Right after enqueuing the first shadow-kernel-launch operation, the CLPKM runtime creates a user event, enqueues a marker command that waits on the user event, and returns the event object of the marker command that identifies the kernel execution instance of the intercepted call. Once the aforementioned callback function (as part of the CLPKM runtime) has been called—which indicates a change to the later-runs phase, it checks whether the progress vector contains progress flags only with a value of DONE (i.e., all work-items of the kernel has finished). If it is not the case, a shadow-kernel-launch operation is enqueued again to resume a checkpointed kernel execution instance. In addition, the callback function also enqueues a write-buffer command, which flips the sign of all the flags that belongs to the same work-group in the progress vector, prior to the shadow-kernel-launch operation if those flags have the same negative number, as described in the end of Section 3.2.3. Once a callback function has determined that the progress vector is filled with “DONE”—which indicates a change to the finalization phase, the callback function releases the buffers that are allocated by the CLPKM runtime in the initialization phase and sets the execution status 58

M.-T. Chiu and Y.-P. You

Journal of Systems Architecture 98 (2019) 53–62

3.4.5. Coordination with the daemon The preemption-enabling transformation improves the scheduling granularity of kernels. To make it more effective, low-priority processes must acknowledge the existence of high-priority tasks and make way for them, or low-priority processes are still competing against high-priority processes. High-priority processes also need to express their resource demands with a bitmap, as described in Section 3.3. The CLPKM runtime library automatically spawns a worker thread in the background to maintain the global bitmap, coordinating with the daemon. When a high-priority process is about to enqueue a new task, the CLPKM runtime informs the worker to announce that a high-priority task is about to start; low-priority processes, with help of the CLPKM runtime, stop enqueuing tasks that have a conflict with the global bitmap until the conflict is resolved.

the time overhead of the preemption-enabling transformation, assuming the transformed code has been cached, was adopted so as to form an outline of how much the transformation contributed to the execution time of a benchmark. Since the transformation time for a specific kernel function is constant and irrelevant to the minimum checkpoint interval, the results for when the code caching mechanism was enabled were displayed only for the configuration with largest minimum checkpoint interval; this configuration is denoted as “LP-100000K (cc)”. The execution times of high-priority processes running with the CLPKM framework were almost the same compared to those running with the native (or vendor’s) OpenCL runtime library, with a geometric mean of 1.03; this is reasonable because when the CLPKM runtime library receives a task request from a high-priority process, it just notifies the worker to inform the CLPKM daemon of the demand of the high-priority task and immediately delegates to the vendor’s OpenCL runtime library. By contrast, the geometric mean of the execution times of low-priority processes running with the CLPKM framework considering different checkpoint intervals ranged from 3.02 to 6.08 if the code caching mechanism was not enabled. We examined all the execution logs, and believe that the introduced overhead mainly came from (1) the transformation time that was spent by the CLPKM transcompiler, (2) the negative impact of missing some opportunity for the OpenCL kernel compiler to optimize a transformed kernel, and, of course, (3) the periodic checkpointing mechanism. For small benchmarks (e.g., bfs_(NY), bfs_(NT), sad_(default), and sgemm_(small)), the transformation time of the kernel function within a benchmark was about as much as the execution time of the unmodified kernel function. Therefore, we deployed a code caching mechanism that stores the codes transformed, thereby eliminating the transformation time if a previously transformed kernel function exists. With the configuration of “LP-100000K (cc)”, the normalized execution times were significantly reduced and ranged from 1.00 to 6.16, with a geometric mean of 1.35, for different benchmarks with different datasets. On the other hand, since it is difficult to quantify the impact of missing some optimization opportunity, we analyzed the PTX codes that were compiled from the OpenCL kernel compiler for both the original and transformed versions of kernels within the benchmarks, and found some evidence for our earlier claim. Taking sgemm, which contains only a single kernel function that involves almost only a simple for loop, as an example, the OpenCL kernel compiler generated efficient PTX codes for the kernel function by unrolling the loop up to 16 times, renaming registers to avoid data hazards, and eliminating some conditional branches to mitigate the impact of control hazards, and scheduling instructions to alleviate structure hazards. However, the preemption-enabling transformation disturbed the control flow of the kernel function by allowing the function to return at the end of some iteration of the loop; hence the OpenCL compiler could no longer easily optimize the kernel function. A possible solution to this problem is to apply a source-level optimizer or to perform such transformation in the post-optimization stage (e.g., prior to the code generation phase of the OpenCL compiler). Another source of overhead, which was not applied to all the scenarios in the experiments conducted, was the extra operations that were performed due to the checkpointing mechanism, such as launching multiple kernel execution instances for an originally single kernel execution instance, which introduced additional kernel-launching overhead, and restoring/saving execution states for almost every checkpointed kernel execution instance. Table 1 lists the numbers of kernel execution instances that were created during the execution when using different checkpointing configurations so as to further identify how the checkpointing mechanism influenced the benchmarks. For some benchmarks and datasets (including bfs_(NY/UT/1M), sgemm_(small), and spmv_(small/medium/large)), checkpoints occurred because of the allotted time slice expiring (indicated with a “c” inside parentheses following the number of instances) only when the minimum check-

3.5. Limitations The CLPKM framework can yield unexpected results if atomic operations, instead of barriers, are used as a synchronization primitive within a kernel function of a low-priority process. This is attributable to the kernel execution instance after transformation not being aware of the absence of work-items that are checkpointed. A possible solution similar to the one discussed in Section 3.2.3.4 may be applied to make sure all the work-items present before proceeding to access the data manipulated by the atomic operations. However, the data are inherently accessible only through pointers according to the OpenCL specification. Pointer alias problem makes it unclear which memory access(es) need to be considered, and a conservative approach will likely lead to a terrible slowdown. Hence, we leave it as an avenue to be explored thoroughly in future work. 4. Evaluation 4.1. Experimental setup All of the evaluations were conducted on a computer with two Intel Xeon E5-2620 CPUs (containing six cores operating at 2.0 GHz), 64 GB of RAM, and one NVIDIA Tesla K20c (operating at 706 MHz) running the Linux 4.14 kernel and an NVIDIA 387.34 GPU driver. The CLPKM transcompiler was implemented based on Clang 5.0.1, built by the same version of Clang, whereas the CLPKM runtime library and the CLPKM daemon were built by GCC 7.2.1 20171128. The benchmarks we used to evaluate our proposed CLPKM framework were selected from the Parboil benchmark suite [20] version 2.5, which provides a self-validation mechanism. Some benchmarks were not included since they contain no loops (lbm and stencil) or break the limitations of the CLPKM framework (histo, mri-gridding, and tpacf). The cost of a loop iteration for updating the checkpoint timer was measured with the PTX clock register (%clock); hence, the minimum checkpoint interval (i.e., the minimum time slice allotted to a kernel) was also in clock cycles. 4.2. Overhead 4.2.1. Execution time Fig. 7 presents the normalized execution times for six Parboil benchmarks (with different datasets) individually running with the CLPKM framework, compared to those without CLPKM. The six benchmarks were individually treated as a high-priority (HP) or low-priority (LP) process in order to identify the overhead introduced by CLPKM. When a benchmark ran as a low-priority process, in which periodic checkpointing was enabled for each OpenCL task, five different configurations were used to control the minimum checkpoint interval: LP-10K, LP-100K, LP1000K, LP-10000K, and LP-100000K, in which the interval was set as “10 K” (i.e., 10 × 1024), “100 K”, “1,000 K”, “10,000 K”, and “100,000 K” clock cycles, respectively. In addition to the five configurations, a special configuration that uses a code caching mechanism for eliminating 59

M.-T. Chiu and Y.-P. You

Journal of Systems Architecture 98 (2019) 53–62

Fig. 7. Normalized execution times for six Parboil benchmarks (with different datasets) running with the CLPKM framework, compared to those without CLPKM. HP, high-priority; LP-x, low-priority with the minimum checkpoint interval of x cycles; cc, code caching enabled. Table 1 Numbers of kernel execution instances created during execution when using different checkpointing configurations; the letters within parentheses refer to the type(s) of checkpoints occurring during execution: ‘c’ for regular checkpoints due to time slice expiring and ‘b’ for barricaded checkpoints. Benchmark

bfs

Dataset

NY

UT

1M

SF

cutcp

mri-q

Native OpenCL CLPKM-100000K CLPKM-10000K CLPKM-1000K CLPKM-100K CLPKM-10K

632 632 632 632 632 4673 (c)

817 817 817 817 817 5906 (c)

1999 1999 1999 1999 1999 17,969 (c)

1003 11 26 1003 253 (b) 598 (b) 1003 253 (b) 598 (b) 1003 457 (b, c) 1180 (b, c) 1011 (c) 22,248 (b, c) 60,577 (b, c) 9645 (c) 22,272 (b, c) 60,671 (b, c)

Small

Large

sad

sgemm

Small

Large

Default Large

Small Medium Small

Medium Large

4 4 4 19 (c) 283 (c) 15,382 (c)

3 3 3 15 (c) 571 (c) 10,259 (c)

3 3 3 3 5 (c) 382 (c)

1 1 1 1 1 65 (c)

50 50 50 50 50 563 (c)

3 3 3 3 6 (c) 405 (c)

spmv

1 1 1 1 8 (c) 989 (c)

50 50 50 50 50 138 (c)

50 50 50 50 50 2256 (c)

Table 2 Sizes of extra global-memory buffers allocated by the CLPKM framework. Benchmark

bfs

Dataset

NY

UT

1M

SF

cutcp Small

Large

mri-q Small

Large

sad Default

Large

Small

Medium

Small

Medium

Large

MiB

0.14

0.14

0.11

23.26

7.57

42.29

1.13

9

5.9

478.95

1.33

70.13

0.07

0.78

9.53

point interval was set to 10 K cycles (in the experiments conducted), which is about 14 𝜇s (at 706 MHz) and considered an small (but aggressive) checkpoint interval. For some other benchmarks and datasets (e.g., mri-q_(small/large) and sad_(default/large)), the checkpoint mechanism took effect when the minimum checkpoint interval was less than or equal to 100 K cycles. Furthermore, as the checkpoint interval was decreased, a sudden increase in the number of kernel execution instances was observed for all scenarios since checkpoints had been triggered, and another sudden, significant increase could be seen for some benchmarks (e.g., mri-q) since checkpoints occurred in most loop iterations within the kernel function. With the configuration of LP-10K, all the benchmarks were checkpointed, and the slowdown ranged from 2.23x (for bfs_(SF)) to 199.98x (for cutcp_(large)), which represented the worst-case scenario among all benchmarks with different datasets and configurations. cutcp suffered a slowdown of 5–9x even with the small dataset and the largest checkpoint interval. This is attributable to frequent and huge amount of regular checkpoints occurred in the worst-case scenario, as well as barricaded checkpoints taking place for all configurations.

sgemm

spmv

noticeable that sad_(large) required significantly more memory in terms of the extra buffers than others. This is owing to the fact that the benchmark launched a kernel execution instance with a large NDRange that contained approximately eight million work-items, each of which requires only a space of 60 bytes to store the execution states. We plan to apply a workload partition mechanism along with the proposed software checkpointing mechanism in the future in order to reduce the size of extra buffers required. 4.3. Effectiveness We evaluated the effectiveness of the CLPKM framework by measuring the total execution time of the six benchmarks individually and consecutively running as high-priority processes with all datasets and all configurations, respectively, while different numbers of low-priority processes (randomly selected from the same set of benchmarks with the largest dataset and the same configuration) ran in the background. We disabled the validation function of the benchmarks by appending the --no-check option when running them in order to keep the GPU as busy as possible. We also started the NVIDIA Persistence Daemon [21] before the evaluation so as to eliminate the cost of GPU reinitialization. Fig. 8 displays the normalized total execution times of the highpriority processes with different numbers of low-priority processes running in the background. The high-priority processes running with the vanilla OpenCL stack (or the native OpenCL runtime library)

4.2.2. Memory usage As mentioned in Section 3.2.3, a low-priority processes requires additional global-memory buffers to record its execution states, such as execution progress and live values. The size of these additionally required buffers varies across kernel functions. Table 2 summarizes the maximum size required among the kernel functions of each benchmark. It is 60

M.-T. Chiu and Y.-P. You

Journal of Systems Architecture 98 (2019) 53–62

Fig. 8. Normalized execution times of high-priority processes with different numbers of low-priority processes running in the background.

Acknowledgments

experienced a slowdown up to 4.66x when there were 16 low-priority processes competing against them. In the cases of running with the CLPKM framework, which made the GPU partly exclusive to highpriority processes, the slowdown was reduced to 2.23x under the heaviest workload condition when using the largest (or slackest) checkpoint interval (100,000 K cycles), which was expected to barely result in checkpoints. The preemption granularity was further improved by using the smallest (most rigid) checkpoint interval, leading to a reduced slowdown of 1.52x.

This study was partially supported by the Ministry of Science and Technology, Taiwan under grant MOST 107-2221-E-009-011-MY3. Supplementary material Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.sysarc.2019.06.008.

5. Conclusions and future work

References

We have presented a pure software framework, called CLPKM, that enables preemptive multitasking of OpenCL kernel execution without any modifications to OpenCL applications. The CLPKM framework divides a kernel execution instance of a low-priority process into multiple sub-instances, each of which ends when a checkpoint is taken (i.e., when the time slice given to the kernel execution instance expires), and allows the kernel execution instance to yield the OpenCL device that it runs on for another kernel execution instance of a high-priority process during the in-between period of transition from one sub-instance to another sub-instance. This is done by instrumenting checkpoints into the kernel function with the CLPKM transcompiler, manipulating the checkpoints by the CLPKM runtime library, and managing priorities of kernel execution instances by the CLPKM daemon. The experimental results show that our CLPKM framework allowed for the reduction in slowdown for high-priority processes under heavy workloads. It improved the slowdown from 4.66x to 1.52–2.23x; that is, it provided a speedup of 2.09–3.07x. In contrast to high-priority processes, low-priority processes suffered an average slowdown between 3.02x and 6.08x when using a checkpoint interval between the largest and smallest predefined values. We also adapted a code caching mechanism to future reduce the slowdown of low-priority processes from 3.02x to 1.35x. We believe the proposed framework is portable (without any hardware support) and capable of interrupting an ill-designed kernel that contains an infinite loop, later of which is probably a common use case on platforms for training or development of OpenCL programming. With the support of preemptive multitasking of OpenCL kernel execution, there will be opportunities for further study on scheduling techniques (such as process migration and scheduling with precedence graphs) for enhancing performance of OpenCL devices. We think they might be interesting issues and leave them for future study.

[1] M.-T. Chiu, Y.-P. You, Enabling OpenCL Preemptive Multitasking Using Software Checkpointing, in: Proceedings of the 47th International Conference on Parallel Processing Companion (ICPP-EMS ’18), ACM, New York, NY, USA, 2018, pp. 15:1–15:7. [2] NVIDIA, NVIDIA Tesla P100 Whitepaper. 2016 https://images.nvidia.com/ content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf. [3] J. Nickolls, I. Buck, M. Garland, K. Skadron, Scalable parallel programming with CUDA, ACM Queue 6 (2008) 40–53. [4] Khronos OpenCL Working Group, The OpenCL Specification, version 2.2, 2017. [5] W. Han, H.S. Chwa, H. Bae, H. Kim, I. Shin, GPU-SAM: Leveraging multi-GPU split-and-merge execution for system-wide real-time support, Journal of Systems and Software 117 (2016) 1–14. [6] Y.-P. You, H.-J. Wu, Y.-N. Tsai, Y.-T. Chao, VirtCL: A Framework for OpenCL Device Abstraction and Management, in: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’15’), ACM, New York, NY, USA, 2015, pp. 161–172. [7] J.J.K. Park, Y. Park, S. Mahlke, Chimera: Collaborative Preemption for Multitasking on a Shared GPU, in: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15), ACM New York, NY, USA, 2015, pp. 593–606. [8] I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, M. Valero, Enabling Preemptive Multiprogramming on GPUs, in: Proceedings of the 41st Annual International Symposium on Computer Architecture (ISCA ’14), IEEE Press, 2014, pp. 193–204. [9] G. Chen, Y. Zhao, X. Shen, H. Zhou, EffiSha: A Software Framework for Enabling Efficient Preemptive Scheduling of GPU, in: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’17), ACM, 2017, pp. 3–16. [10] S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, RGEM: A Responsive GPGPU Execution Model for Runtime Engines, in: Proceedings of the 2011 IEEE 32nd Real-Time Systems Symposium (RTSS ’11), IEEE, 2011, pp. 57–66. [11] C. Basaran, K.-D. Kang, Supporting Preemptive Task Executions and Memory Copies in GPGPUs, in: Proceedings of the 2012 24th Euromicro Conference on Real-Time Systems (ECRTS ’12), IEEE, 2012, pp. 287–296. [12] C. Du, X.-H. Sun, M. Wu, Dynamic scheduling with process migration, in: Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid, CCGRID ’07, IEEE Computer Society, Washington, DC, USA, 2007, pp. 92–99, doi:10.1109/CCGRID.2007.46. [13] N.-S. Pham, Y. Kim, K.-H. Baek, C.-G. Lee, Reduction of task migrations and preemptions in optimal real-time scheduling for multiprocessors by using dynamic T-L plane, Journal of Systems Architecture 79 (2017) 19–30. [14] S.-H. Chien, Y.-M. Chang, C.-C. Yang, Y.-S. Hwang, J.-K. Lee, Graph support and scheduling for opencl on heterogeneous multi-core systems, in: Proceedings of the 47th International Conference on Parallel Processing Companion (ICPP-EMS ’18), ACM, New York, NY, USA, 2018, pp. 14:1–14:7, doi:10.1145/3229710.3229724. [15] H. Topcuouglu, S. Hariri, M.-Y. Wu, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE Trans. Parallel Distrib. Syst. 13 (3) (2002) 260–274, doi:10.1109/71.993206.

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. 61

M.-T. Chiu and Y.-P. You

Journal of Systems Architecture 98 (2019) 53–62 Yi-Ping You received his Ph.D. in computer science from National Tsing Hua University, Taiwan in February 2007, where he also received a M.S. (2002) in computer science. He received his B.S. in computer science and information engineering fromNational ChiNan University in 2000. He was a research scientist at the IC Design Technology Center,National Tsing HuaUniversity from March to July 2007. He joined the Department of Computer Science,National Chiao Tung University in 2008. His current research interests include optimizing compilers and runtime systems and GPGPU techniques.

[16] H. Topcuoglu, S. Hariri, M.-Y. Wu, Task scheduling algorithms for heterogeneous processors, in: Proceedings of the Eighth Heterogeneous Computing Workshop, HCW ’99, IEEE Computer Society, Washington, DC, USA, 1999, p. 3. [17] Y.C. Lee, A.Y. Zomaya, Minimizing energy consumption for precedence-constrained applications using dynamic voltage scaling, in: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGRID ’09, IEEE Computer Society, Washington, DC, USA, 2009, pp. 92–99, doi:10.1109/CCGRID.2009.16. [18] C. Lattner, et al., “clang” C Language Family Frontend for LLVM. https://clang. llvm.org/. [19] L. Poettering, et al., systemd—System and Service Manager. https://www. freedesktop.org/wiki/Software/systemd/. [20] J.A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G.D. Liu, W. mei W Hwu, Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing, Technical Report, IMPACT-12-01, University of Illinois, at Urbana-Champaign, 2012. [21] NVIDIA, Persistence daemon, in: DRIVER PERSISTENCE, 2017, pp. 7–10. https:// docs.nvidia.com/pdf/Driver_Persistence.pdf. Ming-Tsung Chiu received his M.S. degree in computer science from National Chiao Tung University, Taiwan in January 2018, where he also received his B.S. (2013) in computer science. He joined Andes Technology Corporation as a technical engineer in 2018.

62