Taking Lessons Learned from a Proxy Application to a Full Application for SNAP and PARTISN

Taking Lessons Learned from a Proxy Application to a Full Application for SNAP and PARTISN

Available online at www.sciencedirect.com ScienceDirect Procedia 108CProcedia (2017) 555–565 This space isComputer reservedScience for the header, do...

459KB Sizes 27 Downloads 47 Views

Available online at www.sciencedirect.com

ScienceDirect Procedia 108CProcedia (2017) 555–565 This space isComputer reservedScience for the header, do not use it This space is reserved for the Procedia header, do not use it This space is reserved for the Procedia header, do not use it

International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland

Taking Lessons Learned from a Proxy Application to a Full Taking Lessons Learnedfor from a Proxy Application to a Full Application SNAP and PARTISN Taking Lessons Learnedfor from a Proxy Application to a Full Application SNAP and PARTISN 1 Geoff Womeldorff1∗ , Joshua Payne , and Ben Bergen1 Application for SNAP and PARTISN 1∗ 1 1 Geoff Womeldorff , Joshua Payne , and Ben Bergen Los Alamos National Laboratory Geoff Womeldorff1∗, Joshua Payne1 ,[email protected] and Ben Bergen1 [email protected]*, [email protected], Los Alamos National Laboratory [email protected]*, [email protected], [email protected] Los Alamos National Laboratory [email protected]*, [email protected], [email protected]

Abstract SNAP is a proxy application which simulates the computational motion of a neutral particle Abstract transport In which this work, we have parts ofmotion SNAP ofseparately; have SNAP is acode, proxyPARTISN. application simulates the adapted computational a neutral we particle Abstract re-implemented the iterative shell of SNAP in the task-model runtime Legion, showing an transport code, PARTISN. In this work, we have adapted parts of SNAP separately; we have SNAP is a proxy application which simulates the computational motion of a neutral particle improvement to the original schedule, and we have created multiple Kokkos implementations re-implemented the iterative shell of SNAP in the task-model runtime Legion, showing an transport code, PARTISN. In this work, we have adapted parts of SNAP separately; we have of the computational kernel of SNAP, displaying similar performance to the native Fortran. improvement to the original schedule, and we have created multiple Kokkos implementations re-implemented the iterative shell of SNAP in the task-model runtime Legion, showing an We then translate Kokkos in have SNAP to PARTISN, engineering of the computational kernel of experiments SNAP, and displaying similar performance to theimplementations native Fortran. improvement to theour original schedule, we created multiple necessitating Kokkos development, regression testing, and further thought. We then translate our Kokkos experiments in SNAP to PARTISN, necessitating engineering of the computational kernel of SNAP, displaying similar performance to the native Fortran. development, regression testing, and B.V. further in thought. We2017 then our Kokkos experiments SNAP to PARTISN, necessitating engineering Keywords: © Thetranslate Authors. Published by Elsevier Peer-review under responsibility of the scientific committee of the International Conference on Computational Science development, regression testing, and further thought. Keywords: Keywords:

1 Introduction 1 Introduction One of the paramount tasks in the discipline of scientific computing is in adapting computer 1 Introduction codes to new and future computer architectures. This is important in and of itself, but espe-

One of the paramount tasks in the discipline of scientific computing is in adapting computer cially the light thetasks rapidly changing aspects of modern architectures us increasingly codesofin tothe new and of future computer architectures. This is important inisand of itself, but espeOne paramount in the discipline of scientific computing insuch adapting computer deep memory hierarchies and exploding core counts, from multi-core to many-core. It isespethe cially in the rapidly changing aspects of modern architectures us increasingly codes to the newlight and of future computer architectures. This is important in andsuch of itself, but expectation of many practitioners of scientific computing that due to hardcoded algorithmic deep memory hierarchies and exploding counts, from multi-core to many-core. It is the cially in the light of the rapidly changingcore aspects of modern architectures such us increasingly choices and optimizations, computer that run performantly on to current high-performance expectation of hierarchies many practitioners ofcodes scientific computing due algorithmic deep memory and exploding core counts, from that multi-core tohardcoded many-core. It is the systems willoptimizations, on future ones. Thus, if that we are protect investment that algorithmic we have in choices and computer runtoperformantly on current high-performance expectation ofnot many practitioners ofcodes scientific computing thatthe due to hardcoded physics codes we on usefuture on acomputer daily basis, we must assess how to them high-performance forward systemsand willoptimizations, not ones. Thus, if that we are toperformantly protect themove investment that we into havethe in choices codes run on current future. physics codes we on usefuture on a ones. daily basis, must how to them forward systems will not Thus, we if we are assess to protect themove investment that we into havethe in Of the many modernizing of the that the platforms we future. physics codes wechallenges use on a in daily basis, we code, must one assess howgreatest to moveisthem forward into the are wishing to run code performantly on docode, not exist yet. we turn to research products Of the many challenges in modernizing one of theThus greatest is that the platforms we future. and libraries that abstract away some of the architecture specific decisions, without comproare Of wishing to run code performantly on docode, not exist yet. we turn to research products the many challenges in modernizing one of theThus greatest is that the platforms we mising performance, or, performantly perhaps, offering a not performance for the to abstraction that we and wishing libraries away some of do the architecture specific decisions, without products comproare tothat run abstract code on exist yet.penalty Thus we turn research find acceptable. additional difficulty that the computer we abstraction arewithout truly interested mising performance, or, perhaps, offering ainperformance penalty for the that we and libraries thatAn abstract away some ofisthe architecture specificcodes decisions, comproin optimizing are of the size and age that a refactoring is not a task taken without a clear find acceptable. An additional difficulty is in that the computer codes we are truly interested mising performance, or, perhaps, offering a performance penalty for the abstraction that diwe rection. In order to problem, produce complex programs which in areAn of address the size this anddifficulty age thatiswe ainrefactoring is not a task taken without clear difindoptimizing acceptable. additional that the less computer codes we are truly areproduce interested rection. In order to this produce less complex which areproduce in ∗optimizing of address the as size and problem, age that we a refactoring is not a taskprograms taken without clear diThis work is are designated LA-UR-16-29442. ∗ This work rection. In order to address this problem, we produce less complex programs which reproduce is designated as LA-UR-16-29442. ∗ This work is designated as LA-UR-16-29442. 1 1877-0509 © 2017 The Authors. Published by Elsevier B.V. 1 Peer-review under responsibility of the scientific committee of the International Conference on Computational Science 1 10.1016/j.procs.2017.05.243

556

Proxy to Full for SNAP and PARTISN Womeldorff, Payne, and Bergen Geoff Womeldorff et al. / Procedia Computer Science 108C (2017) 555–565

the memory and computational patterns of the programs which spawned them, and we denote these proxy applications. In starting our modernization experiments with proxy apps, we are able to much more rapidly prototype various designs to test them. Specifically, the path we took to meet this challenge, was two-fold, using Kokkos, an execution model and data management library, and Legion, an asynchronous task-based runtime system. We chose to focus on two aspects of SNAP [4], Los Alamos National Laboratory’s neutral particle transport proxy application, on-node performance and cross-node performance. For the cross-node performance aspect of our study, we re-implemented SNAP in its entirety in Legion, using an abstraction layer, Dragon, that provides a simplified interface to much of the set-up code involved in writing in plain Legion (as opposed to the more recent DSL, Regent, from the Legion developers). When we examined on-node performance, we chose to focus solely on the most representative computational kernel in SNAP, dim3 sweep, and to reimplement it Kokkos. Since productivity was one of our study’s metrics, we made a series of Kokkos implementations, to mimic the path a user or group of users might take, in modernizing a code using Kokkos. We felt encouraged by our success with SNAP, and so turned our Kokkos themed on-node efforts towards its parent code PARTISN [1]. As PARTISN is a mature code implemented in FORTRAN, this necessitated language interoperability efforts as we were to ensure that we maintained the numerical behaviour through our changes. These interoperability experiments were quite successful, however, in terms of achieving a performance rise, with a kernel executed on a GPU, we were limited by the expression of concurrency in PARTISN; primarily the interaction between shared and distributed memory parallelism. The rest of this work is organized as follows: we discuss a small amount of background information on SNAP, Legion, and Kokkos in Sections 2, 3, and 4. We then discuss our experiments and their results, for our Legion implementation in Section 5, and for our Kokkos versions in Section 6. Finally, we discuss our efforts in PARTISN in Section 7, and conclusions in Section 8.

2

SNAP Proxy Application

SNAP serves as a proxy application to model the performance of a modern discrete ordinates neutral particle transport application. SNAP may be considered an update to Sweep3D, intended for hybrid computing architectures. It is modeled from the Los Alamos National Laboratory code PARTISN which solves the linear Boltzmann transport equation (TE), a governing equation for determining the number of neutral particles (e.g., neutrons and gamma rays) in a multi-dimensional phase space. SNAP itself is not a particle transport application; SNAP incorporates no actual physics in its available data, nor does it use numerical operators specifically designed for particle transport. Rather, SNAP mimics the computational workload, memory requirements, and communication patterns of PARTISN. The equation it solves has been composed to use the same number of operations, use the same data layout, and load elements of the arrays in approximately the same order. Although the equation SNAP solves looks similar to the TE, it has no real world relevance.

3

Legion Background

Legion [2] is a data-centric programming model for writing high-performance applications for distributed heterogeneous architectures. Making the programming system aware of the struc2



Proxy to Full for SNAP and PARTISN Womeldorff, Payne, and Bergen Geoff Womeldorff et al. / Procedia Computer Science 108C (2017) 555–565

ture of program data gives Legion programs three advantages. The first of which is that the user can specify the data properties using abstractions which allow for organization, partitioning, privileges, and coherence. Unlike current programming systems in which these properties are implicitly managed by programmers, Legion makes them explicit and provides the implementation for programmers. Current programming models require developers to explicitly specify parallelism and issue data movement operations. Both responsibilities can easily lead to the introduction of bugs in complex applications. By understanding the structure of program data and how it is used, Legion can implicitly extract parallelism and issue the necessary data movement operations in accordance with the application-specified data properties, thereby removing a significant burden from the programmer. Finally, by providing abstractions for representing both tasks and data, Legion makes it easy to describe how to map applications onto different architectures. Legion provides a mapping interface which gives programmers direct control over all the details of how an application is mapped and executed. Furthermore, Legion’s understanding of program data makes the mapping process orthogonal to correctness. This simplifies program performance tuning and enables easy porting of applications to new architectures.

3.1

Legion Abstractions

There are three important abstractions in the Legion programming model. 3.1.1

Logical Regions

Logical regions are the fundamental abstraction used for describing program data in Legion applications. Logical regions support a relational model for data. Each logical region is described by an index space of rows (either unstructured pointers or structured 1D, 2D, or 3D arrays) and a field space of columns. Unlike other relational models, Legion supports a different class of operations on logical regions: logical regions can be arbitrarily partitioned into sub-regions based on index space or sliced on their field space. Data structures can be encoded in logical regions to express locality with partitioning and slicing describing data independence. 3.1.2

Tree of Tasks using Regions

Every Legion program executes as a tree of tasks with a top-level task spawning sub-tasks which can recursively spawn further sub-tasks. All tasks in Legion must specify the logical regions they will access as well as the privileges and coherence for each logical region. Legion enforces a functional requirement on privileges which enables a hierarchical and distributed scheduling algorithm that is essential for scalability. 3.1.3

Mapping Interface

Legion makes no implicit decisions concerning how applications are mapped onto target hardware. Instead mapping decisions regarding how tasks are assigned to processors and how physical instances of logical regions are assigned to memories are made entirely by mappers. Mappers are part of application code and implement a mapping interface. Mappers are queried by the Legion runtime whenever any mapping decision needs to be made. Legion guarantees that mapping decisions only impact performance and are orthogonal to correctness which simplifies tuning of Legion applications and enables easy porting to different architectures. 3

557

558

Proxy to Full for SNAP and PARTISN Womeldorff, Payne, and Bergen Geoff Womeldorff et al. / Procedia Computer Science 108C (2017) 555–565

4

Kokkos Background

Kokkos implements a programming model in C++ for writing performance portable applications targeting all major HPC platforms. For that purpose it provides abstractions for both parallel execution of code (e.g parallel for) and data management (Kokkos View). Kokkos is designed to target complex node architectures with N-level memory hierarchies and multiple types of execution resources. For parallel execution, it allows for use of OpenMP, Pthreads, CUDA, and other threading APIs as backend programming models, while for architecture specific optimizations, including but not limited to memory layout and vector instructions, it supports x86 Haswell Xeons, Xeon Phi (KNC and KNL), and Power8, and other emerging architectures1 .

5 5.1

Legion Investigation SNAP - Legion

The Legion implementation of SNAP for this work is designed to test Legion developer productivity, provide a target example for our Legion abstraction layer, and test ideas for system-level portability and performance improvements over MPI. We implemented approximately 90% of the original Fortran version of SNAP in C++ using the Dragon abstraction layer2 . Since Legion is a task-graph model we needed to choose implementation strategies for data layout, task-execution layout, and which of the original Fortran subroutines to wrap in tasks. One major goal here was to implement the KBA-sweep using only data-dependencies between decomposed domains. By correctly identifying the data requirements of each task performing the KBA-sweep, the runtime can build an execution graph which not only correctly performs a complicated sweep, but also finds locations where sweeps in multiple directions can be performed simultaneously and places where non-related tasks can be executed while parts of the sweep are still executing. For this implementation we choose to create a different logical region for each “chunk” of the decomposed domain. In terms of SNAP’s original MPI spatial decomposition, this corresponds to a different logical region for each rank. It would be possible to create a single logical region for the entire domain, and then partition that region into subregions for each rank, but at the time of this work, it was not feasible due to constraints on index space size being limited to 32-bit pointers. In the following example we create a logical region that contains the regions for each sub-domain. We then loop over all of the sub-domains and create a logical region for that subdomain. The second design choice we made for this implementation is how to break the problem down into asynchronous tasks. For the inner and outer source calculations, the convergence tests, and the tasks to save the fluxes from the last iteration this was fairly straightforward. These functions were broken up into a single task per sub-domain, although they could be further divided up into one task per sub-domain per group. However, at that point the work load becomes too fine and the overhead of the task-launch outweighs the cost of the task. For the KBA-sweep, there is one task per sub-domain per octant, or nCx ∗ nCy ∗ nCz ∗ noct tasks. 1 At the time of writing, the supported architectures were, for Intel, Knight’s Corner, Knight’s Landing, Sandy Bridge, Haswell, Broadwell, Skylake, for Power and OpenPower, Power7, Power8, Power9, for NVIDIA, Kepler, Kepler 3.0, Kepler 3.2, Kepler 3.6, Kepler 3.7, Maxwell, Maxwell 5.0, Maxwell 5.3, and for ARM, ARM v8.0 and ARM v8.1. 2 Dragon is an abstraction layer to ease implementation of Legion applications, developed at LANL by Joshua Payne.

4



Proxy to Full for SNAP and PARTISN Womeldorff, Payne, and Bergen Geoff Womeldorff et al. / Procedia Computer Science 108C (2017) 555–565

For example, in a run with a domain decomposed into 8x8x8 chunks the total number of tasks is 7168.

5.2

Performance

On first inspection the performance for the Legion version of SNAP leaves something to be desired - with the Legion implementation at best 8x slower than the Fortran version and 50x slower at worst. However, there are several very important things that should be understood before dismissing the Legion runtime. First, this comparison is between a first-pass Legion implementation and a highly optimized MPI-OpenMP Fortran version. Second, we were experiencing significant issues with the default mapper, and thus were not able to run multi-node tests. Third, Legion developers have not yet implemented an OpenMP-style “processor”, and thus we were unable to test fine-grained parallelism; we are not engaging as many cores as we could be if we had a fine-grain parallel model embedded within the task. Fourth, at the time of our experiments, Legion’s default mapper performed very poorly with respect to memory management and it is likely that we are seeing many more memory copies than are strictly necessary. We do not believe that any of the above performance comments are the result of design decisions with Legion’s model, i.e. they can be solved with additional development time for Legion’s implementation, but they do reinforce the concerns that we are relaying to the Legion developers. In order of importance to us, we desire that Legion have a better default mapper, support for fine-grained parallel models within Legion tasks, and support for MPI-Legion interoperability.

5.3

Portability

One of the major selling points of Legion is portability of a “correct” algorithm. In essence, the application developer does not need to be aware of which node, numa-domain, GPU, or, generically, execution space that the data might live on because this information is abstracted away by the runtime. Again, the default mapper currently hinders work in this area, but with some minor changes we were able to simultaneously generate both gpu and cpu versions of simple kernels using the Dragon abstraction layer. The design of Dragon is such that it will enable us to generate kernels able to take advantage of multi-core, multi-gpu, and multi-node systems.

5.4

Productivity

Productivity is difficult to quantify when comparing two implementations that differ by both language and runtime. Implementing SNAP in base Legion would be onerous, so we chose to first develop an abstraction layer, Dragon, to hide much of the boilerplate code required by the Legion runtime. Using this abstraction layer we were able to produce a rewritten version of SNAP that required approximately 2/3 of the lines of the original Fortran code. There are, however, some issues, mainly associated with the difficulty in debugging Legion applications and the default mapper. It can sometimes be a complex task to determine the cause of a runtime error. We have experienced many errors caused by bugs in the default mapper, and a few related to garbage collection. Tracking down the precise cause of errors can be both extremely difficult and frustrating. However, we wish to take our travails and transmute them into feedback for the Legion developers, to reduce the incidence of frustrations in the future. 5

559

560

Proxy to Full for SNAP and PARTISN Womeldorff, Payne, and Bergen Geoff Womeldorff et al. / Procedia Computer Science 108C (2017) 555–565

6 6.1

Kokkos Investigation SNAP - Kokkos

The Kokkos [3] implementation of SNAP for this work is designed to test ideas for nodelevel portability and performance improvements over more direct, hard-coded implementation strategies. Starting from the original Fortran version of SNAP with hand-tuned OpenMP, we began by isolating the core KBA sweep kernel, dim3 sweep, and removing the OpenMP directives from it. Using this kernelized version as a basis, we then implemented two different versions of the sweep algorithm in C++: (1) a direct version that simply wraps each forloop with Kokkos parallel invocation calls, and (2) a more sophisticated, hierarchically parallel version that seeks to expose the maximum parallelism available from this numerical method. With expert help from the Kokkos developers, a third version was developed; a modification of the hierarchical parallelism version with performance enhancements. These implementations are discussed in more detail in the following sections.

6.1.1

Direct Implementation

In the direct implementation we simulated the effects of giving an application kernel to an application programmer who would not have foreknowledge of fine-grained parallel execution or data layout abstraction APIs. We assumed that they would proceed, more or less, line by line and transform the implicit parallel loops of the kernelized Fortran version by writing a functor for each class of implicit array algebra. That is to say, we assumed enough general computing knowledge to facilitate learning the basics of Kokkos (how to allocate data in a Kokkos View, how to invoke a parallel region using a functor (including the Kokkos specific aspects thereof), how to generalize and classify array updates), but not enough sophistication to use all of the advanced features of Kokkos. We then designed the generic functors to fit the roles of the array algebra in the kernelized Fortran version and replicated it using these functors. It should be noted that this type of implementation does what is colloquially known as “parallelizing the inner loop” and is known to perform poorly due to thread contention. However, given the assumptions we started with, this is the type of implementation that would result. Moreover, we are given the ability to measure the difference in performance, relatively and absolutely, as compared to later versions.

6.1.2

Hierarchical Parallelism Implementation

The hierarchical parallelism version (heretofore, HP), was designed to use the set of features that would give the basis of performance using Kokkos. HP in Kokkos allows for three layers of parallelism, to wit: the outermost layer is threads (typically, real cores), next threads in a thread group (e.g. Hyperthreads) and then innermost vectorized instructions. We identified the appropriate places in SNAP where both parallel performance and correctness could be achieved by mapping SNAP to three level parallelism. Energy groups were mapped to threads, diagonals to thread groups, and (what was implicit in Fortran) array arithmetic to vector regions. The intention was that version, HP, should correctly use Kokkos, and allow mapping of parallelism to multiple architectures with a single codebase, but to leave on the table some room for Kokkos specific, but still portable, optimizations. 6



Proxy to Full for SNAP and PARTISN Womeldorff, Payne, and Bergen Geoff Womeldorff et al. / Procedia Computer Science 108C (2017) 555–565

6.1.3

Optimized HP Implementation

The Optimized Hierarchical Parallelism version was developed by consulting Kokkos developers (specifically, Christian Trott) and sharing initial performance numbers of the HP implementation. Christian was able to suggest optimizations to be made, including thread specific scratch memory, and even a fix for a parallel correctness issue which resulted from a misunderstanding of which parallel regions could be active simultaneously in the Kokkos HP model. It is expected that this implementation have the superior performance of the three, and that it perform approximately as well as the original OpenMP Fortran kernelized version.

6.2

Performance

We tested three different configurations of problem size, versus the three Kokkos implementations and the Fortran amongst various hardware architectures available. The problem size configurations tested were denoted A, B, and C. Configuration A featured a spatial domain of 8x8x8, 40 energy groups, and 240x8 angles. B tested a domain of 16x16x16, 80 energy groups, and 360x8 angles. Finally, C tested a 32x32x32 domain, 120 energy groups, and 480x8 angles. We made use of computing resources hosted in the Darwin cluster3 , and the White cluster4 . We tested the kernelized version of Fortran SNAP both with (F O) and without OpenMP (F NO). We tested all three implementations of Kokkos SNAP, including the Direct (DK), Hierarchical Parallelism (KHP), and Optimized HP (OPT), using thread backends of Serial (S), OpenMP (O), and PThreads (T). The direct implementation (DK) fared as expected, poorly. In fact, when run on Haswellequipped nodes, in OpenMP and Pthreads mode, it would actually take more time than when run in Serial mode, due to thread contention, which was verified using Intel’s diagnostic tool, VTune. The HP Kokkos version was an improvement over the direct implementation, and the Optimized HP (OPT) was the best of the three when tested on Haswell processors. In the end, there were combinations of problem size and threading model for which OPT performed competitively and better than Fortran (with OpenMP turned back on).

6.3

Portability

We had access to compute nodes that included Haswell (similar to those in ATS1, Trinity5 ), Xeon Phi (KNC), Nvidia K40, and Power8. Of those four architectures, we were able to run the same code base of the Kokkos implementations on all but the K40. There were additional modifications needed, above and beyond using Kokkos datatypes, in order to support sweep scheduling data structures that would be necessary to successfully run using the CUDA Kokkos device.

6.4

Productivity

If Kokkos-equipped code were to be written in the same fashion as the direct implementation, it is possible that productivity could be either lower (more lines of code necessary), or higher (re-use of generic functors). For the HP implementations, both base and optimized, sections of computational code expressed as lambdas in Kokkos format are generally the same number of

3A

heterogeneous test cluster at Los Alamos National Laboratory. cluster of PowerPC 8 nodes equipped with NVIDIA K40 GPUs at Sandia National Laboratories. 5 http://www.lanl.gov/projects/trinity/ 4A

7

561

562

Proxy to Full for SNAP and PARTISN Womeldorff, Payne, and Bergen Geoff Womeldorff et al. / Procedia Computer Science 108C (2017) 555–565

lines of code as Fortran, with similar appearance, including only an extra line at the start of the vectorized parallel region.

6.5

GPU compatiblity experiments

Following the above portion of the work, we overcame our challenges in portability, and ran experiments which showed great speedup for running the SNAP kernel on a Tesla K40. We make use of these experiments in the following Section 7. The best speedup gained was approximately 20 times faster than the original Fortran version of the kernel ran with OpenMP, for the largest problem size.

7

PARTISN Engineering and Experiments

PARTISN and SNAP are similar in many way, especially with respect to array sizes, function signatures, and iterative structures. However, PARTISN is more complex than SNAP, because it contains decades of development, and so the amount of changes able to be effected with a similar amount of engineering effort to the SNAP experiments are smaller in scope. The technical decision was made to focus on the interface layer and the C++ kernel as both of those would offer lasting value to the code team responsible for PARTISN’s development, were they interested in refining the efforts of this study. This limited available performance, by choosing to use the current version of Kokkos, versus native CUDA, and by not rewriting the threading section of PARTISN due to temporal constraints. We worked with SNAP, deducing a simple translation of a kernel with generic input argument sets. In working with PARTISN, we worked backwards from the kernel arguments (both function signature and implicitly via MODULE USE statements, to trace memory allocation to its source. In addition, threaded regions in PARTISN were more complex, and, since working with a production code, and not a proxy, maintaining the numerical behaviour of the code was mandatory. With MPI embedded at the thread-level in PARTISN, and Kokkos not supporting (at the time of the experiments) multi-threaded access to CUDA-UVM6 , our choices were to either allow only single threaded behaviour, and use CUDA-UVM through Kokkos to prevent memory duplication, or to allow multi-threaded access through native CUDA which would have the best chance of superior computational performance, with an increase in engineering scope. For an even further increase in performance, we could have lifted the scope of concurrency explored in this experiment to the start of the threaded region in PARTISN. However, this would have required embedding MPI in threaded regions on the CPU, and/or the GPU, which would in turn greatly again increase our engineering scope and again limit our toolchain options. It should be noted that while both SNAP and PARTISN are written in Fortran, the Kokkos SNAP kernels earlier in this work were written solely in C++. Thus, much of the complexity of our work stemmed from language interoperability. Despite this, we were able to interface C++ memory allocations with a Fortran code, and maintain convergence behaviour for the following three scenarios: 1) original Fortran version, with native memory allocations and kernel, 2) Fortran with C++ memory allocations (in CUDA-UVM through Kokkos) and a native kernel, and 3) both the C++ memory allocations and the Kokkos opt3 sweep kernel. These various exercises helped to expose the intricacies and pitfalls in translating idiomatic logic between distinct programming languages, and to highlight the kernel debugging complexity, from the requirement of numerical compatibility. 6 Unified Virtual Memory is an advanced CUDA feature which allows for host-backed device memory allocations with runtime managed bus transfer.

8



Proxy to Full for SNAP and PARTISN Womeldorff, Payne, and Bergen Geoff Womeldorff et al. / Procedia Computer Science 108C (2017) 555–565

7.1

Modifying Production Code

In the following two portions of the work, Sections 7.1.1 and 7.1.2, we discuss, in the context of modifying an expansive production code, changes that we had to make both structurally for memory access and allocation, and then the design of the kernel and its interface, respectively. 7.1.1

Structural Design and Changes

We started by identifying the memory arrays that would need to be used by a GPU kernel version of opt3 sweep. We traced that memory back to its allocation sites, and made note of the native routines which allocated them. With the detective work completed, we replicated the functionality of the allocation routines with new versions (substituting CPU allocated Fortran memory, for CUDA-UVM allocated by C++). Once the new allocations were in place, we tested the modified code with the new memory and the Fortran version of the kernel. The allocation routine interface takes the dimensions (single, in this case) and a string of the array name, and returns a C pointer to the raw allocation (only safe on GPU memory because it is CPU backed UVM) and a C pointer to the view (retained to pass back in as a kernel argument later). Then in the implementation, we have the wrapper for our C++ allocation followed by an intrinsic Fortran subroutine c f pointer, of ISO C BINDING interoperability, which causes a Fortran pointer to be assigned to point to a C pointer (here, the C pointer points to the CPU side of the UVM memory allocation) given both pointers and the array shape. It is important to note that this requires that the C pointer point to memory which is laid out in the same fashion as would Fortran lay out a multidimensional array, as Fortran does not have the same facilities as C with respect to arbitrary layouts of multidimensional arrays. Next we discuss the C++ allocation routine. We catch the dimensions to prevent allocating a dimension as zero (as 0 length allocations are legal in Fortran but not in C). Next the character array is converted to a string. Then, the string and array length are used to allocate the view. deep copy is invoked to zero out the allocated memory on both sides. Finally, the raw pointer to the CPU portion of the UVM allocation is obtained and stored to be passed back to Fortran for use with c f pointer (as mentioned earlier). Not shown is that embedded in the view 1d t type is the specification for memory allocation location (for this work, CUDA-UVM) and the layout order (LayoutLeft in Kokkos parlance) which maps to the Fortran memory order. 7.1.2

Kernel Design and Implementation

In Fortran, the colon operator can mean “all of the array elements in this dimension”. This can be used in a handful of ways, that are transliterated distinctly. Above, we show how it can be used on both signs of an assignment operator, given that the dimensions referred to by the colons are compatible. To rewrite this type of usage in a parallel kernel, it is straightforward, simply adding an iterator index in place of the colon. It is important to note that ordering is not implicit in the parallel region. Since the colon can be used to subset an array’s dimension(s), it is idiomatic in Fortran to use this ability when reducing the dimensionality of an array between execution scopes. For example, the driver of a routine might feature a 4D array, where one dimension signifies multiple time values for a numerical method requiring such, and the other three are spatial dimensions. When passing this 4D array to a physics routine, multiple time values might not be needed, and so the array is subsetted upon entry into the subroutine. In rewriting this behaviour in C++ from Fortran, we use a feature of Kokkos known as the subview. A subview is a full-class 9

563

564

Proxy to Full for SNAP and PARTISN Womeldorff, Payne, and Bergen Geoff Womeldorff et al. / Procedia Computer Science 108C (2017) 555–565

view, a reference counted pointer to memory, that can point to a subset of the view that is used to create the subview.

7.2

Lessons Learned

We tested the modified PARTISN code and showed that it provided compatible physical results with the original code. This was a great engineering success to us. It let us know that it is literally possible to run PARTISN in a GPU environment. We tested both the modified version (using C++ backed memory allocations) both with and without the GPU kernel enabled, and showed success via numerical compatibility in both cases. In our experiments, we found the modified version to be much slower than the original version, because of lack of threaded concurrency, and greatly reduced kernel launch sizing parameters. Above, in SNAP, we were able to launch virtually the entire sweep all at once, but with PARTISN, only a tiny fraction thereof. This discrepancy in the size of the kernel launches led to a slowdown in PARTISN when running the sweep kernel on a GPU vice an equivalent CPU. Some care had to be taken to ensure the correct level of pointer dereferencing was being used at various stages of interface between Fortran and C++. We learned to use a technique of testing the pointer level of a Kokkos view by calling a member function. If the member function was linked correctly, then we knew we had the correct level of dereferencing for use in a parallel for construct. Subviews are pointers to views, and this technique helped with subviews whose type was derived with the auto keyword, especially. We felt that many of the array allocations could have been created with automated tools. Scripts and or parsers could have been written to separate the above representative ALLOCATE statement into component allocation routines. While not necessary in our experiment, such would be invaluable to a production team. One of the more complex aspects of this work was in aligning production code team desires with experimental team desires, and the programming environment toolchain, between the two groups. The production team preferred, for example, the Intel compiler, but was willing to manage GCC support for the experimental team. The experimental team preferred rewriting kernel sections as lambdas, instead of functors, because we feel that this preserves, to the greatest extent possible, the look and feel of the code, which would improve maintainability for subject matter experts. Lambdas in CUDA are an experimental feature in CUDA 7.5, and are still experimental in CUDA 8.0 Release Candidate (which was current as of the time of the work). They were added to CUDA at the behest of the Kokkos developers by NVIDIA. In 7.5, nvcc (the CUDA compiler) only worked with GNU compilers (and only less than major version 5), while in 8.0 it works with Intel (and others) as well. However, experimental features (which we needed for lambdas) were only available in GCC. In addition, we required nested lambdas (to write the kernel in the most future-proof fashion, were we able to expose greater concurrency in the kernel and surrounding regions of PARTISN code), and this mandated the 8.0 RC of CUDA. This matrix of versioning described gives some insight into the complexity of the toolchain choices faced. We were pleased to find a minimum of one combination of toolchain selections which supported our endeavours. We have every reason to believe that the number of choices will expand with time, as NVIDIA has shown willingness to increase compiler family support in nvcc from 7.5 to 8.0. In supporting many different physics solvers and modes, PARTISN has a fairly complex set of memory allocation sites. There are places where an array, if not needed by a particular physics mode, is allocated with zero length dimensions. This preserves array dimensionality, for subroutine arguments, but reduces memory allocation demands for efficiency. This is perfectly 10



Proxy to Full for SNAP and PARTISN Womeldorff, Payne, and Bergen Geoff Womeldorff et al. / Procedia Computer Science 108C (2017) 555–565

correct idiomatic Fortran, but it presents difficulty in transliterating because the same idiom is not present in C++ (the least reasons of which is that C++ does not have multi-dimensional array as a first-class data type). To work around this, we identified areas where this issue was relevant for our work and modified those areas to produce arrays with length one in dimensions instead of zero. Then, as in the native PARTISN code, we ignored those shortened arrays, safely.

8

Conclusions

We have argued for the importance of modernizing code, and given examples of how that modernization might be effected. Specifically, we have shown how a proxy app can be used to explore techniques such as tasking run-time systems and fine-grain parallelism. We have shown that, for a complex kernel in a proxy app, speedup can be gained by reimplementing the kernel in Kokkos, to execute it on a GPU. Finally, when we applied the same technique to a code from which a proxy was derived, we were not able to realize the same speedup due to the parent code complexity. In fact, we achieved a slowdown, due to lack of available concurrency in the necessary programming scope. We reason that if modernizing techniques are to be applied to a code in ways that require language interoperability and the interleaving of shared and distributed memory parallelism, then one must be willing to make deeply invasive changes to that code.

Acknowledgments We thank Joe Zerr, Randy Baker, and Rob Lawrie, for their explanations of PARTISN and willingness to help us debug our efforts to support numerical compatibility between the original and modified versions of the code. We would like to thank our reviewers for their constructive comments which improved the technical accuracy and readability of this work. This work was performed under US Government contract DE-AC52-06NA25396 for Los Alamos National Laboratory, which is operated by Los Alamos National Security, LLC, for the U.S. Department of Energy (LA-UR-15-27949, LA-UR-15-26757, LA-UR-16-21207, LA-UR-1627057, LA-UR-16-29442).

References [1] Ray E Alcouffe, Randal S Baker, Jon A Dahl, Scott A Turner, and Robert Ward. Partisn: A time-dependent, parallel neutral particle transport code system. Los Alamos National Laboratory, LA-UR-05-3925 (May 2005), 2005. [2] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. Legion: Expressing locality and independence with logical regions. In Proceedings of the international conference on high performance computing, networking, storage and analysis, page 66. IEEE Computer Society Press, 2012. [3] H Carter Edwards, Christian R Trott, and Daniel Sunderland. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing, 74(12):3202–3216, 2014. [4] RJ Zerr and RS Baker. Snap: Sn (discrete ordinates) application proxy: Description. Los Alamos National Laboratories, Tech. Rep. LAUR-13-21070, 2013.

11

565