Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations

Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations

ARTICLE IN PRESS JID: PARCO [m3Gsc;June 10, 2017;8:48] Parallel Computing 0 0 0 (2017) 1–18 Contents lists available at ScienceDirect Parallel Co...

3MB Sizes 3 Downloads 49 Views

ARTICLE IN PRESS

JID: PARCO

[m3Gsc;June 10, 2017;8:48]

Parallel Computing 0 0 0 (2017) 1–18

Contents lists available at ScienceDirect

Parallel Computing journal homepage: www.elsevier.com/locate/parco

Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations Luis Costero a, Francisco D. Igual a,∗, Katzalin Olcoz a, Sandra Catalán b, Rafael Rodríguez-Sánchez b, Enrique S. Quintana-Ortí b a b

Depto. de Arquitectura de Computadores y Automática, Universidad Complutense de Madrid, Madrid, Spain Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I, Castellón, Spain

a r t i c l e

i n f o

Article history: Received 14 October 2016 Revised 9 May 2017 Accepted 1 June 2017 Available online xxx Keywords: Linear algebra Task parallelism Runtime task schedulers Asymmetric architectures

a b s t r a c t Dealing with asymmetry in the architecture opens a plethora of questions related with the performance- and energy-efficient scheduling of task-parallel applications. While there exist early attempts to tackle this problem, for example via ad-hoc strategies embedded in a runtime framework, in this paper we take a different path, which consists in addressing the asymmetry at the library-level by developing a few asymmetry-aware fundamental kernels. The appealing consequence is that the architecture heterogeneity remains then hidden from the task scheduler. In order to illustrate the advantage of our approach, we employ two well-known matrix factorizations, key to the solution of dense linear systems of equations. From the perspective of the architecture, we consider two low-power processors, one of them equipped with ARM big.LITTLE technology; furthermore, we include in the study a different scenario, in which the asymmetry arises when the cores of an Intel Xeon server operate at two distinct frequencies. For the specific domain of dense linear algebra, we show that dealing with asymmetry at the library-level is not only possible but delivers higher performance than a naive approach based on an asymmetry-oblivious scheduler. Furthermore, this solution is also competitive in terms of performance compared with an ad-hoc asymmetryaware scheduler furnished with sophisticated scheduling techniques. © 2017 Elsevier B.V. All rights reserved.

1. Introduction The end of Dennard scaling [1] has promoted heterogeneous systems into a mainstream approach to leverage the steady growth of transistors on chip dictated by Moore’s law [2,3]. Asymmetric multicore processors (AMPs) are a particular class of heterogeneous architectures that combine two (or more) distinct types of cores, but differ from multicore servers equipped with hardware accelerators in that the AMP cores share the same instruction set architecture (single-ISA) and a strongly coupled memory subsystem. A representative example of these architectures are ARM-based AMPs, consisting of a cluster of high performance (big) cores plus a cluster of low-power (LITTLE) cores. These architectures can, in theory, deliver much higher performance for the same power budget and adapt better to heterogeneous application ecosystems [4]. ∗

Corresponding author. E-mail addresses: [email protected] (L. Costero), fi[email protected] (F.D. Igual), [email protected] (K. Olcoz), [email protected] (S. Catalán), [email protected] (R. Rodríguez-Sánchez), [email protected] (E.S. Quintana-Ortí). http://dx.doi.org/10.1016/j.parco.2017.06.002 0167-8191/© 2017 Elsevier B.V. All rights reserved.

Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

JID: PARCO 2

ARTICLE IN PRESS

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18

A different scenario leading to asymmetric configurations arises when a conventional system equipped with a symmetric multicore architecture, e.g. an Intel Xeon multisocket server, has to regulate its sockets/cores to operate at two (or more) voltage-frequency scaling (VFS) levels, during a certain period, in order to meet a certain power budget. Task parallelism has been reported as an efficient means to leverage the considerable number of cores in current processors. Several efforts, pioneered by Cilk [5], aim to ease the development and improve the performance of task-parallel programs by tackling task scheduling from inside a runtime (framework). The benefits of this approach for complex dense linear algebra (DLA) operations have been demonstrated, among others, by OmpSs [6], StarPU [7], PLASMA/MAGMA [8,9], and libflame [10]. In general, the runtimes underlying these tools decompose DLA routines into a collection of numerical kernels (or tasks), and then take into account the dependencies between the tasks in order to correctly issue their execution to the system cores. The tasks are therefore the “indivisible” scheduling unit while the cores constitute the basic computational resources. Task-based programming models and asymmetric architectures (specifically, low-power ones such as the ARM big.LITTLE processors) have recently attracted the interest of the scientific community to tackle the programmability and energy consumption challenges, respectively, on the road towards Exascale [11,12]. Two questions arise when combining both approaches: first, will these architectures require a complete redesign of existing runtime task schedulers? Second, how will conventional (asymmetry-oblivious) runtimes behave when operating under asymmetric architectures? In this paper we address both questions by introducing an innovative and efficient approach to execute task-parallel DLA routines on AMPs via conventional asymmetry-oblivious schedulers. Conceptually, our solution aggregates the cores of the AMP into a number of symmetric virtual cores (VCs), which become the only basic computational resources that are visible to the runtime scheduler. In addition, a specialized implementation of each type of task, embedded into an asymmetryaware DLA library, partitions each numerical kernel into a series of fine-grain computations, which are efficiently executed by the asymmetric aggregation of cores of a single VC. Our work thus makes the following specific contributions: • We target the Cholesky and LU factorizations [13], two complex operations for the solution of linear systems that are representative of many other DLA computations. Therefore, we are confident that our approach and results carry over to a significant fraction of the DLA functionality covered by LAPACK (Linear Algebra PACKage) [14]. • For these two DLA operations, we describe how to leverage the asymmetry-oblivious task-parallel runtime scheduler in OmpSs, in combination with a data-parallel instance of the BLAS-3 (basic linear algebra subprograms) [15] in the BLIS library specifically designed for AMPs [16,17]. • We provide practical evidence that, compared with an ad-hoc asymmetry-conscious scheduler recently developed for OmpSs [18], our solution yields higher performance for the multi-threaded execution of the LU and Cholesky factorizations on four different asymmetric configurations, including state-of-the-art instances of ARM big.LITTLE and VFSconfigured Intel Xeon systems. • In conclusion, compared with previous work [16,18], this paper demonstrates that, for the particular domain of DLA, it is possible to hide the difficulties intrinsic to dealing with an asymmetric architecture (e.g., workload balancing for performance, energy-aware mapping of tasks to cores, and criticality-aware scheduling) inside an asymmetry-aware implementation of the BLAS-3. In consequence, our solution can refactor any conventional (asymmetry-agnostic) scheduler to exploit the task parallelism present in complex DLA operations. This paper is an extension of the work presented in [19]. In particular, the experimental analysis of the LU factorization with partial pivoting, as well as the study with ARM 64-bit AMPs and VFS-asymmetric scenarios were not analyzed in [19]. The rest of the paper is structured as follows. Section 2 presents the target asymmetric configurations, including two ARM big.LITTLE AMPs and the Intel Xeon architecture. Section 3 reviews the state-of-the-art in runtime-based task scheduling and DLA libraries for (heterogeneous and) asymmetric architectures. Section 4 introduces the approach to reuse conventional runtime task schedulers on AMPs by relying on an asymmetric DLA library. Section 5 reports the performance results attained by the proposed approach on different architectures for both the Cholesky and LU with partial pivoting factorizations; more specifically, Section 5.3 dives into performance details for a given architecture (Exynos) and operation (Cholesky factorization). Finally, Section 6 closes the paper with the concluding remarks. 2. Asymmetric configurations and software support 2.1. Target asymmetric configurations The architectures employed to evaluate our solution can be divided in two broad groups: physically-asymmetric architectures, in which the big and LITTLE clusters differ in the microarchitecture and (possibly) in the processor frequency; and virtually-asymmetric architectures, where the two clusters differ only in their frequency. Table 1 summarizes the main characteristics of the three target architectures that yield the four asymmetric configurations for the experimentation. In the first group, Exynos is an ODROID-XU3 board comprising a 32-bit Samsung Exynos 5422 SoC with an ARM CortexA15 quad-core processing cluster (1.3 GHz) and a Cortex-A7 quad-core processing cluster (1.3 GHz). Each core has a private 32+32-Kbyte L1 (instruction+data) cache. The four A15 cores share a 2-Mbyte L2 cache, while the four A7 cores share a smaller 512-Kbyte L2 cache. Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

ARTICLE IN PRESS

JID: PARCO

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18

3

Table 1 Description of the architectures and asymmetric configurations employed in the experimental evaluation. Intra- or inter-socket distribution for each VC indicate whether the cores that compose a VC are part of the same socket or they are distributed into different sockets, respectively. Name

Exynos Juno XeonA XeonB

Performance cluster (P)

Low energy cluster (L)

Virtual Cores (VC)

Cores

Arch.

Freq.

Cores

Arch.

Freq.

Number (distribution)

4 2 7 14

Cortex-A15 Cortex-A57 Xeon E5-2695 Xeon E5-2695

1.3 1.1 2.3 2.3

4 4 7 14

Cortex-A7 Cortex-A53 Xeon E5-2695 Xeon E5-2695

1.3 0.85 1.2 1.2

4 (1P+1L, intra-socket) 2 (1P+2L, intra-socket) 7 (1P+1L, intra-socket) 14 (1P+1L, inter-socket)

OS / Task scheduler (two asymmetric clusters, only one active at a time)

OS scheduler / Task scheduler (4 VCs, one physical core per VC active at a time)

(a) CSM

OS scheduler / Task scheduler (8 asymmetric cores, any active at a time)

(b) CPUM

(c) GTS

Fig. 1. Operation modes for recent big.LITTLE architectures.

Juno is an ARM 64-bit development board featuring a Cortex-A57 dual-core processing cluster (1.1 GHz) and a CortexA53 quad-core processing cluster (850 MHz). Each A57 core has a private 48+32-Kbyte L1 (instruction+data) cache; each A53 core has a private 32+32-Kbyte L1 (instruction+data) cache. The two A57 cores share a 2-Mbyte L2 cache, while the four A53 cores share a smaller 1-Mbyte L2 cache. To evaluate scenarios involving virtually-asymmetric architectures, we leverage the ability of recent Intel Xeon processors to apply VFS on a per-core basis. For this purpose, we operate on a Xeon E5-2695v3 platform equipped with two 14-core sockets to configure two experimental platforms: XeonA employs only the 14 cores in one of the sockets, divided into a 7-core Performance (P) cluster operating at the processor base frequency (2.3 GHz), and a 7-core Low-energy (L) cluster in which the cores operate at the processor lowest frequency (1.2 GHz); XeonB employs the full system (28 cores), with one of the sockets operating at the base frequency and the second socket at the lowest; it can be thus viewed as an asymmetric processor composed of 14 VCs, each one featuring a Performance plus a Low-energy core. In an abuse of notation, hereafter we will refer to the groups of cores operating at high and low frequencies as big and LITTLE, respectively, in all target architectures. 2.2. Execution modes on ARM big.LITTLE architectures Current ARM big.LITTLE-based SoCs offer a number of software execution models with support from the operating system (OS): 1. Cluster Switching Mode (CSM): The processor is logically divided into two clusters, one containing the big cores and the other the LITTLE cores, but only one cluster is active at any moment of time. The OS transparently activates/deactivates the clusters depending on the workload to balance performance and energy consumption. 2. CPU migration (CPUM): The physical cores are fused into several VCs, each consisting a few fast cores and a number of slow ones, exposing a symmetric architecture to which the OS maps threads. At any given moment, only one type of physical core (fast or slow) is active per VC, depending on the requirements of the workload. The In-Kernel Switcher (IKS) is Linaro’s1 solution for this model. 3. Global Task Scheduling (GTS). This is the most flexible model. All cores are available for thread scheduling, and the OS maps the threads to any of them depending on the specific nature of the workload and core availability. ARM’s implementation of GTS is referred to as big.LITTLE MP. Fig. 1 offers an schematic view of these three execution models for an ARM big.LITTLE architecture, such as that in the Exynos 5422, with a big cluster and a LITTLE clusters consisting of 4 cores each. CSM is the solution offering less flexibility, since only one cluster is available at any time. GTS is the most flexible solution, allowing the OS scheduler to map threads to any available core or group of cores. GTS exposes the complete pool of 8 (fast and slow) cores in the Exynos 5422 SoC to the 1

Linaro (https://www.linaro.org) is a collaborative software ecosystem for ARM platforms, including an ad-hoc GNU/Linux kernel.

Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

JID: PARCO 4

ARTICLE IN PRESS

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18

1 void cholesky ( double * A [ s ][ s ] , int b , int s ) 2 { 3 for ( int k = 0; k < s ; k ++) { 4 5 po_cholesky ( A [ k ][ k ] , b , b ) ; 6 7 for ( int j = k + 1; j < s ; j ++) 8 tr_solve ( A [ k ][ k ] , A [ k ][ j ] , b , b ) ; 9 10 for ( int i = k + 1; i < s ; i ++) { 11 for ( int j = i + 1; j < s ; j ++) 12 ge_multiply ( A [ k ][ i ] , A [ k ][ j ] , 13 A [ i ][ j ] , b , b ) ; 14 sy_update ( A [ k ][ i ] , A [ i ][ i ] , b , b ) ; 15 } } }

// Cholesky factorization // ( diagonal block ) // Triangular solve

// Matrix multiplication // Symmetric rank - b update

Listing 1. C implementation of the blocked Cholesky factorization.

OS. This allows a straight-forward migration of existing threaded applications, including runtime task schedulers, to exploit all the computational resources in this AMP, provided the multi-threading technology underlying the software is based on conventional tools such as, e.g., POSIX threads or OpenMP. Attaining high performance in asymmetric architectures, even with a GTS setup, is not trivial, and thus becomes one of the goals of this paper. Alternatively, CPUM proposes a pseudo-symmetric view of the Exynos 5422, transforming this 8-core asymmetric SoC into 4 symmetric multicore processors (SMPs), which are logically exposed to the OS scheduler. (In fact, as this model only allows one active core per VC, but the type of the specific core that is in operation can differ from one VC to another, CPUM is still asymmetric.) In practice, application-level runtime task schedulers can mimic or approximate any of these OS operation modes. A straight-forward model is simply obtained by following the principles governing GTS to map ready tasks to any available core. With this solution, load unbalance can be tackled via ad-hoc (i.e., asymmetry-aware) scheduling policies embedded in the runtime that map tasks to the most “appropriate” resource. To close this section, we recognize that Intel does not support the previous software execution models but, when an Intel Xeon server is transformed into an VFS-asymmetric system, the principles underlying CPUM and GTS are directly applicable. 3. Parallel execution of DLA operations on multi-threaded architectures In this section we briefly review several software efforts, in the form of task-parallel runtimes and libraries, that were specifically designed for DLA or have been successfully applied in this domain, when the target is (a heterogeneous system or) an AMP. 3.1. Runtime task scheduling of complex DLA operations 3.1.1. Task scheduling of DLA matrix factorizations We start by describing how to extract task parallelism during the execution of a DLA operation, using the Cholesky factorization as a workhorse example. This particular operation, which is representative of several other factorizations for the solution of linear systems, decomposes an n × n symmetric positive definite matrix A into the product A = U T U, where the n × n Cholesky factor U is upper triangular [13]. Listing 1 displays a simplified C code for the factorization of an n× n matrix A, stored as s× s (data) sub-matrices of dimension b× b each. This blocked routine decomposes the operation into a collection of building kernels: po_cholesky (Cholesky factorization), tr_solve (triangular solve), ge_multiply (matrix multiplication), and sy_update (symmetric rank-b update). The order in which these kernels are invoked during the execution of the routine, and the sub-matrices that each kernel read/writes, result in a direct acyclic graph (DAG) that reflects the dependencies between tasks (i.e., instances of the kernels) and, therefore, the task parallelism of the operation. For example, Fig. 2 shows the DAG with the tasks (nodes) and data dependencies (arcs) intrinsic to the execution of Listing 1, when applied to a matrix composed of 4 × 4 sub-matrices (i.e., s = 4). The DAG associated with an algorithm/routine is a graphical representation of the task parallelism of the corresponding operation, and a runtime system can exploit this information to determine the task schedules that satisfy the DAG dependencies. For this purpose, in OmpSs the programmer employs OpenMP-like directives (pragmas) to annotate routines appearing in the code as tasks, indicating the directionality of their operands (input, output or input/output) by means of clauses. The OmpSs runtime then decomposes the code (transformed by Mercurium source-to-source compiler) into a number of tasks at run time, dynamically identifying dependencies among these, and issuing ready tasks (those with all dependencies satisfied) for their execution to the processor cores of the system. Listing 2 shows the annotations a programmer needs to add in order to exploit task parallelism using OmpSs; see in particular the lines labelled with ‘#pragma omp”. The clauses in, out and inout denote data directionality, and help the Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

JID: PARCO

ARTICLE IN PRESS

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18

5

Fig. 2. DAG with the tasks and data dependencies resulting from the application of the code in Listing 1 to a 4 × 4 blocked matrix (s = 4). The labels specify the type of kernel/task with the following correspondence: “C” for the Cholesky factorization, “T” for the triangular solve, “G” for the matrix multiplication, and “S” for the symmetric rank-b update. The subindices (starting at 0) specify the sub-matrix that the corresponding task updates, and the color is used to distinguish between different values of the iteration index k.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

# pragma omp task inout ([ b ][ b ] A ) void po_cholesky ( double *A , int b , int ld ) { static int INFO = 0; static const char UP = ’U ’; dpotrf (& UP , &b , A , & ld , & INFO ) ; // LAPACK Cholesky factorization } # pragma omp task in ([ b ][ b ] A ) inout ([ b ][ b ] B ) void tr_solve ( double *A , double *B , int b , int ld ) { static double DONE = 1.0; static const char LE = ’L ’ , UP = ’U ’ , TR = ’T ’ , NU = ’N ’; dtrsm (& LE , & UP , & TR , & NU , &b , &b , // BLAS -3 triangular solve & DONE , A , & ld , B , & ld ) ; } # pragma omp task in ([ b ][ b ]A , [ b ][ b ] B ) inout ([ b ][ b ] C ) void ge_multiply ( double *A , double *B , double *C , int b , int ld ) { static double DONE = 1.0 , DMONE = -1.0; static const char TR = ’T ’ , NT = ’N ’; dgemm (& TR , & NT , &b , &b , &b , // BLAS -3 matrix multiplication & DMONE , A , & ld , B , & ld , & DONE , C , & ld ) ; } # pragma omp task in ([ b ][ b ] A ) inout ([ b ][ b ] C ) void sy_update ( double *A , double *C , int b , int ld ) { static double DONE = 1.0 , DMONE = -1.0; static const char UP = ’U ’ , TR = ’T ’; dsyrk (& UP , & TR , &b , &b , & DMONE , A , & ld , // BLAS -3 symmetric rank - b update & DONE , C , & ld ) ; } Listing 2. Labeled tasks involved in the blocked Cholesky factorization.

task scheduler to keep track of data dependencies between tasks during the execution. We note that, in this implementation, the four kernels simply contain calls to four fundamental computational kernels for DLA from LAPACK (dpotrf) and the BLAS-3 (dtrsm, dgemm and dsyrk). The LU factorization is the analogous decomposition for the solution of unsymmetric linear systems. The blocked algorithm for this factorization includes partial pivoting for numerical stability [13] and is composed of four building kernels: ge_lu (LU factorization), la_swp (row permutations), ge_multiply, and tr_solve. The rationale to extract task parallelism is the same for this operation: The order in which these kernels appear in the execution and their input/output operands define a collection of dependencies with the form of a DAG that a runtime system can employ to produce a correct schedule of the tasks; see, e.g., [20] for details. Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

JID: PARCO 6

ARTICLE IN PRESS

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18

1 void gemm ( double A [ m ][ k ] , double B [ k ][ n ] , double C [ m ][ n ] , 2 int m , int n , int k , int mc , int nc , int kc ) 3 { 4 double * Ac = malloc ( mc * kc * sizeof ( double ) ) , 5 * Bc = malloc ( kc * nc * sizeof ( double ) ) ; 6 7 for ( int jc = 0; jc < n ; jc += nc ) { // Loop 1 8 int jb = min (n - jc +1 , nc ) ; 9 10 for ( int pc = 0; pc < k ; jc += kc ) { // Loop 2 11 int pb = min (k - pc +1 , kc ) ; 12 13 pack_buffB ( B [ pc ][ jc ] , Bc , kb , nb ) ; // Pack A - > Ac 14 15 for ( int ic = 0; ic < m ; ic += mc ) { // Loop 3 16 int ib = min (m - ic +1 , mc ) ; 17 18 pack_buffA ( A [ ic ][ pc ] , Ac , mb , kb ) ; // Pack A - > Ac 19 20 gemm_kernel ( Ac , Bc , C [ ic ][ jc ] , 21 mb , nb , kb , mc , nc , kc ) ; // Macro - kernel 22 } } } } Listing 3. High performance implementation of gemm in BLIS.

3.1.2. Task scheduling in heterogeneous and asymmetric architectures For servers equipped with one or more graphics accelerators, (specialized versions of) the schedulers underlying OmpSs, StarPU, MAGMA, Kaapi and libflame distinguish between the execution target being either a general-purpose core (CPU) or a GPU, assigning tasks to each type of resource depending on their properties, and applying techniques such as data caching or locality-aware task mapping; see, among many others, [20–24]. The designers/developers of the OmpSs programming model and the Nanos++ runtime task scheduler recently introduced a new version of their framework, hereafter referred to as Botlev-OmpSs, specifically tailored for AMPs [18]. This asymmetry-conscious runtime embeds a scheduling policy named CATS (Criticality-Aware Task Scheduler) that relies on bottom-level longest-path priorities, keeps track of the criticality of the individual tasks, and leverages this information, at execution time, to assign a ready task to either a critical or a non-critical queue. In this solution, tasks enqueued in the critical queue can only be executed by the fast cores. In addition, the enhanced scheduler integrates uni- or bi-directional work stealing between fast and slow cores. According to the authors, this sophisticated ad-hoc scheduling strategy for heterogeneous/asymmetric processors attains remarkable performance improvements in a number of applications; see [18] for further details. When applied to a task-parallel DLA routine, the asymmetry-aware scheduler in Botlev-OmpSs maps each task to a single (big or LITTLE) core, and simply invokes a sequential DLA library to conduct the actual work. On the other hand, we note that this approach required an important redesign of the underlying scheduling policy (and thus, a considerable programming effort for the runtime developer), in order to exploit the heterogeneous architecture. In particular, detecting the criticality of a task at execution time is a nontrivial question.

3.2. Data-parallel libraries of fundamental (BLAS-3) DLA kernels 3.2.1. Multi-threaded implementation of the BLAS-3 An alternative to the runtime-based (i.e., task-parallel) approach to execute DLA operations on multi-threaded architectures consists in relying on a library of specialized kernels that statically partitions the work among the computational resources, or leverages a simple schedule mechanism such as those available, e.g., in OpenMP. For DLA operations with few and/or simple data dependencies, as is the case of the BLAS-3, and/or when the number of cores in the target architecture is small, this option can avoid the costly overhead of a sophisticated task scheduler, providing a more efficient solution. Currently this is preferred option for all high performance implementations of the BLAS for multicore processors, being adopted in both commercial and open source packages such as, e.g., AMD ACML, Intel MKL, GotoBLAS/OpenBLAS [25,26] and BLIS. BLIS in particular mimics GotoBLAS to orchestrate all BLAS-3 kernels (including the matrix multiplication, gemm) as three nested loops around two packing routines, which accommodate the data in the higher levels of the cache hierarchy, and a macro-kernel in charge of performing the actual computations. Internally, BLIS implements the macro-kernel as two additional loops around a micro-kernel that, in turn, is essentially a loop around a rank-1 update. For the purpose of the following discussion, we will only consider the three outermost loops in the BLIS implementation of gemm for the multiplication C := C + A · B, where A, B, C are respectively m × k, k × n and m × n matrices, stored in arrays A, B and C; see Listing 3. In the code, mc, nc, kc are cache configuration parameters that need to be adjusted for performance taking into account, among others, the latency of the floating-point units, number of vector registers, and size/associativity degree of the cache levels. Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

ARTICLE IN PRESS

JID: PARCO

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18

7

Cholesky factorization. OmpSs - Sequential BLIS 10 1 Worker thread 2 Worker threads 3 Worker threads 4 Worker threads 8

5 Worker threads 6 Worker threads 7 Worker threads 8 Worker threads

GFLOPS

6

4

2

0 1000

2000 Problem dimension (n)

3000

4000

Fig. 3. Performance of the Cholesky factorization using the conventional OmpSs runtime and a sequential implementation of BLIS on Exynos.

3.2.2. Data-parallel libraries for asymmetric architectures The implementation of gemm in BLIS has been demonstrated to deliver high performance on a wide range of multicore and many-core SMPs [27,28]. These studies offered a few relevant insights that guided the parallelization of gemm (and also other Level-3 BLAS) on the ARM big.LITTLE architecture under the GTS software execution model. Concretely, the architecture-aware multi-threaded parallelization of gemm in [16] integrates the following three techniques: • A dynamic 1-D partitioning of the iteration space to distribute the workload in either Loop 1 or Loop 3 of BLIS gemm between the two clusters. • A static 1-D partitioning of the iteration space that distributes the workload of one of the loops internal to the macrokernel between the cores of the same cluster. • A modification of the control tree that governs the multi-threaded parallelization of BLIS gemm in order to accommodate different loop strides for each type of core architecture. The strategy is general and can be applied to a generic AMP, consisting of any combination of fast/slow cores sharing the main memory, as well as to all other Level-3 BLAS operations. To close this section, we emphasize that our work differs from [16,18] in that we address sophisticated DLA operations, with a rich hierarchy of task dependencies, by leveraging a conventional runtime scheduler in combination with a dataparallel asymmetry-conscious implementation of the BLAS-3. 4. Combining conventional runtimes and asymmetry-aware libraries to exploit heterogeneity In this section, we initially perform an evaluation of the task-parallel Cholesky routine in Listings 1 and 2, executed on top of the conventional (i.e., default) scheduler in OmpSs linked to a sequential instance of BLIS, on Exynos. The outcome from this study motivates the development and experiments presented in the remainder of the paper. 4.1. Motivation: performance evaluation of conventional runtimes on AMPs Fig. 3 reports the performance, in terms of GFLOPS (billions of flops per second), attained with the conventional OmpSs runtime, when the number of worker threads (WT) varies from 1 to 8, and the mapping of worker threads to cores is delegated to the OS. We evaluated a range of block sizes (b in Listing 1), but for simplicity we report only the results obtained with the value b that optimized the GFLOPS rate for each problem dimension. All the experiments hereafter employed ieee double precision. Furthermore, we ensured that the cores operate at the highest possible frequency by setting the appropriate cpufreq governor. The conventional runtime of OmpSs corresponds to release 15.06 of the Nanos++ runtime task scheduler. For this experiment, it is linked with the “sequential” implementation of BLIS in release 0.1.5. (For the experiments with the multi-threaded/asymmetric version of BLIS in the later sections, we will use specialized versions of the codes in [16] for slow+fast VCs.) The results in the figure reveal the increase in performance as the number of worker threads is raised from 1 to 4, which the OS maps to the (big) Cortex-A15 cores. However, when the number of threads exceeds the amount of fast cores, the OS starts binding the threads to the slower Cortex-A7 cores, and the improvement rate is drastically reduced, in some cases even showing a performance drop. This is due to load imbalance, as tasks of uniform granularity, possibly laying on Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

JID: PARCO 8

ARTICLE IN PRESS

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18

the critical path, are assigned to slow cores. Clearly, some type of asymmetry-awareness is necessary in the runtime task scheduler (or the underlying libraries) in order to tackle this source of inefficiency. Stimulated by this first experiment, we recognize that an obvious solution to this problem consists in adapting the runtime task scheduler (more specifically, the scheduling policy) to exploit the SoC asymmetry [18]. Nevertheless, we part ways with this solution, exploring an architecture-aware alternative that leverages a(ny) conventional runtime task scheduler combined with an underlying asymmetric library. We discuss this option in more detail in the following section. 4.2. Our solution: combining conventional runtimes with asymmetric libraries 4.2.1. General overview Our proposal operates physically under the GTS model but is conceptually inspired by CPUM; see Section 2.2. Concretely, our task scheduler regards the computational resources as a number truly symmetric VCs, each composed of a fast and one or more slow cores. For this purpose, unlike CPUM, both physical cores within each VC remain active and collaborate to execute a given task. Furthermore, our approach exploits concurrency at two levels: task-level parallelism is extracted by the runtime in order to schedule tasks to the four symmetric VCs; and each task/kernel is internally divided to expose data-level parallelism, distributing its workload between the two asymmetric physical cores within the VC in charge of its execution. Our solution thus only requires a conventional (and thus, asymmetry-agnostic) runtime task scheduler, e.g. the conventional version of OmpSs, where instead of spawning one worker thread per core in the system, we adhere to the CPUM model, creating only one worker thread per VC. Internally, whenever a ready task is selected to be executed by a worker thread, the corresponding routine from BLIS internally spawns as many threads as cores exist in the VC, binds them using the corresponding thread-to-core affinity tools offered by the OS to the appropriate big and LITTLE cores within the VC, and asymmetrically divides the work between the fast and the slow physical cores in the VC. Following this idea, the architecture exposed to the runtime is symmetric, and the kernels in the BLIS library configure a “black box” that abstracts the architecture asymmetry from the runtime scheduler. In summary, in a conventional setup, the core is the basic computational resource for the task scheduler, and the “sequential” tasks are the minimum work unit to be assigned to these resources. Compared with this, in our approach the VC is the smallest (basic) computational resource from the point of view of the scheduler; and tasks are further divided into smaller units, and executed in parallel by the physical cores inside the VCs. 4.2.2. Comparison with other approaches Provided an underlying asymmetry-aware library is available to the developer, it is possible to combine this with an existing conventional runtime task scheduler in order to exploit the underlying architecture. In the enhanced version of OmpSs in [18], tasks were actually invocations to a sequential library, and asymmetry was exploited via sophisticated scheduling policies especially designed for asymmetric architectures. Following our new approach, asymmetry is exploited at the library level, not at the runtime level. To accommodate this, tasks are cast in terms of invocations to a multi-threaded asymmetryaware library (in our case, BLIS). Our approach features a number of advantages for the developer: • The runtime is not aware of asymmetry, and thus a conventional task scheduler will work transparently with no special modifications. • Any existing scheduling policy (e.g. cache-aware mapping or work stealing) in an asymmetry-agnostic runtime, or any enhancement technique, will directly impact the performance attained on an AMP. • Any improvement in the asymmetry-aware BLIS implementation will directly impact the performance on an AMP. This applies to different ratios of big /LITTLE cores within a VC, operating frequency, or even to the introduction more than two levels of asymmetry. Obviously, there is also a drawback in our proposal as a tuned asymmetry-aware DLA library must exist in order to reuse conventional runtimes. In the scope of DLA, this drawback is easily tackled with BLIS. We recognize that, in more general domains, an ad-hoc implementation of the application’s fundamental kernels becomes mandatory in order to fully exploit the underlying architecture. 4.2.3. Requisites on the BLAS-3 implementation We finally note that certain requirements are imposed on a multi-threaded implementation of BLIS that operates under the CPUM mode. To illustrate this, consider the gemm kernel and the high-level description of its implementation in Listing 3. For our objective, we still have to distribute the iteration space between the big and the LITTLE cores but, since there is typically only one resource of each type per VC, there is no need to partition the loops internal to the macro-kernel. An obvious exception is Juno, in which each VC is composed of one big and two LITTLE cores; in that case, we effectively partition an inner loop accordingly, and distribute its iteration space among the LITTLE cores. Furthermore, we note that the optimal strides for Loop 1 are in practice quite large (e.g., nc is in the order of a few thousands for ARM big.LITTLE cores), while the optimal values for Loop 3 are much more reduced (for example, mc is 32 for the Cortex-A7 and 156 for the Cortex-A15 in Exynos). Therefore, we target Loop 3 in our data-parallel implementation of BLIS for VCs, which we can expect to easily yield a proper workload balancing. Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

ARTICLE IN PRESS

JID: PARCO

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18

9

Table 2 Optimal block sizes for the Cholesky factorization using the conventional OmpSs runtime and a sequential implementation of BLIS. Problem dimension (n)

Exynos - 4 wt Juno - 2 wt XeonA - 7 wt XeonB - 14 wt

512

1024

1536

2048

2560

3072

3584

4096

4608

5120

5632

6144

6656

7168

7680

128 128 192 128

128 128 192 192

192 192 192 320

192 192 192 192

192 192 192 192

320 256 192 320

320 256 192 320

320 320 320 320

320 320 384 320

448 448 384 256

320 448 384 320

448 320 448 448

448 448 448 448

384 448 448 384

448 448 448 448

Table 3 Optimal block sizes for the LU factorization factorization using the conventional OmpSs runtime and a sequential implementation of BLIS. Problem dimension (m=n)

Exynos - 4 wt Juno - 2 wt XeonA - 7 wt XeonB - 14 wt

512

1024

1536

2048

2560

3072

3584

4096

4608

5120

5632

6144

6656

7168

7680

160 128 96 96

160 128 64 128

160 160 96 192

160 192 160 256

160 160 128 256

224 192 160 192

224 224 160 128

256 224 192 192

224 224 224 160

288 224 256 160

256 192 256 192

288 224 224 160

288 224 256 224

256 224 256 224

256 224 384 256

Additionally, thread-to-core binding is a key issue in this scenario. Internally, the basic asymmetry-conscious implementation of BLIS deploys as many fast and slow threads as dictated by the user, and binds them to the corresponding core depending on its type. When combined with a runtime task scheduler, each BLIS thread needs to be bound to a specific core depending on the exact core to which the OmpSs worker thread is initially bound to. Consider, for example, the Exynos target: the OS numbers LITTLE cores with identifiers from 0 to 3, and big cores with identifiers from 4 to 7. When the execution starts, we need leverage the binding capabilities of OmpSs to associate each worker thread to a fast core c ∈ {4, 5, 6, 7}; internally, when a BLIS call is invoked, the newly generated fast thread will bind itself to core c, while the additional slow core will be bound to core c − 4. This technique is adapted depending on the number of cores in the architectures and their distribution between fast and slow cores. 5. Experimental results 5.1. Performance evaluation for asymmetric BLIS Let us open this section by reminding that, at execution time, OmpSs decomposes the routine for the Cholesky factorization into a collection of tasks that operate on sub-matrices (blocks) with a granularity dictated by the block size b; see Listing 1. Each one of these tasks typically performs a single call to a fundamental kernel of the BLAS-3, in our case provided by BLIS, or LAPACK; see Listing 2. Similar remarks apply to the LU factorization with partial pivoting. The first step in our evaluation aims to provide a realistic estimation of the potential performance benefits of our approach (if any) on the different target architectures. A critical factor from this perspective is the range of block sizes, say bopt , that are optimal for the conventional OmpSs runtime. In particular, the efficiency of our hybrid task/data-parallel approach is strongly determined by the performance attained with the asymmetric BLIS implementation when compared against that of its sequential counterpart, for problem dimensions n that are in the order of bopt . Tables 2 and 3 report the optimal block sizes bopt for the Cholesky and LU factorizations, respectively, with problems of increasing matrix dimension, using the conventional OmpSs runtime linked with the sequential BLIS, and a number of worker threads that match the number of VCs in the architecture. Note that, except for the smallest problems, the optimal block sizes vary between 192 and 448. These dimensions offer a fair compromise, exposing enough task-level parallelism while delivering high “sequential” performance for the execution of each individual task via the sequential implementation of BLIS. In the following, experiments have been repeated for 10 times, obtaining the average performance of the observed results; no significant deviations were between different round of experiments. The key insight gained from these initial experiments is that, in order to extract fair performance from a combination of the conventional OmpSs runtime task scheduler with a multi-threaded asymmetric version of BLIS, the kernels in this instance of the asymmetric library must outperform their sequential counterparts, for matrix dimensions in the order of the block sizes in Tables 2 and 3. The plots in Fig. 4 show the performance attained with the BLAS-3 tasks involved in the Cholesky and LU factorizations (gemm, syrk, and trsm) for our range of dimensions of interest. There, the multi-threaded asymmetry-aware kernels run concurrently on one big core plus one LITTLE core (except in the case of Juno, in which the combination consisting of one big core plus two LITTLE cores is also explored), while the sequential kernels operate exclusively on a single big core. In general, the three BLAS-3 routines exhibit a similar trend for all architectures: the kernels from the sequential BLIS outperform their asymmetric counterparts for small problems (typically in the range of m, n, k = 128 to 256); but, from that dimension, the use of the slow core starts paying off. The interesting aspect here is that the cross-over threshold between both performance curves is in the range of bopt (usually at an early point,); see Tables 2 and 3. This Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

ARTICLE IN PRESS

JID: PARCO 10

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18 dgemm in EXYNOS

2.5

2

2

1.5

200

300

400

0

500

100

200

300

400

0

500

Problem dimensions (m = k) dsyrk in Juno

6

4

4

4

3

GFLOPS

5

3 2

Seq. BLIS on 1 A57 Asym. BLIS on 1 A53 + 1 A57 Asym. BLIS on 2 A53 + 1 A57

1

200

600

800

0

1000

200

50

Seq. BLIS on 1 Xeon@2300 Asym. BLIS on 1 Xeon@1200 + 1 Xeon@2300

GFLOPS

20 10 0

50

600

800

0

1000

400

600

800

50

GFLOPS

20 10

200

400

600

800

200

400

600

800

1000

50

1000

200

400

600

800

1000

Problem dimensions (m = n) dtrsm in XeonB Seq. BLIS on 1 Xeon@2300 Asym. BLIS on 1 Xeon@1200 + 1 Xeon@2300

40

30 20

0

800

20

0

1000

Problem dimensions (m = k) dsyrk in XeonB

10

Problem dimensions (m = n = k)

600

10

40

30

400

Problem dimensions (m = n) dtrsm in XeonA

30

Seq. BLIS on 1 Xeon@2300 Asym. BLIS on 1 Xeon@1200 + 1 Xeon@2300

40

500

Seq. BLIS on 1 Xeon@2300 Asym. BLIS on 1 Xeon@1200 + 1 Xeon@2300

20

0

1000

Problem dimensions (m = n = k) dgemm in XeonB

400

40

10

200

200

50

30

Seq. BLIS on 1 Xeon@2300 Asym. BLIS on 1 Xeon@1200 + 1 Xeon@2300

0

400

Problem dimensions (m = k) dsyrk in XeonA

40

30

300

Seq. BLIS on 1 A57 Asym. BLIS on 1 A53 + 1 A57 Asym. BLIS on 2 A53 + 1 A57

Seq. BLIS on 1 Xeon@2300 Asym. BLIS on 1 Xeon@1200 + 1 Xeon@2300

40

200

3

1

GFLOPS

50

400

Problem dimensions (m = n = k) dgemm in XeonA

100

2 Seq. BLIS on 1 A57 Asym. BLIS on 1 A53 + 1 A57 Asym. BLIS on 2 A53 + 1 A57

1

GFLOPS

0

Seq. BLIS on 1 A15 Asym. BLIS on 1 A7 + 1 A15

Problem dimensions (m = n) dtrsm in Juno

6

5

2

GFLOPS

0.5

Seq. BLIS on 1 A15 Asym. BLIS on 1 A7 + 1 A15

5

GFLOPS

GFLOPS

100

1

0.5

Seq. BLIS on 1 A15 Asym. BLIS on 1 A7 + 1 A15

Problem dimensions (m = n = k) dgemm in Juno

6

1.5

1

0.5 0

GFLOPS

2.5

2 1.5

dtrsm in EXYNOS

3

2.5

1

GFLOPS

dsyrk in EXYNOS

3

GFLOPS

GFLOPS

3

30 20 10

200

400

600

800

Problem dimensions (m = k)

1000

0

200

400

600

800

1000

Problem dimensions (m = n)

Fig. 4. Performance of the BLAS-3 kernels in the sequential and the multi-threaded/asymmetric implementations of BLIS on the four target architectures.

implies that the asymmetric BLIS can potentially improve the performance of the overall computation. Moreover, the gap in performance grows with the problem size, stabilizing at problem sizes around m, n, k ≈ 400.

5.2. Performance comparison versus asymmetry-aware task scheduler Given the potential benefits in performance observed from the execution of each individual task implementation, the next round of experiments aims to assess the actual advantage of different task-parallel executions of the Cholesky and LU factorizations via OmpSs. Concretely, we consider: 1. the conventional task scheduler linked with the sequential BLIS (“Conventional OmpSs + Sequential BLIS”); 2. the conventional task scheduler linked with our multi-threaded asymmetric BLIS that views the SoC as four symmetric virtual cores (“Conventional OmpSs + Asymmetric BLIS”); and 3. the criticality-aware task scheduler in Botlev-OmpSs linked with the sequential BLIS (“Botlev OmpSs + Sequential BLIS”). Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

ARTICLE IN PRESS

JID: PARCO

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18 Cholesky factorization on EXYNOS

12

12

10

10

8

8

6

6

4

4

2

2

0

1000

2000

4000

5000

6000

7000

0

8000

Problem dimension (n) Cholesky factorization on Juno

14

12

3000

LU factorization with partial pivoting on EXYNOS Conventional OmpSs + Asymmetric BLIS - 4+4 WT Conventional OmpSs + Sequential BLIS - 4 WT Botlev OmpSs + Sequential BLIS - 4+4 WT

GFLOPS

GFLOPS

Conventional OmpSs + Asymmetric BLIS - 4+4 WT Conventional OmpSs + Sequential BLIS - 4 WT Botlev OmpSs + Sequential BLIS - 4+4 WT

11

1000

2000

3000

4000

5000

6000

7000

8000

7000

8000

Problem dimension (m=n) LU factorization with partial pivoting on Juno

12

Conventional OmpSs + Asymmetric BLIS - 2+4 WT Conventional OmpSs + Sequential BLIS - 2 WT Botlev OmpSs + Sequential BLIS - 2+4 WT

Conventional OmpSs + Asymmetric BLIS - 2+4 WT Conventional OmpSs + Sequential BLIS - 2 WT Botlev OmpSs + Sequential BLIS - 2+4 WT

10

10

GFLOPS

GFLOPS

8

8

6

6

4

4 2

2

0

1000

2000

3000

4000

5000

Problem dimension (n)

6000

7000

8000

0

1000

2000

3000

4000

5000

6000

Problem dimension (m=n)

Fig. 5. Performance (in GFLOPS) for the Cholesky (left plots) and LU (right plots) factorizations using the conventional OmpSs runtime linked with either the sequential BLIS or the multi-threaded/asymmetric BLIS, and the ad-hoc asymmetry-aware version of the OmpSs runtime (Botlev-OmpSs) linked with the sequential BLIS in Exynos and Juno. The labels of the form “y+x WT” refer to an execution with y big cores and x LITTLE cores.

Fig. 5 shows the performance attained by the aforementioned alternatives on the ARM-based architectures (Exynos and Juno), and Fig. 6 reports equivalent performance numbers for the Intel-based architectures (XeonA and XeonB). The results can be divided into groups along three problem dimensions, offering similar qualitative conclusions for all architectures: • For small matrices (n = 512–1024 for ARM-based architectures, and n < 4096 for Intel-based architectures), the conventional runtime that employs exclusively big cores (that is, linked with a sequential BLIS library for task execution) attains the best results in terms of performance. This was expected and was already observed in Fig. 7; the main reason is the small optimal block size, enforced by the reduced problem size, that is necessary in order to expose enough task-level parallelism. This invalidates the use of our asymmetric BLIS implementation due to the low performance for very small matrices; see Fig. 4. • For medium-sized matrices (n = 2048–4096 for ARM-based architectures, and n = 4096–6144 for Intel-based architectures), the gap in performance between the different approaches is reduced. The variant that relies on the asymmetric BLIS implementation outperforms the alternative implementations for n = 4096 by a narrow margin. For these problem ranges, Botlev-OmpSs is competitive, and also outperforms the conventional setup. • For large matrices (n = 6144–7680 for ARM-based architectures, and n > 10,240 for Intel-based architectures) the previous trend is consolidated, and both asymmetry-aware approaches deliver significant performance gains with respect to the conventional setup. Comparing both asymmetry-aware solutions, our mechanism delivers comparable and, in most cases, higher performance rates. Note that our solution combining a conventional runtime and assymmetry-aware task implementations is consistently competitive with the ad-hoc Botlev approach. This validates the idea of revisiting conventional runtimes for asymmetric architectures, alleviating the effort of specific runtime implementations. The only exception is Juno, in which the Botlev runtime behaves consistently better than our approach. The small extra performance added by the introduction of the second LITTLE core for each individual task in BLIS (see Fig. 4) explains this behavior and hints an idea of the dependence of our approach on the specific performance of the underlying assymmetry-aware BLAS implementation. Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

ARTICLE IN PRESS

JID: PARCO 12

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18 Cholesky factorization on XeonA

300

200

200

GFLOPS

GFLOPS

250

150

150

100

100

50

50

0.2

0.4

450

0.6

0.8

1

1.2

1.4

1.6

1.8

Problem dimension (n) Cholesky factorization on XeonB

500

0

2 ×10 4

450

400

400

350

350

300

300

250

200

150

150

100

100

50

50

0.2

0.4

0.6

0.8

1

1.2

Problem dimension (n)

1.4

1.6

1.8

2 4 ×10

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2 ×10 4

1.8

2 4 ×10

Problem dimension (m=n) LU factorization with partial pivoting on XeonB Conventional OmpSs + Asymmetric BLIS - 14+14 WT Conventional OmpSs + Sequential BLIS - 14 WT Botlev OmpSs + Sequential BLIS - 14+14 WT

250

200

0

0.2

500

Conventional OmpSs + Asymmetric BLIS - 14+14 WT Conventional OmpSs + Sequential BLIS - 14 WT Botlev OmpSs + Sequential BLIS - 14+14 WT

GFLOPS

GFLOPS

Conventional OmpSs + Asymmetric BLIS - 7+7 WT Conventional OmpSs + Sequential BLIS - 7 WT Botlev OmpSs + Sequential BLIS - 7+7 WT

250

0

LU factorization with partial pivoting on XeonA

300

Conventional OmpSs + Asymmetric BLIS - 7+7 WT Conventional OmpSs + Sequential BLIS - 7 WT Botlev OmpSs + Sequential BLIS - 7+7 WT

0

0.2

0.4

0.6

0.8

1

1.2

Problem dimension (m=n)

1.4

1.6

Fig. 6. Performance (in GFLOPS) for the Cholesky (left plots) and LU (right plots) factorizations using the conventional OmpSs runtime linked with either the sequential BLIS or the multi-threaded/asymmetric BLIS, and the ad-hoc asymmetry-aware version of the OmpSs runtime (Botlev-OmpSs) linked with the sequential BLIS in XeonA and XeonB. The labels of the form “y+x WT” refer to an execution with y big cores and x LITTLE cores.

To summarize, our proposal to exploit asymmetry improves portability and programmability by avoiding a revamp of the runtime task scheduler specifically targeting AMPs. In addition, our approach renders performance gains which are, for all problems cases, comparable with those of ad-hoc asymmetry-conscious schedulers. Furthermore, for medium to large matrices, it clearly outperforms the efficiency attained with a conventional asymmetry-oblivious scheduler. 5.3. Extended performance analysis. Blocked Cholesky factorization on EXYNOS Given the large number of results obtained from the combination of architectures, implementations and problem sizes, we next provide further details on the performance behavior of the runtime configurations for a specific target architecture (Exynos) and operation (Cholesky factorization). We need to note that many of the insights and qualitative observations extracted from this detailed analysis were also observed for the rest of the architectures, operations and matrix dimensions. 5.3.1. Details on the integration of asymmetric BLIS in a conventional task scheduler In order to analyze the actual benefits of our proposal, we next evaluate the conventional OmpSs task scheduler linked with either the sequential implementation of BLIS or its multi-threaded asymmetry-aware version. Hereafter, the BLIS kernels from the first configuration always run on a single Cortex-A15 core while, in the second case, they exploit one CortexA15 plus one Cortex-A7 core. Fig. 7 reports the results for both set-ups, using an increasing number of worker threads from 1 to 4. For simplicity, we only report the results obtained with the optimal block size. The solution based on the multi-threaded asymmetric library outperforms the sequential implementation for relatively large matrices (usually for dimensions n > 2048) while, for smaller problems, the GFLOPS rates are similar. The reason for this behavior can be derived from the optimal block sizes reported in Table 2 and the performance of BLIS reported in Fig. 4: for that range of problem dimensions, the optimal block size is significantly smaller, and both BLIS implementations attain similar performance rates. The quantitative difference in performance between both approaches is reported in Tables 4 and 5. The first table illustrates the absolute gap while the second one shows the difference per Cortex-A7 core introduced in the experiment. Let us Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

ARTICLE IN PRESS

JID: PARCO

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18 Cholesky factorization. Asymmetry-agnostic OmpSs. 2 worker threads

Cholesky factorization. Asymmetry-agnostic OmpSs. 1 worker thread 10 OmpSs - Asym. BLIS

OmpSs - Asym. BLIS

OmpSs - Seq. BLIS

OmpSs - Seq. BLIS

8

8

6

6

GFLOPS

GFLOPS

10

4

4

2

2

0

2000

4000 Problem dimension

0

6000

4000 Problem dimension

6000

10 OmpSs - Asym. BLIS

OmpSs - Asym. BLIS

OmpSs - Seq. BLIS

OmpSs - Seq. BLIS

8

8

6

6

GFLOPS

GFLOPS

2000

Cholesky factorization. Asymmetry-agnostic OmpSs. 4 worker threads

Cholesky factorization. Asymmetry-agnostic OmpSs. 3 worker threads 10

13

4

4

2

2

0

2000

4000 Problem dimension

0

6000

2000

4000 Problem dimension

6000

Fig. 7. Performance of the Cholesky factorization using the conventional OmpSs runtime linked with either the sequential or the multithreaded/asymmetric implementations of BLIS, using respectively one Cortex-A15 core and one Cortex-A15 plus one Cortex-A7 core, in Exynos. Table 4 Absolute performance improvement (in GFLOPS) for the Cholesky factorization using the conventional OmpSs runtime linked with the multi-threaded/asymmetric BLIS with respect to the same runtime linked with the sequential BLIS in Exynos. Problem dimension (n)

1 2 3 4

wt wt wt wt

512

1024

2048

2560

3072

4096

4608

5120

5632

6144

6656

7680

−0.143 −0.116 −0.308 −0.421

0.061 −0.109 −0.233 −0.440

0.218 0.213 −0.020 −0.274

0.289 0.469 0.432 0.204

0.326 0.573 0.720 0.227

0.267 0.495 0.614 0.614

0.259 0.454 0.603 0.506

0.313 0.568 0.800 0.769

0.324 0.588 0.820 0.666

0.340 0.617 0.866 0.975

0.348 0.660 0.825 0.829

0.300 0.582 0.780 0.902

consider, for example, the problem size n = 6144. In that case, the performance roughly improves by 0.975 GFLOPS when the 4 slow cores are added to aid the base 4 Cortex-A15 cores. This translates into a performance raise of 0.243 GFLOPS per slow core, which is slightly under the improvement that could be expected from the results reported in the previous section. Note, however, that the performance per Cortex-A7 core is reduced from 0.340 GFLOPS when adding just one core, to 0.243 GFLOPS when simultaneously using all four slow cores. 5.3.2. General task execution overview We next investigate further the sources for the performance differences reported in the previous sections, through a detailed analysis of actual execution traces of the parallel executions in Exynos. The execution traces in this section were all extracted with the Extrae instrumentation tool and analyzed with the visualization package Paraver [29]. The results correspond to the Cholesky factorization of a single problem with matrix dimension n = 6144 and block size b = 448. Fig. 8 reports complete execution traces for each runtime configuration. At a glance, a number of coarse remarks can be extracted from the trace: Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

ARTICLE IN PRESS

JID: PARCO 14

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18 Table 5 Performance improvement per slow core (in GFLOPS) for the Cholesky factorization using the conventional OmpSs runtime linked with the multi-threaded/asymmetric BLIS with respect to the same runtime linked with the sequential BLIS in Exynos. Problem dimension (n)

1 2 3 4

wt wt wt wt

512

1024

2048

2560

3072

4096

4608

5120

5632

6144

6656

7680

−0.143 −0.058 −0.102 −0.105

0.061 −0.054 −0.077 −0.110

0.218 0.106 −0.006 −0.068

0.289 0.234 0.144 0.051

0.326 0.286 0.240 0.056

0.267 0.247 0.204 0.153

0.259 0.227 0.201 0.126

0.313 0.284 0.266 0.192

0.324 0.294 0.273 0.166

0.340 0.308 0.288 0.243

0.348 0.330 0.275 0.207

0.300 0.291 0.261 0.225

Fig. 8. Execution traces of the three runtime configurations for the Cholesky factorization (n = 6144, b = 448) in Exynos. The timeline in each row collects the tasks executed by a single worker thread. Tasks are colored following the convention in Fig. 2; phases colored in white between tasks represent idle times. The green flags mark task initialization/completion. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

• From the perspective of total execution time (i.e., time-to-solution), the conventional OmpSs runtime combined with the asymmetric BLIS implementation attains the best results, followed by the Botlev-OmpSs runtime configuration. It is worth pointing out that an asymmetry-oblivious runtime which spawns 8 worker threads, with no further considerations, yields the worst performance by far. In this case, the load imbalance and long idle periods, especially as the amount of concurrency decreases in the final part of the trace, entail a huge performance penalty. • The flag marks indicating task initialization/completion reveal that the asymmetric BLIS implementation (which employs the combined resources from a VC) requires less time per task than the two alternatives based on a sequential BLIS. An effect to note in the Botlev-OmpSs configuration is the difference in performance between tasks of the same type, when executed by a big core (worker threads 5–8) or a LITTLE one (worker threads 1–4). • The Botlev-OmpSs task scheduler features a complex scheduling strategy that includes priorities, advancing the execution of tasks in the critical path and, whenever possible, assigning them to fast cores (see, for example, the tasks for the factorization of diagonal blocks, colored in yellow). This yields an execution timeline that is more compact during the first stages of the parallel execution, at the cost of longer idle times when the degree of concurrency decreases (last iterations of the factorization). Although possible, a priority-aware technique was not applied in our experiments with the conventional OmpSs setups but remains part of future work. Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

ARTICLE IN PRESS

JID: PARCO

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18 Table 6 Average time (in ms) per task and worker thread in the Cholesky factorization (n = 6144, OmpSs - Seq. BLIS (4 worker threads)

wt 0 wt 1 wt 2 wt 3 wt 4 wt 5 wt 6 wt 7 Avg.

15

b = 448), for the three runtime configurations.

OmpSs - Asym. BLIS (4 worker threads)

Botlev-OmpSs - Seq. BLIS (8 worker threads, 4+4)

dgemm

dtrsm

dsyrk

dpotrf

dgemm

dtrsm

dsyrk

dpotrf

dgemm

dtrsm

dsyrk

dpotrf

89.62 88.96 89.02 90.11 – – – – 89.43

48.12 48.10 48.36 48.51 – – – – 48.27

47.14 47.14 47.18 47.42 – – – – 47.22

101.77 – 87.22 – – – – – 94.49

79.82 78.65 79.14 79.28 – – – – 79.22

42.77 42.97 43.14 43.10 – – – – 42.99

44.42 44.56 44.60 44.59 – – – – 44.54

105.77 76.35 85.98 67.73 – – – – 83.96

406.25 408.90 415.31 410.84 90.97 90.61 91.28 91.60 250.72

216.70 207.41 230.07 216.95 48.97 48.86 49.43 49.49 133.49

– 212.55 212.56 216.82 48.36 48.16 47.97 48.62 119.29

– – – 137.65 – 90.78 89.58 95.43 103.36

Fig. 9. Task time duration histogram obtained with Paraver for dgemm tasks on the three studied runtime configurations for the Cholesky factorization (n = 6144, b = 448) in Exynos. At each figure, each row corresponds to a different worker thread in the execution. Columns correspond to time bins. Gradient color indicates the number of tasks in the corresponding bin (ranging from light green –small number of tasks– to dark blue –large number of tasks–). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

5.3.3. Impact of task duration Table 6 reports the average execution time per type of task for each worker thread, for the same experiment analyzed in Section 5.3.2. These results show that the execution time per individual type of task is considerably shorter for our multithreaded/asymmetric BLIS implementation than for the alternatives based on a sequential version of BLIS. The only exception is the factorization of the diagonal block (dpotrf) as this is an LAPACK-level routine, and therefore it is not supported by BLIS. Inspecting the task execution time of the Botlev-OmpSs configuration, we observe a remarkable difference depending on the type of core the tasks are mapped to. For example, the average execution times for dgemm range from more than 400 ms on a LITTLE core, to roughly 90 ms on a big core. This behavior is visible for all types of tasks. To illustrate these facts more clearly, Fig. 9 represents a detailed histogram of the execution times for the dgemm task instantiations in the task-parallel executions. Each bin corresponds to the amount of tasks with a given execution time while the rows correspond to different worker threads. Comparing first the conventional runtimes (traces (a) and (b)), there is a clear deviation in the average execution times towards faster executions; that is, the average time to execute a task using the asymmetry-aware BLIS implementation is shorter. The histogram for Botlev (trace (c)) exhibits two different peaks, corresponding to tasks executed on the big or on the LITTLE cores, respectively. Note that tasks executing on the fast cores experience the same performance rate as those for the conventional runtime setup using a sequential BLIS implementation. A similar behavior was observed for the rest of the BLAS-3 tasks involved in the LU and Cholesky factorizations. 5.3.4. Task scheduling policies and idle times Fig. 10 illustrates the task execution order determined by the Nanos++ task scheduler. Here tasks are depicted using a color gradient, attending to the order in which they are encountered in the sequential code, from the earliest to the latest. Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

ARTICLE IN PRESS

JID: PARCO 16

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18

Fig. 10. Task execution order of the three studied runtime configurations for the Cholesky factorization (n = 6144, b = 448) in Exynos. In the trace, tasks are ordered according to their appearance in the sequential code, and depicted using a color gradient, with light green indicating early tasks, and dark blue for the late tasks. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 7 Percentage of time per worker thread in idle or running state for different runtime configurations for the Cholesky factorization (n = 6144, b = 448). Note that wt 0 is the master thread, and thus is never idle; for it, the rest of the time till 100% percentage is devoted to synchronization, scheduling and thread creation. For the rest of threads, this amount of time is devoted to runtime overhead.

wt 0 wt 1 wt 2 wt 3 wt 4 wt 5 wt 6 wt 7 Avg.

OmpSs - Seq. BLIS (4 worker threads)

OmpSs - Asym. BLIS (4 worker threads)

Botlev-OmpSs - Seq. BLIS (8 worker threads, 4+4)

idle

running

idle

running

idle

running

– 5.59 3.14 5.77 – – – – 4.84

98.41 94.22 96.67 94.07 – – – – 95.89

– 5.51 5.27 5.17 – – – – 5.32

97.85 94.29 94.53 94.62 – – – – 94.90

– 13.63 13.94 13.43 19.26 21.12 20.84 20.09 17.47

86.53 86.28 85.98 86.47 80.51 78.69 78.97 79.70 82.89

At runtime, the task scheduler in Botlev-OmpSs issues tasks to execution in an order that strongly depends on their criticality. The main idea behind this scheduling policy is to track the criticality of each task and, when possible, prioritize the execution of critical tasks by assigning them to the fast Cortex-A15 cores. Conformally with this strategy, an out-oforder execution reveals itself more frequently in the timelines for the big cores than in those for the LITTLE cores. With the conventional runtime, the out-of-order execution is only dictated by the order in which data dependencies for tasks are satisfied. From the execution traces, we can observe that the Botlev-OmpSs alternative suffers a significant performance penalty due to the existence of idle periods in the final stages of the factorization, when the concurrency in the factorization is greatly diminished. This problem is not present in the conventional scheduling policies. In the first stages of the factorization, however, the use of a priority-aware policy for the Botlev-OmpSs scheduler effectively reduces idle times. Table 7 reports the percentage of time each worker thread is in running or idle state. In general, the relative amount of time Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

JID: PARCO

ARTICLE IN PRESS L. Costero et al. / Parallel Computing 000 (2017) 1–18

[m3Gsc;June 10, 2017;8:48] 17

spent in idle state is much higher for Botlev-OmpSs than for the conventional implementations (17% vs. 5%, respectively). Note also the remarkable difference in the percentage of idle time between the big and LITTLE cores (20% and 13%, respectively), which drives to the conclusion that the fast cores stall waiting for completion of tasks executed on the LITTLE cores. This fact can be confirmed in the final stages of the Botlev-OmpSs trace. The previous observations pave the road to a combination of scheduling policy and execution model for AMPs, in which asymmetry is exploited through ad-hoc scheduling policies during the first stages of the factorization –when the potential parallelism is massive–, and this is later replaced with the use of asymmetric-aware kernels and coarse-grain VCs in the final stages of the execution, when the concurrency is scarce. Both approaches are not mutually exclusive, but complementary depending on the level of task concurrency available at a given execution stage. 6. Conclusions In this paper, we have addressed the problem of refactoring existing runtime task schedulers to exploit task-level parallelism in novel AMPs, focusing on ARM big.LITTLE systems-on-chip and virtually-asymmetric configurations due to the application of VFS at the core/socket level on Intel Xeon architectures. We have demonstrated that, for the specific domain of DLA, an approach that delegates the burden of dealing with asymmetry to the library (in our case, using an asymmetryaware BLIS implementation), does not require any revamp of existing task schedulers, and can deliver high performance. This proposal paves the road towards reusing conventional runtime schedulers for SMPs (and all the associated improvement techniques developed through the past few years), as the runtime only has a symmetric view of the hardware. Our experiments reveal that, in most cases, this solution is competitive or even improves the results obtained with an asymmetryaware scheduler for DLA operations. Acknowledgments The researchers from Universidad Complutense de Madrid are supported by the EU (FEDER) and the Spanish MINECO, under grants TIN 2015-65277-R and TIN2012-32180. The researchers from Universidad Jaime I were supported by projects CICYT TIN2011-23283 and TIN2014-53495-R as well as the EU H2020 671602 project “INTERTWinE”. References [1] R. Dennard, F. Gaensslen, V. Rideout, E. Bassous, A. LeBlanc, Design of ion-implanted MOSFET’s with very small physical dimensions, Solid-State Circuits IEEE J. 9 (5) (1974) 256–268. [2] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, D. Burger, Dark silicon and the end of multicore scaling, in: Proceedings of the 38th Annual International Symposium on Computer Architecture, in: ISCA’11, 2011, pp. 365–376. [3] M. Duranton, D. Black-Schaffer, K.D. Bosschere, J. Maebe, The HiPEAC vision for advanced computing in Horizon 2020, 2013. http://www.hipeac.net/ roadmap. [4] T. Morad, U. Weiser, A. Kolodny, M. Valero, E. Ayguade, Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors, Comput. Archit. Lett. 5 (1) (2006) 14–17. [5] Cilk project home page, (http://supertech.csail.mit.edu/cilk/). Last visit: July 2015. [6] OmpSs project home page, (http://pm.bsc.es/ompss). Last visit: July 2015. [7] StarPU project home page, (http://runtime.bordeaux.inria.fr/StarPU/). Last visit: July 2015. [8] PLASMA project home page, (http://icl.cs.utk.edu/plasma/). Last visit: July 2015. [9] MAGMA project home page, (http://icl.cs.utk.edu/magma/). Last visit: July 2015. [10] FLAME project home page, (http://www.cs.utexas.edu/users/flame/). Last visit: July 2015. [11] N. Rajovic, P.M. Carpenter, I. Gelado, N. Puzovic, A. Ramirez, M. Valero, Supercomputing with commodity CPUs: are mobile SoCs ready for HPC? in: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, in: SC ’13, ACM, New York, NY, USA, 2013, pp. 40:1–40:12, doi:10.1145/2503210.2503281. [12] N.e. a. Rajovic, The Mont-Blanc prototype: an alternative approach for HPC systems, [13] G.H. Golub, C.F.V. Loan, Matrix Computations, 3rd edition, The Johns Hopkins University Press, Baltimore, 1996. [14] E. Anderson, et al., LAPACK Users’ Guide, 3rd edition, SIAM, 1999. [15] J.J. Dongarra, J. Du Croz, S. Hammarling, I. Duff, A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Softw. 16 (1) (1990) 1–17. [16] S. Catalán, F.D. Igual, R. Mayo, R. Rodríguez-Sánchez, E.S. Quintana-Ortí, Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors, Cluster Comput. 19 (3) (2016) 1037–1051. [17] F.G. Van Zee, R.A. van de Geijn, BLIS: a framework for rapidly instantiating BLAS functionality, ACM Trans. Math. Softw. 41 (3) (2015) 14:1–14:33. [18] K. Chronaki, A. Rico, R.M. Badia, E. Ayguadé, J. Labarta, M. Valero, Criticality-aware dynamic task scheduling for heterogeneous architectures, in: Proceedings of ICS’15, 2015. [19] L. Costero, F.D. Igual, K. Olcoz, S. Catalán, R. Rodríguez-Sánchez, E.S. Quintana-Ortí, Refactoring conventional task schedulers to exploit asymmetric ARM big.LITTLE architectures in dense linear algebra, in: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2016, pp. 692–701. [20] R.M. Badia, J.R. Herrero, J. Labarta, J.M. Pérez, E.S. Quintana-Ortí, G. Quintana-Ortí, Parallelizing dense and banded linear algebra libraries using SMPSs, Concurrency Comput. 21 (18) (2009) 2438–2456. [21] G. Quintana-Ortí, E.S. Quintana-Ortí, R.A. van de Geijn, F.G. Van Zee, E. Chan, Programming matrix algorithms-by-blocks for thread-level parallelism, ACM Trans. Math. Softw. 36 (3) (2009) 14:1–14:26. [22] C. Augonnet, S. Thibault, R. Namyst, P.-A. Wacrenier, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency Comput. 23 (2) (2011) 187–198. [23] S. Tomov, R. Nath, H. Ltaief, J. Dongarra, Dense linear algebra solvers for multicore with GPU accelerators, in: Proceedings of the IEEE Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010, pp. 1–8. [24] T. Gautier, J.V.F. Lima, N. Maillard, B. Raffin, XKaapi: a runtime system for data-flow task programming on heterogeneous architectures, in: Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing, in: IPDPS’13, 2013, pp. 1299–1308. [25] K. Goto, R.A. van de Geijn, Anatomy of a high-performance matrix multiplication, ACM Trans. Math. Softw. 34 (3) (2008) 12:1–12:25. [26] OpenBLAS, (http://xianyi.github.com/OpenBLAS/). Last visit: July 2015.

Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002

JID: PARCO 18

ARTICLE IN PRESS

[m3Gsc;June 10, 2017;8:48]

L. Costero et al. / Parallel Computing 000 (2017) 1–18

[27] F.G. Van Zee, T.M. Smith, B. Marker, T.M. Low, R.A. van de Geijn, F.D. Igual, M. Smelyanskiy, X. Zhang, M. Kistler, V. Austel, J. Gunnels, L. Killough, The BLIS framework: experiments in portability, ACM Trans. Math. Softw. (2014). Accepted. Available at http://www.cs.utexas.edu/users/flame. [28] T.M. Smith, R. van de Geijn, M. Smelyanskiy, J.R. Hammond, F.G. Van Zee, Anatomy of high-performance many-threaded matrix multiplication, in: Proceedings of the IEEE 28th International Symposium on Parallel and Distributed Processing, in: IPDPS’14, 2014, pp. 1049–1059. [29] Paraver, Paraver: the flexible analysis tool, (http://www.cepba.upc.es/paraver).

Please cite this article as: L. Costero et al., Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations, Parallel Computing (2017), http://dx.doi.org/10.1016/j.parco.2017.06.002