Dynamic look-ahead in the reduction to band form for the singular value decomposition

Dynamic look-ahead in the reduction to band form for the singular value decomposition

Parallel Computing 81 (2019) 22–31 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Dyn...

977KB Sizes 0 Downloads 2 Views

Parallel Computing 81 (2019) 22–31

Contents lists available at ScienceDirect

Parallel Computing journal homepage: www.elsevier.com/locate/parco

Dynamic look-ahead in the reduction to band form for the singular value decomposition Andrés E. Tomás a, Rafael Rodríguez-Sánchez b,∗, Sandra Catalán a, Rocío Carratalá-Sáez a, Enrique S. Quintana-Ortí a a b

Dept. Ingeniería y Ciencia de Computadores, Universidad Jaume I, Castellón, Spain Dep. Arquitectura de Computadores y Automática, Universidad Complutense de Madrid, Madrid, Spain

a r t i c l e

i n f o

Article history: Received 24 May 2018 Revised 11 September 2018 Accepted 28 November 2018 Available online 29 November 2018 Keywords: Singular value decomposition Two-stage reduction Look-ahead Runtime Dynamic scheduling Multicore processors

a b s t r a c t We investigate the introduction of look-ahead in two-stage algorithms for the singular value decomposition (SVD). Our approach relies on a specialized reduction for the first stage that produces a band matrix with the same upper and lower bandwidth instead of the conventional upper triangular-band matrix. In the case of a CPU-GPU server, this alternative form accommodates a static look-ahead into the algorithm in order to overlap the reduction of the “next” panel on the CPU and the “current” trailing update on the GPU. For multicore processors, we leverage the same compact form to formulate a version of the algorithm that advances the reduction of “future” panels, yielding a dynamic look-ahead that overcomes the performance bottleneck that the sequential panel factorization represents. © 2018 Elsevier B.V. All rights reserved.

1. Introduction The Singular Value Decomposition (SVD) is a handy numerical tool for the computation of the matrix rank, the calculation of low-rank approximations to a matrix, and the solution of ill-conditioned least squares problems. These linear algebra kernels arise in medicine, geosciences, material sciences, crystallography, security, and information retrieval, among others [1–3]. Given a dense matrix A ∈ Rm×n , the standard algorithm to compute its SVD first obtains a reduced bidiagonal matrix B ∈ Rm×n , using a sequence of Householder reflectors that are applied from the left- and right-hand sides of A as:

A = UBV T , Rm×m ,

(1) Rn×n

Rm×n

where U ∈ V ∈ are both orthogonal and B ∈ is upper bidiagonal [4]. (Without loss of generality, hereafter we will assume that m ≥ n. Otherwise, we can simply compute the SVD of AT .) The cost of this direct two-sided reduction (TSR) algorithm is 4n2 (m − n/3 ) floating-point operations (flops). The singular values of A are then computed from the bidiagonal matrix B, via the QD algorithm or some divide-and-conquer variant, adding a minor cost to the flop count (unless the singular vectors are also required) [4]. An algorithm for the direct reduction to bidiagonal form A → B, via Householder reflectors, will necessarily perform a significant amount of flops in terms of the Level-2 BLAS (basic linear algebra subprograms) [6]. As a consequence, this ∗

Corresponding author. E-mail addresses: [email protected] (A.E. Tomás), [email protected] (R. Rodríguez-Sánchez), [email protected] (S. Catalán), [email protected] (R. Carratalá-Sáez), [email protected] (E.S. Quintana-Ortí). https://doi.org/10.1016/j.parco.2018.11.001 0167-8191/© 2018 Elsevier B.V. All rights reserved.

A.E. Tomás, R. Rodríguez-Sánchez and S. Catalán et al. / Parallel Computing 81 (2019) 22–31

23

procedure will generally deliver a small fraction of the sustainable performance of current architectures. (A clear example that illustrates the low performance of this approach is routine dgebrd from LAPACK [5], which computes this single-stage reduction.) Two-stage algorithms [7] tackle the low-performance problem of the direct reduction via an alternative strategy that aims to attain high performance by first transforming A into an intermediate upper triangular–band matrix BT , with upper bandwidth wT , to then reduce this by-product to the sought-after bidiagonal form. The advantage of these two-stage TSR reduction A → BT → B is that the first stage mostly consists of efficient Level-3 BLAS [8]. In addition, provided wT  m, n and the singular vectors are not desired, the cost of the second stage is small. Hereafter we will focus on the first stage of the TSR algorithm. The first stage of TSR algorithms for the SVD was initially studied in [9] for clusters; and later, for multicore processors, in [10]. A similar approach has been elaborated for the solution of symmetric eigenvalue problems [7,10–13]. In practice, as the number of cores grows, the panel factorization that appears at each iteration of the first stage of the reduction rapidly becomes a performance bottleneck. In order to overcome this, it is possible to apply a sort of software pipelining, usually known as look-ahead [14], in order to overlap the factorization of the next panel with the update of the trailing submatrix (with respect to the current panel). For multicore processors, in [15] we introduced an alternative reduction A → BB → B, where the intermediate by-product BB is a band matrix with lower and upper bandwidth wB . The result is a TSR algorithm that allows the integration of static look-ahead into the first stage of the factorization. In this paper we revisit the first stage of the TSR algorithm, extending our previous work in [15], making the following two main contributions: •



We demonstrate that the reduction to specific band form BB also accommodates a general form of look-ahead which advances the factorization of several “future” panels to the “current” iteration, via a runtime-assisted parallelization. The result is a parallel algorithm for multicore processors with dynamic look-ahead, much in the spirit of similar techniques applied to the LU, Cholesky and QR factorizations as part of the SuperMatrix, StarPU, OmpSs, and PLASMA efforts [16–19]. We develop an implementation of the first stage of the TSR using a state-of-the-art runtime system such as OmpSs. The two-sided matrix factorization performed by the TSR algorithm exhibits a much richer set of data dependencies than well-known one-sided factorizations for linear systems such as LU, Cholesky and QR. Nonetheless, with due care and the use of block “representants” [20], these dependencies can be specified and tackled using the existing mechanisms in OmpSs (and OpenMP). We provide an experimental evaluation of the algorithm on a platform equipped with two Intel E5-2630 v3 sockets, with 8 cores each, showing considerable performance benefits compared with a solution that only exploits static look-ahead. In addition, we make the following secondary contributions:





We briefly review the algorithms for the reduction to band form, linking this presentation with an analysis of the advantages and drawbacks of distinct alternatives for the accumulation of orthogonal matrices from the point of view of performance and the application of look-ahead. Due to the amount of past work on the topic, having a review on the parallelization of the SVD is relevant and, in our case, motivates the development of our alternative solution with dynamic look-ahead. We summarize the results in [15] and [21] which leverage a static look-ahead on multicore processors and servers equipped with a graphics processing unit (GPU), discussing the main differences with a truly dynamic application of the technique.

Overall, this paper has the primary goal of exposing the advantages of the proposed alternative band form to the scientific community, fostering the development of high-performance software for the subsequent stages. The rest of the paper is structured as follows. In Section 2 we review the reduction to band form for the SVD and the use of three different compact representations of orthogonal matrices. In Section 3 we describe the introduction of static lookahead into the TSR algorithm, the practical implementation on a CPU-GPU server, and provide a brief experimental study to expose the advantages of this technique in this type of heterogeneous platforms. In Section 4 we compare the static and dynamic forms of look-ahead, in the context of an execution on a multicore processors. Furthermore, in that section we provide experimental evidence that demonstrates the benefits of a runtime-assisted parallelization in terms of performance. We close the paper with a few concluding remarks in Section 5. 2. Reduction to band forms The first stage of the two-stage TSR algorithms for the SVD was analyzed in [9]. In that work, the authors proposed the conventional triangular–band reduction for the SVD, which computes an upper triangular matrix BT with upper bandwidth w = wT . However, in the same work, the authors also described an alternative algorithm that produces a band matrix BB with the same upper and lower bandwidth w = wB . The latter algorithm was only theoretically formulated in that work as, apparently, it did not offer any special advantage. In [15], we showed that the alternative band form is actually the key to obtaining a TSR reduction algorithm, enhanced with look-ahead, and with no constraints on the relation between w and b. In that work, we also leveraged this property to

24

A.E. Tomás, R. Rodríguez-Sánchez and S. Catalán et al. / Parallel Computing 81 (2019) 22–31

Fig. 1. Partitioning of the matrix during one iteration of the reduction to band form for the solution of SVD.

accelerate the execution of the first stage on multicore processors. Later, in [21], we exploited the same algorithm to obtain a high-performance, concurrent execution of the first stage on a CPU-GPU server. We next review this reduction procedure, which is also the basis for the dynamic scheduling strategy described later in the paper. 2.1. The TSR algorithm Consider that the first k − 1 rows/columns of A have already been reduced to band form, and assume for simplicity that k + w + b − 1 ≤ m, n; see Fig. 1. During the current iteration of the procedure, b new rows/columns of the band matrix are computed as follows (for brevity, we do not state explicitly the dimensions of some of the matrix blocks/factors in the following, as they can be easily derived from the context and Fig. 1): 1. Left Panel Factorization. Compute the QR factorization

C0 = UR,

(2)

Ii + WU YUT

where U = is orthogonal, Ii denotes the (square) identity matrix of order i, and R is upper triangular. This implicit representation of matrix U in terms of the factors WU , YU ∈ Ri×b , known as the WY transform [4], is crucial for the efficient and economic update of the trailing submatrix in the following step. 2. Left Trailing Update. Apply U to the trailing submatrix:

C1 := U T C1 = C1 + YU (WUT C1 );

(3)

E := U T E = E + YU (WUT E ).

(4)

3. Right Panel Factorization. Compute the LQ factorization

D0 = LV T ,

(5)

with V = I j + WV YVT orthogonal, WV YV ∈ R j×b , and L lower triangular. 4. Right Trailing Update. Apply V to the trailing submatrix:

D1 := D1V = A1 + (A1WV )YVT ;

(6)

E := EV = E + (EWV )YVT .

(7)

Provided w  m, n, and the target bandwidth is wB , this alternative algorithm requires about the same amount of flops as the conventional triangular–band reduction with target bandwidth wT = 2wB . In the reductions to both conventional triangular–band forms and band forms, the computation of the panel factorizations represents a bottleneck because these operations are mostly sequential while, in contrast, the trailing updates are highly parallel. Thus, as the number of computational resources grows, the panel factorization constrains the global performance of both TSR algorithms.

A.E. Tomás, R. Rodríguez-Sánchez and S. Catalán et al. / Parallel Computing 81 (2019) 22–31

25

2.2. Efficient representation of orthogonal matrices When employing the WY transform, each panel factorization requires a separate pair of factors W, Y. Here, YU , YV are obtained as by-products of the corresponding panel factorizations, (2) and (5), at no additional cost, and can be basically stored using the annihilated entries of the original matrix. In contrast, WU , WV have to be built and explicitly maintained, increasing the computational cost and storage needs. Furthermore, the assembly of the orthogonal factors WU , WV comprises Level-2 BLAS kernels which rapidly become a performance bottleneck as these are memory-bound operations with limited parallel scalability. In order to tackle this, we can instead rely on the compact WY transform [22], because of its more reduced storage needs and lower computational cost. Concretely, in the compact WY transform, we obtain two triangular factors TU , TV ∈ Rb×b so that

U V

= =

Ii + WU YUT = Ii + YU TU YUT , I j + WV YVT = I j + YV TV YVT .

(8)

Unfortunately, the computation of TU , TU is still composed of Level-2 BLAS kernels, and does not scale well in case it has to be computed in parallel (e.g., on a GPU). For this reason, in [21] we proposed to leverage a variant of the compact WY transform, known as the UT transform [23–25]. In this alternative, we directly obtain the inverse factors T¯U = TU−1 , T¯V = TV−1 so that

U V

= =

Ii + WU YUT = Ii + YU T¯U−1YUT , I j + WV YVT = I j + YV TV−1YVT .

(9)

The advantage of this alternative compact representation of the orthogonal factors is that the inverse factors T¯U , T¯V can be be efficiently computed via the Level-3 BLAS. Moreover, the (theoretical) cost of computing and applying the UT representation is exactly the same as that of the compact WY form, and the numerical stability of both procedures is equivalent [25]. 3. Static look-ahead in the band reduction (on GPUs) 3.1. Band forms Triangular–band form. The first stage of the conventional two-stage TSR reduction for the SVD computes an upper triangular matrix with upper bandwidth w = wT ; see, e.g., [9,15]. The problem of this algorithm is that, in order to fully exploit lookahead, one must select w ≥ 3b, where b denotes the algorithmic block size [15]. This restriction is relevant because, in practice, w must be kept small in order to limit the cost of the second stage, but b must be large enough to ensure high performance from the Level-3 BLAS during the first stage. Band form. In [15], we identified two strategies to integrate a static look-ahead strategy into the algorithm for this type of reduction, yielding two variants. For simplicity, we only describe Variant V1, which requires that w ≥ 2b, 1 as this is the basis for the static look-ahead strategy for CPU-GPU servers in [21] as well as the dynamic generalization of the strategy described later in the paper. Concretely, the condition imposed by this variant implies that the panels to be factorized in the next iteration lie entirely within C1 and D1 ; see Fig. 1. Therefore, the update and factorization of these panels can be overlapped with the updates performed on E from the left and right. To illustrate this, assume that w = 2b. Fig. 2 then displays the operations and dependencies that appear in this algorithm showing that, at any given iteration, it is possible to overlap the factorization of the next left and right panels with the current trailing updates. For example, let us consider that the initial decompositions C0 = U0 R0 and D0 = L0V0 have already been computed. The factorization C1 = U1 R1 can then be overlapped with the updates

U20 ≡ C2 := U0T C2 ,

U30 ≡ C3 := U0T C3 ,

U40 ≡ C4 := U0T C4 , . . . ,

with respect to the factorization of C0 (i.e., the orthogonal factor U0 ). Analogously, the factorization D1 = L1V1 can proceed simultaneously with the updates

V20 ≡ D2 := D2 L0 ,

V30 ≡ D3 := D2 L0 ,

V40 ≡ D4 := D4 L0 , . . .

with respect to the factorization of D0 (i.e., the orthogonal factor V0 ). It is easy to see that the same type of overlapping between the “next” panel factorization and the current trailing matrix update is also possible in subsequent iterations. 3.2. Static-look ahead on GPUs Algorithm. In order to obtain an algorithm for the band reduction that exploits the computational resources of a server equipped with both multicore processor(s) and a GPU, we need to ensure a balanced partitioning of the work between the CPU and the GPU, while the communication schedule should overlap the data transference with the computations. 1

Variant V2 does not impose any condition on the relation between this algorithmic block size and the bandwidth.

26

A.E. Tomás, R. Rodríguez-Sánchez and S. Catalán et al. / Parallel Computing 81 (2019) 22–31

Fig. 2. Matrix partitioning (top) and dependencies (bottom) for the reduction to band form (w = 2b).

Let us suppose that the original matrix initially resides on the GPU memory, and consider, for simplicity, that w = 2b. Then, according to the previous discussion on the scheme in Fig. 2, at iteration k, the factorizations of the next panels Ck+1 , Dk+1 can be computed on the CPU while (the assembly of the UT transforms and) the update of the trailing matrix E with respect to the panel factorizations of Ck , Dk proceeds on the GPU. As for the data transference occurring at each iteration, before the factorizations of the next panels inside the loop body, their contents are retrieved from the GPU to the CPU. Then, at the end of the iteration, the orthogonal factors are transferred in the opposite direction, from CPU to GPU, in preparation for the next iteration. This communication pattern results in the output band matrix being partitioned between the CPU and GPU. The data can be collected in one of the two memory workspaces during the computation or as part of a final step. Performance evaluation. We next provide a brief experimental assessment of the benefits of a concurrent execution on three distinct CPU-GPU servers via the introduction of a static look-ahead. (For a full report, see [21].) For this purpose, we employ platforms equipped with a high-end NVIDIA Tesla P100 (Pascal); a cost-effective NVIDIA GeForce Titan X (Pascal); and an older NVIDIA Tesla K20c (Kepler). The Tesla P100 is paired with two Intel Xeon E5-2620 v4 processors (8+8 cores); the Titan X with an Intel Core i7-3770K processor (4 cores); and the K20c with an Intel i7-3930K processor (6 cores). The codes were compiled with version 8.0 of the CUDA development environment toolkit. Optimized BLAS and LAPACK implementations were provided by NVIDIA cuBLAS 8.0 for the GPU and Intel MKL 2017 for the CPU. Hyper-threading was disabled as suggested by the MKL library documentation. The GNU C compiler was the default version provided by the Linux operating system installed on those computers. The optimizations made by this compiler are mostly irrelevant for this study, because all the performance-sensitive code is implemented inside the cuBLAS and MKL libraries. Fig. 3 compares the performance of the band reduction computing the UT transform in the GPU with and without the look-ahead strategy (solid and dashed lines, respectively), using single and double precision arithmetic (left- and right-hand side plots, respectively). In these plots (as well as those in the following section), performance is reported in GFLOPS (billions of flops per second), using 4n2 (2n/3 − w ) for the flop count (taking into account that m = n). Since the comparison between variants is performed in an scenario with fixed w, this is a reasonable approach to obtain a scaled version of the execution time, with the differences being more visible for smaller problem sizes than those that could have been exposed using the execution time itself. The improvement obtained by exploiting the CPU-GPU concurrency, via the look-ahead strategy, depends on the performance ratio between the CPU and GPU. In all cases the benefits provided by the look-ahead strategy are clear and the variant with look-ahead is considerably faster than its counterpart without it.

A.E. Tomás, R. Rodríguez-Sánchez and S. Catalán et al. / Parallel Computing 81 (2019) 22–31

27

Fig. 3. Performance of the band reduction routine using the UT transform with and without look-ahead (solid and dashed lines, respectively) compared to the trailing matrix update products (dotted line) using single and double precision arithmetic (left- and right-hand side plots, respectively).

4. Static versus dynamic look-ahead in the band reduction (on multicore processors) 4.1. Static look-ahead The static look-ahead strategy discussed in the previous section, which overlaps the execution of the next panel factorization on the CPU with the computation of the current trailing updates on the GPU, was also leveraged in [15] to exploit the parallelism of current multicore processors. The solution there explicitly encoded a static look-ahead strategy into the

28

A.E. Tomás, R. Rodríguez-Sánchez and S. Catalán et al. / Parallel Computing 81 (2019) 22–31

Fig. 4. Execution trace of the first three iterations of the reduction to band form, parallelized with OmpSs to integrate dynamic look-ahead, using 8 threads, and applied to a square matrix of order 5,0 0 0, with w = 512; b = 96. This trace was obtained using the automatic instrumentation function of OmpSs.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

first stage of the TSR algorithm, creating two separate execution paths (or branches) which decoupled the next panel factorizations (together with the prior necessary updates) from the (remaining part of the) trailing updates. The computational resources, in the form of threads/cores, were then split into two disjoint “asymmetric” teams, respectively with a few and most of the threads/cores, that synchronized at the end of each iteration. Furthermore, the team comprising a small number of cores was in charge of the largely-sequential panel factorizations (and corresponding updates) while the counterpart team, containing most of the platform’s cores, performed the highly-parallel trailing updates. A property of this approach is that, at each iteration, the major part of the trailing update can be performed in collaboration by the corresponding team of threads invoking a couple of multi-threaded kernels from BLAS, which results in a better use of shared resources such as the cache memories [15,26]. 4.2. Dynamic look-ahead Alternatively, we can rely on a programming model such as OpenMP, OmpSs, StarPU, etc. [17,18,27], and its supporting runtime, to exploit task-level parallelism in the TSR algorithm for the SVD. The idea is to define each one of the operations in the dependency graph represented in the right-hand side plot in Fig. 2 as a single (sequential) task, and to specify the dependencies via the input/output arguments to each task. The runtime then employs this information to orchestrate, at execution time, a parallel schedule of the operations that is correct while optimizing the exploitation of task-level parallelism. We remark the following subtle difference: In the static look-ahead strategy, the major part of the trailing update is integrated into a single “task”, performed in parallel by a team of threads via a multi-threaded implementation of BLAS. In the runtime-based solution, the trailing update is split into multiple tasks, and the runtime dynamically distributes these operations among the threads, with each thread executing each one of the tasks assigned to it by invoking one (or more) kernels from a sequential instance of BLAS. From the programming point of view, the static look-ahead requires a reorganization of the algorithm, to enforce that the execution of the next panel factorization occurs in the same iteration as the current trailing update. Furthermore, the original trailing update has to be decomposed into two parts, to expose the part of the matrix that will be affected by the next panel factorization [15]. Compared with this, the runtime-assisted parallelization does not require significant changes in the code, except for the introduction of the appropriate parallelization directives, much in the style of an parallelization with OpenMP. Fig. 4 exposes the practical consequences of the dynamic look-ahead strategy from the perspective of an overlapped execution of the panel factorizations. The plot there displays a fragment of an execution trace obtained for the TSR algorithm with dynamic look-ahead parallelized using OmpSs. Following with the color scheme used in Fig. 1, the left and right panel factorizations are depicted in blue and red, respectively. The update of the next panel C1 is shown in dark blue, while the update of the next panel D1 is shown in dark red. The remainder of the left trailing update is marked in yellow, and the remainder of the right trailing update in orange. At this point we remark that an iteration finalizes when the right trailing update is completed, and that the left and right trailing updates cannot be in execution at the same time because both update the same memory region. This figure reveals that, for this particular experiment, up to five left panel factorizations and

A.E. Tomás, R. Rodríguez-Sánchez and S. Catalán et al. / Parallel Computing 81 (2019) 22–31

29

Fig. 5. Performance of the TSR algorithm using OmpSs and the compact WY representation compared with the conventional WY representation (solid and dotted lines, respectively).

four right panel factorizations are carried out during the first iteration. Then, during the second iteration one left and two right panel factorizations are performed, and from that point at each iteration only one left and one right panel factorizations can be performed. The maximum look-ahead depth (i.e., the capacity to perform panel factorizations from subsequent iterations in parallel with the updates in the current one) directly depends on relation between the bandwidth w and the block size b. For this particular experiment the depth is w/b = 512/96 5, which means that, at iteration k, the algorithm can be performing the factorizations of panels k + 1, k + 2, . . . , k + 5. 4.3. Performance evaluation on multicore processors Setup. The following experiments were performed using IEEE double-precision arithmetic, on a server with two Intel Xeon E5-2630 v3 processors (8+8 cores, running at a nominal frequency of 2.4 GHz). The implementations were linked with optimized instances of BLAS and LAPACK provided by the Intel MKL library [28] and hyperthreading was disabled as suggested in the MKL documentation. The Intel C and Fortran 2017 compilers were used for the conventional multi-threaded version without look-ahead, while the C and Fortran compilers from OmpSs 16.06 were used for the dynamic look-ahead implementation. The specific optimizations made by these compilers are mostly irrelevant for this study, because all the performance-sensitive code is implemented inside the MKL library. We employed one thread per core in all executions. In the experiments, we employ square matrices (m = n) with random entries uniformly distributed, and dimensions from 500 up to 10,000 in steps of 500. The performance depends on the TSR algorithm as well as the bandwidth w and block size b. For this reason, we decided to test the algorithms using three different bandwidths: w = {64, 128, and 256}. For each TSR algorithm and problem size, the block size b was then evaluated using values ranging from 16 up to w/2 in steps of 16. The plots reflect the performance attained by the optimal block size b for each pair implementation-bandwidth. Representation of orthogonal matrices. The first experiment aims to expose the benefits of leveraging the compact WY representation2 in the TSR algorithm with dynamic look-ahead instead of the conventional WY representation. In theory, the dynamic look-ahead strategy may be implemented using both representations, but in the practice the latter has a large memory consumption making it impractical for large problem dimensions. Therefore, in our implementations the maximum look-ahead depth is 1 for the conventional WY representation, while it is w/b for the compact WY representation (see Section 2 for more details). The performance results using both representations are plotted in Fig. 5 for three different bandwidths. The first conclusion that can be extracted from this plot is that, for medium to large problem dimensions, both representations deliver similar performance results, which means that, as the problem dimension grows, the maximum depth of the look-ahead strategy becomes less important from the viewpoint of performance. However, for small problem dimensions, the conclusions that can be extracted are completely different as, in those cases, the depth of the look-ahead strategy directly affects the overall performance. In the plot, we can observe that as the bandwidth is incremented, the 2 In a task-parallel implementation of the TSR algorithm for multicore processors, the use of the UT transform is not so crucial as the representation is computed by a single core. The reason is that, while the compact WY transform is slightly slower to compute, it does not present scalability problems as, in the task-parallel algorithms, it is computed by a single core.

30

A.E. Tomás, R. Rodríguez-Sánchez and S. Catalán et al. / Parallel Computing 81 (2019) 22–31

Fig. 6. Performance of the TSR algorithm using OmpSs compared with the conventional version without look-ahead that extracts all parallelism from a multi-threaded version of MKL (solid and dotted lines, respectively). Both implementations use the compact WY representation of orthogonal matrices.

improvements in performance also grow, since increasing the bandwidth implies that the maximum look-ahead depth can also be higher. In summary: •





The conventional WY transform exhibits much higher memory requirements than its compact counterpart, constraining the depth of the look-ahead and, consequently, the performance when the panel factorization dominates the computational cost. The panel factorization is indeed the cost-dominant factor for problems of small dimension. Conversely, when the problem size is large, the cost is dominated by the trailing update, and the benefits of look-ahead are blurred. The dimension of the bandwidth exerts an influence on the performance impact of look-ahead since a larger bandwidth increases the number of future panel factorization which can be computed “in advance” (i.e., the depth of look-ahead). In addition, restricting the bandwidth reduces the algorithmic block size which, in turn, may reduce the performance of the matrix-matrix multiplication kernel (GEMM) that realizes the application of the block Householder transformations.

Dynamic, runtime-assisted look-ahead. The second experiment in this section is designed to demonstrate the benefits of employing the dynamic look-ahead strategy in comparison with the conventional multi-threaded execution without lookahead in which parallelism is exclusively extracted from within a multi-threaded instance of the BLAS kernels (in our case, that of Intel MKL). The results of this evaluation, in Fig. 6, show noticeable improvements for all problem dimensions independently of the target bandwidth. In particular, these improvements are more relevant for the small problems, as in those cases the panel factorizations severely constrain the overall performance of the TSR algorithm without look-ahead. With the runtimeassisted TSR algorithm with dynamic look-ahead, the panel factorizations are overlapped with the execution of the trailing updates and, in consequence, are removed from the critical path of the TSR algorithm. As a result, for these small problem dimensions the dynamic look-ahead strategy doubles the performance reported by the conventional approach without lookahead. On the other hand, for large problem dimension, where the overall execution time is dominated by the trailing updates, the performance differences are less remarkable though they are still visible. For reference, Fig. 6 also includes a line showing the performance of the matrix-matrix multiplication routine in Intel MKL (label “GEMM”). This demonstrates that, provided the bandwidth is large enough, the performance of the factorization routines with look-ahead approach that of GEMM, exposing the scalability of the algorithm. The conclusions extracted in this second experiment are in agreement with those extracted in the first experiment. For large problem dimensions, the overall execution time is dominated by the trailing updates and moderate performance improvements are reported when introducing the look-ahead strategy. Therefore the depth of look-ahead is not so relevant. On the other hand, for small problem dimensions, the overall execution time is dominated by the panel factorizations, and the look-ahead strategy can introduce large performance improvements, proportional to the depth of the look-ahead technique. 5. Conclusions In this paper we depart from the conventional intermediate triangular-band form for the initial step in the two-stage reduction for the SVD and instead compute a by-product with upper and lower banded structure as a result of this re-

A.E. Tomás, R. Rodríguez-Sánchez and S. Catalán et al. / Parallel Computing 81 (2019) 22–31

31

duction. In combination with the decoupling of the algorithmic block size from the bandwidth dimension, and the use of the compact WY and UT representations for the orthogonal factors, this allows us to develop high-performance algorithms for CPU-GPU platforms and multicore processors that can fully exploit look-ahead to overlap the mostly sequential panel factorizations with the efficient trailing matrix updates. For multicore processors, we show that our specialized band form also accommodates a dynamic form of look-ahead, which can advance the factorization of several “future” panels to the “current” iteration, via a runtime-assisted parallelization using OmpSs. Our results on different servers equipped with recent multicore technology from Intel and NVIDIA show the performance throughput of the new codes, with GFLOPS curves that follow closely the sustainable peak performance on these systems. Acknowledgments The researchers from Universidad Jaime I were partially sponsored by the EU H2020 project 732631 OPRECOMP, the CICYT projects TIN2014-53495-R TIN2017-82972-R of the MINECO and FEDER, and the FPU program of MECD. The researcher from Universidad Complutense de Madrid was supported by project TIN2015-65277-R of the MINECO and FEDER. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the NVIDIA GeForce Titan X (Pascal) GPU used for this research. References [1] D. O’Leary, Matrix factorization for information retrieval, lecture Notes for a course on Advanced Numerical Analysis, University of Maryland, 2006. Available at https://www.cs.umd.edu/users/oleary/a600/yahoo.pdf. [2] A. Björck, Numerical methods for least squares problems, Soc. Ind. Appl. Math. (1996), doi:10.1137/1.9781611971484. [3] V. Novakovic´ , Parallel Jacoby-type Algorithms for the Singular and the Generalized Value Decomposition, Ph.D. thesis, University of Zagreb, Croatia, Faculty of Science, Dept. of Mathematics, 2017. [4] G.H. Golub, C.F.V. Loan, Matrix Computations, 3rd, The Johns Hopkins University Press, Baltimore, 1996. [5] E. Anderson, Z. Bai, L.S. Blackford, J. Demmel, J.J. Dongarra, J.D. Croz, S. Hammarling, A. Greenbaum, A. McKenney, D.C. Sorensen, LAPACK Users’ guide, 3rd, SIAM, 1999. [6] J.J. Dongarra, J.D. Croz, S. Hammarling, R.J. Hanson, An extended set of FORTRAN basic linear algebra subprograms, ACM Trans. Math. Softw 14 (1) (1988) 1–17. [7] C.H. Bischof, B. Lang, X. Sun, Algorithm 807: the SBR toolbox—software for successive band reduction, ACM Trans. Math. Soft. 26 (4) (20 0 0) 602–616, doi:10.1145/365723.365736. [8] J.J. Dongarra, J.D. Croz, S. Hammarling, I. Duff, A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Softw. 16 (1) (1990) 1–17. [9] B. Grosser, B. Lang, Efficient parallel reduction to bidiagonal form, Parallel Comput. 25 (8) (1999) 969–986, doi:10.1016/S0167-8191(99)0 0 041-1. [10] A. Haidar, J. Kurzak, P. Luszczek, An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, ACM, New York, NY, USA, 2013, pp. 90:1–90:12, doi:10.1145/2503210.2503292. [11] P. Bientinesi, F.D. Igual, D. Kressner, M. Petschow, E.S. Quintana-Ortí, Condensed forms for the symmetric eigenvalue problem on multi-threaded architectures, Concurr. Comput.: Pract. Exper. 23 (7) (2011) 694–707. [12] D. Davidovic´ , E.S. Quintana-Ortí, Applying OOC Techniques in the Reduction to Condensed Form for Very Large Symmetric Eigenproblems on GPUs, in: Proceedings of the 20th Euromicro Conference on Parallel, Distributed and Network based Processing – PDP 2012, 2012. To appear [13] A. Haidar, H. Ltaief, J. Dongarra, Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memoryaware kernels, in: 2011 international conference for high performance computing, Netw. Storage Anal. (2011) 1–11, doi:10.1145/2063384.2063394. [14] P. Strazdins, A Comparison of Lookahead and Algorithmic Blocking Techniques for Parallel Matrix Factorization, Tech. Rep. TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia, 1998. [15] R. Rodríguez-Sánchez, S. Catalán, J.R. Herrero, E.S. Quintana-Ortí, A.E. Tomás, Look-ahead in the two-sided reduction to compact band forms for symmetric eigenvalue problems and the SVD. Numerical AlgorithmsAvailable online at https://doi.org/10.1007/s1107. [16] FLAME project home page, http://www.cs.utexas.edu/users/flame/. [17] StarPU project, http://runtime.bordeaux.inria.fr/StarPU/. [18] OmpSs project home page, http://pm.bsc.es/ompss. [19] PLASMA project home page, http://icl.cs.utk.edu/plasma. [20] R.M. Badia, J.R. Herrero, J. Labarta, J.M. Pérez, E.S. Quintana-Ortí, G. Quintana-Ortí, Parallelizing dense and banded linear algebra libraries using SMPSs, Conc. Comp.: Pract. Exper. 21 (2009) 2438–2456. [21] A.E. Tomás, R. Rodríguez-Sánchez, S. Catalán, E.S. Quintana-Ortí, Reduction to Band Form for the Singular Value Decomposition on Graphics Accelerators, in: Proc. 9th Int. Workshop on Programming Models and Applications for Multicores and Manycores, PMAM’18, ACM, New York, NY, USA, 2018, pp. 51–60, doi:10.1145/3178442.3178448. [22] R. Schreiber, C.V. Loan, A storage-efficient WY representation for products of householder transformations, SIAM J. Sci. Stat. Comput. 10 (1) (1989) 53–57. [23] Implementation of the GMRES method using householder transformations, SIAM J. Scientif. Stat. Comput. 9 (1) (1988) 152–163, doi:10.1137/0909010. [24] C. Puglisi, Modification of the householder method based on the compact WY representation, SIAM J. Scientif. Stat. Comput. 13 (3) (1992) 723–726, doi:10.1137/0913042. [25] T. Joffrain, T.M. Low, E.S. Quintana-Ortí, R.v.d. Geijn, F.G.V. Zee, Accumulating householder transformations, revisited, ACM Trans. Math. Softw. 32 (2) (2006) 169–179, doi:10.1145/1141885.1141886. [26] S. Catalán, J.R. Herrero, E.S. Quintana-Ortí, R. Rodríguez-Sánchez, R.A. van de Geijn, A case for malleable thread-level linear algebra libraries: The LU factorization with partial pivoting. CoRR abs/1611.06365 http://arxiv.org/abs/1611.06365. [27] OpenMP architecture review board, 2015, OpenMP application program interface version 4.5, http://www.openmp.org/wp-content/uploads/openmp-4. 5.pdf. [28] Intel, Math Kernel Library, 2017, https://software.intel.com/en- us/intel- mkl.