Parallel Computing 48 (2015) 108–124
Contents lists available at ScienceDirect
Parallel Computing journal homepage: www.elsevier.com/locate/parco
A data-driven paradigm for mapping problems Peng Zhang a,∗, Ling Liu b, Yuefan Deng c,d a
Biomedical Engineering Department, Stony Brook University, NY 11794-8151, United States Marine and Atmospheric Science Department, Stony Brook University, NY 11794-5000, United States Applied Mathematics Department, Stony Brook University, NY 11794-3600, United States d National Supercomputer Center in Jinan, Shandong 250101, China b c
a r t i c l e
i n f o
Article history: Received 30 September 2013 Revised 13 January 2015 Accepted 5 May 2015 Available online 13 May 2015 Keywords: Data mapping Task mapping Parallel computing Data movement matrix
a b s t r a c t We present a new data-driven paradigm for solving mapping problems on parallel computers. This paradigm targets at mapping data modules, instead of task modules, onto multiple processing cores. By dependency analysis of data modules, we devise a data movement matrix to reduce the need of manipulating task program modules at the expenses of handling data modules. To visualize and quantify the complex maneuver, we adopt the parallel activities trace graphs introduced earlier. To demonstrate the procedure and algorithmic values of our paradigm, we test it on the Strassen matrix multiplication and Cholesky matrix inversion algorithms. Mapping tasks has been more widely studied while mapping data is a new approach that appears to be more efficient for data-intensive applications that are becoming prevalent for today’s parallel computers with millions of cores. Published by Elsevier B.V.
1. Introduction Cellular networks such as torus and mesh interconnects [1–5] are deeply exploited for the communication sub-systems to accommodate the ever-increasing needs of coupling millions of processing cores in most of today’s advanced parallel computers [1,6]. The growing complexities of such communication sub-systems coupled with intensified performance requirements of such sub-system make it a serious challenge to design the state-of-the-art communication sub-system and the programming model for handling large data sets [7–15]. The programming model [16,17] currently adopted in many applications are facing tight bottlenecks even at the parallelism for tens of thousands of processes. For emerging supercomputers with millions of cores, we need to develop new programming paradigms and, unfortunately, untangling and streamlining the processes of non-uniform remote memory accesses, heterogeneous I/O, and numeral computation for such large-scale systems [17] is a daunting task. For a small aspect, researchers have studied the task mapping paradigm (TMP) to map task modules onto processing cores for minimizing unnecessary communications and processes idling [7,18] and then tested on various application for many parallel computers demonstrated significant performance improvement [8–15,19]. TMP is generally formulated in the graph theoretic terms [7,14], in which both the application requirement and the network information are represented in graphs. The parallel application is represented as a directed acyclic graph (DAG), in which a vertex is a compute task module that encapsulates a set of operations with a specific data set. The network is represented as a DAG whose vertex represents a processor and the edge represents the direct link between two adjacent processors. The task mapping seeks the map for mapping the task graph onto the network graph, i.e., assignment of task modules onto processors. The objective of the task mapping is to essentially reduce communication cost by strategically placing task modules on the appropriate ∗
Corresponding author. Tel.: 6318890809. E-mail address:
[email protected],
[email protected] (P. Zhang).
http://dx.doi.org/10.1016/j.parco.2015.05.002 0167-8191/Published by Elsevier B.V.
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
109
processors, while balancing the compute load on processors. For this, the hop-bytes metric is usually used for measuring the solution quality of the mapping algorithm [7,9,10,12–14,18]. The TMP formulation, naturally, concentrates on optimizing communications by manipulating the placement of task modules and rather than data modules. From the data perspective, we study a new paradigm for the mapping problem that describes an algorithm in this paper. This paradigm substitutes the data modules for the task modules and the data movement for the intertask communication and dependency. Our new data-driven mapping paradigm (DMP) seeks the best assignment of data modules to processor cores to minimize the total execution time including local computation and data availability as well as idling. This paper is organized as follows: the basic concepts and formulism of data-driven mapping paradigm are discussed in Section 2 in which the parallel activities trace graphs are devised for help decouple the computation and communication activities. Section 3 demonstrates the applications of the DMP to the Strassen’s matrix multiplication and Cholesky matrix inversion algorithms. This section illustrates and compares the basic concepts of data and task dependency graphs and matrices. Discussions and conclusions are drawn in the last section. 2. Formulation Like any numerical algorithms, parallel algorithms need to balance accuracy, efficiency, programming flexibilities among others. The data-driven mapping paradigm (DMP) we formulate for representing, managing and optimizing algorithms will demonstrate such balance and programmability. Section 2.1 introduces such a formalism while Section 2.2 shows the mechanism of the parallel activities trace (PAT) graphs for the parallel and highly tangled data movements. 2.1. Data-driven mapping paradigm Definition 1. (data module): An algorithm is assumed to consist of a set of finite data modules D = {ds } (1 ≤ s ≤ n) and task modules. A data module is the smallest unit of data that can be communicated as a whole packet over network and processed by a single processing core. If a data module exists before processing of the algorithm, it is referred to as an initial data module; otherwise, it is referred to as an intermediate data module. Clearly, expected results of an algorithm are a collection of intermediate data modules. Definition 2. (data dependence): The generation of an intermediate data module ds is assumed to depend on one or several data modules ds 1 ,…, dsm and the generation method fs for ds is specified in the algorithm: ds = fs (ds 1 ,…, dsm ). Each intermediate data module associates with one specific method. Then, ds depends on ds 1 ,…, dsm ; ds is a dependent on dsi and dsi is an antecedent of ds where i∈[1, m]. The data dependence is always unidirectional. The initial data modules have no dependents while an intermediate data module may have multiple dependents and has at least one antecedent. Definition 3. (data dependence graph and matrix): The data dependence is represented as a directed acyclic graph Gd (Vd , Ed ). The vertices in Vd represent data modules and the edges in Ed represent direct dependence between data modules. If dj is dependent on di , there is directed edge eij from di to dj . The adjacent matrix of Gd is referred to as the data dependence matrix A = [aij ]n × n whose entry aij = 1 if and only if di is an antecedent of dj and aij = 0 otherwise. The number of vertices representing intermediate data modules in DMP equals the number of atomic tasks in TMP. Definition 4. (data relevance): If two data modules di and dj are an antecedent of a same data module dk , these two data modules di and dj are related to each other through dk . All of the antecedents of an intermediate data module dk form a relevant set of dk denoted as r (dk ). The number of computing tasks in TMP can be calculated in the data dependency matrix A as shown by the following two theorems: Theorem 1. Given a data dependency matrix A = [aij ]n
× n
of an algorithm, a relevant set of dk is r (dk ) = {dk |aik = 1}.
Theorem 2. Given a data dependency matrix A = [aij ]n × n of an algorithm and let K = {k∈[1, n] | r (dk ) = }, then the total number of computing tasks is n–|K| where |K| is the number of elements in the set K and is the empty set. Definition 5. (accomplishment criterion of an algorithm): It is always assumed in TMP that a processor executes a task that is assigned to it and thus an algorithm is accomplished after all tasks are executed. Similarly, in DMP, we assume that a processor executes a method fk (r (dk )) for dk as soon as it receives all of the data modules dk1 ,…, dkm in r (dk ) = {dk1 ,…, dkm }. Therefore, an algorithm is accomplished as long as every relevant set has been completely assembled on the processor. We may assume, without losing generality, that any intermediate data module dk depends on only two data modules, i.e., |r (dk )| = 2. Thus, an algorithm is accomplished as long as a pair of related data modules meets at a same processor. Definition 6. (data movement matrix): As an algorithm is executed in parallel, a task needs to communicate with other tasks in TMP and, similarly, a data module must traverse the interconnection networks to join other relevant modules in DMP. Given k processors, p1 through pk , allocated for an algorithm with data modules D = {ds } (1 ≤ s ≤ n), a matrix for directing data module movement between processors is denoted as M = [mij ]k × n whose entry mij = 2 iff dj is initially mapped to pi ; mij = 1 iff pi
110
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
Fig. 1. Programming skeleton of an algorithm utilizing DMP.
requests dj ; and otherwise, mij = 0. Here, pi requesting dj means that any other processor that already has dj intends to send dj to pi . Theorem 3. Given a data movement matrix M = [mij ]k exists α such that mαβ = 2.
× n,
a data module dβ is an initial data module if it satisfies that there
To accomplish an algorithm, a data movement matrix has to satisfy the criterion in Definition 5, thus leading to: Theorem 4. (qualification of a data movement matrix): Given k processors allocated for an algorithm with data modules D = {ds } (1 ≤ s ≤ n), the completion of the algorithm needs a data movement matrix M = [mij ]k × n to satisfy the condition: for every intermediate data module dl , there exist k subjects that mkl > 0 for any dl ∈ r (dl ). A data movement matrix that satisfies the qualification condition is referred to as a qualified data movement matrix. Through these data-based mapping concepts, we provide an associated programming paradigm: Definition 7. (data mapping programming paradigm): The data-based programming flowchart of an algorithm is illustrated in Fig. 1 where the data movement matrix in DMP is shown crucial. As usual, a parallel computer is abstracted as compute nodes, consisting of processor cores and communication interfaces. At the start of an algorithm, the initial data modules dj are loaded to a destined processor pi iff mij = 2. By searching their locally buffered data modules, these processors associate a relevant set r (dk ) with a specific method fk and then assemble them as a new task to pipeline them to the processor queues. On one node, there can be multiple queues assigned to multiple cores and for prioritizing tasks. After tasks’ completion, the newly generated intermediate data modules dk may output as the final results or flow to networks for further processing or both. With the presence of network adaptors capable of managing data flows, a network switch would query other nodes for a specific data module and also respond to such queries based on its data movement matrix. In the same manner, the paradigm can also be applied to a multi-core system for designing multithreading programs. Specifically, the initial data modules are loaded to the shared memory. By searching the shared memory, a main thread looks for the next available tasks and pipelines them to other worker threads for concurrent execution. As long as the completion of a task, the newly generated intermediate data modules may output as final results or be pushed into the shared memory. With the presence of shared memory, multiple threads would communicate. With the specified data dependency matrix, the main thread can manage the data flows for completion of an algorithm. Therefore, for parallel programming of an algorithm, a data dependency matrix for multithreading purpose and a data movement matrix for communication purpose are sufficient, given that sequential methods of the algorithm coded [16]. A program flowchart for DMP is shown in Fig. 2. This alleviates program complexities by separating consideration of computation and communication and it can help expedite performance tuning by convenient manipulation of communication strategies defined by the data movement matrix. Definition 8. (data mapping problem): Given k processors allocated for an algorithm with data modules {ds } (1 ≤ s ≤ n), a qualified data movement matrix M maps the initial data modules and makes real-time data requests on processors such that every relevant set of an intermediate data module can coincidently assemble on the same processor. Definition 9. (objective of data mapping problem): The total amount of computing loads of a specific algorithm is usually fixed regardless of the number of processing cores involved. Parallel computing adds overheads such as communications and idles due
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
111
program DMP var – M = [mij]: data movement matrix; fl (Ωr (dl)); - k: current processor core ID; - a list of well-established methods fl with the relevant set Ωr (dl). begin 3 steps in parallel: Step 1: Communication in network switch { Wait for any di that satisfies mki = 1, and end until all di arrive and have been input into the pool of buffered data modules. } Step 2: Communication in network switch { Input di that satisfies mki = 2 into the pool of buffered data modules. Examine a newly-generated dl: when mk’l = 1, query if k’ still needs it. If yes, send dl to processor core k’. Examination ends until no more new task is pending and Step 1 ends. } Step 3: Computation in processors { Search for new Ωr (dl) in the pool of buffered data modules. If a new Ωr (dl) is found, execute fl (Ωr (dl)) for generating dl then input dl into the pool of data modules as a new intermediate data module. Searching ends until Step 2 ends. } end Fig. 2. Program flowchart for DMP.
Fig. 3. Data module partition.
to computing and communication load imbalance among processing cores. The objective of parallel computing is to minimize such overheads. In other words, we expect all of the participating processor cores busy computing with the given algorithm, resulting in minimized total processor idle time. Considering this, the objective of the DMP algorithms is to seek, from all of the qualified data movement matrices, a data movement matrix that minimizes the total processor idle time. The objective of the classical TMP is to seek the best task mapping that minimizes the total communication time [18]. These two, related at very high levels, are quite different approaches because DMP is a potential superset of the TMP whose Objective Function (OF) of minimizing communication is a natural part of the OF of DMP. It is not difficult to argue that minimization of communication is a necessary, but not sufficient, condition for overall minimal execution time. For example, desynchronized communication and computation can be realigned to overlap or permute the execution order to achieve the same objective of minimizing total processor idle time. The TMP complicates the task mapping and scheduling processes while incorporating many coupled impact factors and thus is difficult to understand and to implement. But, DMP can address this case by concentrating on the intrinsic feature of parallelism: saturate all processor cores with computing tasks by feeding them the necessary and sufficient data modules at the optimal moments. In Section 3, we would demonstrate how the data dependency matrix is used to improve the programmability and achieve the expectant parallel efficiencies. Before that, we would use a simpler example: tiling matrix multiplication to illustrate the new terms introduced above. Example 1. (matrix multiplication): Multiplication of two matrices (2 × 2 tiles) is used to illustrate the above definitions. Initial and final data module partitions are in Fig. 3. In this case, submatrices of A and B assume to store in row-major and columnmajor for fast memory access. Otherwise, if A is column-major, parallel matrix transpose may need to redistribute submatrices of the operand [20]. This algorithm involves totally 8 initial and 12 intermediate data modules and defines two methods: matrix addition and multiplication in Fig. 4. Fig. 5 shows the data dependence matrix for multithreading parallelism. Fig. 6 shows a qualified data movement matrix for interprocess communication parallelism. In Fig. 6, gray element (i, j) means processor i loads data module j as initial data module. For example, element (2, 2) is gray and implies processor 2 loads data module 2 when initializing. Black element (i, j) means processor i needs data module j as intermediate data module. For example, element (4, 15) is black, indicating processor 4 needs data module 15 to continue some operands. The rest is empty or white elements. White element (i, j) means processor i never need data module j at any time. For example, processor 1 does not need module 2 so element (1, 2) is white (namely, an empty block in Fig. 6). Qualification of the data movement matrix (Theorem 4) can be checked in this way: for example, d9 = d1 × d5 implies {d1 , d5 } ∈ r (d9 ) (Theorem 1). In the figure, we found k = 1 such that m11 = 2 > 0 and m15 = 2 > 0 (Def. 6), approving that d9 satisfied the condition in Theorem 4. In other words, it implies that data modules 1 and 5 could appear at the same processor for producing data module 9. The same procedure applied for all other modules to verify qualification of data movement matrix. After the matrix multiplication algorithm completes, submatrices of resulting matrix C are distributed in a circular fashion: each participating processor buffering one submatrix. One can resemble resulting
112
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
Fig. 4. Data dependency graph.
Fig. 5. Data dependency matrix.
Fig. 6. Data movement matrix.
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
113
Fig. 7. Parallel activities trace of matrix multiplication in Example 1 based on the data movement matrix in Example 2.
matrices in one location by gathering submatrices or one can directly continue with other algebraic operations based on such distributions. 2.2. Parallel activities trace (PAT) It is inefficacious to describe, let alone analyze, parallel algorithms, quantitatively. We use a graphic system, which we call parallel activities trace (PAT) graph, to pictorialize the computing and communicating activities and the data dependency and movement. Similar systems were introduced before [21–23] and each has its some unique features. In our work, PAT, defined as a 2-dimensional graph with the horizontal axis for the wall clock time and the vertical axis for the IDs of processors or cores depending on the level of details that we wish to examine, is sufficient to describe these activities. Each processor or core has two horizontal bars: the upper bar, which we call the computation bar, is used to indicate a local computation on a specific processor core or computing unit and the lower bar, which we call the communication bar, is used to indicate the inter-process communication activities described as data module movements. Naturally, the two ends of the computation bar indicate the starting and ending times of the computation of a particular task or subroutine or function and thus the length of each bar shows the amount of time the underlying serial computation subroutine takes. Above the computation bar, one may write down the name of the subroutines being executed and colors might be used to differentiate functions. An arrowed line between the communication bars indicates a data module movement from a processor core to the other. We expect the PAT graph for algorithmic analysis and record vividly, a priori or posterior, the time series of parallel algorithms in a multicore architecture. The fact that many state-of-the-art parallel computers can perform message passing and local computation simultaneously leads to the effective separation of the above two bars. With such a PAT graph, one can easily predict the amount of local computation, amount and direction of communication, and load distribution, etc. As a result, one can visually consider, or obtain guide for, minimization of processor idle time and load imbalance. We will follow this straightforward PAT graphic system to help describe, design, analyze, and to assess our DMP and TMP approaches. PAT in Fig. 7 shows the algorithm as discussed in Figs. 3 to 6. Given a matrix block of size L × L, the arithmetic complexities for the matrix multiplication, addition and communication are O(Tmulti ) = L3 and O(Tadd ) = O(Tcomm ) = L2 , leading to the main cost as the naïve matrix multiplication. 3. Applications on the algorithms Strassen matrix multiplication (MM) [24–26] and Cholesky matrix inversion [27–29] algorithms are implemented to demonstrate applicability and features of DMP. 3.1. Experiments To better understand concepts, we present the details for prototyping DMP on multicore architectures. A code1 is available to demonstrate examples of Strassen matrix-multiplication (MM) and Cholesky matrix-inversion algorithms. Technically, the code 1 URL: https://code.google.com/p/data-driven-mapping/ in which the code included naïve and Strassen matrix multiplication and Cholesky matrix inversion algorithms. Necessary guideline included.
114
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
Fig. 8. Flowchart of the code for demonstrating DMP in matrix multiplication and inversion algorithms.
Fig. 9. Data module partition for Strassen MM (2 × 2 tiles).
works on multithreading mode and allows varying concurrent threads for optimal performance. The sketch of code is shown in Fig. 8. In the code, concurrent computing is managed by thread pool pattern. Available data modules are buffered. A scheduler is working to check possibility of creating new jobs based on the availability of data modules. A new job is created once all of its required data modules are ready and then the new job is inserted to the pool. In this design, we adopt the FIFO policy to prioritize jobs in the pool. We have also tested the random and most dependencies policies. The results showed no significant impact on the performances of the tests in this work and then we use FIFO policy for the following tests. Appendix A presented more details. Algorithmically, initial data mapping matrices are generated for the naïve and Strassen matrix multiplication (2 × 2 tiles) and for the Cholesky matrix inversion algorithm (4 × 4 tiles). Particularly, the initial data mapping matrices for Strassen MM and Cholesky inversion algorithms are built based on the works [24] and [27], in which input matrices are partitioned into submatrices and participating cores load specified submatrices (namely, initial data modules) when initializing. In this, individual submatrices are defined as data modules and assigned with unique integers as identifiers. Second, data dependency matrices are built upon the nature of algorithm. For example, if there is an operator F between two operands d1 and d2 , resulting in d3 = F (d1 , d2 ), then d3 is dependent on d1 and d2 (Def. 3). In this manner, the data dependency matrix is built upon the algorithm and in the meanwhile, different operators (namely, functions in the codes) are recorded (and complied). Upon the establishment of initial data mapping and data dependency matrices, data movement matrices are constructed by searching relevant sets r (dk ) for unfinished data modules dk (Def. 4). Once there is no more unfinished data module, the code completes on participating processor cores. 3.2. Strassen matrix multiplication Matrix multiplication (MM) is the core of many linear algebraic algorithms such as matrix inversion and eigenvalue decomposition. Many MM algorithms were introduced to reduce computational complexity but they cause substantial parallelization complications. One of such algorithms is the Strassen method or S-method in [24,25]. The S-method is parallelized by grouping a matrix multiplication with several specified additions, resulting in much less structured communication patterns [24] than those of the naïve MM algorithm such as Example 1. Using the same example as in [24], we apply our DMP approach to the S-method for multiplying two matrices (2 × 2 tiles) using a multi-core architecture. The data module partition is shown in Fig. 9 in which data modules 1 to 8 are the initial data modules. Fig. 10 compared the task and data dependency graphs on a side by side base. In the figure, the task dependency graph (TDG) showed the flowchart of a conventional S-method in [26]. The data dependency graph (DDG) is created accordingly. The conversion rule between TDG and DDG is straightforward: a data module C depends on the data modules A and B if and only if there is a function such that C = F(A, B) where the F could be ADD or MUL. Accordingly, a data dependency matrix (shown
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
115
Fig. 10. The task and data dependence graphs in TMP and DMP (ADD/MUL: matrix addition/multiplication).
in Fig. 11) is generated by examining the data dependency graph. A data movement matrix (shown in Fig. 12) directs all datarequest-based communications among the seven processes according to the parallelism pattern designed in [24]. The PAT in Fig. 13 illustrates the algorithmic activities based on the data movement matrix (shown in Fig. 12) and local computation. Compared with Fig. 7, we can see that both Strassen and naïve MM algorithms need two initial data modules when initializing. Fig. 14 showed the performance results of running the multi-threaded S-method (2 × 2 tiles) on a multi-core system (Quad-Core i7-2920XM at 2.5 GHz installed with 16 GB RAM). The results clearly showed: (1) The detailed PAT with analysis in Fig. 13 and timing results in Fig. 14 affirmed the coherent fact that 7 threads performed the S-method (2 × 2 tiles) the fastest. For different matrix sizes, the performance trends are stable as illustrated in Fig. 14. (2) Users can perform different multiplication methods simply by managing different data dependency matrices, without recoding. In this example, the S-method and naïve matrix multiplication (MM) method are performed in the same program with different data dependency matrices (as shown in Figs. 5 and 11 respectively). Re-coding is unnecessary in DMP. This improved the programmability of novel complicated algorithms provided the legacy libraries. (3) Fig. 15 compared the computing performance between S-method and naïve MM. The results showed that the DMP can demonstrate the expectant features of the algorithms very well. For example, on a single thread mode, S-method is clearly better than MM because of one submatrix multiplication reduction. The 7-threaded mode is the best performer because the S-method (2 × 2 tiles) needs 7 multiplication operations at most. When the number of concurrent threads is less than 7, the performance improvement of S-method over MM is not obvious. (4) For execution of S-method, we need to buffer several intermediate submatrices [24,25]. Memory space management for these intermediate submatrices is not a trivial job. In the DMP, the program can automatically figure out when a memory
116
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
Fig. 11. Data dependency matrix.
Fig. 12. Data movement matrix.
space should be allocated and deallocated by searching the data dependency matrix. For example, an intermediate module is allocated as soon as all of its antecedents are ready and it is deallocated as soon as it is not in any relevant set. Fig. 15(C) showed that a multi-threaded S-method program required more memory space than a single-threaded program since more intermediate results need to be buffered concurrently. The DMP would save the program from a complex memory management. Therefore, we can see the performance of DMP as a practical strategy for rapid programmability of complex algorithms.
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
117
Fig. 13. PAT of Strassen matrix multiplication on a multicore architecture.
(A) Wallclock Time (in seconds) vs. Number of Threads
(B) Speedup vs. Number of Threads
(C) Parallel Efficiency vs. Number of Threads
Fig. 14. Performance of the Strassen algorithm on a multi-core system. In the legend, 4000 means a square matrix of size 4000 × 4000 and the same for 5000. The data type is double-precision floating-point.
3.3. Cholesky matrix inversion Cholesky matrix inversion algorithm needs more functions and more complex steps than the Strassen algorithm. It is a challenging job to optimize the parallelism of the algorithm [27–29], not to mention the optimal concurrent computing. We use DMP to illustrate an effective tiled Cholesky matrix inversion algorithm (4 × 4 tiles) on the multicore platforms for computing the inverse of a symmetric positive definite matrix [27,28]. The Cholesky inversion algorithm involves three successive steps (Fig. 17): Cholesky factorization for computing L such that S = LLT , the lower triangular matrix inversion for L−1 and
118
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
(A) Wallclock Time (in seconds) vs. Number of Threads
(B) Performance improvement ratio of Strassen over the naïve algorithm vs. Number of Threads
(C) Peak Memory Use (in MB) for different matrix sizes (4000 and 5000) vs. Number of Threads
Fig. 15. Performance comparison between the naïve (MM) and Strassen algorithms. Here data type is double-precision floating point. In (A) and (B), the matrix size is 5000. In (C), “size = 5000” means a matrix size of 5000 and the same for “size = 4000”.
Fig. 16. Data module partition for Cholesky inversion (4 × 4 tiles).
the product of lower triangular matrices for S−1 = (L−1 )T L−1 . One native approach (denoted as “native”) is to perform these three steps sequentially. This is naturally a drawback for parallelization since three steps are separated. To deliver better parallelism, one has to interleave these steps by adhering to any task dependencies within each and between successive steps (so this approach is denoted as “interleaved”). This goal has been achieved through a thorough critical path approach [27]. While, for the DMP in which the intricate task dependencies are embodied by the data dependencies, we can achieve the same goal naturally by summing the data dependence matrices of the above three interleaved steps and then validate the tightness of the lower bounds of the Cholesky matrix inversion algorithm in PAT and programs. Fig. 16 shows the data module partition of the Cholesky inversion algorithm (4 × 4 tiles). Fig. 17 shows: (the first row) the task dependency graphs (TDG) of the Cholesky factorization, inversion and multiplication algorithms following the works [27,30]; (the second row) the data dependency graph for each algorithm; lastly, (the third row) the data dependency matrices. Using TMP, producing an optimal interleaved task dependency graph from three separate graphs (Fig. 17) is a tough challenge that was considered in the previous works [27,30]. Using DMP, we produce the final interleaved data dependency matrix by simply summing over the above three data dependency matrices as shown in Fig. 18. Fig. 19 intuitively shows the performance difference between the native and interleaved approaches for the Cholesky matrix inversion algorithm. In the numerical experimenting, Fig. 20 shows the performance of the interleaved approach on the multicore computer and Fig. 21 compares the performance of the native and interleaved approaches. These results showed: (1) The algorithmic analysis in Fig. 19 and the timing results in Fig. 21 predicted the dramatic performance improvements of an interleaved approach over the native approach for the same algorithm. As the number of threads increases, the performance coherently improves to the maximal parallelism. (2) The data dependency matrix (Fig. 18) is the only necessary input argument to the program. The dependency is guaranteed by the DMP and the complex management of task dependencies was unnecessary. In that, the DMP is effective at reducing the programming complexity. (3) The DMP gives a stable performer to the Cholesky matrix inversion algorithm. Fig. 20 showed the timing results, speedups and parallel efficiencies of the program behave the same for different matrix sizes. The DMP is a reliable performer for the complex algorithms. (4) The memory management is becoming more complex in this case than that in the S-method and it is more challenging than the task management. In this case, we use the dense matrix format to buffer each submatrix. Fig. 21 (C) shows the change of the peak memory utilization (in MB) over the number of threads. The results reaffirmed that a multi-threaded program would stress the memory space more than a single-threaded program because more intermediate results need to buffer concurrently. As the maximal performance achieved, the peak memory usage would become stable too.
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
Step 1 Cholesky (A) → A = LLT
Step 2 Inverse (L) : S = L-1
119
Step 3 Multiplication (T) : A-1 = STS
Task dependency graphs [27]:
Data dependency graphs:
Data dependency matrices:
Fig. 17. The task dependency graphs from [27], data dependency graphs and matrices of three steps of the Cholesky matrix inversion algorithm.
These benefits might inspire a solution to untangle complex data structures and dependencies in computer simulators [31] in which atoms could be clustered into encapsulated data modules and multiple timesteping sizes could be employed to individual atom clusters [32]. Collectively, the improved programmability of DMP might facilitate modeling of complex biological systems on modern parallel architectures [33]. 4. Discussions and conclusions The greatest disadvantage of TMP is occurrence of the potential deadlocks that may result if task dependencies are not properly constrained. However, DMP can automatically eradicate such problem because an antecedence data module always appears in the system before its dependent data modules and it vanishes as soon as the request is satisfied. Furthermore, all processors in the DMP approach never wait for data from others because they work independently with the locally buffered data modules. For this, DMP is superior at handling intricate task dependence constraints over TMP. The second benefit of the DMP approach is its strengths in handling algebraic problems. Pipelining the Cholesky matrix inversion algorithm requires a thorough analysis of critical paths in Cholesky factorization, matrix inversion and multiplication and then compacting intertwined tasks that can be parallelized in every opportunity in Section 3.3. The performance is improved with increased algorithmic and programming complexities in TMP. Summing over three data dependence matrices simply produces the final data dependency matrix in DMP. This data dependency matrix automatically starts computing tasks when the requested data set has arrived. The third benefit is that DMP allows developing dynamic loading strategies. In Section 3.2, each processor is assumed to load three initial submatrices (the memory requirement [24]). As the program starts, a processor requires an initial submatrix from
120
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
Fig. 18. Data dependence matrix of the Cholesky inversion algorithm.
Fig. 19. PAT of the Cholesky inversion algorithm on the multicore architecture.
others, for example, processor #7 requires the submatrix B22 which is already contained in #1, #3 and #5. In this case, a dynamic loading strategy is necessary for achieving the data from the earliest available processor. This is automatically assured in DMP because the processor that achieves B22 earliest will proactively send a query to #7 for this. However in TMP, such dynamic loading strategy has to be manually handled.
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
(A) Wallclock Time (in seconds) vs. Number of Threads
(B) Speedup vs. Number of Threads
121
(C) Parallel Efficiency vs. Number of Threads
Fig. 20. Performance of the interleaved approach for Cholesky inversion algorithm on a multi-core system. In the legend, 4000 means a square matrix of size 4000 × 4000 and the same for 5000. The data type is double-precision floating-point.
(A) Wallclock Time (in seconds) vs. Number of Threads
(B) Performance improvement ratio of the interleaved approach over the native approach vs. Number of Threads
(C) Peak Memory Use (in MB) for the interleaved approach vs. Number of Threads (All submatrices are stored as dense matrices.)
Fig. 21. Performance comparison between the native and interleaved approach for the Cholesky matrix inversion algorithm. Here data type is double-precision floating point. In (A) and (B), the matrix size is 5000. In (C), “size = 5000” means a matrix size of 5000 and the same for “size = 4000”.
We may develop the DMP formulation into a data mapping language (DML). If an algorithm is described in DML, it will be effectively translated to machine language along with the sequential processor processing part and the network data flow controlling parts. Not a single mapping paradigm is the best representation of every parallel algorithm. Task-oriented and data-oriented mapping paradigms and task scheduling strategies have been proposed and demonstrated superior utilizations. The data-based mapping paradigm (DMP) continues and extends these discoveries and beyond. Appendix A. Implementing DMP on a multicore architecture We present a software package prototyping the DMP implementation based on a multicore architecture. The program works on a multi-threaded parallel mode and it allows the users to vary the number of threads. Fig. A.1 showed the flowchart of the program. The software is available as the supplemental data to the paper. To demonstrate the applications, the software included all of the necessary computing routines for the matrix–matrix (MM) multiplication and Cholesky inversion algorithms. In the meanwhile, it included the data mapping matrices for the naïve and Strassen matrix multiplications (2 × 2 tiles) and for the Cholesky inversion algorithm (4 × 4 tiles). This helped validate the applicability of the DMP by paralleling the Strassen and Cholesky inversion algorithms. In the software implementation, the concurrent computing is managed by the thread pool pattern that is the FIFO thread pool marked in Fig. A.1. It is coded based on Boost Pool Library. A specified number of work threads are created to perform the tasks from a first-in first-out queue that is so-called thread pool. As soon as a thread completes its task, it will request the next task from the queue until all tasks have been completed. The number of threads is a parameter that can be tuned to provide the best performer. A buffer is allocated to contain all of available data modules (marked as “buffered data modules” in the figure). A
122
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
Fig. A.1. Flowchart of the program.
Fig. A.2. Performances of Strassen multiplication algorithm versus different task prioritizing policies (Vertical axis is the wallclock time in seconds. Horizontal axis is task-prioritizing policy).
separate thread is working in loop-mode to check the availability of new tasks based on the availability of data modules. A new task is available if and only if all of its required data modules are ready. Once a task is available, it will be inserted into the thread pool for execution. In this, FIFO policy is used for the pool. We have also tested the random and most dependencies policies for the pool. Figs. A.2 and A.3 compared the performances of Strassen multiplication and Cholesky inversion algorithms versus different task prioritizing policies: FIFO, RAND (random) and MDEP (most dependencies). The average values and standard deviations are calculated after sampling each case five times. These results consistently showed that task-prioritizing policies have no significant impacts on the performances of tests in this work. Thus, we use the FIFO policy for the pool in the work. In addition to the concurrent computing, memory space is also dynamically managed and optimized for minimal use. Actual memory space is not allocated to a data module until the first time the module is required by any other module. The memory attached to a data module will be deallocated as soon as no other modules depend on the module. In this manner, the dynamic memory allocation and deallocation can be managed. Further, Fig A.4 shows the efficiency analysis of the scheduler for scheduling tasks of Strassen algorithm using seven threads concurrently, Efficiency is calculated in percentage as the ratio of busy time over total time for each thread. Wallclock time is used for timing in seconds. Average efficiency of all participating threads is 95.75 % and standard deviation is 3.77 % which is relatively small. This demonstrated that the scheduler is sufficient to handle concurrent computing efficiently.
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
123
Fig. A.3. Performances of Cholesky inversion algorithm versus different task prioritizing policies (Vertical axis is the wallclock time in seconds. Horizontal axis is task-prioritizing policy).
Fig. A.4. Efficiency analysis for scheduling tasks of Strassen algorithm (matrix size = 2000 × 2000, partition = 4 × 4 tiles, 7 work threads).
References [1] TOP500. Top 500 Supercomputer Site. Available: http://www.top500.org. [2] N.R. Adiga, M.A. Blumrich, D. Chen, P. Coteus, A. Gara, M.E. Giampapa, P. Heidelberger, S. Singh, B.D. Steinmacher-Burow, T. Takken, M. Tsao, P. Vranas, Blue Gene/L torus interconnection network, IBM J. Res. Dev. 49 (2005) 265–276. [3] R. Brightwell, K.T. Pedretti, K.D. Underwood, T. Hudson, Sea Star interconnect: balanced bandwidth for scalable performance, IEEE Micro 26 (2006) 41–57. [4] Y. Ajima, S. Sumimoto, T. Shimizu, Tofu: A 6D mesh/torus interconnect for exascale computers, Computer 42 (2009) 36–40. [5] P. Zhang, R. Powell, Y. Deng, Interlacing bypass rings to torus networks for more efficient networks, Parallel and Distributed Syst. IEEE Trans. 22 (2011) 287–295. [6] Y. Deng, P. Zhang, C. Marques, R. Powell, L. Zhang, Analysis of Linpack and power efficiencies of the world’s TOP500 supercomputers, Parallel Comput. 39 (2013) 271–279. [7] S.H. Bokhari, On the mapping problem, IEEE Trans. Comput. 30 (1981) 207–214. [8] S.Y. Lee, J.K. Aggarwal, A mapping strategy for parallel processing, IEEE Trans. Comput. 36 (Apr 1987) 433–442. [9] P. Sadayappan, F. Ercal, Nearest-neighbor mapping of finite element graphs onto processor meshes, IEEE Trans. Comput. C-36 (1987) 1408–1424. [10] J.M. Orduna, F. Silla, J. Duato, A new task mapping technique for communication-aware scheduling strategies, in: International Conference on Parallel Processing Workshops, 2001, 2001, pp. 349–354. [11] B.E. Smith, B. Bode, Performance effects of node mappings on the IBM BlueGene/L machine, Euro-Par 2005 Parallel Proc., Proc. 3648 (2005) 1005–1013. [12] G. Bhanot, A. Gara, P. Heidelberger, E. Lawless, J.C. Sexton, R. Walkup, Optimizing task layout on the Blue Gene/L supercomputer, IBM J. Res. Dev. 49 (Mar–May 2005) 489–500. [13] H. Yu, I.-H. Chung, J. Moreira, Topology mapping for Blue Gene/L supercomputer, in: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, Tampa, Florida, 2006. [14] T. Agarwal, A. Sharma, A. Laxmikant, L.V. Kale, Topology-aware task mapping for reducing communication contention on large parallel machines, in: IPDPS 2006. 20th International Parallel and Distributed Processing Symposium, 2006., 2006, p. 10. [15] Y. Chen, Y. Deng, Task mapping on supercomputers with cellular networks, Comput. Phys. Commun. 179 (Oct 1 2008) 479–485. [16] T. Mattson, B. Sanders, B. Massingill, Patterns for parallel programming, Addison-Wesley Professional, 2004. [17] T. Eidson, J. Dongarra, V. Eijkhout, Applying aspect-orient programming concepts to a component-based programming model, in: International Proceedings: Parallel and Distributed Processing Symposium, 2003., 2003, p. 7. [18] P. Zhang, Y. Gao, J. Fierson, Y. Deng, Eigen analysis-based task mapping on parallel computers with cellular networks, Math. Comput. 83 (2014) 1727–1756.
124
P. Zhang et al. / Parallel Computing 48 (2015) 108–124
[19] P. Zhang, Y. Deng, R. Feng, X. Luo, J. Wu, Evaluation of various networks configurated by adding bypass or torus links, Parallel and Distributed Systems, IEEE Transactions on 26 (4) (2015) 984–996, doi:10.1109/TPDS.2014.2315201. [20] J. Choi, J.J. Dongarra, D.W. Walker, Parallel matrix transpose algorithms on distributed memory concurrent computers, in: Proceedings of the Scalable Parallel Libraries Conference, 1993, 1993, pp. 245–252. [21] J.T. Padding, W.J. Briels, Systematic coarse-graining of the dynamics of entangled polymer melts: the road from chemistry to rheology, J. Phys. Condens Matter. 23 (Jun 15 2011) 233101. [22] J.C. de Kergommeaux, B. Stein, P.E. Bernard, Paje, an interactive visualization tool for tuning multi-threaded parallel applications, Parallel Comput 26 (Sep 2000) 1253–1274. [23] W.E. Nagel, A. Arnold, M. Weber, H.C. Hoppe, K. Solchenbach, VAMPIR: visualization and analysis of MPI resources, Supercomputer 12 (Jan 1996) 69–80. [24] C.C. Chou, Y.F. Deng, G. Li, Y. Wang, Parallelizing Strassens method for matrix multiplication on distributed-memory MIMD architectures, Comput. Math. Appl. 30 (Jul 1995) 49–69. [25] S. Huss-Lederman, E.M. Jacobson, J.R. Johnson, A. Tsao, T. Turnbull, Implementation of Strassen’s algorithm for matrix multiplication, in: Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, 1996., 1996, p. 32. [26] B. Dumitrescu, J.-L. Roch, D. Trystram, Fast matrix multiplication algorithms on MIMD architectures, Parallel Algorithms Appl. 4 (1994) 53–70. [27] S. Tomov, R. Nath, H. Ltaief, J. Dongarra, Dense linear algebra solvers for multicore with GPU accelerators, in: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010, pp. 1–8. [28] J. Kurzak, A. Buttari, J. Dongarra, Solving systems of linear equations on the CELL processor using Cholesky factorization, Parallel and Distributed System, IEEE Transactions on, vol. 19, 2008, pp. 1175–1186. [29] E. Agullo, H. Bouwmeester, J. Dongarra, J. Kurzak, J. Langou, L. Rosenberg, Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures, in: J.L. Palma, M. Daydé, O. Marques, J. Lopes (Eds.), High Performance Computing for Computational Science – VECPAR 2010, 6449, Springer, Berlin , Heidelberg, 2011, pp. 129–138. [30] H. Bouwmeester, J. Langou, A critical path approach to analyzing parallelism of algorithmic variants. Application to Cholesky inversion, arXiv:1010.2000, 2010. [31] P. Zhang, C. Gao, N. Zhang, M. Slepian, Y. Deng, D. Bluestein, Multiscale particle-based modeling of flowing platelets in blood plasma using dissipative particle dynamics and coarse grained molecular dynamics, Cell. Mol. Bioeng. 2014/09/04 (2014) 1–23. [32] P. Zhang, N. Zhang, Y. Deng, D. Bluestein, A multiple time stepping algorithm for efficient multiscale modeling of platelets flowing in blood plasma, J. Comput. Phys. 284 (2015) 668–686, doi:10.1016/j.jcp.2015.01.004. [33] P. Zhang, J. Sheriff, J.S. Soares, C. Gao, S. Pothapragada, N. Zhang, Y. Deng, D. Bluestein, Multiscale modeling of flow induced thrombogenicity using dissipative particle dynamics and coarse grained molecular dynamics, in: ASME 2013 Summer Bioengineering Conference, 2013 V01BT36A002–V01BT36A002.