Generating data transfers for distributed GPU parallel programs

Generating data transfers for distributed GPU parallel programs

J. Parallel Distrib. Comput. 73 (2013) 1649–1660 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevi...

1MB Sizes 0 Downloads 47 Views

J. Parallel Distrib. Comput. 73 (2013) 1649–1660

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc

Generating data transfers for distributed GPU parallel programs F. Silber-Chaussumier ∗ , A. Muller, R. Habel Institut Mines Télécom, Télécom SudParis, Computer Science Department, 91011 Évry, France

highlights • • • •

We automatically generate heterogeneous communications for distributed-memory architectures. Communication generation is based on static compiler analysis and runtime decisions. Accurate heterogeneous communications are generated for regular applications. Heterogeneous communications deal with accelerator-based GPU data transfers and message-passing for transfers between CPUs.

article

info

Article history: Received 16 October 2012 Received in revised form 2 May 2013 Accepted 19 July 2013 Available online 26 August 2013 Keywords: Distributed memory Data transfer Source-to-source transformation Parallel execution Compiler directives GPU

abstract Nowadays, high performance applications exploit multiple level architectures, due to the presence of hardware accelerators like GPUs inside each computing node. Data transfers occur at two different levels: inside the computing node between the CPU and the accelerators and between computing nodes. We consider the case where the intra-node parallelism is handled with HMPP compiler directives and message-passing programming with MPI is used to program the inter-node communications. This way of programming on such an heterogeneous architecture is costly and error-prone. In this paper, we specifically demonstrate the transformation of HMPP programs designed to exploit a single computing node equipped with a GPU into an heterogeneous HMPP + MPI exploiting multiple GPUs located on different computing nodes. The STEP tool focuses on generating communications combining both powerful static analyses and runtime execution to reduce the volume of communications. Our source-to-source transformation is implemented inside the PIPS workbench. We detail the generated source program of the Jacobi kernel and show that the execution times and speedups are encouraging. At last we give some directions for the improvement of the tool. © 2013 Elsevier Inc. All rights reserved.

1. Introduction While OpenMP [25] and MPI [23] still remain de facto standards to target traditional shared-memory and distributed-memory architectures, different programming languages have emerged to exploit hardware accelerators like GPUs (Graphics Processing Units): for instance CUDA (Compute Unified Device Architecture) proposed by NVIDIA and OpenCL (Open Computing Language). Highperformance computing community has to face once again the dilemma of writing highly performant and tuned programs while targeting several possible parallel architectures. Development costs are then very high and parallel programs must be rewritten for new heterogeneous clusters that combine multi-cores and accelerators. Programming with compiler directives has reborn as a promising alternative to low-level accelerator programming. Different sets of directives are currently available to program them:



Corresponding author. E-mail addresses: [email protected] (F. Silber-Chaussumier), [email protected] (A. Muller), [email protected] (R. Habel). 0743-7315/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jpdc.2013.07.022

HMPP [11], PGI [28], and OpenACC [30] expressing kernel parallelism and data transfers. Compiler directives for accelerators are designed to exploit a single accelerator-based computing node. To execute parallel programs on multiple computing nodes, message-passing primitives must be inserted into the program annotated with directives for accelerators to deal with data transfers between computing nodes. The combination of high-level and low-level programming in this resulting heterogeneous program has the drawbacks of low-level programming: costly and error-prone. In this context we propose our STEP (Système de Transformation pour l’Exécution Parallèle: Transformation System for Parallel Execution) tool. Based on the success of compiler directives to exploit one computing node, we propose to generate message-passing programs from input parallel programs annotated with compiler directives to exploit multiple computing nodes. We focus on regular parallel programs whose parallelism can be expressed with loop parallel directives. The objective is to automatically generate the low-level message-passing primitives thanks to high-level expression of parallelism given by the user. This will decrease the development cost of parallel programs.

1650

F. Silber-Chaussumier et al. / J. Parallel Distrib. Comput. 73 (2013) 1649–1660

Fig. 1. The STEP transformation.

Our current prototype targets two different distributedmemory architectures: clusters of CPUs and clusters of GPUs. For clusters of CPUs (Central Processing Unit), input programs are written with OpenMP directives. For clusters of GPUs, input programs are written with HMPP directives. In both cases the generated output program contains both compiler directives and MPI messagepassing primitives to exploit multiple computing nodes. Compared to the existing parallel programming models based on compiler directives, our STEP transformation (see Fig. 1) focuses on the generation of communications. The partitioning of computations are all specified by the user using the OpenMP for directive with the schedule clause for CPU programs and HMPP codelet and gridify directives for GPU programs. Compilation technology is thus specifically used to generate communications and not for parallelism detection or parallelism specification as data or computation distribution. In our prototype, the distribution of data is managed through the distribute directive close to the HPF directive. Combining the expression of both data and computation allows our STEP tool to generate efficient programs. Implemented inside the PIPS compiler workbench, the generation of communications uses extensively static dataflow analyses, combined with runtime execution to reduce the volume of communications. This paper is organized as follows: Section 2 describes the compiler techniques for parallel directives transformation and compares the existing compiler directives for parallel programming with our STEP programming model proposal. The PIPS workbench is described in Section 3. Section 4 describes the design of our STEP tool. Section 5 then describes the experiments we have made with the Jacobi kernel and comments on the results. Section 6 suggests some directions for improvement of the STEP tool and concludes. 2. Compiler techniques and compiler directives for parallel programming 2.1. Compiler techniques In the past, efficiently exploiting vector machines has led to powerful vectorization techniques in compilers and evolution in programming languages. While vectorization is still widely used, MIMD (Multiple Instruction Multiple Data) distributed-memory machines have become popular since the 1990s. Automatic parallelization and HPF standardization vast movements showed that the dataflow analyses developed for vectorization techniques are directly useful for MIMD distributed-memory machines. Nevertheless the dataflow analysis requirements are stronger for MIMD distributed machines than for vectorization. In this section, we describe these requirements. Dataflow analysis for distributed-memory machines will be useful to detect parallelism as for vectorization but also to generate

communications between the processes in order to exchange upto-date data. In parallel programs, data accesses can trigger communications and thus result in significant overhead. Dataflow analysis becomes crucial for this kind of architecture to reduce both the volume and the number of communications. Main vectorization techniques [26,19,5,20] as data dependence tests, statement reordering, reduction recognition, data alignment, loop transformations, work at a fine-grain loop level to both detect parallelism and generate efficient vectorized code. Most of these techniques rely on accurate dataflow analysis that determines which variables may be used or modified in a given part of the code (statement, block of statements, procedure). MIMD distributed-memory machines allow coarse grain parallelism. The ratio of computation/communication must be high in order to obtain an efficient parallel program. Thus to detect parallelism or generate parallel code, the dataflow analysis must be accurate on coarse grain blocks of statements, particularly containing function calls and control branches. At last the memory organization has an impact on the whole program. When compiling for distributed-memory architectures, the whole program needs to be analyzed whereas vectorization can be performed on small parts of the program. This increases the complexity of the dataflow analysis. Interprocedural dataflow analysis: the Interprocedural dataflow analysis consists in computing the summary data that identify the effects of a procedure call [2]. Summary data must then be translated at call sites dealing with aliasing, effective parameters from different calling contexts and array reshaping. Array access representation: because most high-performance applications that target distributed-memory architectures are based on massive data arrays, data analysis must precisely identify which elements inside an array are accessed in order to reduce the volume of communications. The main idea is to avoid ending up with either ‘‘whole array’’ or ‘‘none of the array’’ when computing summary data at the call point of the procedure. This kind of imprecise result would prevent parallelism detection and/or efficient code generation especially when generating communications. Havlak and Kennedy [14] summarize different ways of representing array access sets. The threshold is to find a representation for subarrays which allows any kind of array accesses to be precisely represented, whose size is independent from the number of array elements and which allows fast operations. Array region approximations combine compact representation and fast operations. Array regions used in the PIPS tool (detailed in Section 3.3) as well as ‘‘regular sections’’ array regions are presented in [14]. In addition to that, to identify subarray regions, subscript expressions must be analysable. Linear, non-linear and array references subscript expressions are difficult to analyze [27]. When subscripts are unanalyzable, array regions cannot be precisely determined. Dealing with pointers: nowadays programming languages provide pointers that are very often used to define data arrays. In Fortran

F. Silber-Chaussumier et al. / J. Parallel Distrib. Comput. 73 (2013) 1649–1660

programs, aliasing appeared because several variables could share the same memory locations at procedure call sites. With pointers, additional analysis should deal with cases where a single variable may refer to different memory locations. All these compiler techniques are combined with compiler directives in order to generate efficient parallel programs. 2.2. Parallel programming with directives In addition to core functionalities of the program, parallel programming requires dealing with computation distribution, data distribution and data communication. It is thus often low-level and error-prone. Compiler directives combined with advanced compiling techniques can handle some of these aspects. In this section, we recall the HPF experience and present how the existing parallel directives interact with the compiler. Table 1 compares the existing directives sets with our proposal on four criteria: expression of parallelism (loop or task), expression of computation distribution, data distribution and data communication. All directives sets propose a way to express parallelism, i.e. none of them use a fully automatic parallelization approach. HPF automatically generates computation distribution and data communication. XcalableMP proposes directives to express all four criteria. Directives for accelerator-based architectures as PGI, HMPP and CAPS propose directives to express data movements between host and accelerator and do not define any notion of data distribution. The originality of our approach lies in taking into account both computation and data layout to automatically generate data communication. Detailed discussion for each set of directives follows. High Performance Fortran: models and techniques for distributedmemory compilation resulted in the elaboration of the High Performance Fortran standard in the 1990s. Zima et al. [18] stated that ‘‘HPF never achieved a sufficient level of acceptance among [. . . ] users’’. HPF was designed for different classes of parallel machines including MIMD machines. It provides the user with directives expressing information on parallel iterations of a loop with the INDEPENDENT directive and on data distribution allowing block and block-cyclic distribution of arrays through the distribute directive. Based on this information, the compiler will effectively distribute the data, decide the computation distribution and generate the communication. One major drawback of HPF is that proper compilation of HPF not ‘‘only’’ required the extensive global analysis of distributions and generation of communications but also requires the partitioning of computations and optimizations such as redundant computations or communication/computation overlap to reduce the execution time. This too often led to the generation of correct but inefficient programs. OpenMP: the OpenMP directives [25] are designed for sharedmemory architectures and are widely used. Computation distribution is expressed by the user with the C for directive and the schedule clause. The compiler distributes the work among threads. There is no need for data distribution and communication since the standard targets shared-memory architectures. Let us notice that for Non Uniform Memory Access machines, expression of data placement could be useful to improve performances (OpenMP 4.0 may propose the environment variable OPENMP_MEMORY_ PLACEMENT to control data placement). OpenMP standard provides the task directive that particularly allows to exploit parallelism in irregular applications. XcalableMP: XcalableMP [33] is a set of directives and routines whose design is based on the experience on HPF and OpenMP. The idea is to propose a programming model for distributed-memory

1651

architectures, providing data distribution with distribute and align directives, data communication with gmove and bcast directives and computation distribution with loop on and task directives. Besides XcalableMP provides the concept of shadow area of an array used to maintain the consistency of the global data. The Portland Group Compiler: the Portland Group Compiler [28] provides directives to specify the regions of code in Fortran and C programs that can be offloaded from a host CPU to an attached accelerator. In this hybrid architecture, data must be transferred from host to accelerator. Parallelism is expressed with the region directive between the host and the accelerator and the for directive inside the accelerator. Data distribution is not explicitly expressed but data communication is expressed through several clauses: copy, copyin, copyout, mirror, local and update. It is also possible to specify subarrays using range expressions for each array dimensions. CAPS HMPP: the CAPS HMPP product [11] gives programmers a portable interface for developing parallel applications over the available specialized and heterogeneous cores. HMPP provides a set of directives to express which kernels should be offloaded on the accelerator and to express data transfers. Computation distribution is done through codelet, callsite directives between the host and the accelerator and through the gridify directive inside the accelerator. Data movements between host and accelerator are expressed through advancedload and delegatedstore directives. As for the PGI compiler, it is also possible to specify subarrays using range expressions for each array dimensions. OpenACC : OpenACC [30] was founded by a group of companies (including formerly presented Portland Group and CAPS). It created the OpenACC Application Program Interface specification to program platforms with accelerators. OpenACC directives include computation distribution between the host and the accelerator through kernel directives and parallelism inside the accelerator with the loop directive. Data distribution is not explicitly expressed but data communication is expressed through several clauses: copy, copyin, copyout, mirror, local and update. It is also possible to specify subarrays using range expressions for each array dimension. Our proposal: we propose to generate message-passing MPI programs from OpenMP programs in the case of CPU programs or from HMPP programs in the case of GPU programs. Our STEP transformation deals only with generation of communications. The partitioning of computations are totally specified by the user:

• using the OpenMP for directive with the schedule clause for clusters of CPUs,

• using HMPP codelet and gridify directives for clusters of GPUs. Compilation technology is thus specifically used for communication generation and not for parallelism detection nor parallelism specification as data or computation distribution. In a currently developed prototype, the distribution of data is managed through the distribute directive like HPF. Combining the expression of both data and computation allows our STEP tool to generate efficient programs. This STEP tool uses powerful dataflow analysis provided by the PIPS compiler framework. 3. The PIPS workbench and its dataflow analysis We implement our STEP transformation inside the PIPS workbench. The PIPS, Paralléliseur Interprocédural de Programmes Scientifiques, workbench [16,15] targets real programs written in Fortran or C languages and provides powerful code transformations and interprocedural analyses. Originally for Fortran

1652

F. Silber-Chaussumier et al. / J. Parallel Distrib. Comput. 73 (2013) 1649–1660

Table 1 Comparing the expressiveness of the sets of directives and compiler actions.

HPF (distributed-memory) OpenMP (shared-memory) XcalableMP (distributed-memory) PGI HMPP OpenACC (accelerator) Our proposal (distributed-memory) a b c

Parallelism

Computation distribution

Data distribution

Data communication

Loop Loop/task Loop/task Loop Loop

Compiler generated Yes Yes Yes Yes

Yes Noa Yes Nob Yesc

Compiler generated Noa Yes Yes Compiler generated

No data management because the target architectures are shared-memory. Can be deduced by data communication directives. Not automatically handled. Currently developed.

programs, PIPS has been extended for the C language [8]. Interprocedural analyses were designed to investigate automatic parallelization. In this section, computation of interprocedural array regions is described. A lot of other analyses as well as program transformations are provided inside the PIPS framework but will not be described here. 3.1. Interprocedural PIPS simple effects Effects describe the memory operations performed by a given part of code. In PIPS, interprocedural dataflow analysis [32] is based on two levels of effects: statement and procedure levels. Statement effects are lists of read or written variables. If the statement is a call to a procedure then the corresponding statement effect is the translation of the procedure effect of the called procedure into the context of the caller. Procedure effects gather into a list of read or written variables the statement effects of the procedure statements, but the effects performed on variables local to the module are ignored. For each procedure, the computation of statement and procedure effects is performed in a bottom-up manner in the call graph for READ, WRITE and IN effects, in a top-down manner for OUT effects. This is possible because PIPS supposes that no recursivity is used in the program thus ending up in an directed acyclic graph. The idea is that the computation of the effects of a given procedure appears after all the procedures it calls. As a consequence, the last handled procedure is the main program. 3.2. Interprocedural PIPS predicates: transformers and preconditions To capture the execution context, PIPS also computes predicates represented with affine equations between variables that express constraints on integer scalar variables. As they are often used as loop indexes or array indexes, these predicates will help refining analysis particularly cumulated effects at a loop level or procedure level. Two kinds of predicates are used in PIPS. Transformer predicates capture the transformation of variables at statement or procedure level. Transformers are computed in a bottom-up manner in the call graph. A transformer can be a summary transformer corresponding to information gathered at the body level of the procedure after getting rid of information about local variables. When a call to a procedure is encountered, the summary transformer of the procedure is translated in the caller’s context. More details about transformer computations can be found in [1]. Precondition predicates hold just before a statement is executed. They are computed in a top-down manner in the call graph by applying recursively transformers to preconditions. Details about these predicate computations can be found in [15].

convex polyhedron represented in PIPS by a system of affine inequalities [32]. For a given statement, values of array indices corresponding to accessed elements are represented by a Φ symbol for each array dimension (the ‘‘PHIs’’). Inequalities are built from Φ symbols and subscript expressions. For accurate array regions, subscript expressions must be affine combinations of integer scalars. In addition to that, some knowledge about the execution context capturing values of scalar variables and represented with predicates (see Section 3.2) is taken into account for better approximation. These affine equations are added to the system representing array regions. Convex polyhedra are suitable to represent diagonals, triangular upper and lower parts, blocks of matrices. In other cases, this representation may lose information. For instance when representing even parity in index expression, the corresponding convex polyhedron will be the entire array. Two drawbacks of the array region representation can be found. First information is not always precise but exploitation can be conservative at the expense of some performance. Second, computing the convex hull for aggregating regions can be exponential. In practice, this limitation has not appeared for our purpose yet. 3.4. Interprocedural array regions For a given array, the resulting convex polyhedron is first computed at a statement level. For a simple statement, statement array regions can be MUST if all elements are certainly accessed or MAY if elements are potentially accessed. If the statement is a call to a procedure then the corresponding statement array region is the translation of the procedure array region of the called procedure into the context of the caller. To compute procedure array region, the union of statement array regions is performed for all the statements of the procedure. The union consists in computing the convex hull of corresponding convex polyhedra. Convex hull can lead to an overapproximation of the array region. Overapproximation generates MAY regions, underapproximation generates MUST regions otherwise regions are EXACT [1,31]. READ, WRITE and IN array regions are built bottom-up, intra and interprocedurally from elementary statements in leaf procedures in the call graph to the largest compound statement in the main program. OUT array regions are built in a top-down manner. 3.5. Translating array regions at call sites Translation at call sites of array regions is very important for the interprocedural data flow analysis. As a matter of fact, array arguments and formal parameters do not always conform. A first simple case is when the formal array fits in a subarray of the argument. For instance a formal vector can be recognized as a matrix column. This translation has been improved to take into account offsets in declarations, offsets in the references at call site and array reshaping [9].

3.3. Convex polyhedral array regions

3.6. PIPS in practice

For a given array, effects are approximated by array regions as described in Section 2.1. An array region is approximated by a

In practice, PIPS is structured in phases that are applied to program modules which may use and/or produce resources [15].

F. Silber-Chaussumier et al. / J. Parallel Distrib. Comput. 73 (2013) 1649–1660

1653

Fig. 2. The STEP transformation uses PIPS resources as input and produces new resources: SEND and RECV array regions.

In order to generate communications in our STEP tool, we will use the following two different resources:

• convex effects: READ, WRITE, IN and OUT arrays regions, • the dependence graph: called dependence chains (when it is computed by the PIPS region chain phase). Nodes represent statements and each edge represents a dependency between two statements. Edges are labeled with memory effects of the two linked statements. Memory effects can be simple effects representing variables or convex effects representing array regions. Resources and phases ordering: as explained above, computing array regions is based on preconditions and transformers resources. For the user, determining which resources to compute and when can be difficult. In practice, phases ordering is automatically deduced by PIPS depending on the resource dependences declared in the phase specification file1 [16]. Fig. 2 illustrates input and output resources of the STEP transformation. Our STEP transformation is based on READ, WRITE, IN and OUT array regions as well as the dependence graph. In the phase specification file, we declare these resources as input of our transformation. PIPS then automatically computes necessary resources as simple effects, preconditions and transformers and schedule the different phases. 4. STEP: transformation system for parallel execution In the PIPS workbench, we propose the STEP transformation focusing on the generation of communications for distributedmemory machines. It generates MPI programs from annotated parallel programs with OpenMP or HMPP directives. 4.1. The STEP programming model As STEP focuses on generating communications, the partitioning of computations as well as the data distribution are specified by the user in input programs. We target two different parallel architectures: clusters of CPUs and clusters of GPUs. For clusters of CPUs, input programs express computation partitioning with the OpenMP for directive with the schedule clause. The STEP tool generates an OpenMP+MPI program. For clusters of GPUs, input programs express computation partitioning with HMPP codelet and gridify directives. The STEP tool generates HMPP + MPI programs.

1 See pipsmake-rc.tex file.

In a currently developed prototype, the distribution of data is managed through the distribute directive as in HPF expressing a distribution pattern. Combining the expression of both data and computation allows our STEP tool to generate efficient programs. 4.2. STEP dataflow analysis Using the powerful dataflow analysis provided by the PIPS compiler workbench, STEP computes two new PIPS resources in order to determine necessary communications between worksharing regions: SEND array regions and RECV array regions. To do so, STEP uses all types of array regions resources available in PIPS: READ, WRITE, IN and OUT array regions at statement, compound statement and procedure level. SEND array regions: the SEND array regions for a given worksharing region represent updated array regions inside the worksharing region which are used afterwards. It is thus included into OUT array regions. Nevertheless, due to the PIPS interprocedural array regions analysis, the OUT region of an array A that is a parameter of a procedure P, refers to the array region of A updated (and used after) by all calls to the P procedure through the whole program. However the SEND region of the array A corresponds to only one call to the P procedure. Thus the SEND region of array A for procedure P is the intersection between the OUT region and the WRITE region of A: SEND(A) = OUT(A) ∩ WRITE(A). Fig. 3 represents a parallel loop updating 1D array A on three processes. Computations (loop iterations) are distributed by blocks over the processes. Loop iterations directly map accesses of array A. SEND array regions are thus represented by blocks of array A described by a constraints system depending on loop bounds on each process. Such a constraints system is represented on the right part of Fig. 3. A SEND array region can be EXACT when the analysis could precisely represent the array region. Otherwise the SEND array region is approximated (SEND-MAY). In this case, the SEND array region will entirely be considered as modified and potentially sent to other processes. An inspector/executor method will be used and only modified data will be updated. RECV array regions: the RECV array regions for a given procedure represent array regions that are used for the computation. In the common case, RECV array regions are PIPS IN array regions. If the SEND array region for the given worksharing region is approximated (SEND-MAY) then it should be updated before the

1654

F. Silber-Chaussumier et al. / J. Parallel Distrib. Comput. 73 (2013) 1649–1660

Fig. 3. SEND and RECV array regions and corresponding communications in full and partial communication modes between two worksharing directives.

4.4. Full communication mode

Fig. 4. Interprocedural propagation of SEND and RECV array regions.

computation in order to later transfer up-to-date data. Thus the RECV array region is computed as follows: RECV(A) = IN(A) ∪ SEND − MAY(A). Interprocedural propagation of SEND and RECV array regions: we perform interprocedural propagation of SEND and RECV array regions in a bottom-up manner in the call graph. At call site a list of translated SEND and RECV array regions is built. Each array region at a module level is associated with the corresponding array region at statement level. It is thus not a cumulated resource as PIPS defines it. As a matter of fact, cumulated resources gather information without keeping track of original statements. In Fig. 4 the statement S1 contains the worksharing region. SEND and RECV array regions are computed for the statement S1. These array regions are propagated at the call site of the function FUNCTION(). This propagation provides the association between the call site and the statement containing the worksharing region. A′ is the translation of A in the caller context. This technique allows us to provide interprocedural analysis in order to accurately determine the communication mode and also to attach the communication mode down to the statement containing the worksharing region. 4.3. The STEP execution model Our STEP tool is based on a simple parallel execution model for a cluster of CPUs which is adapted to cluster of GPUs. Execution model on a cluster of CPUs: for a cluster of CPUs, each MPI process executes the sequential code (between parallel sections) together with its share of the workshare constructs; data are updated after each workshare construct by means of communications, so as to be usable later on. Related work on distributed execution of OpenMP programs [29,3,12,17,7,4] is described in [21]. Execution model on a cluster of GPUs: this execution model is adapted to a cluster of GPUs. For a cluster of GPUs, each codelet will be executed in parallel on several GPUs. Data are updated after each parallel codelet execution, by means of communications first between the GPU and the host CPU, then between CPUs and at last between the host CPU and the GPU. Fig. 5 illustrates two successive HMPP codelet executions on one computing node (including host and GPU). After STEP transformation, the program is executed on two computing nodes and codelets are parallelized on two GPUs. After parallel executions of codelets, data transfers are generated.

A simple way to generate correct message-passing programs is to update after each worksharing regions all data that are modified. This is performed by STEP after a worksharing region when using full communication mode. In full communication mode, communicated array regions are equal to SEND array regions. Fig. 3 illustrates full communications: the SEND array regions of array A are transferred to every process. A is entirely updated on all nodes after each worksharing regions. Determination of communicated array regions at compile time: in full communication mode, communicated array regions for a given worksharing region are determined at compile time. This has two main advantages. First of all it does not add overhead at execution time. Second, as it is a source-to-source transformation, SEND array regions are described inside the source. This allows the programmer to modify the communicated regions inside the source. This is useful when the generated array regions are not precise enough in order to reduce the execution time. Despite the exact computation of array regions representing the modified data, the main drawback of full communications is that it does not take into account the way that data is later used. If only a part of the array is used that has not been modified by any other processor then unnecessary communications occur. Thus we propose another way to generate communications using partial communications. 4.5. Partial communication mode To avoid unnecessary communications generated in full communication mode, we propose to generate partial communications when possible. Fig. 3 shows a case that would greatly benefit from the partial communication mode. Two successive parallel loops are executed: before the second region, only boundaries of the array should be communicated. In full communication mode, the whole array on all nodes is updated. Runtime determination of communicated array regions: to reduce the volume of communications, we need to know depending on the data dependences from which worksharing regions data involved in a given communication phase come. From a worksharing region the data might be involved in several different communications because of branches in the program. For this reason, we chose to compute communicated array regions at runtime, based on the statically determined SEND and RECV array regions. The runtime system updates two new kinds of array regions: UPTODATE and COMM. Each computing node will store, for all computing nodes, UPTODATE(Pi , A) representing up-to-date array regions of array A for computing node Pi . In the same way, each computing node will store, for all computing nodes, COMM(Pi → Pj, A) representing array regions to be communicated for array A.

F. Silber-Chaussumier et al. / J. Parallel Distrib. Comput. 73 (2013) 1649–1660

1655

Fig. 5. HMPP codelet execution on one GPU and its corresponding MPI + HMPP execution on two GPUs after STEP transformation. Table 2 Runtime UPTODATE and COMM array regions: computation based on SEND and RECV array regions computed at compile time. UPTODATE(Pi , A) (after worksharing region) COMM(Pi → Pj , A) UPTODATE(Pi , A) (after communication)

UPTODATE(Pi , A)∪SEND(Pi , A)−



Pi ̸=Pj

SEND(Pj , A)

(UPTODATE(Pi , A)∩RECV(Pj , A))−UPTODATE(Pj , A) UPTODATE(Pi , A) ∪ RECV(Pi , A)

After each worksharing region, the computing node Pi updates its UPTODATE array region. Before each worksharing region, communicated array regions are computed and data are exchanged. Afterwards the UPTODATE array region is updated once again. These runtime array regions are computed at runtime based on static SEND and RECV array regions as described in Table 2. Fig. 6 illustrates the computation of communicated array regions in partial communication mode. While for the worksharing region 2, communications will involve only updated data in the worksharing region 1, communicated array regions depend on the previous worksharing regions 1 and 2 computations for the worksharing 3. 4.6. Static determination of the communication mode A communication mode is attached to a SEND array region for each worksharing region at compile time. This allows to mix inside the program partial communications when it is possible and full communications. In order to determine the communication mode, the data dependence graph is used. For each pair of vertices (representing parts of the user program) in this graph, if the SEND array region of the source vertex matches one RECV array regions of all destination vertices then the communication mode associated to the source vertex is partial. SEND and RECV array regions are associated to each vertex, thanks to the interprocedural propagation described above. The goal of the partial communication mode is to be used between two data-dependent worksharing constructs. In other cases, we use full communication mode as shown in Fig. 7. In general, in the data dependence graph, there are several dependence edges from a vertex. If one of the destination vertices has no matching RECV array region then the communication mode of the source vertex is forced to full. Fig. 8 illustrates two cases where two edges start from a given source vertex corresponding to

Fig. 6. Communicated array regions may involve several worksharing array regions in the data dependence graph: UPTODATE and COMM array regions are computed at runtime, SEND and RECV array regions at compile time.

Fig. 7. Communication mode (partial or full) in simple cases with one dependence edge.

a worksharing region. In the first case, matching RECV array regions are found in all destination vertices then the communication mode is partial. In the second case, one of them does not own any match then the communication is forced full for both destination vertices.

1656

F. Silber-Chaussumier et al. / J. Parallel Distrib. Comput. 73 (2013) 1649–1660

Fig. 8. Communication mode (partial or full) in the case of several dependence edges: full communication mode is forced when there is a data dependence towards a non parallel region.

As a result, the STEP tool computes information at compile time (SEND and RECV array regions and communication modes) and uses runtime to precisely determine array regions to communicate. This combination of static analysis and runtime determination helps in generating precise communications. The branches are resolved at execution time thus simplifying the generated program. Computation of SEND/RECV array regions is done at compile time to reduce as much as possible the overhead of the runtime system. This technique for generating communication is used for both clusters of CPUs and clusters of GPUs cases. 5. Results In this paper, to evaluate the STEP prototype, we first show that this transformation has been successfully applied to HMPP C programs. The generated programs correctly exploit distributedmemory machines. Second, we show that generated communications are accurate and allow generated message-passing programs to scale. 5.1. Generated source programs In [21,22], we detail the STEP transformation of several OpenMP Fortran benchmarks as: matrix multiplication, molecular dynamic program and the NAS Fortran benchmarks MG and FT. Our prototype has been improved with the transformation of C programs as well as the transformation of HMPP programs for an execution of GPUs. In this paper, we demonstrate only the transformation of HMPP C programs using a Jacobi kernel. Input HMPP Jacobi program: the HMPP Jacobi program is composed of two codelets executed inside a for loop. The codelets and the main program are printed in Listings 1, 2 and 3. This program can be executed on a single compute node. Data transfers between the CPU and GPU are explicitly programmed with advancedload and delegatedstore directives. Fig. 9 represents HMPP codelets and communications on a single computing node and after STEP generation on multiple computing nodes. Listing 1: HMPP jacobi codelet 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

#pragma hmpp j a c o b i codelet , args [ inT ] . i o =in , args [ outT ] . i o =out void j a c o b i ( double inT [W] [ L ] , double outT [W] [ L ] ) { int i , j ; #pragma hmppcg g r i d i f y ( i , j ) for ( i = I ; i < W − I ; i ++) for ( j = J ; j < L − J ; j ++) { outT [ i ] [ j ] = f ( inT [ i −I ] [ j ] , inT [ i ] [ j −J ] , inT [ i ] [ j + J ] , inT [ i + I ] [ j ] ) ; }}

Listing 2: HMPP normReduce codelet 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

#pragma hmpp diffnorm codelet , args [ outT ] . i o =in , args [ inT ] . i o = inout , args [ diffNorm ] . i o = inout void diffNormReduce ( double inT [W] [ L ] , double outT [W] [ L ] , double diffNorm [ 1 ] ) { int i , j ; double diffsum = diffNorm [ 0 ] ; #pragma hmppcg g r i d i f y ( i , j ) , reduce ( + : diffsum ) for ( i = I ; i < W − I ; i ++) for ( j = J ; j < L − J ; j ++) { double d i f f ; d i f f = outT [ i ] [ j ] − inT [ i ] [ j ] ; diffsum += d i f f ∗ d i f f ; inT [ i ] [ j ] = outT [ i ] [ j ] ; } diffNorm [ 0 ] = diffsum ; }

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

double myInT [W] [ L ] , myOutT [W] [ L ] ; i n t main ( ) { i n i t ( myInT , myOutT ) ; i n t index , i t e r a t i o n s = NBITER ; double theDiffNorm = 1; #pragma hmpp a l l o c a t e , args [ diffnorm : : inT ] . s i z e ={W, L } , args [ diffnorm : : outT ] . s i z e ={W, L } #pragma hmpp advancedload , args [ j a c o b i : : inT ] , args [ j a c o b i : : inT ] . addr= "myInT" for ( index = 0; ( index < i t e r a t i o n s ) && ( theDiffNorm >= RefDiffNorm ) ; index ++ ) { #pragma hmpp j a c o b i c a l l s i t e , args [ inT ; outT ] . noupdate= true j a c o b i ( myInT , myOutT ) ; theDiffNorm = 0 . 0 ; #pragma hmpp diffnorm c a l l s i t e , args [ inT ; outT ] . noupdate= true diffNormReduce ( myInT , myOutT , &theDiffNorm ) ; } #pragma hmpp delegatedstore , args [ diffnorm : : inT ] , args [ diffnorm : : inT ] . addr= "myInT" #pragma hmpp r e l e a s e displayRegion ( myInT ) ; d i s p l a y ( theDiffNorm ) ; return 0; }

Listing 3: HMPP main program

Result of the program transformation: in order to generate MPI programs, STEP parallelizes the execution of both codelets. Codelet definition and pragmas are modified to take into account the new loop bounds. For the Jacobi codelet, the codelet pragma is modified in Listing 4 line 1 (name) and line 2 (parameters).

F. Silber-Chaussumier et al. / J. Parallel Distrib. Comput. 73 (2013) 1649–1660

1657

Fig. 9. Execution of the Jacobi kernel on one GPU and on two GPUs after STEP transformation.

The codelet definition is modified line 7 (parameters) and line 11 (loop bounds). The same transformation is applied to the codelet diffNorm in Listing 5. Listing 4: HMPP jacobi codelet after STEP transformation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

#pragma hmpp j a c o b i _ s t e p codelet , args [ inT ; STEP_i_LOW ; STEP_i_UP ] . i o =in , args [ outT ] . i o =out void jacobi_CODELET ( double inT [1+1022+1][1+1022+1] , double outT [1+1022+1][1+1022+1] , i n t STEP_i_LOW , i n t STEP_i_UP ) { int i , j ; #pragma hmppcg g r i d i f y ( i , j ) for ( i =STEP_i_LOW ; i <=STEP_i_UP ; i +=1) for ( j = 1; j <= 1022; j += 1) { double neighbor = f ( inT [ i − 1][ j ] ) , inT [ i ] [ j − 1]) , inT [ i ] [ j + 1 ] ) , inT [ i + 1 ] [ j ] ) ; outT [ i ] [ j ] = neighbor / 3 ; }}

Listing 5: HMPP normReduce codelet after STEP transformation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

#pragma hmpp diffnorm_step codelet , args [ outT ; STEP_i_LOW ; STEP_i_UP ] . i o =in , args [ inT ; diffNorm ] . i o = inout void diffNormReduce_CODELET ( double inT [1+1022+1][1+1022+1] , double outT [1+1022+1][1+1022+1] , double diffNorm [ 1 ] , i n t STEP_i_LOW , i n t STEP_i_UP ) { int i , j ; double diffsum = diffNorm [ 0 ] ; #pragma hmppcg g r i d i f y ( i , j ) , reduce ( + : diffsum ) for ( i =STEP_i_LOW ; i <=STEP_i_UP ; i +=1) for ( j = 1; j <= 1022; j += 1) { double d i f f ; d i f f = outT [ i ] [ j ]− inT [ i ] [ j ] ; diffsum += d i f f ∗ d i f f ; inT [ i ] [ j ] = outT [ i ] [ j ] ; } diffNorm [ 0 ] = diffsum ; }

Each call site codelet is replaced by a STEP function including the call site of the STEP generated call site as well as all STEP functions to generate and execute MPI communications. Listing 6 prints out the resulting STEP function including the Jacobi codelet call site:

• send array regions for the outT array are computed (lines 41 to 44),

• recv array regions for the inT array are computed (lines 24 to 27),

• the communication mode is partial (see lines 30 and 55), • runtime calls are inserted that compute the array regions to be communicated (UPTODATE and COMM array regions) and perform MPI communications (lines 31 and 58), • GPU–CPU data transfers are executed (line 29), • new loop bounds are instantiated (line 49), • the new call site is executed (lines 52 and 53). Listing 6: Jacobi call site after STEP transformation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

void simpleJacobi_HYBRID ( double inT [1+1022+1][1+1022+1] , double outT [1+1022+1][1+1022+1]) { / / Perform some d e c l a r a t i o n s and i n i t i a l i z a t i o n s i n t STEP_i_LOW , STEP_i_UP , STEP_COMM_SIZE , STEP_COMM_RANK, IDX , STEP_RR_inT [ STEP_MAX_NB_LOOPSLICES ] [ 2 ] [ STEP_INDEX_SLICE_UP ] , STEP_SR_outT [ STEP_MAX_NB_LOOPSLICES ] [ 2 ] [ STEP_INDEX_SLICE_UP ] ; STEP_CONSTRUCT_BEGIN ( STEP_PARALLEL_DO ) ; STEP_INIT_ARRAYREGIONS ( inT , STEP_REAL8 , 2 , 0 , 1+1022+1 −1, 0 , 1+1022+1 − 1); STEP_INIT_ARRAYREGIONS ( outT , STEP_REAL8 , 2 , 0 , 1+1022+1 −1, 0 , 1+1022+1 − 1); STEP_GET_COMMSIZE(&STEP_COMM_SIZE ) ; STEP_COMPUTE_LOOPSLICES ( 1 , 1022 , 1 , STEP_COMM_SIZE ) ; / / G e t t i n g t he RECV r e g i o n s for ( IDX = 1; IDX <= STEP_COMM_SIZE ; IDX += 1) { / / RECV REGIONS STEP_GET_LOOPBOUNDS ( IDX −1, &STEP_i_LOW , &STEP_i_UP ) ; / / D i s p l a y t he c o n s t r a i n t s system o f t he RECV r e g i o n f o r i n T / / RECV MAY a r r a y r e g i o n //

< i n T [ PHI1 ] [ PHI2]−RECV−MAY−{0<=PHI1 , PHI1 <=1023 , STEP_i_LOW <=PHI1 +1 ,

//

PHI1 <= STEP_i_UP +1 , 0<=PHI2 , PHI2 <=1023 , STEP_i_LOW <=1022 ,

//

STEP_i_LOW <= STEP_i_UP , 1<= STEP_i_UP } >

STEP_RR_inT [ IDX − 1][1 − 1][STEP_INDEX_SLICE_LOW −1] = MAX( 0 , STEP_i_LOW − 1); STEP_RR_inT [ IDX − 1][1 − 1][STEP_INDEX_SLICE_UP −1] = MIN(1023 , STEP_i_UP + 1 ) ;

1658

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

F. Silber-Chaussumier et al. / J. Parallel Distrib. Comput. 73 (2013) 1649–1660 STEP_RR_inT [ IDX − 1][2 − 1][STEP_INDEX_SLICE_LOW −1] = 0; STEP_RR_inT [ IDX − 1][2 − 1][STEP_INDEX_SLICE_UP −1] = 1023;} STEP_SET_RECVREGIONS ( inT , STEP_COMM_SIZE , ( STEP_ARG ∗ ) STEP_RR_inT ) ; STEP_SET_HMPP_IN_FUNC ( inT , s t e p _ g p u _ r e c v _ j a c o b i _ s t e p _ i n T ) ; STEP_REGISTER_ALLTOALL_PARTIAL ( inT , STEP_NBLOCKING_ALG , STEP_TAG_DEFAULT ) ; STEP_FLUSH ( ) ; / / G e t t i n g t he SEND r e g i o n s for ( IDX = 1; IDX <= STEP_COMM_SIZE ; IDX += 1) {

/ / SEND REGIONS

STEP_GET_LOOPBOUNDS ( IDX −1, &STEP_i_LOW , &STEP_i_UP ) ; / / D i s p l a y t he c o n s t r a i n t s system o f t h e SEND r e g i o n f o r outT / / SEND EXACT a r r a y r e g i o n // //



STEP_SR_outT [ IDX − 1][1 − 1][STEP_INDEX_SLICE_LOW −1] = MAX( STEP_i_LOW , 1 ) ; STEP_SR_outT [ IDX − 1][1 − 1][STEP_INDEX_SLICE_UP −1] = MIN( STEP_i_UP , 1022); STEP_SR_outT [ IDX − 1][2 − 1][STEP_INDEX_SLICE_LOW −1] = 1; STEP_SR_outT [ IDX − 1][2 − 1][STEP_INDEX_SLICE_UP −1] = 1022; } STEP_SET_SENDREGIONS ( outT , STEP_COMM_SIZE , ( STEP_ARG ∗ ) STEP_SR_outT ) ; STEP_SET_HMPP_OUT_FUNC ( outT , step_gpu_send_jacobi_step_outT ) ; STEP_GET_RANK(&STEP_COMM_RANK ) ; STEP_GET_LOOPBOUNDS (STEP_COMM_RANK , &STEP_i_LOW , &STEP_i_UP ) ; / / Codelet c a l l s i t e #pragma hmpp j a c o b i _ s t e p c a l l s i t e , args [ outT ; inT ] . noupdate= true

to achieve performances close to MAGMA [24] or CUBLAS [10] matrix multiplication. In this paper, we focus on the generation of communications. The multiple GPU version generated by STEP performs a monodimensional row computation with communication of the results at the end. This execution time includes inter-node communications using MPI and data transfers between CPU and GPU. The speedup on four GPUs (compared to the execution on one GPU) is equal to 2.92. Discussion: the resulting speedups are close to 3 on 4 GPUs. We always measure the speedup by comparing the execution on one GPU and the execution on multiple GPUs of the whole application. These speedups are very encouraging concerning the technique used to generate communications. The two benchmarks used in this article illustrate two typical cases of regular applications:

• gathering of blocks of rows of the resulting matrix (matrix multiplication)

• reduction and communications of borders in a iterative algorithm (Jacobi).

Generated communications: partial communication mode results in the communication of boundaries of the output array at each iteration as illustrated in Fig. 9. The reduce clause of the diffNorm codelet results in a MPI reduction. At the end the entire output array is gathered on one computing node to use the results.

These results could be further improved by reducing the overhead due to the underlying communication mechanism. This mechanism involves first communication between GPU and CPU, MPI communication over the interconnection network between the CPUs and then communication between CPU and GPU. This could be reduced in the future with communication technologies that would allow direct communications between GPUs. For instance, NVIDIA GPU Direct technology [13] introduces support for accelerated communications between GPUs through the network. While this technology will greatly improve the performances, the transfer of data still needs to be provided by the program. This transfer of data will be automatically generated by the STEP tool.

5.2. Performances of STEP generated programs

6. Conclusion and future work

Measurements are performed on a Bull NOVA cluster using one GPU, two GPUs on two different nodes and four GPUs on two different nodes. GPUS are Tesla M2050 with 3 GB of memory. Each node is composed with:

Designed to generate message-passing programs, the STEP transformation presented in this paper deals with heterogeneous data transfers: starting from a source code with directives, it generates a new source code with additional accelerator data transfer directives and message-passing MPI primitives. This transformation is included in the PIPS workbench which provides powerful dataflow analysis. Based on the PIPS resources we have designed a technique to generate communications that relies on both statically computed resources (SEND and RECV array regions) as well as the information computed at runtime to handle program branches. This transformation has been demonstrated for clusters of CPUs on different benchmarks in a previous article [22]. In this paper, we specifically show that the transformation is successful for clusters of GPUs on two HMPP C benchmarks: Jacobi kernel and matrix multiplication. First of all, generated communications are heterogeneous: data transfers use HMPP for GPU and CPU communications and use MPI between CPUs. Second, generated communications are precise enough to allow applications to scale (despite the overhead of the multiple levels of communications). Future work will be oriented in several directions. First of all, we want to demonstrate that while fully automatic parallelization is difficult to achieve, combining information given by the user and compiler analysis can bring up good results. To further demonstrate this, we want to generate communications from OpenACC new standard and also be able to exploit more information given by the user by means of directives. As a matter of fact, currently our tool does not use any hint on input or output data for a given worksharing region. Directives sets for accelerator-based architectures all propose ways to describe the data that should be exchanged. This could be used by the user of our transformation to circumvent

simpleJacobi_CODELET ( inT , outT , STEP_i_LOW , STEP_i_UP ) ; STEP_REGISTER_ALLTOALL_PARTIAL ( outT , STEP_NBLOCKING_ALG , STEP_TAG_DEFAULT ) ; / / Where communications are performed STEP_FLUSH ( ) ; STEP_CONSTRUCT_END ( STEP_PARALLEL_DO ) ; }

• two NVIDIA Tesla M2050 GPUs, each GPU contains 448 cores and 3 GB of global memory;

• two Intel Xeon X5640 quad-core Westmere EP 2.66 GHz CPUs, 24 GB of memory, each CPU contains four cores;

• two dedicated PCI-e 16× connection 8 GB/s for each GPU; • two double InfiniBand QDR connections between blades. Jacobi measurements: to measure performances of the Jacobi benchmark, we used double precision matrices with three different matrix sizes: 1000 × 1000, 5000 × 5000 and 10 000 × 10 000. Fig. 10 represents the obtained execution times and speedups. For the 10 000 × 10 000 matrices, elapsed time of the whole Jacobi benchmark decreases from 98.95 s using one GPU to 33.46 s using four GPUs. This execution time includes inter-node communications using MPI and data transfers between CPU and GPU. The speedup on four GPUs (compared to the execution on one GPU) is equal to 2.96. Matrix multiplication measurements: to measure performances of the matrix multiplication benchmark, we used single precision matrices with two different matrix sizes: 10 000 × 10 000 and 15 000 × 15 000. Fig. 11 represents the obtained execution times and speedups. For the 15 000 × 15 000 matrices, elapsed time of the whole Jacobi benchmark decreases from 51.24 s using one GPU to 17.5 s using four GPUs. The original matrix multiplication program on one GPU has been improved using loop transformations via unroll, jam and split HMPP directives. This execution time can be further improved using HMPP directives

F. Silber-Chaussumier et al. / J. Parallel Distrib. Comput. 73 (2013) 1649–1660

1659

Fig. 10. Jacobi benchmarks.

Fig. 11. Matrix multiplication benchmarks.

the inspector/executor mechanism used when the results of our analyses are not precise enough. Secondly, the STEP tool is able to generate point-to-point communications at different levels: GPU level with HMPP directives, CPU level with MPI primitives. The performance of the generated communications could be improved in different ways. On the one hand, STEP can generate communications based on direct communications between GPUs. On the second hand, compiler analysis can be used to detect communication optimizations as collective communications or communication/computation scheduling to organize communication/computation overlap. At last, we are currently working on bigger applications to demonstrate the potential of the compiler analysis. Particularly we work on an application using a fast algorithm based on high frequency asymptotics for computational electromagnetics [6]. Acknowledgment The research presented in this paper has partially been supported by the French Ministry for Economy, Industry and Employment (DGCIS) through the ITEA2 Call 4 project ‘‘H4H’’ (No. 09011, October 2010–October 2013). References [1] B. Apvrille-Creusillet, Regions exactes et privatisation de tableaux, Master’s Thesis, École Nationale Supérieure des Mines de Paris, 1994. [2] J.M. Barth, An interprocedural data flow analysis algorithm, in: Proceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, 1977.

[3] A. Basumallik, S. Min, R. Eigenmann, Towards OpenMP execution on software distributed shared memory systems, in: International Workshop on OpenMP: Experiences and Implementations, WOMPEI, 2002. [4] A. Basumallik, S. Min, R. Eigenmann, Programming distributed memory systems using OpenMP, in: 12th International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2007. [5] W. Blume, R. Eigenmann, Performance analysis of parallelizing compilers on the perfect benchmarks programs, IEEE Trans. Parallel Distrib. Syst. 3 (6) (1992). [6] A. Boag, C. Letrou, Multilevel fast physical optics algorithm for radiat ion from non-planar apertures, IEEE Trans. Antennas Propag. 53 (6) (2005). [7] T. Boku, M. Sato, M. Matsubara, D. Takahashi, OpenMPI—OpenMP like tool for easy programming in MPI, in: Sixth European Workshop on OpenMP, 2004. [8] B. Creusillet, Integrating pointer analyses in PIPS effect analyses for an effective parallelization of C programs, in: Presentation at the PIPS Days, 2010. [9] B. Creusillet, F. Irigoin, Interprocedural array region analyses, in: Eighth Workshop on Languages and Compilers for Parallel Computing, LCPC, 1995. [10] CUDA CUBLAS Library, NVIDIA Corporation, 2010. [11] R. Dolbeau, S. Bihan, F. Bodin, HMPP: a hybrid multi-core parallel programming environment, in: Proceedings of the Workshop on General Purpose Processing on Graphics Processing Units, GPGPU, 2007. [12] R. Eigenmann, J. Hoeflinger, R.H. Kuhn, D. Padua, A. Basumallik, S. Min, J. Zhu, Is OpenMP for grids, in: NSF Next Generation Systems Program Workshop, 2002. URL http://polaris.cs.uiuc.edu/publications-b1.html. [13] GPU Direct Technology, NVIDIA Corporation, 2012. [14] P. Havlak, K. Kennedy, An implementation of interprocedural bounded regular section analysis, IEEE Trans. Parallel Distrib. Syst. 2 (3) (1991). [15] F. Irigoin, Interprocedural analyses for programming environments, in: Workshop on Environments and Tools for Parallel Scientific Computing, 1992. [16] F. Irigoin, P. Jouvelot, R. Triolet, Semantical interprocedural parallelization: an overview of the PIPS project, in: Proceedings of the 5th International Conference on Supercomputing, ICS, 1991. [17] Y. Kee, J. Kim, S. Ha, ParADE: an OpenMP programming environment for SMP cluster systems, in: Conference on High Performance Networking and Computing, 2003. [18] K. Kennedy, C. Koelbel, H. Zima, The rise and fall of high performance Fortran: an historical object lesson, Commun. ACM 54 (11) (2007). [19] D. Levine, D. Callahan, J. Dongarra, A comparative study of automatic vectorizing compilers, Parallel Comput. 17 (10–11) (1991).

1660

F. Silber-Chaussumier et al. / J. Parallel Distrib. Comput. 73 (2013) 1649–1660

[20] S. Maleki, Y. Gao, M.J. Garzaran, T. Wong, D.A. Padua, An evaluation of vectorizing compilers, in: International Conference on Parallel Architectures and Compilation Techniques, PACT, 2011. [21] D. Millot, A. Muller, C. Parrot, F. Silber-Chaussumier, STEP: a distributed OpenMP for coarse-grain parallelism tool, in: International Workshop on OpenMP, IWOMP, OpenMP in a New Era of Parallelism, Purdue University, USA, 2008. [22] D. Millot, A. Muller, C. Parrot, F. Silber-Chaussumier, From OpenMP to MPI: first experiments of the STEP source-to-source transformation, in: MiniSymposium in International Conference on Parallel Computing (ParCO)— Parallel Programming Tools for Multi-Core Architectures, 2009. [23] MPI-2: extensions to the message-passing interface, Message Passing Interface Forum, 1997. URL http://www.mpi-forum.org. [24] R. Nath, S. Tomov, J. Dongarra, An improved magma gemm for Fermi graphics processing units, Int. J. High Perform. Comput. Appl. 24 (4) (2010). [25] OpenMP application program interface, Version 3.1, OpenMP architecture review board, 2011. URL http://www.openmp.org. [26] D.A. Padua, M.J. Wolfe, Advanced compiler optimizations for supercomputers, Commun. ACM 29 (12) (1986) Special issue on parallelism. [27] P.M. Petersen, D.A. Padua, Static and dynamic evaluation of data dependence analysis, in: Proceedings of the 7th international conference on Supercomputing, ICS, 1993. [28] PGI Fortran & C Accelerator Programming Model Current Features, Portland Group, 2010. URL http://www.pgroup.com/resources/articles.htm. [29] M. Sato, S. Satoh, K. Kusano, Y. Tanaka, Design of OpenMP compiler for an SMP cluster, in: Proceedings of First European Workshop on OpenMP, EWOMP, 1999. URL www.it.lth.se/ewomp99. [30] The OpenACC Application Programming Interface Version 1.0, 2011. URL http://www.openacc-standard.org/. [31] R. Triolet, F. Irigoin, Many other contributors, PIPS high-level software interface pipsmake configuration, in: Technical Documentation, 2012. [32] R. Triolet, F. Irigoin, P. Feautrier, Direct parallelization of call statements, in: Proceedings of the 1986 SIGPLAN Symposium on Compiler Construction, 1986. [33] XcalableMP Language Specification Version 1.0, XcalableMP Specification Working Group, 2011. URL http://www.xcalablemp.org/.

F. Silber-Chaussumier received her Ph.D. in 2001 from École Normale Supérieure, Lyon, France. She has been working since 2002 at Télécom SudParis, located in Évry, France, in the Computer Science department in the Parallelism and High Performance team. Her research interests focus on parallel programming. She has been involved in several research projects. Currently she is involved in two European ITEA2 projects: H4H (Hybrid programming environment for Heterogeneous computing clusters) and MANY (Many-core programming and resource management for high-performance Embedded Systems). A. Muller received a master degree in computer science and engineer degree from École Nationale Supérieure d’Informatique pour l’Industrie et l’entreprise (ENSIIE) in 2002. He has been a research engineer in Parallelism and High Performance team in the Computer Science department at Télécom SudParis since 2006.

R. Habel received a master degree in computer science from University Paris 6, Pierre and Marie Curie in 2010. During his master thesis he worked in collaboration with Queen’s University Belfast to accelerate scientific applications using GPU accelerator. He has been a Ph.D. student in Parallelism and High Performance team in the Computer Science department at Télécom SudParis since 2010.