An environment for OpenMP code parallelization

An environment for OpenMP code parallelization

Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 20...

495KB Sizes 5 Downloads 175 Views

Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.

811

A n e n v i r o n m e n t for O p e n M P c o d e p a r a l l e l i z a t i o n c.s. Ierotheou ~, H. Jin b, G. Matthews b, S.R Johnson ~, and R. Hood b ~Parallel Processing Research Group, University of Greenwich, London SE 10 9LS, UK bNASA Advanced Supercomputing Division, NASA Ames Research Center, Moffett Field, CA 94035, USA In general, the parallelization of compute intensive Fortran application codes using OpenMP is relatively easier than using a message passing based paradigm. Despite this, it is still a challenge to use OpenMP to parallelize application codes in such a way that will yield an effective scalable performance gain when executed on a shared memory system. If the time to complete the parallelization is to be significantly reduced, an environment is needed that will assist the programmer in the various tasks of code parallelization. In this paper the authors present a code parallelization environment where a number of tools that address the main tasks such as code parallelization, debugging and optimization are available. The parallelization tools include ParaWise and CAPO which enable the near automatic parallelization of real world scientific application codes for shared and distributed memory-based parallel systems. One focus of this paper is to discuss the use of ParaWise and CAPO to transform the original serial code into an equivalent parallel code that contains appropriate OpenMP directives. Additionally, as user involvement can introduce errors, a relative debugging tool (P2d2) is also available and can be used to perform near automatic relative debugging of an OpenMP program that has been parallelized either using the tools or manually. In order for these tools to be effective in parallelizing a range of applications, a high quality fully interprocedural dependence analysis as well as user interaction are vital to the generation of efficient parallel code and in the optimization of the backtracking and speculation process used in relative debugging. Results of parallelized NASA codes are presented and show the benefits to using the environment. 1. I N T R O D U C T I O N Today the most popular parallel systems are based on either shared memory, distributed memory or hybrid distributed-shared memory systems. For a distributed memory parallelization, a global view of the whole program can be vital when using a Single Program Multiple Data (SPMD) paradigm [2]. The whole parallelization process can be very time consuming and error-prone. For example, to use the available distributed memory efficiently, data placement is an essential consideration, while the placement of explicit communication calls requires a great deal of expertise. The parallelization on a shared memory system is only relatively easier. The data placement may appear to be less crucial than for a distributed memory parallelization and a more local loop level view may be sufficient in many cases, but the parallelization process is still error-prone, time-consuming and still requires a detailed level of expertise. The main goal

812 for developing tools that can assist in the parallelization of serial application codes is to embed the expertise within automated algorithms that perform much of the parallelization in a much shorter time frame than what would otherwise be required by a parallelization expert doing the same task manually. In addition, the toolkit should be capable of generating generic, portable, parallel source code from the original serial code [ 1]. In this paper we discuss the tools that have been developed and their interoperability to assist with OpenMP code parallelization, specifically targeted at shared memory machines. These include an interactive parallelization tool for message passing based parallelizations (Para Wise) that also contains dependence analysis capability and many valuable source code browsers; an OpenMP code generation module (CAPO) with a range of techniques that aid in the production of efficient, scalable OpenMP code; a relative debugger built on P2d2 and capable of handling hundreds of parallel processes. 2. PARAWISE, CAPO AND P2D2 TOOLS The tools in this environment have been used to parallelize a number of FORTRAN application codes successfully for distributed memory [1, 2, 4] and shared memory [5, 6, 7] systems based on distributing arrays and/or loop iterations across a number of processors/threads. A detailed description of the tools will not be given here but can be found elsewhere [2, 5]. Instead, an overview is presented here. 2.1. Overview Figure 1 shows an overview of the various tools, their functions and the nature of the interactions between them. Note that the dependence analysis and directive insertion engines provide facilities that are used by other components (e.g. symbolic algebra, proofs etc). The expert assistant and profiling and tracing tools are not completely integrated into the environment and are part of an ongoing and longer term project (see section 4). 2.2. CAPO code generation features As part of the code generation process CAPO will automatically identify a number of essential characteristics about the code. In particular CAPO will automatically classify the different types of loops in the code, of which there are four basic types. S e r i a l - loops contain a loopcarried true data dependence that causes the serialization of the loop. Other possible reasons for a loop to be defined as serial include the presence of I/O or loop exiting statements within the loop body; Covered serial - as with serial loop but contains, or is contained in, nested parallel loops. If the serial loop can be made parallel then the parallelism may be defined at a higher level; Chosen parallel - parallel loops at which the OMP DO directive is defined and also include parallel pipeline and reduction loops; Not chosen parallel - parallel loops not selected for application of the OMP DO directive because these loops are surrounded by other parallel loops at a higher nesting level. All of the code generation is automatic and includes identifying parallel loops using the interprocedural dependence analysis to define parallelism at a coarser level and where to place the OMP DO directive; creating PARALLEL regions based on the identified parallel loops; merging consecutive PARALLEL regions into a single region to reduce overheads in thread start up and shut down; detecting and producing the NOWAIT clause on an ENDDO to reduce barrier synchronisations when this is proven legal; identifying and defining the scoping of all vail-

813

(SERIALCODE) PARAWISE~ IP , [ ANALYSIS DEPENDENCE .... i....Deletion d EXPERT t ENGINEt Info Dependence ASSISTANT !DApEIEND CEGRAPH~

~CO--~DEj

i

!

I

HISTORY , 1

Bou~pit u~!ELATIVEoEBUGRGER n

~ARALLELDEBUGGER IPARALLELPROFILER__ &TRACEVISUALIZER

Z \ (I .REA ) .REAO)

1-THREAD LoopSpeedupandOpenMPRuntimeOverheads

Figure 1. Overview of the environment indicating the interactions between the different tools ables in PARALLEL regions such as SHARED, PRIVATE, FIRSTPRIVATE, LASTPRIVATE, THREADPRIVATE etc. The quality of the parallel source code generated is derived from many of the features provided by ParaWise. For example, the dependence analysis is fully interprg,cedural and valuebased [3 ] (i.e. the analysis detects the flow of data rather than just the memory location accesses) and allows the user to assist with essential knowledge about program variables. There are many reasons why an analysis may fail to determine the non-existence of a dependence accurately. This could be due to incorrect serial code, a lack of information on the program input variables, limited time to perform the analysis and limitations in the current state-of-the-art dependence algorithms. For these reasons it is essential to allow user interaction as part of the process, particularly if scalability is required on a large number of processors. For instance, a lack of knowledge about a single variable that is read into an application can lead to a single assumed data dependence that serializes a single loop, which in turn greatly affects the parallel scalability of the application code. An example to illustrate the benefit of using a quality interprocedural analysis is shown in the next section. 2.3. User interaction Apart from the obvious need for user interaction to enable the removal of loop serializing dependence(s), efficient parallel OpenMP code requires the consideration of other factors. Due to the relative expense in starting threads as part of a PARALLEL region definition and barrier synchronizations within the region it is essential to attempt to place directives so that these overheads are significantly reduced. To this end, sophisticated algorithms have been developed for CAPO to improve the merging of PARALLEL regions and to prove the independence of threads in loops within the same PARALLEL region where a barrier synchronization can been

814 avoided. These algorithms rely heavily on dependence analysis, if a dependence that inhibits these optimizations has been assumed to exist, then user control is necessary to provide additional information to the tool. Figure 2 shows a sample of code taken from an ocean modelling code [6] (due to space restrictions the sample code also includes the final OpenMP directives discussed later). For the serial code, the variables being read in at run time mean that the static analysis alone has been unable to determine that the variable np (used in statement s2) cannot be equal to n c or nm (used in statement $1) for a given main time step iteration. In this case CAPO will define PARALLEL regions around k loops in the BAROCLINI C subroutine. If the user provides information to ParaWise such that np can never be equal to nc or nm in a given time step iteration, this results in the removal of some assumed dependencies and allows the i and j loops to be executed as parallel loops. This optimization enables parallelism at an outer level (j loop) where the PARALLEL region includes 12 nested parallel loops as well as calls to other subroutines. In this case the optimization goes a step further since the entire BAROCLINIC subroutine is defined with a single PARALLEL region, the interprocedural feature of the tools means that the final PARALLEL region can be defined in some routine higher up in the call graph, in a calling routine and is indicated in (Figure 2). Although user interaction is essential, it does allow for the possible introduction of errors in the parallel code. The incentive to enable parallelism can lead to the user being over optimistic. Apart from the erroneous deletion of a dependence or the incorrect bounds for a variable being added, the user can be presented with a choice of solutions. For example, a loop that is serial due to the re-use of a non-privatizable variable between iterations can be executed in parallel if the variable can be privatized or if the re-use dependence can be proven non-existent. An incorrect choice by the user can lead to erroneous results in the parallel execution.

2.4. Automatic relative debugging As the user is likely to introduce errors into the parallel code, tools are needed in the environment to assist the user in identifying the incorrect decisions made during the parallelization. Automatic relative debugging achieves this by comparing the execution of the serial program with that of the parallelized version, automatically finding the first difference in the parallel program that may have caused an incorrect value that was originally identified. Automation of the search for the first difference relies on the availability of an interprocedural dependence analysis of the program and the ability to determine mappings from the serial to the parallel program [8]. Here, the automatic relative debugging approach is applied to shared-memory OpenMP programs and determines the first PARALLEL region that contains a difference that may have caused an incorrect value that has been identified. Previous work [8] provides a thorough presentation of the algorithm used to determine the first difference between the serial (one processor parallel) execution and an N-processor parallel execution for a distributed memory, message passing based code. The steps (1)- (4) below are repeated until the earliest difference is found: ( 1) Find the possible definition points of the earliest incorrect value observed so far, using dependence analysis information (2) Examine the variable references on the right-hand sides of those definitions to determine a set of suspect variable references to monitor in a re-execution

815 read*,nm,np,nc do l o o p = l , m x i t e r _ t i m e call T I M E S T E P

SUBROUTINE BAROCLINIC !$OMP DO do j = j s t a , j e n d do i = i s t a , i e n d

enddo

9 . .

do k=l, kmc .... v ( k , i - l : i + l , j,nc) , v (k, i-I :i+l, j ,nm) , v(k, i, j -I :j+l, nc) , v(k,i,j-l:j+l,nm)

S1

SUBROUTINE nnc=np nnm=nc nnp=nm np=nnp nc=nnc

TIMESTEP

nm=nnm

enddo

do k = l , k m c v(k,i,j,np)

S2

....

enddo enddo enddo

200 continue !$OMP P A R A L L E L D E F A U L T ( S H A R E D ) P R I V A T E ( j c , L O W L I M i t b t p , itbtp, itbt, [$OMP& n n c 0 , n n m 0 , n n p 0 , c 2 d t b t , n t b t p 0 , n t b t 2 , i c , k , m , b m f , s m f , n c o n , stf,n) !$OMP& S H A R E D ( d y 2 r , d x 2 r , g r a v , n p 0 , d t b t , d c h k b d , n m 0 , d x , n c 0 , d y , !$OMP& d y r , d x r , n t b t , m x p a s 2 , c 2 d t u v , n p , n c , n m , p c y c , c 2 d t t s , g a m m a ) call B A R O C L I N I C 9

.

if(maxits.and.eb)THEN nc=nnp np=nnm maxits=.false.

goto

200

endif

Figure 2. Pseudo code showing the CAPO generated OpenMP directives. (3) Instrument the suspect variable references in both serial and parallel versions of the program {4) Execute the instrumented programs, stopping when a difference (i.e. a bad value) is detected at an instrumentation point. If any bad value encountered has not been previously observed then it is used as input for step (1), otherwise debugging is terminated The automated search for the first difference begins after all necessary dependence and directive information is obtained. An incorrect variable at a specific location in the execution is used as the starting point of the search, typically at an output statement where the output information differs between the serial and the parallel execution. The steps above are performed by the relative debugging controller and P2d2; the controller guides backtracking [9] and reexecution based on dependence analysis information and P2d2 retrieves and compares values. When necessary, the controller makes use of directive information by asking the CAPO library to determine, for example, if a variable is PRIVATE or SHARED at a given location in the application code [ 10]. P2d2 is used to manage control of the serial and parallel executions at the location of the earliest known difference. Two optimizations are implemented to reduce the number of reexecutions necessary to search for differences. First, the set of suspect variables determined in step (2) were computed in previous program states and are therefore candidates for examination by tracing their values without the use of re-execution (backtracking). The memory overhead of tracing is undesirable, so a limited form of backtracking that utilizes only the current program state to retrieve values that have not been overwritten (based on dependence analysis) is used. If any value retrieved in this way is found to be incorrect then that value is used as input to step (1), saving a re-execution.

816 Table 1 Summary of codes parallelized using parallelization tools Application LU, SP, B T FT, CG, MG CTM Code size 3K lines 2K lines 16K lines benchmark benchmark 105 routines Dependence analysis 0.5-1 hr 0.5-1 hr 1 hr Code generation under 5 mins under5 mins 10 mins User effort/tuning 1 day 1 day 2 days Total manual time 3 weeks 3 weeks 5 months Performance (cfwith within5-10% within 10-35% betterby 30% manual version) Sample speed up BT:30 on 32 CG:22on 32 3.5 on 4

GCEM3D O V E R F L O W 18K lines 100K lines 100 routines 851 routines 6.5 hrs 25 hrs 30 mins 30 mins 14 days 4 days 1 month 8 months factor8 better slightlybetter

24 on 32

16 on 32

For values of suspect variables that have been overwritten, a second optimization is employed. Such values are treated as if they were incorrect and steps (1) and (2) are speculatively applied to the definition points of each to find if any of their fight-hand side values can be retrieved with backtracking. Note that this optimization can be applied recursively for any fight-hand side value that has been overwritten. If a bad value is found among the right-hand side values then it is assumed that the overwritten value was also incorrect. The overall search for the first difference with the bad right-hand side value as input for step (1) is then continued. For a CAPO parallelized code the stored dependence graph and other information can be used. This tool can also be applied to manually parallelized OpenMP codes where a dependence analysis and directive extraction are performed on initialization of the tool.

3. RESULTS A number of codes have been parallelized both manually and using the parallelization tools. The codes include the NAS parallel benchmarks - a suite of well-used benchmark programs; C T M - NASA Goddard code that is used for ozone layer climate simulations; GCEM3D - NASA Goddard code used to model the evolution of cloud systems under large scale thermodynamic forces; O V E R F L O W - NASA Ames version that is used to model aerospace CFD simulations. Table 1 summarizes the approximate time taken for the various efforts involved in parallelizing these applications. In nearly all cases the quality of the code generated was at least as good as the manually parallelized version and was achieved with a significantly reduced user effort. Unlike the other codes, the performance for the GCEM3D code was enhanced by using the Paraver [ 11 ] profiling tool together with the existing environment tools. For all the larger codes CTM, GCEM3D and OVEFLOW user interaction was required to identify more parallelism, enable effective scalability and produce a significant speed up. This was performed by the authors, taking great care in the decisions made. To examine the effectiveness of the relative debugging facility an error was deliberately introduced into the parallel NAS LU code and the first indication of incorrect output was used as the start point for the debugging algorithm. After 3 re-executions for successively earlier instrumentation points, a variable at a particular location was identified as the first difference in the code. This related directly to the erroneous user interaction deliberately introduced as part of the earlier parallelization, exposing the likelihood that it was incorrect.

817 4. FUTURE W O R K AND C O N C L U D I N G R E M A R K S

The quality of the code generated yields comparable performance to a manual parallelization effort and since almost all of the work is automated, the total time to parallelize the application is significantly reduced when using the tools. The authors are working on an "expert assistant" that will guide the user to the reasons why, for example, loops are serialized or variables are non-privatizable, by asking pertinent questions that the user can attempt to answer. The aim of the expert assistant is to try and exploit any parallelism that is not immediately apparent. It is also envisaged that the questions can be prioritized by interacting with a profiling tool that can indicate inefficiencies in the parallel execution, such as loops exhibiting a poor speed up, frequently executed PARALLEL region start/stop overheads and barrier synchronizations within PARALLEL regions. These focus the user on code sections that have a significant effect on parallel performance. 5. A C K N O W L E D G E M E N T S The authors would like to thank their colleagues involed in the many different aspects of this work, including Gabriele Jost and Jerry Yan (NASA Ames), Dan Johnson, Wei-Kuo Tao and Steve Steenrod (NASA Goddard), Emyr Evans, Peter Leggett, Jacqueline Rodrigues and Mark Cross (Greenwich). Finally, the funding for this project from AMTI subcontract No.SK-03N-02 and NASA contract DTTS59-99-D-00437/A61812D is gratefully acknowledged. REFERENCES

[1]

[2] [3]

[4]

[5]

[6] [7]

[8]

E.W. Evans, S.E Johnson, EF. Leggett, M. Cross, Automatic and effective multidimensional parallelisation of structured mesh based codes. Parallel Computing, 26, 677703, 2000. C.S. Ierotheou, S.E Johnson, M. Cross and EF. Leggett, Computer aided parallelisation tools (CAPTools) - conceptual overview and performance on the parallelisation of structured mesh codes. Parallel Computing, 22, 197-226, 1996. S.P. Johnson, M. Cross and M. Everett, Exploitation of symbolic information in interprocedural dependence analysis. Parallel Computing, 22, 197-226, 1996. S.E Johnson, C.S. Ierotheou and M. Cross, Computer aided parallelisation of unstructured mesh codes. Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, Editors H.R.Arabnia et al, publisher CSREA, vol. 1,344-353, 1997. H. Jin, M. Frumkin, and J. Yan, Automatic generation of OpenMP directives and its application to computational fluid dynamics codes, Intl Symposium on High Performance Computing, Tokyo, Japan, October 16-18, 2000,Lecture Notes in Computer Science, Vol. 1940, 440-456. C.S. Ierotheou, S.E Johnson, EF. Leggett and M. Cross, Using an interactive parallelisation toolkit to parallelise an ocean modelling code. FGCS, vol 19, 789-801, 2003. H.Jin, G.Jost, D.Johnson, W-K. Tao, Experience on the parallelization of a cloud modeling code using computer-aided tools. NASA Technical report NAS-03-006, 2003. G. Matthews, R. Hood, S.E Johnson, and EF. Leggett, Backtracking and re-execution in the automatic debugging of parallelized programs. Proceedings of the 1 l th IEEE Inter-

818 national Symposium on High Performance Distributed Computing, Edinburgh, Scotland, July 2002. [9] H. Agrawal, Towards automatic debugging of computer programs. Ph.D. Thesis, Department of Computer Sciences, Purdue University, West Lafayette, IN, 1991. [ 10] G. Matthews, R. Hood, H.Jin, S.P. Johnson, and C.S. Ierotheou, Automatic relative debugging of OpenMP programs. To appear in EWOMP 2003 conference proceedings, 2003. [11 ] http://www.upc.cebpa.es/paraver