Parallel Computing: Fundamentals, Applications and New Directions E.H. D'Hollander, G.R. Joubert, FJ. Peters and U. Trottenberg (Editors) © 1998 Elsevier Science B.V. All rights reserved.
397
Porting to HPF: Experiences with DBETSY3D within PHAROS Dr. Thomas Brandes* and Dr. Kadri Krause+ *lnstitute for Algorithms and Scientific Computing (SGAI), GMD National Research Center for Information Technology, 53754 St. Augustin, Germany +debis Systemhaus, Dept. CCS RSW-ST/HVS, 70771 Leinfelden-Echterdingen, Germany
Abstract In this paper we will describe what DBETSY3D is before outlining the Esprit project PHAROS. We will then recap the steps in porting DBETSY3D from Fortran 11 to High Performance Fortran (HPF). We will show our results and discuss the advantages of HPF. 1. DBETSY3D DBETSY3D is a Fortran 11 program which is used to calculate stresses and deformations on the surfaces of objects using the theory of linear elastostatics in three dimensional applications. The code was developed by Mercedes Benz and is used extensively there for calculating linear stresses. [1] It is based oh the Boundary Element Method (BEM). BEM reduces the problem from volumes to surfaces by using the so called fundamental solutions within the object and replacing the describing differential equations by integral equations over only the surface. The advantage of the BEM over finite element methods is that only the surface and not the entire volume need be modelled. DBETSY3D started as the Fortran IV BETSY. This was at the Forschungsvereinigung Verbrennungskraftmaschinen (Research Group for Combustion Engines) in Frankfurt, Germany in 1977. The sequential version of DBETSY3D consists of six programs. One DBETSY3D calculation consists of all six programs that run serially. This whole package consists of nearly 28,000 lines of Fortran 11 code. To stay within the constraints of the PHAROS project, we have chosen to focus on the fifth program of this series of six. DBETSY5 is the heart of the code where many matrices are solved using LU decompositions. Of the six programs, DBETSY5 requires the most CPU-time. It consists of 1591 lines of code in 19 subroutines. DBETSY5 solves the linear systems based on the residual matrix technique. The algorithm works on submatrices that cover at least all the non-zero elements of the full (virtual) matrix.
398 Each submatrix is reduced to the degrees of freedom that it has in common with other submatrices by means of forward substitution. (These submatrices are underdetermined.) These individual residual matrices are combined to form the residual matrix. This reduced equation system Is solved. Subsequently the residual partial equation systems are also solved by means of backward substitution. Pseudocode for DBETSY5 follows: read in problem sizes for each submatrix LU decompose It copy part of the LU decomposition into the residual matrix endfor solve the residual matrix for each submatrix use solution of residual matrix to make the submatrix square perform backwards substitution using the LU decomposition write submatrix solution into solution of the original problem endfor Even with this technique of residual matrices, it is not possible to store a complete submatrix in core. Out of core routines are necessary for the forward and backward substitution. We chose to do a coarse grain parallelization over the submatrices. 2. PHAROS The PHAROS project was funded by the European Union's ESPRIT program for research and development in information technology. It aimed to assess HPF as a paradigm for porting large Fortran 11 scientific applications to distributed memory architectures, in comparison to message-passing programming. To this end, four major commercial Fortran 11 application codes have been ported to HPF within PHAROS by using European tools which are available on the market today. There were nine partners from five countries working together within PHAROS. The three tools that were used were FORESYS from Simulog (France), the HPF Mapper from NA Software (England), and VAMPIR from Pallas (Germany). The two HPF expertise providers were VCPC (Austria) and the GMD (Germany). The industrial code providers were CISE (Italy), Matra (France), Semcap (France), and debis Systemhaus (Germany). PHAROS started in January 1996 and runs until December 1997. More information about PHAROS and its results can be obtained via the web at http://www.vcpc.univie.ac.at/activlties/projects/PHAROS/. All four industrial codes were ported by using the general steps that form the following headers. The details in the sections themselves refer only to the DBETSY3D port.
399 3. FIRST STEP IN THE PORTING: CONVERT FORTRAN 77 CODE TO FORTRAN 90 This conversion was done using the help of a powerful tool, FORESYS. This was done at the syntax level with the algorithm remaining unchanged. FORESYS produced the new Fortran 90 syntax. It declared all variable (IMPLICIT NONE). It eliminated most of the GOTOs. It provided intent (IN, OUT, INOUT) for the dummy arguments in subroutines. It described the in-out behavior of COMMON variables. It indented the code and pointed out unused variables. It also identified dead code. In short, FORESYS made the code much more readable. 4. SECOND STEP: CLEAN UP BY HAND We did much to improve readability. Most important was to introduce dynamic arrays that are two-dimensional. DBETSY3D had many conceptually two-dimensional arrays that were stored as one-dimensional ones. This was necessary, since Fortran 11 does not support dynamic arrays. The cleanup was not straightforward, since the workspace array in the blank common was used in two different ways in different parts of the program. In addition we removed sequence association so that the HPF Mapper could compile correctly. We reorganized I/O. Through intimate knowledge of the code, we were able to remove a loop dependency so that all iterations of the LU decomposition could be done in parallel. We also made numerous other small changes to improve readability. 5. THIRD STEP: CONVERT TO HPF BY HAND This step was fairly easy. Only a few HPF-directives were necessary, mainly to tell the HPF compiler that the for loops in the pseudocode can be executed independently. 6. FOURTH STEP: HPF MAPPER Not aH the features of the HPF language that we needed were supported by the NAS HPF Mapper at the beginning of PHAROS. However part of PHAROS was to see how well the tools performed on actual industrial codes. Through the input of the four industrial code providers, NA Software saw what features were important to our real-life codes and then prioritized their development of new features differently. The features that we needed and which NA Software added were EXTRINSIC (HPF^SERIAL) and the parallel execution of such subroutines. 7. USTSTEP: FINE TUNING We expected load balancing to become a problem. Visualization with VAMPIR confirmed that this problem existed. Through the VAMPIR traces, it was easy to see the idle processors. Although indirect distributions were not supported, we found a solution for load balancing that reduced the execution time dramatically. We exploited a two-dimensional boolean array that specifies the mapping of submatrices to processors. The first dimension is indexed by the number of the submatrix and the other dimension, which is block distributed, is indexed by the number of the processor.
400 8. RESULTS The following table shows the results for DBETSY5 for a crank shaft under gas pressure. These cases were run on the IBM SP2 at the GMD. As one can see, there is a nice speedup of about 3.2 for 5 processors. It is inherent in the coarse grain parallellzation, that with the differing size of the submatrices, more speedup is not attainable for this example. Table 1 Speedup on the IBM SP2 Number of Elapsed Seconds Processors
1 2 3 4 5
590 314 261 190 184
Speedup
1.88 2.26 3.11 3.21
9. EFFORT ANALYSIS The bulk of the effort in porting DBETSY3D to HPF was spent on the conversion from Fortran 11 to Fortran 90. This had the added benefit of much cleaner and more readable code. The conversion from Fortran 90 to HPF was faster and easier than writing MPI code. We expect the HPF code to be much easier to maintain than the MPI code. The HPF code has the huge advantage that it can run both sequentially and in parallel. There is just one code to maintain. This one code can run on a single workstation, on a cluster of workstations, or even In parallel on a supercomputer. 10. FUTURE WORK During PHAROS we also experimented with fine-grain parallelism, but with disappointing results. In the future when more HPF language features are supported by HPF compilers, especially HPF^LOCAL, and when there is better support for INDEPENDENT loops, we can exploit fine-grain parallelism when solving the residual matrix, which is currently being done sequentially. REFERENCES 1. H. Butenschon, W. Mohrmann, and W. Bauer, Advanced Stresss Analysis by a Commercial BEM Code in P.K. Banerjee and R.B. Wilson, editors. Industrial Applications on Boundary Element Methods, Developments in Boundary Element Methods-5, Elsevier Science Publisher, Ltd., London (1989) 231-261.