Computational physics with PetaFlops computers

Computational physics with PetaFlops computers

Computer Physics Communications 180 (2009) 555–558 Contents lists available at ScienceDirect Computer Physics Communications www.elsevier.com/locate...

120KB Sizes 3 Downloads 175 Views

Computer Physics Communications 180 (2009) 555–558

Contents lists available at ScienceDirect

Computer Physics Communications www.elsevier.com/locate/cpc

Computational physics with PetaFlops computers Norbert Attig Jülich Supercomputing Centre (JSC), Institute for Advanced Simulation (IAS), Forschungszentrum Jülich GmbH, 52425 Jülich, Germany

a r t i c l e

i n f o

a b s t r a c t

Article history: Received 9 September 2008 Accepted 18 December 2008 Available online 25 December 2008

Driven by technology, Scientific Computing is rapidly entering the PetaFlops era. The Jülich Supercomputing Centre (JSC), one of three German national supercomputing centres, is focusing on the IBM Blue Gene architecture to provide computer resources of this class to its users, the majority of whom are computational physicists. Details of the system will be discussed and applications will be described which significantly benefit from this new architecture. © 2008 Elsevier B.V. All rights reserved.

PACS: 07.05.Bx 83.10.Rs Keywords: Supercomputing IBM Blue Gene CPMD

1. Introduction

2. Integration of JSC in existing HPC networks and alliances

In many areas of physics, numerical simulation has become an essential tool for advancing theoretical research. Driven by the rapid development in computer technology, this insight has dramatically raised the expectations of computational scientists with respect to application performance, memory, data storage, and data transfer capabilities [1,2]. Currently, only high-performance supercomputers with a large number of processors are capable to fulfil these needs. The Jülich Supercomputing Centre has implemented a dual supercomputer strategy to provide computational scientists with adequate computing resources. First, a general-purpose supercomputer, currently realised as a moderately parallel cluster with a peak performance of 8.4 TeraFlop/s serves about 150 German and European user groups from universities and research institutions. This system allows the development of parallel codes as well as the execution of small to mid-size projects. Second, for applications which scale-up to ten thousands of processors (capability computing) and which tackle Grand Challenge problems, an IBM Blue Gene system with a peak performance of 223 TeraFlop/s is available, serving as a leadership-class system and geared to petascale problems. On this system, a much smaller number of projects are granted to give selected researchers the opportunity to get new insights into complex problems which were out of reach before. Both supercomputer systems are integrated in a common user environment and have access to a common general parallel file system, a functionality which is provided by a dedicated file server.

Since 1986 the primary mission of the Jülich Supercomputing Centre has been the provision of supercomputer resources of the highest performance class to the scientific and engineering research communities at national and European universities and research institutions. This includes the provision of a state-of-theart technical infrastructure as well as an optimal user support. The appropriate allocation of the corresponding resources is ensured by an international peer-review process. Responsible for this process is the John von Neumann Institute for Computing (NIC) [3], a virtual institute founded by three partners of the Helmholtz Association; among them Forschungszentrum Jülich. NIC is managed by a Board of Directors with members from each partner institution and the Director of JSC as acting director. The NIC Scientific Council, consisting of renowned international scientists, gives recommendations with respect to the scientific programme of NIC and the allocation of supercomputing resources. With the creation of the Gauss Centre for Supercomputing (GCS) in 2006, Germany created a new and powerful structure for its three national supercomputing centres in Garching, Jülich and Stuttgart to take a leading role in Europe [4]. The Gauss Centre’s members, who signed an agreement to found a registered association, will follow a common direction in this organisation. The procurement of hardware will be closely coordinated, applications for computing time will be scientifically evaluated on a common basis, and software projects will be jointly developed. Another key area will be training. The work of specialist researchers will be supported and promoted by harmonising the services and organising joint schools, workshops, and conferences on simulation techniques. Method-

E-mail address: [email protected]. 0010-4655/$ – see front matter doi:10.1016/j.cpc.2008.12.032

©

2008 Elsevier B.V. All rights reserved.

556

N. Attig / Computer Physics Communications 180 (2009) 555–558

ologically oriented user support is also a major concern of the Gauss Centre. The GCS association represents Germany as a single legal entity in the European supercomputing infrastructure initiative PRACE (Partnership for Advanced Computing in Europe) which aims at the creation and sustained operation of a pan-European Tier-0 supercomputing service and its full integration into the European HPC ecosystem. Jülich is a leading partner within PRACE and aspires to become the first European supercomputing centre with PetaFlops capability in 2009/2010 [5]. For leading supercomputing centres like JSC it becomes indispensable to be well integrated in and even to take the lead in strategic alliances and networks on different levels. This integration is necessary to establish a constant high visibility and a prestige which are essential preconditions to successfully compete with other centres for an adequate and sustainable funding.

Table 1 Comparison of the BG/L and the BG/P systems. Property

Blue Gene/L

Node Node processors 2 * PowerPC®440 properties Processor frequency 0.7 GHz Coherency Software managed L3 Cache size (shared) 4 MB Main store 512 MB / 1 GB Main store bandwidth 5.6 GB/s Peak performance 5.6 GF/node

Blue Gene/P 4 * PowerPC®450 0.85 GHz SMP 8 MB 2 GB / 4 GB 13.6 GB/s 13.9 GF/node

Torus network

Bandwidth Hardware latency (nearest neighbour) Hardware latency (worst case)

6*2*175 MB/s = 2.1 GB/s 200 ns (32 B packet) 1.6 μs (256 B packet) 6.4 μs (64 hops)

6*2*425 MB/s = 5.1 GB/s 100 ns (32 B packet) 800 ns (256 B packet) 3.2 μs (64 hops)

Tree network

Bandwidth Hardware latency (worst case)

2*350 MB/s = 700 MB/s 2*0.85 GB/s = 1.7 GB/s 5.0 μs 3.5 μs

3. IBM Blue Gene systems at JSC 4. User support When the IBM Blue Gene technology became available in 2004/2005, the Jülich Supercomputing Centre quickly recognised the potential of this architecture as a Leadership-class system for capability computing applications. In early summer 2005, Jülich started testing a single Blue Gene/L rack with 2048 processors [6]. It soon became obvious that many more applications than initially expected were ported to efficiently run on the Blue Gene architecture. Therefore, in January 2006 the system was expanded to 8 racks with 16,384 processors, funded by the Helmholtz Association. The 8-rack system has successfully been in operation for two years. About 30 research groups, which were carefully selected with respect to their scientific quality, run their applications on the system using job sizes between 1024 and 16,384 processors. In early 2007, Research Centre Jülich decided to order a powerful next-generation Blue Gene system. In October 2007, a 16-rack Blue Gene/P system with 65,536 processors was installed, mainly financed by the Helmholtz Association and the State of North Rhine Westphalia. With its peak performance of 222.8 TFlop/s, Jülich’s Blue Gene/P – alias JUGENE – currently the biggest supercomputer in Europe and ranked No 6 in the June 2008 edition of the Top500 list of the most powerful supercomputers worldwide [7–9]. The main characteristics of the Blue Gene architecture and the major differences between Blue-Gene/L and Blue Gene/P are summarised in Table 1. A detailed description of the Blue Gene/P project can be found in a corresponding IBM report [10]. Besides the main characteristics two important features should also be mentioned: The first one is the Double Hummer Floating Point Unit (FPU) – available on BG/L and BG/P – which allows parallel floating point operations on pairs of doubles, e.g., a multiplication of two complex numbers in two instructions. The FPU processes 64-bit variables only, while load and stores are possible for 32- or 64-bit variables. The second one is the Direct Memory Access (DMA) engine – only available on BG/P – allowing messages to be sent to other nodes or to itself without any processor intervention (direct puts and gets). The DMA engine interfaces directly with the torus network and has a separate access to the L3 cache. The MPI commands MPI_ISEND and MPI_IRECV implicitly use the DMA; however, the engine can also be controlled by lower-level communication interfaces. Key features of the Blue Gene systems are their balanced architecture with respect to processor, memory and network speed and their scalability towards PetaFlops computing based on low power consumption [11], small footprint and a reasonable price/performance ratio.

With the increasing prevalence of architectures based on massively parallel and multi-core processor topologies, many simulation scientists are compelled to take scalability into account when developing new models or when porting long-established codes to machines like the Blue Gene. This poses significant problems for the small research groups making up the majority of users of the JSC computing facilities, which typically do not have the resources or expertise for application petascaling. To address the urgent software challenges posed by supercomputing in the PetaFlops era a new strategy for a high-level user support is mandatory. Traditionally, JSC’s user support is structured in three levels: A basic support is the first level to be contacted for all questions and problems that may arise. If necessary the problem is forwarded to a JSC specialist who provides advanced support and may help with more specific questions, concerning in particular methodological and optimisation aspects. Furthermore, each of the projects is assigned an expert advisor, i.e. a staff member of JSC with a scientific background close to the research field of the project who can also discuss scientific questions with the project members and form a long-term partnership. Unfortunately, the third level suffers from the increasing work to be performed by the advisors with respect to the selection and implementation of efficient, highly-scalable algorithms for massively parallel systems. JSC plans to remedy this problem by establishing several socalled Simulation Laboratories (SL). A Simulation Laboratory is a community-oriented research and support structure for a scientific community. It is an integral part of the community, strengthening it by working closely with its co-members to assist them in performing simulations on supercomputers. It consists of a core group located at a supercomputer centre and a number of associated scientists outside [12]. The SL is staffed to the level of a small, self-contained research team, disposing of high-level expertise on community-related codes and algorithms. The know-how provided to a research community by a Simulation Laboratory thus far exceeds the support level of a traditional expert advisor. The latter user-support model is not completely substituted by Simulation Laboratories but is naturally augmented by them. The mission of a Simulation Laboratory is defined and monitored by its community. For this purpose a Steering Committee will be established jointly by the community and the supercomputing centre, whose task is to evaluate incoming work packages, selecting and ranking them. The whole high-level user support structure at JSC is complemented by a rich offer of training and education activities. Winter schools, symposia and scaling workshops are in this context of major importance. While the first two kinds of events mainly

N. Attig / Computer Physics Communications 180 (2009) 555–558

address the advanced education of junior scientists, scaling workshops have proved to be an excellent instrument to further exploit scalability of applications intended for running on massively parallel systems [13,14]. Since 2006 JSC performed four of these workshops, each time attracting between 30 and 50 users. During these workshops, optimisation experts from Argonne National Laboratory, IBM and JSC – all members of the Blue Gene consortium – joined forces and tuned codes in collaboration with users towards effective execution on more than 10,000 processors of JSC’s Blue Gene systems. Computational scientists from many research areas take the chance to improve their codes during these events and then later apply for significant shares of Blue Gene computer time to tackle unresolved questions which were out of reach before. 5. Running applications on Blue Gene Due to the fact that the Blue Gene architecture is well-balanced in terms of processor speed, memory latency, and network performance, parallel applications scale reasonably on these systems up to large numbers of processors. However, it is surprising how many applications can be ported to and run efficiently on this new architecture which in a forerunner version was mainly designed to perform lattice quantum chromo dynamics (LQCD) codes. Blue Gene applications at JSC cover a broad spectrum ranging from LQCD to MD codes like CPMD and VASP, materials science, protein folding codes, kinetic plasma simulation, fluid flow research, quantum computing and many, many others. In the following, three examples will be discussed in some detail. Codes used for the investigation of strong interactions on 4D space-time lattices (LQCD codes) have ever increasing demands with respect to compute power. For decades they have been continuously optimised and adopted early to new architectures. They usually show excellent scaling behaviour and highest performance. One of these codes, running at JSC since several years [15], was ported early to the Blue Gene/P system. The code, Hybrid Monte Carlo algorithm with Symanzik improved gauge action and dynamical ultraviolet filtered Clover fermions is mainly written in C; communication is implemented in assembler and SPI (no MPI!). The Clover sparse matrix multiplication which takes about 80% of the execution time is fully written in assembler. Special features of the code are the use of low-level compiler macros for compute intensive parts, an efficient overlap of computation and communication and a heavy use of the double hummer FPU. Furthermore, it is ensured that the lattice fits the torus network of the Blue Gene. With this code a perfect linear strong scaling behaviour was observed up to the full 16-rack JUGENE system (65,000 processors) and a utilisation of nearly 37% of its peak performance which translates to about 80 TeraFlop/s. A code which is widely used at JSC is the Car Parrinello Molecular Dynamics (CPMD) package [16]. CPMD is a parallelised plane wave/pseudopotential implementation of Density Functional Theory, particularly designed for ab-initio molecular dynamics. The code is mainly written in Fortran77, is well-parallelised and runs on many different platforms. Special features for Blue Gene architectures are a hierarchical taskgroup parallelisation, the use of parallel Linear Algebra routines and a parallel initialisation. An individual support and a tuning for IBM architectures is ensured by Alessandro Curioni, who is both IBM research staff member and member of the CPMD development team. He continuously improves CPMD together with research colleagues who have additional requests or ideas for optimisation. According to user reports CPMD is implemented in a highly efficient version on JUGENE. The CPMD package shows an excellent scaling behaviour up to ten thousands of processors, however, the most interesting scientific questions are best tackled by runs using between 1000 and 4000 processors. This is also true when simulations are

557

performed by a similar code, the Vienna Ab-initio Simulation Package (VASP). With CPMD on Blue Gene and more than six million processor hours the research team of Dominik Marx at Ruhr University of Bochum gained outstanding new results on the synthesis of peptides in aqueous media under extreme thermodynamic conditions [17]. The computational fluid dynamics solver XNS is used to simulate the blood flow in Ventricular Assist Devices (VADs). The tuning of these devices in a way that the release of haemoglobin into the bloodstream (haemolysis) or the clotting of blood (thrombosis) can be avoided is of highest importance. Extended parameter studies with the XNS code enable major design improvements of the VADs. The XNS code is based on finite element techniques using stabilised formulations, unstructured three-dimensional meshes and iterative solution strategies. Furthermore, a lot of novel features are integrated to obtain reliable results from the simulation runs which can be mapped to real devices. The simulations are so complex that it becomes crucial to exploit very large processor numbers simultaneously. After porting the code to the Blue Gene system, an acceptable scaling behaviour was observed up to 1000 processors only. During one of the scaling workshops the code was analysed intensively together with performance analysis experts from JSC. It turned out that a good scaling behaviour can be obtained beyond 4000 processors by improving the communication pattern of XNS. This already allows simulation data to be acquired four times faster [18], and holds potential for further optimisation. 6. Summary In this contribution the rapid development in computer technology and its impacts were introduced. It was discussed that only centres that are fully imbedded in local, regional, national and international (European) networks and alliances are able to provide supercomputing resources of the highest performance class continuously. The Jülich Supercomputing Centre JSC in Germany is such a centre and it concentrates today on leadership-class supercomputers of the type IBM Blue Gene making PetaFlops computing possible. It serves a significant number of selected research groups from the computational sciences in Germany and Europe who perform leading-edge science with outstanding supercomputing resources. Applications on Jülich’s Blue Gene systems profit not only from the enormous scalability of the systems, the high-speed network and the balanced system architecture but also from a newly restructured community-oriented high-level user support; altogether an excellent basis for performing cutting-edge science. Acknowledgements The author would like to thank the organisers of the Conference on Computational Physics (CCP 2008) for inviting him and for providing the possibility to present his experiences with largescale Blue Gene systems to a broad scientific audience. The author would also like to thank Alessandro Curioni, Rüdiger Esser, Paul Gibbon, Stefan Krieg and Dominik Marx for stimulating discussions on this topic. References [1] A. Bode, W. Hillebrandt, T. Lippert, Petaflop-Computing mit Standort Deutschland im europäischen Forschungsraum, Bedarf und Perspektiven aus Sicht der computergestützten Natur- und Ingenieurwissenschaft, “Scientific Case” im Auftrag des BMBF, Bonn, 2005. [2] K. Koski, et al., European Scientific Case for high-end computing; see HPC in Europe Task Force, http://www.hpcineuropetaskforce.eu/draftdeliverables. [3] The John von Neumann Institute for Computing (NIC); see http://www.fzjuelich.de/nic. [4] The Gauss Centre for Supercomputing; see http://www.gauss-centre.de. [5] The PRACE project; see http://www.prace-project.eu.

558

N. Attig / Computer Physics Communications 180 (2009) 555–558

[6] N. Attig, K. Wolkersdorfer, IBM Blue Gene/L in Jülich: A first step to Petascale computing, inSiDE 3 (2) (2005) 18–19. [7] M. Stephan, K. Wolkersdorfer, IBM BlueGene/P in Jülich: The next step towards Petascale computing, inSiDE 5 (2) (2007) 46–47. [8] N. Attig, F. Hoßfeld, Towards PetaFlops computing with IBM Blue Gene, in: PASA09 Proceedings, vol. 124, ISBN 978-3-88579-218-5, 2008, pp. 11–13. [9] TOP 500 Supercomputer Sites; see http://www.top500.org. [10] IBM Blue Gene Team, Overview of the IBM Blue Gene/P project, IBM J. Res. & Dev. 52 (1/2) (2008) 199–220. [11] The Green 500 list; see http://www.green500.org. [12] N. Attig, R. Esser, P. Gibbon, Simulation laboratories: An innovative communityoriented research and support structure, in: CGW’07 Proceedings, ISBN 978-83915141-9-1, 2008, pp. 1–9. [13] W. Frings, M.-A. Hermanns, B. Mohr, B. Orth (Eds.), Blue Gene/L Scaling Workshop 2006, Technical Report IB-2007-02, 2007. For an online version see

www.fz-juelich.de/jsc/files/docs/ib/ib-07/ib-2007-02.pdf. [14] W.D. Gropp, W. Frings, M.-A. Hermanns, E. Jedlicka, K.E. Jordan, F. Mintzer, B. Orth, Scaling science applications on Blue Gene, in: ParCo 2007 Proceedings, in: Advances in Parallel Computing, vol. 15, IOS Press, ISBN978-1-58603-796-3, 2008, pp. 583–584. [15] S. Krieg, Optimizing Lattice QCD Simulations on Blue Gene/L, in: ParCo 2007 Proceedings, in: Advances in Parallel Computing, vol. 15, IOS Press, ISBN978-158603-796-3, 2008, pp. 543–550. [16] CPMD consortium; see http://www.cpmd.org. [17] E. Schreiner, N.N. Nair, D. Marx, Influence of extreme thermodynamic conditions and pyrite surfaces on peptide synthesis in aqueous media, J. Am. Chem. Soc. 130 (9) (2008) 2768–2770. [18] M. Behbahani, M. Nicolai, M. Probst, M. Behr, Simulation of blood flow in a ventricular assist device, inSiDE 5 (1) (2007) 20–23.