Multithreaded shared memory parallel implementation of the electronic structure code GAMESS

Multithreaded shared memory parallel implementation of the electronic structure code GAMESS

Computer Physics Communications 128 (2000) 55–66 www.elsevier.nl/locate/cpc Multithreaded shared memory parallel implementation of the electronic str...

444KB Sizes 0 Downloads 30 Views

Computer Physics Communications 128 (2000) 55–66 www.elsevier.nl/locate/cpc

Multithreaded shared memory parallel implementation of the electronic structure code GAMESS ✩ Barry Bolding a , Kim Baldridge b,∗ a CRAY Inc., 411 First Ave South, Suite 600, Seattle, WA 98104, USA b San Diego Supercomputer Center, 9500 Gilman Drive, La Jolla, CA 92093-0505, USA

Abstract The parallelization of electronic structure theory codes has become a very active area of research over the last decade. Until recently, however, most of this research has dealt with distributed memory parallel architectures. The objective of this work is to provide general information about the parallel implementation of the General Atomic Molecular Electronic Structure System, GAMESS, for the TERA multithreaded shared memory architecture. We describe the programming paradigm of this machine, the strategies used to parallelize GAMESS, general performance results obtained to date, conclusions and future work on this project.  2000 Elsevier Science B.V. All rights reserved.

1. Introduction High-performance computational technologies have been growing in capability and power at a rate unimaginable a decade ago. The degree of complexity in various types of molecular applications has caused a greater focus on advanced high performance computing methods, such as massive parallelization, more sophisticated visualization, and robust networked communications for data transfers. Resulting methodologies have facilitated laboratory investigations and paved the way for technological advancements in fields of computational science, which involve complexities far beyond analytic/numerical solution or human perseverance. There is no question that advancements in the field of quantum mechanics would not have been so dramatic without the significant developments in hard✩ This paper is published as part of a thematic issue on Parallel Computing in Chemical Physics. ∗ Corresponding author.

ware architectures and capabilities. The importance of parallel computers in increasing the size and complexity of chemical systems that can be treated with quantum mechanical techniques has been realized for some time now by many groups [1–15]. Much of this work has been focused on developing algorithms and techniques for harnessing cycles on distributed memory architectures. Exploitation of large-scale distributed memory platforms has clearly been a milestone for computational chemistry, allowing the computational study of molecular targets exceeding 1000 basis functions (1 kiloboys, or 1 Pople) [16] in a scalable fashion [17–21]. Massively parallel processors (MPPs) now routinely deliver performance exceeding that of serial technology by greater than 100 fold. Previous studies have reported successful parallelization strategies of the energy and analytical gradient for most types of ab initio wavefunctions, analytical RHF, ROHF, and GVB Hessian computations, as well as coupled-cluster theory [22–37]. As is documented in these studies and review articles, there are

0010-4655/00/$ – see front matter  2000 Elsevier Science B.V. All rights reserved. PII: S 0 0 1 0 - 4 6 5 5 ( 0 0 ) 0 0 0 6 7 - 9

56

B. Bolding, K. Baldridge / Computer Physics Communications 128 (2000) 55–66

many techniques available for parallelizing quantum mechanical algorithms, weighted heavily by the particular hardware architecture one is faced with, and typically employing one of several message passing library toolkits. Most recently, increased focus is being placed on more efficient ways of exploiting parallelism effectively through either specialized architectures, or, significant efforts placed in restructuring algorithms to better handle the large memory requirement of solving the quantum mechanical equations. An example of the latter direction is the use of a distributed data storage technique within the GAMESS (General Atomic and Molecular Electronic Structure System) program [32], which more effectively uses parallel computers for correlated wavefunctions with large basis sets. Of particular interest here, is the exploitation of large multi-threaded shared memory architectures for parallel electronic structure theory. Multithreading has received considerable attention in recent years [38] as a promising way to hide memory latency in highperformance computers, while providing access to a large and uniform shared memory. The intended benefits of such an architecture are projected as, high processor utilization, scalable performance on applications that are inherently difficult to parallelize, and, reduced programming effort. The combination of these advantages promises an extra 1–2 orders of magnitude improvement in performance over conventional MPP architectures. We report on these and other issues in our efforts to GAMESS to the TERA multi-threaded architecture [38] (MTA).

2. Design features of the TERA massively parallel multithreaded architecture The TERA MTA, designed and built by TERA Computer Co. of Seattle, is the state-of-the-art multithreaded computer investigated here. It has multithreaded processors and a high-bandwidth network. These are designed to hide the latency to a large and uniform shared memory. Sophisticated compilers and performance tuning tools are available to extract parallelism from a broad spectrum of applications. Expected benefits of the MTA include high processor utilization, near linear scalability, and reduced programming efforts compared to distributed memory architec-

tures. This work was carried out on two machines, one at TERA Computer in Seattle, consisting of 2 processors and 2 gigabytes of memory, and a system housed at the San Diego Supercomputer Center, (SDSC) on the campus of the University of California, San Diego (UCSD), consisting of 8 processors and 8 gigabytes of memory. 2.1. TERA MTA characteristics The TERA Multi-Threaded Architecture, MTA, represents a radical departure from traditional vectoror cache-based computers, in that this particular architecture supports neither data cache, nor local memory. Instead, processors are connected via a network to commodity memory, configured in a shared memory fashion. Hardware multithreading is used to tolerate high latencies to memory, typically on the order of 150 clock cycles. Even higher latencies can be tolerated by instruction lookahead [38]. Each processor on the MTA architecture has up to 128 hardware streams. Each stream holds the context for one thread in a program counter, and 32 registers. The processor switches from one stream to another every clock period, executing instructions from nonblocked streams in an impartial fashion. A minimum of 21 streams are required to keep a processor fully utilized. If a program is well parallelized and providing enough work to sustain 21 or more streams/processor, then it is possible for the TERA-MTA to be issuing an instruction every single clock, thereby achieving unprecedented processor efficiency. On all other present architectures, such sustained efficiency is difficult to attain. On the parallel/vector architectures, achievement of high efficiency is only attainable for long, simple vector loops, while incorporating parallelism entails adding a layer of microtasking on top of the vector code. On RISC architectures, moderate efficiency is only attainable by programming in order to maximize cache efficiency. The present design for the MTA, while highly efficient for multithreaded codes, has the disadvantage, for serial codes, that a program can execute an instruction only once in 21 clocks; the length of the instruction pipeline. Also, the lack of a data cache hinders memory latency for serial execution. The consequence of this design is that single-threaded performance on

B. Bolding, K. Baldridge / Computer Physics Communications 128 (2000) 55–66

the MTA is poor relative to other architectures with comparable clock speeds. Scientific codes typically exhibit a high level of instruction-level parallelism, and so the compiler is designed to effectively use software pipelining to schedule memory references ahead of their uses, thereby achieving the maximum lookahead of 7. This means that 7 subsequent instructions from the same stream can be executed before the memory reference of the current instruction must be completed. This lookahead is highly effective in hiding memory latency. With each clock period, a processor can issue an instruction containing 1 memory reference, and 2 other operations, one of which may be a fussed multiply-add. Thus, the theoretical peak speed of a processor is three floating-point operations per clock (two adds and one multiply); however; in practice, one typically would see no more than two floating-point operations per clock on realistic computations. 2.2. Networking/Memory The network connecting the processors to memory is a partially connected 3D torus that has multiple routing nodes per processor, rather than the reverse configuration which is increasingly utilized on distributed memory systems. Moreover, the number of nodes is at least p1.5 , where p is the number of processors. This means that the bisection bandwidth scales linearly with p, while the network latency scales as p0.5 . The 8 processor system employed at SDSC is currently configured with 8 processors, or 2 × 4 × 8 = 64 nodes. Of these, 8 nodes are attached to compute processors, 8 are attached to I/O processors, 16 are attached to memory boards, and 32 are not attached to any resources. The extra nodes provide links between resources to increase bandwidth. The maximum bandwidth from memory to a processor is 8 bytes times the clock speed. This particular system has one GB of memory per processor, corresponding to two memory boards. Each memory board has 64 banks, and memory references by the processors are randomly scattered among all of the banks of all the memory boards (to avoid stride-induced hotspots), except for instruction fetches which access the processor instruction cache via a dedicated data path. As a result, memory is equally accessible to each processor, and memory latency is independent of stride.

57

2.3. Operating System (OS) environment The TERA OS is a parallel version of Unix [39], based on Berkeley sources, and modified to use a concurrent microkernel, developed by TERA. A two-tier scheduler is incorporated in the microkernel to support execution of multiple tasks, both large and small, without manual intervention. Large tasks running on more than a single processor are scheduled via a binpacking scheme, while small tasks are scheduled using a traditional Unix approach. Fortran, C, and C++ compilers are all available with separate front-ends, but a common back end. The sophisticated back-end of the compiler ‘automatically’ parallelizes serial code by decomposing it into threads. Generally, it is necessary to augment such ‘automatic’ parallelization by inserting pragmas or explicit parallel constructs into the code. One can even take further control of thread management to access finer control of the parallel profile of the application, although this was not done in our initial work with GAMESS. One feature of the MTA which we use extensively is the atomic update. This allows the programmer to take advantage of the lightweight hardware instructions to update shared memory locations using a simple Fortran directive. The C$TERA UPDATE directive used either the FETCH_ADD instruction to update shared integers, or the extremely fast full-empty bit hardware, to update floating point variables. Two tools are provided within the TERA environment: (1) Canal, which provides an annotated version of the source code that shows how decomposition into threads has taken place, and subsequently targeted to the hardware, and (2) Traceview, which provides a post-mortem analysis of an execution trace, showing how well the available hardware has been used by the code that executed.

3. GAMESS overview GAMESS, General Atomic and Molecular Electronic Structure System, is a quantum chemistry program which solves the Schrödinger Equation at various levels of theory, leading to direct quantitative predictions of chemical phenomena from first principles

58

B. Bolding, K. Baldridge / Computer Physics Communications 128 (2000) 55–66

Table 1 Breakdown of functionalities within GAMESS according to algorithm type available Task

RHF [1]

ROHF [2]

UHF [3]

GVB [4]

MCSCF [5]

Energy

CD-DP

CD-DP

CD-DP

CD-DP

CD-DP

Analytic gradient

CD-DP

CD-DP

CD-DP

CD-DP

CD-DP

Analytic Hessian

CD-DP

CD-DP



CD-DP



Numerical Hessian

CD-DP

CD-DP

CD-DP

CD-DP

CD-DP

MP2 energy

CD-DP

CD-DP

CD-DP



C

MP2 gradient

CD









CI energy

CD-DP

CD-DP



CD-DP

CD-DP

CI gradient

CD









MOPAC [1] energy

CD

CD







MOPAC gradient

CD

CD







D: direct evaluation of AO integrals. DP: parallel distributed memory architecture execution. C: conventional storage of AO integrals on disk.

(ab initio). Table 1 gives a summary of wavefunctions available in GAMESS, together with information about the availability of analytic determination of gradients, Hessians, second order Moller–Plesset theory (MP2), configuration interaction (CI), and multiconfigurational wavefunctions (MCSCF). Where analytical gradients are available, GAMESS can be used to calculate stationary points, intrinsic reaction coordinates (IRCs), and numerical Hessians. Complete details for GAMESS can be found elsewhere. Parallelization efforts on GAMESS have been extensive [20,21,32,34–37]. Table 1 additionally gives a summary of the current program capabilities. These efforts began with the implementation of direct (D) SCF methods [40] over conventional SCF (C) for the processing of atomic orbital (AO) integrals. In the direct method, one computes these batches of integrals each cycle within the SCF, while in the latter method, the integrals are processed once and stored off on disk. The disadvantage of the conventional method for most parallel architectures is the limitations in I/O speed for processing the integrals each iteration. A third possibility for the processing the large number of atomic orbital integrals that has not been rigorously addressed due to lack of memory and disk typical of parallel distributed platforms, is the exploitation of fully InCore techniques. This method would allow the computation

of all integrals once as in the conventional method, but instead of moving them off to disk, would allow them to be stored in memory for faster parallel access. The InCore method can be simple to program on shared memory architectures and is efficient when access to memory is fast. The InCore method is an ideal methodology for the TERA MTA and will be discussed in this work. Parallel (P) implementations have primarily focused on distributed memory platforms and clusters of workstations. In the past, shared memory implementations have only been addressed via the running of message passing parallel programs on SMP computer enclosures, or by use of a distributed data storage (MEMDDI) [41], in which each process owns only a portion of the total array. The T3E supports this latter implementation via the SHMEM library. An accompanying paper discusses implementation of this technique on traditional Unix workstation clusters connected by Ethernet using socket calls.

4. TERA MTA parallel strategies Our principal objective in this work was to determine whether a multi-threaded computer architecture, embodied in the TERA MTA, would allow the elec-

B. Bolding, K. Baldridge / Computer Physics Communications 128 (2000) 55–66

tronic structure theory package, GAMESS, to achieve significant performance enhancements on large-scale systems. It is difficult for many existing implementations of GAMESS on distributed memory architectures to achieve peak speed. The same is true for implementations on parallel/vector machines. This occurs (a) because data can not be moved from memory to the processors fast enough to keep them fully busy, and (b) because the time to retrieve data from memory (i.e. memory latency), is not improving as fast as the processor clock period. Thus, management of memory latency is increasingly important to obtain high, sustained speeds. Promising approaches to manage memory latency in a scalable way includes novel hardware architectures, such as multi-threading, discussed here, or via such algorithmic modifications such as a distributed data interface, which is described in another of the contributed articles in this special issue. As has been identified in many previous studies [40, 42], the computational bottlenecks for SCF energies and gradients stand with the computation of the twoelectron integrals and two-electron gradient integrals, respectively. Clearly, then, these are the portions of the code that received the most attention in this first stage of research with the TERA MTA. However, where possible, we are also addressing other sections of the code. For example, we have also parallelized the oneelectron integrals since the procedure is similar to that of the two-electron integrals, and in the general course of ‘automatic’ parallelization of the TERA compiler, many routines were additionally addressed. Significant attention has been payed to how one handles the I/O in previous parallelization efforts on distributed parallel platforms. These implementations take a master-slave approach in which node one does the reading of the input and the broadcasting to the other nodes, and additionally is responsible for collecting and writing the output. These issues are no longer considered on the TERA architecture since I/O becomes trivial under the multi-threaded style architecture. Since the platform is shared-memory, one stream reads the input and stores the variables in memory. All threads then have access to this data, simply avoiding the broadcasting step all together. During the course of this work, it became apparent that more than one strategy existed for parallelization due to the unique architecture of the TERA platform. Over the past several years, parallel algorithm de-

59

velopment has accommodated the distributed-memory model because of its cost effectiveness and portability (using such libraries as the Message-Passing-Interface MPI). This has pushed algorithm development toward DIRECT methods, which allow for recalculation of integrals, thereby avoiding I/O and memory bottlenecks. While these methods are not the most intuitive to the scientist, they effectively maximize efficiency for distributed-memory architectures, at the expense of data-replication and communication. Several years ago, shared-memory platforms spurred on the development of InCore algorithms, which are the most efficient and intuitive for small problems (e.g., ∼300 basis functions for approximately 8 GBytes of memory), but have been limited by memory for extremely large problems and the memory scalability if memory access becomes non-uniform. Our approach to parallelization has been to take the existing Direct SCF code and parallelize it efficiently for the sharedmemory TERA architecture. This is important because even in an InCore scheme, the integrals need to be calculated fully at least once during any SCF convergence cycle. In addition, using a simple technique, we have coded an InCore methodology into GAMESS, which takes full advantage of the shared-memory and multithreading architecture intrinsic within this platform. 4.1. Energy The specifics of the SCF procedure within GAMESS are detailed elsewhere [32]. The primary concept to focus on in this work is that, within a shared memory architecture, arrays such as the FOCK matrix and density matrix now no longer have to be replicated on each of the nodes on the machine [32,35–37]. Thus, limitations on molecular size that can be accommodated on such architectures are much less restrictive. The following sections of the SCF code were modified to run in parallel: one-electron integrals, two-electron integrals, matrix multiplications, matrix diagonalization, one-electron-gradient integrals, and two-electron gradient integrals. Parallelization of SCF energies within GAMESS for the TERA MTA architecture is conceptually simple because of the shared-memory paradigm. The existing parallel code within GAMESS was entirely designed for distributed-memory architectures. Therefore some recoding of GAMESS was required in order

60

B. Bolding, K. Baldridge / Computer Physics Communications 128 (2000) 55–66

to remove intrinsic programming assumptions regarding the data locality of some variables. The most common modification needed was the removal of COMMON blocks used to pass information between subroutines within the parallel region. Within GAMESS the two electron integrals are calculated by a four-index calculation over atomic orbitals (AO). The iteration of this four index loop can easily be distributed across streams on the MTA by including a set of parallel directives defining the parallel loop, and declaring which variables will be shared or local to the individual streams. Note that conceptually, on the MTA, it is easiest to think in terms of streams, rather than processors, the way one would on a more conventional architecture. Also, only one level of parallelism is needed, rather than the two required for a parallel/vector architecture (microtasking and vectorization). The FOCK and density matrices are shared among all streams, and updates to these are done with atomic updates to insure proper summation. This parallelization is at a high level. This four-index loop calls several layers of subroutines and functions, which consider all the different types of integrals to be calculated. As a practical matter, some portions of these routines needed to be modified to reduce the usage of common blocks for passing information to called subroutines. Previous programming paradigms emphasized distributed-memory architecture, where each processor would have an independent copy of the common block. This assumption is not the case on the TERA-MTA, and although one can use TaskCommon directives, we chose not to, since not all variables in a give common block were to be shared. We primarily removed the COMMON blocks and simply passed the appropriate information through the subroutine calls themselves. Those COMMON blocks containing the FOCK and density matrices were left untouched, since these arrays are shared. The two-electron integrals are calculated within the TWOEI subroutine. The basic structure of the parallel region is illustrated in Scheme 1. One notes the similarity of the TERA directive notation with Cray, SGI, or OpenMP directives. On the MTA, each iteration is scheduled in such a way that as streams become available, the iterations are handed out and the work performed, regardless of the number of processors. All that matters are the availability of streams. The program will thus run efficiently on one

c$TERA assert parallel c$TERA assert local {variable list} do {i>j>k>l} ...logic associated with AOs... if (condition1) call POPLE_method--> (which calls other routines) else call HONDO_method--> (which calls other routines) endif call UPDATE_FOCK_and DENSITY (use atomic updates) enddo Scheme 1.

or multiple processors, as long as there is sufficient work to be distributed among the available streams. Local variables within subroutines inside the parallel region are by definition local to each stream, unless they occur in a COMMON or DATA statement. On the MTA, COMMON blocks are assigned static address and therefore are by definition global data structures shared by all streams. The same is true of any variable defined in a DATA statement, since a DATA statement assigns a value to a variable at program initiation. Since each four-index term is independent, the complex loop is highly parallel. We have fused the outer i and j loops and, for a moderate sized problem, this provides enough loop iterations to keep several thousand streams busy. Load balance is not an issue, since the scheduler keeps handing out iterations as streams become available. Load balance would only be a problem if the number of iterations were comparable to the number of streams being used. There is the small potential of a hotspot within the atomic update (which entails single stream locking a shared memory location and storing to it), but would only occur for a very large number of streams attempting to access a small FOCK matrix. In this part of the code we are taking the integral values and summing them into the FOCK and Density matrices. Since many separate four-index terms can contribute to an individual element of these matrices, if many streams attempt to LOCK and write to the same address at the same time, then a bottleneck will occur. On the MTA this atomic update is highly efficient. This, combined with the fact that the calcu-

B. Bolding, K. Baldridge / Computer Physics Communications 128 (2000) 55–66

lations involved in each iteration are extensive, involving thousands of lines of code, minimize the probability that multiple streams will collide on the same memory location simultaneously. This would be more problematic on a small loop in which the streams remain highly synchronized. 4.2. Gradient The RHF gradients in GAMESS are handled in a similar fashion to the energies with a four-index loop. The gradients perform an update of the derivative matrix in much the same fashion that the FOCK and Density matrices are updated in the energy portion of the code. On the MTA, parallelization was achieved using the same method, with a simple directive and definition of local variables. Several common blocks had to be removed in order for variables to remain local to each stream. The portions of GAMESS parallelized were the one-electron-gradient integrals, and two-electron gradient integrals. 4.3. InCore method While parallelization of the DIRECT method within GAMESS is important and enlightening, it does not fully utilize the MTA’s shared memory architecture. In fact, the DIRECT method is ideal for distributed memory architectures, providing good scalability with minimal communication. In order to maximize the MTA’s resources, we undertook the implementation of an InCore algorithm within GAMESS. The InCore method involves calculating the two-electron integrals once (using the already parallel two-electron energy routine), and storing them in memory for use by subsequent iterations in the SCF convergence cycle. The InCore methods have previously been implemented within other ab initio codes, but this is the first implementation within GAMESS. The InCore method is typically communication bound on distributed memory machines because it entails moving millions of integrals from processor to processor on each iteration step. On the MTA, it is extremely elementary to implement, simply by having each stream working on the two-electron energy pack the four indexes and the value of the integral into two global arrays. At each subsequent iteration of the energy calculation, GAMESS unpacks the indexes and integral values us-

61

c$tera assert parallel do intcount = 1,numint c$tera assert local variables c read value of the integral VAL = val_mem(intcount) c read and unpack i,j,k,l indices LABEL = in_mem(intcount) I = ISHFT(LABEL, -48) J = IAND(ISHFT(LABEL, -32), 65535) K = IAND(ISHFT(LABEL, -16), 65535) L = IAND(LABEL, 65535) c$tera update (insures guarded update of Fock matrix) Fock_Matrix update enddo Scheme 2.

ing them for the energy calculation with a new set of coefficients. This reading from memory and unpacking (done in the HSTAR subroutine) is performed in parallel by using a simple directive. A schematic of this loop is show in Scheme 2. This reading of integrals is highly efficient, since it involves reading succeeding members of two large arrays. This read is performed using approximately 80 streams/processor without loss of efficiency. Our results for a system with 294 basis functions (using approximately 4.3 Gbytes of memory to store 250 million integrals and their indexes) indicate that the InCore method is the most efficient for the MTA. It is widely known that as the number of basis functions increase, there will be a crossover point at which DIRECT methods will outperform InCore methods, due to the cost of the immense number of memory references. On the MTA, with its large flat memory, multithreaded access to memory, and lookahead hiding any latency, this crossover is not even close to being reached for a 300 basis function problem. Future research will attempt to determine the location of this crossover on the MTA.

5. Results Our initial efforts in the parallelization of GAMESS involved DIRECT methods, and so it is interesting to compare timings of DIRECT vs InCore for the TERA platform. In general, one sees an incredible

62

B. Bolding, K. Baldridge / Computer Physics Communications 128 (2000) 55–66

plane of the molecule [43]. The basis set is 6– 31G(D) [44,45], with 294 basis functions. Storage of integrals on disk for this molecule requires 5.9 Gbytes. The timing results on IBM-SP clusters (Power2 and Power3 chips) and the TERA platform are shown in Table 2(a). The calculation was run in DIRECT mode on the IBM platforms, since this provides the best performance, and using InCore mode on the TERA, which is the best method to use on that platform. Table 2(a) compares the SCF convergence time for the IBM-SP platforms using MPI message passing with that of our TERA results. The IBM Power3 SP in this table is a 12 node machine, with 8 processors per node with shared memory architecture, giving the capability of 96 255 MHz processors. The IBM Power2 SP is a 128 node distributed memory machine, with 160 MHz processors. In the former case, we have performed timings between processors across nodes, as well as between processors of a single node. In general, one sees only about a 1% improvement in speed within processors of a node, for small numbers of processors. The increase in performance between the Power2 and Power3 is on average 1.5 for this

Fig. 1. Firefly Luciferin structure.

performance enhancement using InCore over DIRECT methodology. For example, an RHF/DH(d,p) energy + gradient calculation on formamide, a relatively small molecule, runs 7.8 time faster using InCore over DIRECT on one node. Larger molecules have demonstrated similar performance differences. A larger example chosen to illustrate performance is Firefly Luciferin [43] (Fig. 1). This molecule, among others, has been used as the benchmark molecule for many of our parallelization efforts of the energy and gradient tasks. This molecule is biologically active in the case where the carboxyl group is above the

Fig. 2.

B. Bolding, K. Baldridge / Computer Physics Communications 128 (2000) 55–66

Fig. 3.

Fig. 4.

63

64

B. Bolding, K. Baldridge / Computer Physics Communications 128 (2000) 55–66

Table 2(a) Total wall clock time spent performing SCF iterations (15) for Luciferin. All times are in seconds # Processors

IBM Power2 SP DIRECT SCF

MTA TERAa

IBM Power3 SP DIRECT SCF

InCore SCF Time (s)

Speedup

Time (s)

Speedup

InCore SCF/Opt HSTAR

Time (s)

Speedup

Time (s)

Speedup

1650

1

1046

1

1

6290

1

4178

1

2

3161

1.99

2125

1.97

964

1.71

544

1.92

4

1594

3.95

1262

3.31

645

2.56

287

3.64

8

815

7.71

564

7.41





170

6.15

16

418

15.0

291

14.3









32

222

28.3

151

27.7









a The MTA/TERA at SDSC has a total of 8 nodes.

Table 2(b) Total wall clock time spent in the HSTAR routine. All times are in seconds # Nodes

Optimized HSTAR routine(s)

Speedup

1

67.4

1

2

33.9

1.99

4

16.9

4.00

8

8.5

7.93

particular application. The speedup on either platform is quite good, with pretty much linear performance up until somewhere between 8 and 16 nodes, at which time the performance begins to degrade. The efficiency then drops to about 85% at 32 nodes of the IBM/Power2. The MTA presents promising results, both in terms of scalability and raw performance. The first column of MTA data represents a simple approach in which the compiler parallelizes the HSTAR routine. Performance and scalability were good, but because the vast majority of exectution time was spent in the parallel loop mentioned above (Scheme 2) we rearranged the loop to minimize instructions and maximize register reuse. We also replaced combined IAND(ISHFT) calls with TERA_BIT_PACK intrinsics. Table 2(b) shows the scaleabity now achieved within the HSTAR routine alone, where most of the InCore time is spent. One sees near perfect scaling up through 8 nodes.

With the further improvement of the HSTAR routine, the overall results for the SCF are shown in the second MTA column of Table 2(a). The scalability over processors appears poorer than what is displayed in Table 2(b). One must remember that the MTA is running the parallel portions within this region on 80–90 streams/processor. The formation of the FOCK matrix (Scheme 2) is running with almost perfect scaling on 90–720 streams (1–8 processors). On one processor, this loop accounts for nearly 90% of the computational time and can process 2.7 million integrals/sec/processor. Several small routines are only partially parallelized (SOGRAD, SOTRAN, SONEWT and some file I/O) and this leads to some degradation in scaling on a per processor basis. This demonstrates both the strength and the weakness of the MTA. The MTA can achieve spectacular performance when streams can be utilized to their full efficiency, but this excellent performance in parallel regions can cause small nonparallel routines to become important for larger numbers of processors. The net effect is that the programmer (or compiler) must parallelize subroutines which most platforms tend to ignore. Fortunately this parallelization is very simple on the MTA, typically involving a few directives to aid the compiler in finding parallel work. We are now in the process of parallelizing these other routines. For illustration purposes, traceviews and scalability results have been obtained for a smaller basis set calculation of the Luciferin example (186 basis functions). Fig. 1 is a Traceview (TERA performance tool)

B. Bolding, K. Baldridge / Computer Physics Communications 128 (2000) 55–66

image for the InCORE portion of an SCF calculation. The red (light gray in b/w) line near 0.9 is the available resources and therefore the maximum performance achievable for this run. This availability is at 0.9 and is typical because of contention with other job running simultaneously on the machine. The blue (black in b/w) lines represent the instructions issued by the GAMESS job. Each iteration, the bulk of the time is spent in HSTAR (blue horizontal lines at approximately 0.85), reading integrals and forming the FOCK matrix. The vertical blue (black in b/w) lines at each iteration are the lower efficiency portions of the calculation and this less parallel region is expanded in Fig. 2 and more extensively in Fig. 3. There we see several distinct regions which correspond to SOGRAD, SONEWT, and SOTRAN routines. While somewhat parallel, the overall efficiency for this portion of the calculation is low enough to hinder scalability for large numbers of streams. These routines are currently the object of intensive optimization work. The ability to utilize the more efficient InCore method is shown to have excellent scaling and compares well with the best methodologies on other parallel platforms. The overall speedup on the MTA using the InCore method is 6.2 going from 1–8 nodes, the degradation mostly stemming from several small routines yet to be parallelized. This additional parallelization is ongoing and we are expecting near linear performance when finished. The performance on the latest IBM platform is also shown to scale very well, however, even though this DIRECT methodology is known to scale very well, the algorithm performs many extra calculations during the SCF iterations. Thus, an additional advantage seen in the TERA perforamnce is the ability to do the calculation much faster over all. 6. Present standing and future pursuits The combination of multithreaded processors and a uniform shared memory affords a very attractive parallel programming model for GAMESS on the MTA. Multithreading is the only type of parallelism that need be considered, regardless of the number of processors. Thus an application like GAMESS can be tuned to multithread well on a single processor or multiple processors without fundamental additional algorithmic modifications. The MTA is a scalable system, this

65

particular design intended to ultimately be built with up to a 256 processors. This combined with a very favorable performance on a per processor basis for this code (as well as we have found for other applications) holds much promise for this type of architecture in the coming decade. Given the results shown in this paper, we see that excellent results are achievable on even a small numbers of processors when using the simple shared-memory programming model. Additionally, it is clear how to improve the scalability for this code by working on well defined subroutines. This simplicity of programming model, combined with the promise of outstanding performance on much larger MTA systems than those used in this paper, give reason to hope that we may soon see problems being solved in computational chemistry that are 1–2 orders of magnitude more computationally intensive than those being performed on the largest distributed memory machines in use today. While several computationally intensive portions of the GAMESS package have been addressed here, much work remains to be done. Further optimizations of existing algorithms and general tuning is needed to enhance performance of the code for the TERA platform. Future work on the MP2 and CI energies and gradients, as well as numerical and analytic Hessians is important in order to improve the parallel functionality of GAMESS on the TERA MTA. In addition, further work expanding the functionality of the InCore algorithm and testing its limitations is vital for characterizing the potential of the TERA for such applications. Acknowledgements This work is supported by the US National Science Foundation (ASC-9613855), the San Diego Supercomputer Center, and the TERA Computer Company. During the course of the review process, TERA Computers acquired the CRAY business of SGI Inc. The official name of the combined company is CRAY Inc. The TERA MTA is now called the CRAY MTA. The authors would also like to acknowledge Gary Montry for the preliminary investigations done to prepare for the port of GAMESS to the TERA, and Giri Chukkpalli for his advice and aid in getting the IBMPower3 results presented in this work.

66

B. Bolding, K. Baldridge / Computer Physics Communications 128 (2000) 55–66

References [1] M. Dupuis, J.D. Watts, Theoret. Chim. Acta 7 (1987) 91. [2] M.F. Guest, E. Apra, D.E. Bernholdt, H.A. Fruchtl, R.J. Harrison, R.A. Kendall, R.A. Kutteh, J.B. Nicholas, J.A. Nichols, M.S. Stave, A.T. Wong (1994). [3] R. Ernenwein, M.M. Rohmer, M. Benard, Comput. Phys. Commun. 58 (1990) 305. [4] M.M. Rohmer, J. Demuynck, M. Benard, R. Wiest, C. Bachmann, C. Henriet, R. Ernenwein, Comput. Phys. Commun. 60 (1990) 127. [5] R.J. Harrison, J. Chem. Phys. 94 (1991) 5021. [6] M.D. Cooper, L.H. Hillier, J. Comput.-Aided Mol. Design 5 (1991) 171. [7] S. Kindermann, E.O. Michel, J. Comput. Chem. 13 (1992) 414. [8] H.P. Luthi, J.E. Mertz, Supercomput. Rev. (1992). [9] M. Feyereisen, R.A. Kendall, Theoret. Chim. Acta 84 (1993) 289. [10] J.D. Watts, M. Dupuis, J. Comput. Chem. 9 (1988) 158. [11] R.J. Harrison, E. Stahlberg, CSCC Update (1992). [12] R.J. Harrison, R.A. Kendall, Theoret. Chim. Acta 79 (1991) 337. [13] A.P. Rendell, T.J. Lee, R. Lindh, Chem. Phys. Lett. 194 (1992) 84–94. [14] M. Schueler, T. Kovar, H. Lischka, R. Shepard, R.J. Harrison, Theoret. Chim. Acta 84 (1993) 489. [15] R.J. Harrison, R. Shepard, Ann. Rev. Phys. Chem. 45 (1994) 623–658. [16] K.K. Baldridge, J.S. Siegel, Theoret. Chem. Acc. 97 (1997) 67–71. [17] I.T. Foster, J.L. Tilson, A.F. Wagner, R.L. Shepard, J. Comput. Chem. 17 (1996) 109–123. [18] R.J. Harrison, M.F. Guest, R.A. Kendall, D.E. Bernholdt, J. Comput. Chem. 17 (1996) 124–132. [19] H. Lischka, H. Dachsel, R. Shepard, R.J. Harrison, Future Gen. Comput. Systems 11 (1995) 445–450. [20] K.K. Baldridge, J.H. Jensen, N. Matsunaga, M.W. Schmidt, M.S. Gordon, T.L. Windus, J.A. Boatz, T. Cundari, Applications of Parallel GAMESS; 592 edn., T. Mattson (Ed.) (Amer. Chem. Soc., 1995) pp. 29–46. [21] K.K. Baldridge, JNNIE: The Joint NSF-NASA Initiative on Evaluation B-51 (1994). [22] R.A. Whiteside, J.S. Binkley, M.E. Colvin, H.F. Schaefer III, J. Chem. Phys. 86 (1987) 2185. [23] R. Wiest, J. Demuynck, M. Benard, M.M. Rohmer, R. Ernenwein, Comput. Phys. Commun. 62 (1991) 107. [24] L.A. Covick, K.M. Sando, J. Comput. Chem. 11 (1990) 1151.

[25] M. VonArnim, R. Ahlrichs, J. Comput. Chem. 19 (1998) 1746– 1757. [26] Y.S. Li, M.C. Wrinn, J.M. Newsam, M.P. Sears, J. Comput. Chem. 16 (1995) 226–234. [27] M.D. Cooper, N.A. Burton, R.J. Hall, I.H. Hillier, J. Mol. Struct. 121 (1994) 97–107. [28] Y. Wu, S.A. Cuccaro, P.G. Hipes, A. Kuppermann, Chem. Phys. Lett. 168 (1990) 429–440. [29] Y. Wu, S.A. Cuccaro, P.G. Hipes, A. Kuppermann, Theoret. Chim. Acta 79 (1991) 225–239. [30] D.K. Hoffman, O.A. Sharafeddin, D.J. Kouri, M. Carter, Theoret. Chim. Acta 79 (1991) 297–311. [31] T.G. Mattson, Parallel Computing in Computational Chemistry, Vol. 592 (Amer. Chem. Soc., Washington, DC, 1995). [32] M.W. Schmidt, K.K. Baldridge, J.A. Boatz, S.T. Elbert, M.S. Gordon, J.H. Jensen, S. Koseki, N. Matsunaga, K.A. Nguyen, S. Su, T.L. Windus, S.T. Elbert, J. Comput. Chem. 14 (1993) 1347. [33] H. Dachsel, H. Lischka, R. Shepard, J. Nieplocha, J. Comput. Chem. 18 (1997) 430–448. [34] G.D. Fletcher, A.P. Rendell, P. Sherwood, Mol. Phys. 91 (1997) 431–438. [35] T.L. Windus, M.W. Schmidt, M.S. Gordon, Chem. Phys. Lett. 216 (1993) 375. [36] T.L. Windus, M.W. Schmidt, M.S. Gordon, Chem. Phys. Lett. 216 (1994) 375. [37] T.L. Windus, M.W. Schmidt, M.S. Gordon, Parallel Implementation of the Electronic Structure Code GAMESS; 592 edn., T. Mattson (Ed.) (Amer. Chem. Soc., 1995) pp. 16–28. [38] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, B. Smith, The Tera Computer System (Amsterdam, 1990). [39] G. Alverson, P. Briggs, S. Coatney, S. Kahan, P. Korry, Tera Hardware-Software Cooperation (San Jose, CA, 1997). [40] J. Almlöf, K. Faegri Jr., K. Korsell, J. Comput. Chem. 3 (1982) 385. [41] See contribution in this special CPC Issue. [42] J. Ochterski, C.P. Sosa, J. Carpenter, Cray User Group Proceedings, B. Winget, K. Winget (Eds.) (Shepherdstown, WV, 1996) p. 108. [43] E.H. White, F. Capra, W.D. McElroy, J. Amer. Chem. Soc. 83 (1961) 2402–2403. [44] R. Ditchfield, W.J. Hehre, J.A. Pople, J. Chem. Phys. 54 (1971) 724–728. [45] P.C. Hariharan, J.A. Pople, Theoret. Chim. Acta 28 (1973) 213.