Numerical computation of molecular integrals via optimized (vectorized) FORTRAN code

Numerical computation of molecular integrals via optimized (vectorized) FORTRAN code

Nuclear Instruments and Methods in Physics Research A 389 (1997) 117 120 ELSEWIER NUCLEAR INSTRUMENTS 8 METHooS IN PHYSICS RESEARCH SectIonA Numeri...

408KB Sizes 1 Downloads 48 Views

Nuclear Instruments and Methods in Physics Research A 389 (1997) 117 120

ELSEWIER

NUCLEAR INSTRUMENTS 8 METHooS IN PHYSICS RESEARCH SectIonA

Numerical computation of molecular integrals via optimized (vectorized) FORTRAN code T.C. Scott”**, M.B. Monagan”,

I.P. Grantb, V.R. Saundersd

Abstract The calculation of molecular properties based on quantum mechanics is an area of fundamental research whose horizons have always been determined by the power of state-of-the-art computers. A computational bottleneck is the numerical calculation of the required molecular integrals to sufficient precision. Herein, we present a method for the rapid numerical evaluation of molecular integrals using optimized FORTRAN code generated by Maple. The method is based on the exploitation of common intermediates and the optimization can be adjusted to both serial and vectorized computations.

1. Introduction The motivation behind this work started from a project whose goal was to find efficient means of solving the Dirac equation, i.e. the relativistic version of the Schrijdinger equation, for molecular systems [l]. The resulting iterative program required the fast and efficient computation of molecular integrals at each iterative cycle. In this work, we will focus on a particular tool developed for this project, namely the use of Maple to generate an optimized Fortran code. In the past, the interface between computer algebra systems such as Maple. REDUCE. Macsyma, and Mathematics, and numerical languages such as Fortran and C has been shown to be very useful in various applications. The GENTRAN package [2] has been developed as a code generation facility for applications in REDUCE and Macsyma. The MacroC and Macrofort [3] packages have been developed for applications in Maple. What is novel about the work shown here is that, with a few modifications, this interface can be used to generate optimized code for rwtorixd machines. Clearly, such a tool is very handy since ambitious calculations often make use of vectorization to make actual computations feasible.

* Corresponding

author

Program and algorithmic optimization is an area of intensive research. Good compilers can perform sophisticated optimizations. However, they use algorithms with time complexity O(n’) and 0(n3) where II is the size of the input expressions. For very large codes, these algorithms may fall flat on their faces. Yet very large problems are the very ones which optimization is most needed as optimization by hand is the most difficult. The Maple system provides a scalar optimizer which is limited to common subexpression optimization but is very fast: it uses built-in hashing of expressions resulting in an overall time complexity of only O(n). Additional benefits of the Maple optimizer result from the exploitation of internal symmetries. If the given mathematical problem can be decomposed into fundamental “buildingblocks”, the resulting computational savings can result in enormous savings. Additionally, Maple provides tools to manipulate an expression into different forms. e.g. Horner form, factored form, etc. This allows us to “tailor” code at will. Due to space restrictions, it is not possible to discuss all the major ingredients of this project [ 1.41.Herein, we focus on the particular objective of efficient computation of a quantity ubiquitous to molecular integral calculation for Gaussian-type basis functions, namely the R-integrals. However, this illustrates a number of points as stated in the conclusions.

0168-9002/Y7’$17.00 Copyright (:‘ 1997 Elsevier Science B.V. All rights reserved P/I SOI 68.900’(97)00059-4

Ilf. AUTOMATIC CALCULATION

118

T.C. Scott et al. /Nucl. Instr. and Meth. in Phys. Res. A 389 (1997) II 7-120

2. R-integrals The R integrals

are given by the recurrence

relations

c51 R~[~+~~~~v]=?~R~+~[T~~~V]+~R~+~[~-~~~~~]~ Rj[T./J+ 1,111 = _VRj+ ICT> I*,VI+ @j+ ICT,P - 1,$‘I> Rj[T.~~,v+l]=~Rj+~[~,~,V]+vRj+,[s,~.v-l]~ (1)

Rj CO, 0,0]

= ( -2p)'Fj(aQp2)

where

1 exp[-Ws2].~2md.~,

F,(W) =

(2)

0 where the vector QP = -(X, y, 2) and tl are “weighted” averaged geometrical quantities, and F is the Boys’ function for which there already exist various algorithms for its rapid computation [6]. The summation indices are bounded by zero and T + p + v I L and the total number of R functions is given by Nmaxp = (L + l)(L + 2) (L + 3)/6 where L is the total angular quantum number. There is no unique path through the coupled threedimensional recurrence relations for the R’s, and much work has been done to find the “best” way of calculating them, i.e. finding the “optimal path” which leads to the minimum number of FLOPS for a given L. It is essential to find this optimum: any slight deviation from it leads to CPU times radically larger than the minimum time by orders of magnitude as L increases. Combinatorial analysis has been used to find the optimal path [7] but the combinatorial complexity of the problem grows extremely fast, and this is only a feasible approach for low angular quantum numbers at the time this project was undertaken. Nonetheless, this earlier work provided us with the true mathematical optimum and benchmark on which to make comparisons as realized by Saunders’ Fortran module RCALS. This program was limited to L = 6, the highest attainable value when the program was made. Beyond that. only a “semi-optimal” result could be obtained using a routine called ZCALS, the latter being impossible to vectorize. However, since high-quality quantum chemistry type calculations require a range 0 I L 5 16, the L 5 6 limitation severely restricted the vectorized application of this approach to relatively low quantum numbers. The alternative here breaks this limitation while being much simpler: we implement the recurrence relations of Eq. (1) in Maple [4], obtain explicit expressions for the R-integrals and convert them into Fortran modules. Using a Maple program, we have generated explicit Fortran code for L I 16 using only moderate computer time and memory. Note that for Fortran code, it is preferable to map the indices (t. u, u) unto an effective index through a bijection.

We now illustrate the effect of Maple’s optimizer with the following specific example. In the example, Maple’s cost procedure produces statistics about how many floating operations (FLOPS) are involved at any stage. Ignoring vectorization at this stage, we examine the resulting Fortran code obtained for case L = 8. Due to space restrictions, only the Fortran code for R(O,O, 8) (or R(157) according to our bijection) is shown for the un-optimized case. For simplicity, we restrict ourselves to the diatomic case where QPZ is the Maple variable for the z-component of the vector QP. The variable FS[t] is the Boys’ F-function of order t, which is previously tabulated, and ALPHA is x as defined previously. The non-zero R’s for a given L are collected within an expression sequence ds which we analyse:

166

additio”‘

+ I624

m,,tiplic.tio”.

+ 266

‘“b.cript‘.

66

..*iS”‘.“”

The above requires a total cost of 1680 FLOPS. Repeating this same exercise with the optimized option, Maple found 56 “intermediates”, i.e. temporary variables denoted tl, t2, , t56):

> r~sdlib(optini.*):

> fortr.n~cd*l.optiniied~; tl - ALPHA**2 t2

-

t1r.2

t3 t4

-

FS(6) t2rt3

S(1)

.

Lb(LO.DO*L’I

R(6)

.

240.w*t4

R(131)

-

101)

R(I32) RC‘33)

-

1(32) SC331

The total cost is now reduced to 213 FLOPS, roughly a factor of 8 less than the un-optimized result. Maple found that certain entries of R being identical require only re-assignments without further computations.

T.C. Scott et al. JNucl. Instr. and Meth. in Ph?s. Res. A 389 (1997) II 7-1X

II’)

Table I Performance table for R-code L

Time (lo-‘s) No optimization from Fortran compiler

4

5 6 7 x

Full optimization from Fortran compiler

I

2

3

4

0.0674 0.1093 0.1786 I.2551 I .x739

0.1957 0.5739 I .7031 4.3676 IO.01669

0.1350 0.2468 0.4704 0.8804 1.4367

0.1036 0.1942 0.3181 0.7056 1.1204

Y

IO

I

7

3

3

(1.03I2 0.0478 0.236 0.3 I Y 0.445 0.568

0.0605 0.107 0.255 0.549

0.0741 0.0X78 0.196 0.454 0.852

0.03 I2 0.0576 0.131 0.22x 0.475 0.793

I .2X,8

Settmgs: ‘x= 1, QP =(l;lO, 1.2). 1 = RCALS (ZCALS) (V. Saunders). 2 = Maple Generated Code from Eq. (I). ? = (2) + Maple’s optimizer. 4 = (3) but with Eq. (3) instead.

Further investigations show that the reductions in the number of FLOPS is roughly proportional to L. This is understandable: the greater the number of terms with common intermediates, the greater the number of objects which can be exploited by the optimizer and thus, the greater the computational reductions. Table 1 compares the performance times for the R’s by RCALS (ZCALS for L > 6) and by the method shown here for particular settings for r and QP on a DEC-alpha. This illustrates the effect of Maple’s optimizer and the Fortran optimizer, separately and together. The DEC-alpha optimizer by itself offered no advantage ovrr Maple’s optimizer and for high L. the effects of both optimizers are harmoniously additive. The optimized Fortran code generated with Maple required a total of 6486 lines of Fortran code for the range L = 0. _., I?. Without the optimized option, this same code was in excess of 30000, i.e. much too large to be compiled as a single module on the Dee-Alpha (this is why the un-optimized results shown do not exceed L = 8 in the table). Note that the growth of the amount of Fortran code as a function of L is considerably also of reduced complexity with Maple’s optimizer. The CPU times from the code generated by Maple compares favorably when compared to RCALS and in the region L 2 7. a bit better than ZCALS, the semi-optimal result. These results were obtained with relatively little effort. Maple avoids the tedious combinatorial analysis and can produce results up to L = 16 in much the same way as previously available results up to L = 6. A significantly improved version of this code can be obtained by a modification of the recurrence relations

for the R’s in Eq. (1) according Sj[?+

l,p,\‘]

=

S,+,[t./L,\‘]

i

to

+TSj+l[T-

1,/L,\‘]

when

X2Sj+ 1[r, ,u,v] + tS,.+1[s - 1, p, 111 when

r is even, T

is odd, (3)

and similarly for the J and 2 directions. The R’s are obtained from the S functions according to x

=



1

i 7

if i is even, (4)

if i is odd,

and similarly

for Y, and Zi and

sj[r> p> ~‘1= RjCr, K ~~]/‘(XrYPZ,).

(5)

The results of the modified code are also shown in the table. The running times are now much closer to the true optimal results of RCALS in the regime L = 4, $6, the minor differences resulting largely from quirks of Maple’s optimizer and those of the compiler. Overall, this modified code yields the best results which are closest to the true optimum. Furthermore, the resulting modified code is even more compact than before, only 4090 lines of Fortran code for the region L = 0, ,12. An obvious benefit of directly “tailoring” code for a simpler case instead of the general case is that computational redundancy is avoided. For example, the Fortran module generated for the diatomic case is roughly three titnrsjirstrr than RCALS since the latter was hand-coded for the general multi-center case.

Hf. AUTOMATIC CALCULATION

T.C. Scott et al. /Nucl.

120

I&r.

and Meth. in Phys. Res. A 389 (1997) II 7-120

2. Vectorization

Vectorization proceeds by introducing a vectorization index, denoted M, which applies to all geometrical quantities. The Fortran code obtained from the compiler is then “sandwiched” within DO-loops that are readily generated using Maple’s formatted print statements. The explicit code generated by Maple can be uectorized all the way up to I. = 16, thus way beyond the previous barrier at L = 6. Vectorized R code (diatomic

17

RC(11.967)

-

RC(R,968)

-

Z,DO*tS*tl*t242

8xX11.969)

-

FS(II,l)

case)

I.DO*tB.t2*tl94*LC(I.936)

COlTIlUE ElDIF

Although the CRAY Y-MP is now vastly outstripped by other machines, it is important to note that the FLOP rate was about 85% of its peak efficiency. The CRAY compiler was “smart” enough to recognize that the intermediates to, t 1, etc., created by Maple’s optimizer were vectorized quantities and consequently, there was no need to index them. Beyond L 2 24, the numerical coefficients become very large, and there is a danger that the entire scheme becomes numerically unstable. However, a limit of L = 24 is more than enough for most practical calculations.

3. Conclusions

which performs relativistic HartreeeFock calculations [l]. The use of Maple accelerated the development of this work: within six months this program reached the stage of actually performing true molecular calculations. To answer the question raised by seasoned programmers in Fortran: “there are optimizers already provided by Fortrdn compilers, so why bother with Maple”? There are several reasons: (i) Maple’s optimizer is more efficient than that of Fortran compilers for large expressions. (ii) When a large number of common subexpressions are present, the amount of code is severely reduced and consequently “digestible” to the Fortran compiler. The optimization of the Fortran compiler usually yields added improvement. (iii) The symbolic manipulations tools like expand, collect,, factor, etc. are not available in Fortran. These allow us to manipulate an expression into an optimal form before their conversion in Fortran code. (iv) By “exploration” of the various mathematical quantities with Maple, one can isolate various symmetries and properties which could then be exploited to yield huge computational reductions. Finally, the tools shown are general. Any ambitious problem requiring intensive numerical calculations and which can be mathematically broken down into smaller structures is a potentially successful candidate for these methods.

References Cl1 H.M. Quiney, H. Skaane, 1.P. Grant and T.C. Scott, Development of multicentre Gaussian integral algorithms using symbolic manipulation techniques, 1996, submitted. PI B.L. Gates, A numerical code generation facility for REDUCE, Proc. SYMSAC’86 (ACM Press, New York, 1986).

We exploited the existence of common intermediates within molecular integrals to generate the R-integrals in terms of Maple-generated optimized Fortran code. The results apply to both serial or vectorized computation. We used Maple to simplify the algebraic complexity of the problem, and generate, in a rather unorthodox and novel way, verifiable Fortran code. Maple gives us the means to tackle a difficult problem whose complexity is too great to warrant a preliminary investigation (by hand), and provided us with a platform by which we can derive a large number of subsidiary results. These have allowed the design of a portable molecular program

c-11P. Capolsini and C. Gomez. MacroC and Macrofort:

C and Fortran Code generation within Maple, MapleTech 3 (Birkhauser, 1996) 14. M T.C. Scott, M.B. Monagan. I.P. Grant and V.R. Saunders, Generation of optimized vectorized Fortran code for molecular integrals of Gaussian-type functions, MapleTech (Birkhauser. 1997). to appear. Molecular CSI V.R. Saunders, in: Methods in Computational Physics, eds. G.H.F. Diercksen and S. Wilson (Riedel, Dordrecht, Holland, 1983) 1. C61 F.E. Harris, Int. J. Quantum Chem. 23 (1983) 1469. [71 B.G. Johnson, P.M.W. Gill and J.A. Pople, Int. J. Quantum Chem. 40 (1991) 809.