Computer Physics Communications
Computer Physics Communications 72 (1992) 14—28 North-Holland
Large-scale ab initio total energy calculations on parallel computers L.J. Clarke Edinburgh Parallel Computer Centre, University of Edinburgh, Mayfield Road, Edinburgh, EH9 3JZ, UK
I. ~tich’
and M.C. Payne
Cavendish Laboratoiy (TCM), University of Cambridge, Madingley Road, Cambridge, CB3 OHE, UK Received 30 March 1992
An implementation of a set of total energy plane-wave pseudopotential codes on a parallel computer is described which allows calculations to be performed for systems containing many hundreds of atoms in the unit cell. Possible parallelisation strategies are discussed and it is shown that assigning parts of real and Fourier space across the processors is the least restricted approach. The performance of our parallel codes is demonstrated by timing tests carried out on several i860-based parallel machines and these are compared with tests performed on conventional sequential supercomputers. Ab initio computations on systems which are beyond the power of conventional supercomputers as well as new perspectives for first-principles molecular dynamics are discussed.
1. Introduction
[2]), i.e. the ionic degrees of freedom move on a potential energy surface cP[(R1}J, I = 1, 2,. Na which represents the ground-state solution with respect to the electronic degrees of freedom for ionic configurations (R1}. However, even so, to find the many-body potential 1 [~R1}] for each ionic configuration (R1} encountered in EM, MD or MC is a highly non-trivial and computationally demanding task. Empirical modelling of ‘1[(R1)] may be successful in certain cases, examples being, e.g. the Lennard-Jones potential for closedshell systems [3, p. 3981 or the shell model for ionic materials (see e.g. ref. 3). Generally, however, P[{R1}] has to be constructed quantummechanically employing ab initio methods. Recent studies have shown that classical empirical models for c1[(R1}] may yield results which are significantly different from those obtained by an ab initio approach [4]. A fully ab initio approach to EM, MD or MC for system sizes of practical ..,
Computer modelling is an important tool in understanding the properties of condensed matter at a microscopic level (see e.g. ref. [11).Simulation studies are increasingly applied to predict properties of real materials. This application makes high demands on the quality of the interatomic potential to be used in conjunction with simulation techniques such as energy minimisation (EM), molecular dynamics (MD), and Monte Carlo (MC). The great majority of calculations are based on the Born—Oppenheimer (BO) separation of electronic and ionic degrees of freedom (see e.g. ref.
Correspondence to: M.C. Payne, Cavendish Laboratory (TCM), University of Cambridge, Madingley Road, Cambridge, CB3 OHE, UK. 1 Also at: Institute of Inorganic Chemistry of the Slovak Academy of Sciences, Bratislava, Czechoslovakia. 0010-4655/92/$05.00 © 1992
—
interest is generally beyond the power of conventional supercomputers thus making the use of
Elsevier Science Publishers B.V. All rights reserved
L.J. Clarke et al.
/
Large-scale parallel ab initio total energy calculations
15
parallel computers inevitable. In this paper we concentrate and analyse in detail this aspect of the large-scale ab initio calculations. The plane-wave pseudopotential method
able for large-scale computations [9—111have several common features: (i) minimisation of the DFT functional by an iterative improvement of {~/i1};
within the density functional theory (DFT) [51 has proven particularly useful in context of ab initio EM and MD studies of condensed matter, using the approach pioneered by Car and Parrinello [6]. Within DFT the many-body potential P[(R1}] is given by the minimum with respect to electronic single-particle orbitals {~/i1} of the energy functional [5]
(ii) use of Fast Fourier Transforms (FFT). The kinetic energy term in eq. (1) is diagonal in Fourier space whereas the other terms are diagonal in real space. Thus each term is computed in the “easy space”. Such an approach has two characteristic scalings. For small systems the operations associated with the use of the FF1’ techniques, scaling as 9(16NbM log2 M) ~ (Nb being the number of occupied K—S orbitals, and M the basis set size: typically Nb ~ M for a plane-wave representation), will dominate whereas large-scale calcula-
E[{tJi~}, {R,}]
~f1fdr ~,~*(r)(_
=
~V2)w~(r)
+
fdr l’~~~(r) pe(r)
+
~-
fd r d r’ P(
tions are dominated by the computation of the orthonormality constraints of eq. (3), which require t~(N~M)operations. It should be noted
(1)
that a number of authors have investigated methods that scale linearly with the system size [12]. However, at present these methods do not yield results of comparable accuracy to conventional ab initio methods.
where pe(1~) denotes the valence electron charge density
Several algorithms for minimisation of E for large systems have been developed over the last five years [9—11].We adopt here the conjugate
r) Pe( r’)
Ir—r’I + Exc[pe] +
Pe(’)
{f1}
=
I
1 2
Z1Z~ R~
I R1
2
—
‘
(2)
are the single-particle occupation numbers,
~‘~~~(r)is the total ionic (generally non-local) pseudopotential [7] acting on the valence electrons, {Z1} are ion-core charges, and Exc[pe] is the exchange-correlation energy functional [5]. We adopt here the local-density approximation (LDA) [5] for Exc[pe]. The state sums in eqs. (1) and (2) extend over all the occupied electronic states and over all the k-points used in the Brilbum-zone integrals. The single-particle orbitals (tf1} are subject to orthonormality constraints
gradient method (CG) method [11] whichtheis density a robustfuncand efficient of minimising tional E. This method uses CG techniques to iteratively update {i/ij. At the minimum of E with respect to {~}one can compute the ionic manybody potential c1’[{R1)], the many-body forces {F1} exerted on ions {R1} F1
=
—
VR/b[{
R1}],
(4)
stresses, etc. Until recently such an approach was limited to the number of atoms in the unit cell of the order of Na 100 atoms and observation times in MD simulations to T of the order of 10 Ps using conventional supercomputers such as the CRAY or the IBM-3090 (see e.g. refs. [4,9]). These are ‘~
—
fdr ~1*(r) ~i~(r)
=~~•
(3)
The minimisation of E is equivalent to solving the Kohn—Sham (K—S) equations [8], but the minimisation can be performed without explicitly solving the K—S equations. All the methods suit-
~~‘1
The number of FFT grid points is equal 16x M, the number of plane waves used to expand the wave functions.
16
L.J. Clarke et a!.
/
Large-scale parallel ab initio total energy calculations
extremely restrictive and for many systems one needs to go beyond these limitations. Ideally one would like to carry out simulations for systems with many hundreds of atoms in the unit cell (Na 500—1000), i.e. to approach the system sizes normally used in simulations operating with classical empirical potentials. This objective can be achieved in a very cost-effective way on parallel computers. We describe here how the problem of the ab initio computation of ~P[(R,}] required in EM and MD simulation can be mapped onto a network of processors. In particular, we shall be concerned with parallelising the plane-wave pseudopotential calculations for a 64-node Meiko i860 Computing Surface #2, a supercomputer with a theoretical performance of S Gflops which has 1 Gbyte of distributed memory. However, the strategies described here can be applied to an arbitrary large number of computing nodes and thus represent an important step towards the use of technologies which are to achieve Teraflop performances which will become available in the near future. The rest of the paper is organised as follows. In section 2 we analyse possible parallelisation strategies for plane-wave pseudopotential calculations. Examples of the modifications required to parallelise key parts of our codes will be shown explicitly. Performance tests and comparisons with conventional sequential supercomputers are presented in section 3. Perspectives for firstprinciples MD on parallel computers will also be discussed. Finally our conclusions are presented in section 4.
2. Parallel implementation of total energy pseudopotential calculations In this section we discuss the parallel implementation of total energy pseudopotential calculations and describe the approach we have taken
in our program. We begin by demonstrating the complexity of the serial computations performed in the simulation. Using the plane-wave basis representation we have ~i(r)
iIJfl,k(r)
=
=
—,=-
~c~”
e~+~r,
(5)
where n is the band index, k indicate the k-points used to sample the Brilbouin zone (BZ), fl is the unit cell volume, and (G} are the reciprocal lattice vectors. The wave functions are represented by a complex array C(M, Nb, Nk), (6) where M represents the number of plane waves used to expand the wave functions, Nb represents the number of electronic bands, and Nk denotes the number of k-points. These quantities typically scale with the number of atoms Na in the following way: M~ Nh Nk
(100—1000)
4 —*
—~
(1—10) i/Na.
XNa,
X Na,
The time complexity of direct diagonalisation of the K—S Hamiltonian is 19(M3) whereas iterative improvements are 19( N~M),which indicates that direct diagonalisation is at least i04 more expensive. These considerations make it clear that the use of iterative methods is essential for barge scale calculations. Observing the structure of the wave function representation (eq. (6)) it is easily seen that one can consider a number of data driven approaches to parallelising the calculations. We shall now discuss four such approaches, paying particular attention to the computations of the Fourier transform and wave function orthonormalisation. (a) Divide by k-point. Computations performed at each of the Nk k-points are relatively independent. We could exploit this fact by utilis-
#2
The codes described here (CETEP—Cambridge-Edinburgh Total Energy Package) have been developed for a 64-node Meiko i860 Computing Surface located at the University of Edinburgh.
ing a machine containing Nk processing nodes and assigning a different k-point to each node. The computations relevant to each k-point, in-
L.J. Clarke et aL
/
Large-scale parallel ab initio total energy calculations
17
cluding the FFT and orthonormalisation, are performed concurrently. This approach would require communication between nodes in order to sum charge density contributions computed by each node, and in computing the total energies, total ionic forces, etc. However, the scaling prop-
we present in a disjoint memory framework. The k-point subscript is omitted for clarity. (i) Node i sends the wave function for band i to all nodes j
erties described above indicate that in large systems Nk is very small #3 Hence we suggest that this approach might be applicable to small shared memory systems such as appear in multi-processor workstations and multi-headed vector machines, but cannot yield truly significant benefits from the use of parallel computing.
is identified with i/i. on node i, i.e. V= i/it. (4) Every node j i participates in a global summation of V. (5) Node i copies V into ~ This scheme can be optimised by avoiding the use of V in node i, making (5) redundant, and arranging for the result of (4) to be delivered to node i alone. The second and third steps each take a time 19(M), whereas the first and fourth steps can each be implemented to take a time 19(M) with a secondary 19(log2 Nb) contribution. In order to orthogonalise all bands to one another, the procedure is repeated for each band in turn, in order of decreasing band index. Thus the asymptotic time complexity for all bands orthonormalisation is 19(NbM), using 19(Nb) nodes. The major drawback of this approach, at least for our purposes, is the total memory requirement of the parallel program since the FF1’ mesh must be replicated across all nodes leading to an aggregate space requirement of 19(16MNb).
(b) Divide by band. The computations performed with respect to different electronic bands (at the same k-point) are largely independent and can therefore be performed simultaneously on different nodes. We could attempt to exploit this by using a machine containing Nb nodes, and assigning a single electronic band to each node (or on a smaller machine by assigning more than one band to each node). Note that in this case we would perform the minimisation of the K—S Hamibtonian with respect to every electronic band simultaneously, and we would not be using the band by band conjugate gradient method [111 employed in our calculations. This would require communication of charge densities and wave functions during the orthonormalisation calculation in addition to the synchronisations identified above. One appealing aspect of this approach is that each FFT computation can be performed serially, with up to Nb such computations occurring concurrently, which would permit the ready exploitation of special purpose hardware at each node. The problem of orthogonalising each electronic band to every other electronic band can be solved in parallel using the Gram—Schmidt method. We can orthogonalise band i to every band j
#3
We estimate that the convergence with respect to the BZ sampling with a single k-point for metals can be achieved for unit cells as large as Na — moo atoms.
=
(c) Divide real and reciprocal space. The final straightforward option is to consider spreading the wave function within each electronic band over processing nodes, which also necessitates simultaneously spreading the charge densities over the nodes in a compatible manner. This approach exploits the fact that the K—S Hamiltonian can be divided into a part which is local in real space and a second part which is local in reciprocal space. In this approach we should ideally bike to divide the FF1’ mesh into P 19(M) distinct regions of equal volume, P being the number of processing nodes, and assign a single such region to each node. The wave function is then spread by simply assigning to each node computations in respect of the same regions of the wave function. Note here that the regions do not need to be continuous, and we may use the freedom of defining regions in order to assign =
approximately equal numbers of wave function
18
L.J. Clarke et al.
/
Large-scale parallel ab initlo total energy calculations
components to each node. In this approach we again address the problem of computing in paralid the orthogonabisation of band i to all other bands j
the assigned region of the wave function, (2) Every node participates in a global sum of SJVJ
(i) Every node in band group i sends the assigned region of band i to every node in the same region group and in band group j
pates in the summation of S over members of the band group. (4) Every node in every band group j
—
=
L.J. Clarke et a!.
/
Large-scale parallel ab initio total energy calculations
node processor, (b) the node memory of 16 Mbytes and (c) the message passing nature of the architecture. The characteristics of the communication subsystem of, and the communication algorithms used for the Meiko Computing Surface are compared with the Intel iPSC/860 in the appendix. We wish to study large scale systems, with Na taking values in the range 500—1000. In such simulations the number of FFT mesh points in each of the three spatial directions is typically in the range 50—100, indicating that a single copy of the FF1’ mesh represented in FORTRAN COMPLEX * 16 data requires approximately 16 Mbytes of store. We have already discounted approach (a), and this simple calculation mdi-
bands, we have M 100 000 and given 64 nodes the storage at each node for its portion of the wave function is around 25 Kbytes. We do not expect that the orthonormalisation calculation will be abbe to make particularly good use of the data cache. The distribution of data is primarily determined by the requirement of the 3D FF1’, with freedom in details of that distribution determined by the orthonormabisation.
4--F
cates Wethat havewefollowed cannot the pursue spirit approach of approach (b). (c), as opposed to (d), in our program, for the following reasons: the number of processing nodes is not
ited and wish to retain the maximum store for large, Computing thewemaximum Surface; the value aggregate being 64memory in the Meiko is limthe wave function array; the communication performance of each machine is rather poor in comparison to the anticipated computational performance. We have modified the strategy described in (c) to further take account of features of the i860 architecture, specifically the 8 Kbytes data cache, and now describe in more detail some of the computationally dominant parts of our parallel program. These are the computation of the 3D FF1’, orthonormalisation of the wave functions, and calculation of non-local energies. We also describe a standard parallel computation of the potential energy. We can express a three-dimensional FFT as three sets of one-dimensional transforms, corre-
every sponding the other transforms to transform the x-, in yinin and a that setz-directions. set, ishowever independent and canEach thereof of fore 1D be performed parallel, the sets must be transformed above that the edge length inofsequence. the FF1’ We box noted is expected to be in the range 50—100, and we therefore anticipate that the 1D FF1’ ca,lculation will be able to make good use of the i860 data cache. Consider now the wave function for a typical large scale system. With around 500 atoms, 1000
19
-
I
1
2
—
-
—
I I I6
3 4
—
-
8
I 9I
___________________
1 ‘I 2
321 j4 312 Fig. 1. Example of the division of parts of reciprocal (upper panel) and real space (lower panel) over nodes. We assume for simplicity a cubic unit cell, a uniform 9x9x9 FFT grid and computing The x-column reciprocalfour space and the nodes. yz-plane index in real index space inis the indicated by indices (1—9) in square boxes, the node index (1—4) is also shown.
L.J. Clarke et a!.
20
/
Large-scale parallel ab initio total energy calculations
Figure 1 shows a typical example of the division of parts of real and reciprocal space over nodes. Each processor carries a set of x-columns of G-vectors in Fourier space and a set of yzplanes (planes perpendicular to the x-direction) in real space. The Fourier-space cell is distributed over nodes by assigning a number of x-colurnns to each node. It is not necessary to distribute real-space in the same way, and we choose to assign a set of yz-planes to each node. The wave function is distributed such that a wave function plane-wave coefficient is allocated to the node which owns the reciprocal space x-column in which that coefficient is located. This method of distribution does not require any spatial relationship between the particular (a)
I [~rate
IZQg.Qf
Operation
Pick a cut-off for the plane wave basis set G-vectors and r.polnts and distribute over
[~
I I
(b)
I
l~
[~II
nodesl
[ii
[~ Calculate
~
Calculate local part of ionic potential, Vion.loc(r)
—~
Calculate 1/2 v.v
q’j
Ii~]
I
product of wavefunction and localpotential operator. VIoc(r)’Pi(r)I
I
[~.I]
[~i~late
I
~
~
I
L__J
grartient direction -H’Pi
Irtttonorm~ilise(-H’Vi)’ to =
‘Pj
9’j)
Calculate step length
Yes
(I
(‘P1)
stationary
?J
No
Update p(r)
[~]
[!I] P
j
I
Update VXC(r)
Calculate tout energy
I
I
rCalcul~e‘Pi(r)
[~II
I
[~]
(-H’Pi)”I
Orthonormalise (-H’Vi)” to {
_______________________
I
Calculate conjugate direction (.H’f’iY
I Precondition (-H9’i)’
__________________________________________________ Perform band by band conjugate gradients opumlsatlon
[~]
Calculate product ofwavefunction and non-local potential operator,
[~] [~E] ~
Calculate projectors for non-local part ofionic potential, Rlm(r)
IX1l~-Qf
)neratlon
Vion,non-loc(r)1’i(r)
I Calculate Hartree potential, VH(r)I [i.E]
s~
I
I Choose initial wavefuncuons (‘Pi) and orthonormaitse I I Calculate electronic charge density, p(r) I Calculate exchange-correlation potential, VXC(r) I
[j]
x-columns assigned to nodes in the reciprocal space cell. The detailed distribution is determined by our wish to load balance computations such as the orthonormalisation, which means that we must assign equal portions of the wave function to each node. The wave functions contain plane-wave components only up to a cutoff energy, hence the plane waves occupy a sphere in reciprocal space. To ensure a faithful representation of the charge density the FFT must include components in all directions up to twice the cutoff used for the wave functions, from which it follows that the computations over the wave functions do not occur at all points in Fourier-space. We therefore allocate x-columns to nodes with two goals: the number of lines allocated to each
_________________
rupdate VH(r) Compute {Ft) and integrate equations ofmotion ofthe ions
Fig. 2. Flow diagram of computations. The principal operations performed are shown in part (a); part (b) details the operations performed in the band by band conjugate gradients optimisation [11] of the wave functions. k~~(r)= bE 5~[p~l/bp~(r) is the exchange-correlation potential, VH(r) = fdr’ [p~(r’)/ Ir — r’ I] is the Hartree potential, and Vionioc(r) = ~v1~(r— R,) is the sum of local parts of the ionic pseudopotential. The square boxes on the left indicate the space in which the operation is performed (F — Fourier space, R — real space) and the square boxes on the right indicate the type of operation (S — sequential, P — parallel).
~
L.J. Clarke et a!.
/
Large-scale parallel ab initio total energy calculations
node should be as nearly equal as possible; the number of wave function coefficients allocated to each node should be nearly equal. In actual fact it would be very difficult to satisfy the second of these criteria optimally. We have used a simple algorithm which we find works in a satisfactory manner. We firstly divide the x-columns into those which contain at least one wave function plane wave, and label these as hard. We label the rest as easy. We then just hand out the x-columns in turn to the nodes, starting with the hard columns and continuing with the easy columns until all have been assigned. In a large scale system we have thousands of x-columns and very little load imbalance occurs. The detailed distribution of yz-planes is also unimportant. We have simply allocated an as nearly equal as possible number of planes to each node. We note here that the distribution of real space by yz-planes allows us to minimise the amount of communication involved in the 3D FFT, but, however, beads to a load imbalance for computations performed in the real space. On our target machines the saving of communication time outweighed the node idle time due to load imbalance. However, on a large machine the cxtreme may arise where the number of nodes is larger than the number of yz-planes. In this case it would be straightforward to perform FFTs on y- and z-cobumns at the expense of increasing the communication load. The final distribution in real space would then be by z-columns. The data division implies that also the wave functions {tfij, electronic charge density pe(”)’ the K—S Hamiltonian H, etc. are distributed over processors as shown in fig. 1 and that each node performs computations only on the part of {cl’~}~ p
0(r), H, etc. which is physically stored on it. From this discussion it should be clear that the division of operations between real and reciprocal space and the distribution of both spaces over processors (fig. 1) is of paramount importance in our approach. Figure 2 shows the flow diagram of our computations and division of operations between the two spaces. The procedure for calculation of the 3D FF1’ is schematically shown in table 1. With the data organisation as shown in fig. 1, each of the 1D
21
Table 1 A schematic illustration of the parallel version of the 3D FF1’. The example here illustrates the transform from Fourier space into real space. For reverse transform the order of the steps must be reversed. Perform a 1D FFT of x-columns of C-vectors on each node. Reorganise storage from column to yz-plane. CALL global exchange routine so that each node contains a number of complete yz-planes. Perform a 1D FFT in y-direction. Transpose data so that z-points are sequential. Perform a 1D FF1’ in z-direction.
FFT is running ona single node and there is only one massive data exchange from x-columns to yz-planes. The fact that the wave functions use a smaller plane-wave cutoff than the charge density enables a short cut when transforming the wave function, because it is not necessary to perform the 1D FF1’ on columns with all the coefficients zero in the first set of transforms and more importantly neither is it necessary to subsequently communicate values in those columns. One can see that the distribution of real space by yz-planes rather than, for example, z-columns has often reduced the data communication overhead of the FF1’ by a factor of about six. The procedure for wave function orthonormabisation is shown schematically in table 2. Each processor performs computations on its region of the Fourier-space representation of the wave functions. Results needed by all the processors such as matrix elements of the overlap matrix coverl (Kçli~,~ I cl’t,k)) are globally summed over all the nodes. The global summations required for the orthonormalisation sum single complex numbers although N~such global sums are required. To reduce the N~ communication setups the orthonormalisation has been structured as shown in table 2. In this way up to Nb complex numbers are communicated at a time and only Nb communication setups are necessary. As discussed previously the computational cost of the orthonormalisation will dominate for large systems. The calculation of the energy terms which are a function of the charge density p(r) is straightforward. As an example we show in table 3 calculation of the potential energy f dr V10~(r)Pe(T)
22
L.J. Clarke et al.
/
Large-scale parallel ab initio total energy calculations
Table 2 A schematic illustration of the orthonormalisation of nbands wave functions at the point k in the Brillouin zone. A given processor carries nplwkp(k) plane-wave coefficients of the wave functions at the point k. zdotc (double precision dot product of a complex conjugated vector and another complex vector), zaxpy (double precision complex vector-scalar product), and zdscal (double precision multiplication of a complex vector by a real scalar) are standard BLAS routines [15].
a real-space implementation of the non-local potentials [14] which reduces the workload from
_________________________________________
As an example we show computation of the con-
compute ~
tribution to the non-local energy from the Ith atom and band n,k
=
Ki/~
5Ii,fr15) do 10 i = 1, nbands do20j=1, i—i coverl(j)= zdotc(nplwkp(k), c(1, j, k), 1, c(1, i, k)) 20 CALL continueglobal sum of coverl for i — 1 bands over all the nodes compute ~i~ik — ‘5J,k k/’j,k ~ < I do 30 j = 1, i—i CALL zaxpy(nplwkp(k), —coverl(j), c(1,
j, k), 1, c(1, i, k), 1) 30 continue compute k/Ilk /Kk/’l,k Ii/i~)(normalisation) anorm = dble(zdotc(nplwkp(k), c(1, i, k), 1, c(1, i, k), 1)) CALL global sum of anorm over all the nodes anorm = 1.OdO/dsqrt(anorm) CALL zdscal(nplwkp(k), anorm, c (1, i, k), 1) 10 continue
due to the local part of the total ionic potential V10~(r) 1v~0~(r R1). Again each processor performs computations on its portion of the realspace representation of Vi0~(r)and pe(r) and the result is globally summed over all the processors. Computation of the energies due to the nonlocal character of the ionic pseudopotential on a parallel computer is more involved. We assume here that the non-local pseudopotentials (v00~10~(r, r’)) are in a separable form suggested by Kleinman and Bylander [13]. We discuss here =
~
typical of a Fourier space implementation to 19(NbNa). This significantly reduces the cost of operations associated with the non-local operations which would otherwise be the most computationally expensive part of the calculation. 19(NbNaM)
~I,n,k
/
=
m~ =
1”~’zlm I,n,k I,n,k’
(7)
E1Z
where E 1 is an angular momentum dependent scaling factor and ZI”~k=
f
~1(r) Yim(01, 41) clifl,k(r + R1) dr.
core—
Rjm(T)
(8)
Here ~1(r) are (modified) radial projection functions which vanish beyond some critical radius R~ and ~lm are spherical harmonics. Following our general parallebisation strategy the projectors (eq. (8)) can be easily performed by first calculating Rim(T) around each ion. Since planes of real-space
—
Table 3 A schematic illustration of the calculation of the potential energy enpot. A given processor carries numprt points of the real-space representation of the electronic charge density stored in a complex array chdenr and of the local part of the total ionic pseudopotential stored in a complex array cv. zdotu (double precision dot product of two complex vectors) is a standard BLAS routine [15]. enpot = dble(zdotu(numprt, chdenr, 1, cv, 1)) CALL global sum of enpot over all the nodes
FjH 3 2 _________________
~14 2 i
U
Fig. 3. Schematic illustration of the parallel computation of the projectors ZI”~kdefined in text in real space. We assume for simplicity a cubic unit cell, a uniform 9x9x9 FF1’ grid and four computing nodes. The yz-plane index in real space is indicated by indices (1—9) in square boxes, the node index (1—4) is also shown. Note the effect of periodic boundary conditions.
Li. Clarke et al.
/
Large-scale parallel ab initio total energy calculations
points are distributed over nodes so will be Rtm(r) as illustrated in fig. 3. Therefore the integral in eq. (8) is performed by calculating the sum on each node followed by summing the integrals over all the nodes. This is schematically shown for the s-wave component in table 4. Because of the distribution of ions and r-points over nodes described above, the loops over ions run only for ions which have r-points on that given node and the ioops over r-points run only over r-points physically stored on that given node. In a real application the structure of the routine differs from that shown in table 4. In order to reduce the number of communication setups the calculation for each band is first carried out for all the ions at a given node and each angular momentum component I and the results are then globally summed over all the nodes. Implementation of other types of operation involving v~0010~(r, r’), such as, e.g., contributions to the ionic forces from the non-local pseudopotential are carried out using modifications of the strategy illustrated in table 4. These examples show that the parallel routines bear a striking resemblance of their sequential counterparts. The main difference between the
23
Table 5 A schematic illustration of the modifications necessary for parallelisation of the threeG-vectors inserted indo the loops over x-, y-, The and z-components of Fourier space. meaning of the symbols is self-explanatory. Serial do iz = 1, ngz dodo iy ix = 1, ngy = 1, ngx : end do end do end do
Parallel do icol = 1, ncol(inode) iz izcol(icol, mode) iy = = iycol(icol, mode) do ix = 1, ngx end do end do
sequential and parallel version of the routines is that in the parallel version the variables and arrays have only a local (node) meaning. Occasionally we need to access the distributed data whose ordering is not determined by the 3D FF1’ such as, e.g., the x-, y-, and z-components of the vectors in Fourier- or real-space. As an example we show in table 5 the modifications necessary for parallelisation of three inserted do loos over x-, y-, and z-components of the reciprocal-space box. 3. Results and performance tests
Table 4 A schematic illustration of calculation of the s-wave component of the non-local energy due to the ion ni and band n,k. The call of this routine is preceded by a call of the routinewhich computes the number of grid points nrlppi for each ion ni and each node, and the R,w(r) for all the grid points lying in the sphere of radius R~around each ion on each node, vrlgrd. zrwfpi contains a real-space representation of k/s,~k(r— R1) on grid points stored on a given node around ion ni. The real array zrwfpi contains the real part followed by the imaginary part of l/I,,k(r — R1). ddot (dot product of two real vectors) is a standard BLAS routine [15]. compute part of Z~5 around ion ni which is physically on a given node dumi = ddot(nrlppi(ni), vrlgrd(1, 1, ni), 1, zrwfpi(1), 2) dum2 = ddot(nrlppi(ni), vrlgrd(1, 1, ni), 1, zrwfpi(2), 2 cesave(1) = dcmplx(duml, dum2) CALL global sum of Z~k over all the nodes 00 *
z °°
compute Z1 fl,,~ I.v,k duml = dble(cesave(l)*conjg(cesave(1)) calculate the proper scaling scale epss(ni, k)= epss(ni, k)+duml*scale
The strategy described in section 2 was used to obtain a set of parallel total energy pseudopotential codes (CETEP) which can be used for EM and MD calculations. Given an array of n processors we would hope that the application running on all processors would be a factor n faster than the program running on a single processor. In practice this is never achievable because of: (i) Residual sequential computations often the same calculation is carried out at every processor. An example is assigning of ions and grid —
points to processors in calculations involving the non-locality of the pseudopotential or integration of the equations of motion for ionic degrees of freedom in MD. (ii) Uneven distribution of computations the —
column distribution of Fourier space, and therefore the wave functions, was designed to minimise this effect for computations in reciprocal
L.J. Clarke et aL
24
/
Large-scale parallel ab initio total energy calculations
space. Due to the plane distribution of real space in x-direction is not a multiple of n. Overheads of if processor synchronisations an (iii) imbalance arises the number of grid points this occurs when summing data over all the processors where no processor can proceed without the results of some calculations performed at other processors. This is the case, e.g., in the orthonormalisation of wave functions.
i2 -~
8
—
-
-
nt~
C ~<
4
= -~ - - -
0
12
#
(iv) Overheads of data communication processors are inactive while communications are taking place. The communication overhead is sig—
plwvx 1000
12 (b)
CR~ 7/
nificant during, e.g., the data exchange of the 3D
‘~
8
FElT.
Within a given paralbebisation strategy, efficiency of the parallel codes can be increased by improving the node performance and communications. We now turn to that point. Most of the program routines are written in FORTRAN77. Typically the single-node i860 performance for applications written in high-level languages is well below 10 Mflops #4 In order to improve the node performance parts of the FORTRAN code, in all the program routines relevant for the code performance were substituted by corresponding BLAS routines [15]. Typically the single-node performance of those parts of the code has been increased to about 25—30 Mflops. Similarly the node performance was improved by implementing assembler-coded 1D FF1’ transforms #5 We have also paid very close attention to efficient implementation of the communications, since the communication performance of the two target machines is relatively slow compared to their compute speeds. The implementations are briefly described in the appendix. We now describe and test that part of the simulation code which deals with construction of the ground-state potential 1[(R1)]. The system chosen for this test was a 64-atom cubic cell
#4 #5
The Green Hills Fortran compiler has been used, 1D FF1’ have been assembler-coded by W.H. Purvis at the Daresbury Laboratory.
C 4
~7th=I6
--
- -
3.-
________________ 0
#
6 plwvx
12
1000
Fig. 4. Variation of CPU time with basis set size for Meiko i860 Computing Surface (upper panel) and for Intel iPSC/860 (lower panel). In both cases the performance for 16 computing nodes is indicated by dotted lines, for 32 computing nodes by dashed lines and for comparison the performance of a single CRAY X-MP processor is shown by solid lines. The
timing refers to the conjugate gradients optimisation of the electronic K—S orbitals.
containing Si atoms arranged in diamond structure. There are 128 electronic orbitabs {~/i~}optimised by the program. The Brilbouin-zone sampling was done by taking only the F point of the supercell. The energy cutoff was taken to be 6, 12, and 18 Ryd, corresponding to 2109, 6031, and 11075 plane waves, respectively. In all the calculations Kerker pseudopotentials [16] were used with the s-wave component of the non-local potential treated as local, retaining three “p” and five “d” projectors. This may be taken as representative of a medium-size system. This system size was chosen to enable a comparison with runs performed on conventional supercomputers which otherwise would not fit onto a sequential computer. In order to asses the performance of our codesthe we basis show set in fig. the two variation of machines, CPU time with size4 for parallel the Meiko i860 Computing Surface and the Intel
L.J. Clarke et al.
/
Large-scale parallel ab initio total energy calculations
iPSC/860 #6 using 16 and 32 nodes. For comparison the timing for a single CRAY X-MP processor is also shown in fig. 4. Generally, on both machines 16 nodes outperform a single CRAY processor. This figure shows also the importance of a proper loading of processors. For the smallest basis-set size the performance of both 16 and 32 nodes on both machines is comparable to a single CRAY processor. This is an example of a communication dominated calculation. However, for more adequately loaded nodes such as, e.g., for the 18 Ryd cutoff system, doubling the number of nodes results in approximately halving the execution times. From a practical point of view a performance relative to conventional sequential supercomputers is of interest. These tests show that on all the 64 (optimally loaded) Meiko i860 nodes our codes achieve a performance of about 5 single CRAY X-MP processors. This number would be more than six for the Intel iPSC/860. These differences are mainly due to the communications used by the two machines and the cxtreme non-local nature of the communications required in our application (see appendix). We note that applications which require only “nearest neighbour” communications (such as QCD lattice simulations [17]) can achieve a faster communication speed on the Meiko. We have shown that construction of the many-body potential cb[{R1)] can be mapped onto a network of processors and computed efficiently in parallel. It is straightforward to use cP[{R1)] for EM with respect to the ionic degrees of freedom {R1} using a steepest-descent type of algorithm M1I~1(t)
=
—
VR,tb[{RJ(t)}].
(9)
Here the dot indicates time derivative and (MI) are masses of the ions. Such an approach has been successfully applied to systems as large as 400 ions [18] which are clearly beyond the power of conventional supercomputers. We now show in more detail that the CG-based optimisation of ‘14{R1}] can be efficiently applied
#6
The tests described here have been performed on the Intel iPSC/860 hypercube at the Daresbury Laboratory.
25
also in the context of finite temperature MD simulations. The objective is to solve the Newton equations of motion for the ionic degrees of freedom {R1} M1R1(t)
=
—
VR,tIt[{RJ(t)}].
(10)
With first-order optimisation of the electronic degrees of freedom, such as the CG method, the accuracy of the electronic quench to the instantaneous ground state must be well controlled. With a poorly converged electronic minimisation, a systematic damping of the ionic motion occurs [19]. Such problems do not arise if a Car—Parrinello dynamical (second-order) optimisation of {tfr~) [7] is used. We adopt here the solution to this prob1cm proposed by Arias, Payne and Joannopoulos [20]. At any ionic step the electronic CG minimisation is initialised from a set of trial wave functions {,~1{r,t + At)) constructed by a two-dimensional extrapolation of wave functions from the last three time steps [20] c&1(r,
t +
At)
=
a[ili,(r, t) +13[~1(r, t
+
1111(r, t) — —
i/i1(r, t At)] At) il’.(r, t 2At)]. —
—
—
(11)
This procedure generates a non-orthonormal set of trial vectors {i/J,(r, t + At)) which need to be orthonormalised. The extrapolation coefficients {a, 13) are fixed by minimising the norm II i~1(t+ At) R1(t + At) II with {R1(t + At)) being the ionic coordinates extrapolated in the same way as t + At)). The extrapolation in eq. (11) constitutes an excellent starting set of trial wave —
functions for the CG minimisation and eliminates any “drag force” on the ionic dynamics deriving from accumulation of errors in electronic minimisation [19]. In fig. 5 we show temporal evolution of the system temperature, potential energy, and microcanonical constant of motion E~C Const (t) ~ M,R~(t) + 1[{R1(t)}] for a system of 8 sili2 1 con atoms equilibrated at a temperature of 300 K. The same pseudopotential as in the tests with the 64-atom cell and a star of four special k-points in BZ [21] was used. Equation (10) was discretised using Verlet algorithm [221with a time step =
Li. Clarke et a!.
26
/
Large-scale parallel ab initio total energy calculations
~‘I
600.0 400.0
0.00
0.20
0.40
0.60
approach opens new perspectives of ab initio MD simulations with several hundreds of atoms in the :1:;rcebl. Such simulations are currently under-
0.80
—659.0
~I:::1
~
—559.6 000
020
0.40
060
0.80
~—658.95
__________________________ —859.00 ~
—859.05 0.00
0.20
0.40
0.60
0.80
Fig. 5. Temporal evolution of system temperature T (upper panel), potential, energy (middle panel), and microcanonical constant of motion Ecorts (lower panel) for a system of 8 silicon atoms.
of 1 fs. The precision in quenching the dcctronic degrees of freedom was c 106 eV/at, requiring 2 CG iterations per ionic step. An 8-atom system obviously does not represent a reasonable statistical ensemble but is sufficient to demonstrate the main features of our approach. We note that there is no drift in the microcanonical constant of motion on the time scale of this simulation, This is an indication- that our approach yields a correct microcanonical dynamics. We conclude that the present approach is of comparable efficiency to the adiabatic dynamics approach of Car and Parrinello [6]. The rational behind this is that each electronic CG step is about factor 4—5 more expensive than a single adiabatic dynamics step. This difference is cornpensated by the length of the ionic dynamics step At which in the present approach can be safely taken a factor 5 larger. In fact, the tolerance 10—6 eV/at in electronic minimisation is a very conservative choice and keeping this high accuracy of the potential energy is not productive. By relaxing E to a value commonly accepted in the adiabatic dynamics approach (E iO~ eV/at) a single electronic CG optimisation would suffice, thus making the two approaches about equally efficient. Given the efficiency of computing Z4{R1)] in parallel demonstrated above, our ‘~
—
4. Conclusions We have described an implementation of total energy plane-wave pseudopotential codes on a parallel computer. Several approaches to parallelising the calculations have been discussed with a view to the eventual implementation of such codes on teraflop machines. We argue that for the particular parallel machines we are using, a Meiko i860 Computing Surface and an Intel iPSC/860, the approach of assigning parts of real- and Fourier-space across the processors is .
the least restricted one. Withtn that approach we have discussed in detail an efficient implementation of the computationally dominant parts of our parallel codes, namely the 3D FF1’, orthonormalisation of the wave functions, and the calculations involving the non-locality of the pseudopotentials. The efficiency of our parallel codes has been demonstrated by performing timing tests on our two target i860-based parallel machines and cornparing these to tests performed on conventional sequential computers. Particular attention has been paid to the implementation of the ab initio molecular dynamics. We have shown that our general strategies based on the conjugate gradients optimisation of the many-body interatomic potentials generate an accurate and efficient ionic dynamics. In conclusion, our approach opens the possibibity of performing ab initio calculations, such as energy minimisation or molecular dynamics, for systems containing many hundreds of atoms in the unit cell.
Acknowledgements The work reported here was performed as a part of the “Grand Challenge” collaborative project, coordinated by Prof. D.J. Tildesley. We acknowledge financial support by the Science and
L.i. Clarke et al.
/
Large-scale parallel ab initio total energy calculations
Engineering Research Council under the grants GR/G 32779 and B18534: NACC. We are grateful for the help we have received from Dr. N.M. Harrison and Dr. W.H. Purvis at the Daresbury Laboratory, from Dr. K.C. Bowler and Dr. S.P. Booth at the Edinburgh University, and from Mr. D. Hewson at Meiko Scientific Ltd.
Appendix. Comparison of Communications in the Intel iPSC/860 and the Meiko i860 Computing Surface The two target machines are very similar. Each consists of a host computer plus a number of nodes, each node consisting of an application processor, the Intel i860, private memory and some communications hardware. The communication systems provided in these machines have very different properties, which have consequences for the implementation of the communication structures required by our program. We shall now briefly describe the two communication systems and the implementation of the key communication structures of our program. In the iPSC/860, communications are provided by the the custom Direct Connect Module (DCM) hardware. These units are connected into a fixed hypercube topology, and provide efficient cut-through routing between arbitrary pairs of nodes with a channel bandwidth of 2.8 Mbytes/s. In the Computing Surface, communications are provided by a pair of Inmos T800 transputers. The two transputers and the 1860 processor communicate by shared memory and hardware interrupts, thus the nodes provides eight Inmos links which may be connected to links provided at other nodes. Each of these links provides one outgoing channel and one incoming channel, and is able to achieve a channel bandwidth of about 1 Mbytes/s when transmitting messages along both channels. The transputer contains no hardware support for message routing, this is provided by the custom Computing Surface Network (CSN) software, which is essentially a store-and-forward packet switching system. The system software provides the programmer with considerable flexi-
27
bibity in configuration of the hardware. We chose to use a hypercube topology as in the iPSC/860, since efficient alogrithms for the communication structures required by our program are known for this geometry. An n dimensional hypercube contains N 2?? nodes, labelled 0,..., N 1. The nodes are connected such that channel d connects node i to node j =xXOR2”, where XOR is the bitwise exclusive OR operator. Broadcast communication, used in program initialisation, is provided directly by the NX/2 node operating system in the iPSC/860, and we have used these facilities where the program needs to broadcast data. This functionality is not provided by the CSN and in this case we used the following standard tree algorithm. Each node computes the XOR of its node number and the number of the source node (always zero in our case). The parent channel is determined as that corresponding to the least significant bit set in the result, and the child channels as all of those corresponding to more significant bits. Each node receives from the parent, unless it is the source, and forwards to all children, if any. Global summation, used in the orthonormalisation and in computing total energies etc., is also provided directly by the NX/2 node operating system, and again we use these facilities wherever required in the program. This functionality is not provided by the CSN, and we have used another straightforward hypercube algorithm. Each node loops over the n dimensions of the hypercube, exchanges partial sums with the neighbouring node in the current dimension, and accumulates the two partial sums. We implemented this algorithm in C directly on the transputer communication hardware, incorporating it into the CSN. This avoids the expense of repeatedly transferring control between the i860 and the user interface of the CSN running on the transputer during execution of the algorithm. In the case where there are a large number of global summations to be performed simultaneously, a faster algorithm is obtained by using the in-tree corresponding to the out-tree of our broadcast algorithm and pipelining the summations. We did not implement this algorithm. =
—
28
Li. Clarke et a!.
/
Large-scale parallel ab initio total energy calculations
Global exchange, used to redistribute data during the 3D FFT, is not provided directly between either the NX/2 node operating system or the CSN. This is an intensive communication structure and the key to efficient executionis keeping the communication channels busy. The different switching capabilities of the two machines lead directly to the requirement for different algorithms. In the Intel iPSC/860 we have used the optimal cut-through algorithm, which consists of a sequence of N pairwise exchanges. During step m, node i and node j iXORm exchange, except during step m 0, when each node simply performs a memory to memory copy. In the Meiko Computing Surface we have used an optimal store and forward algorithm which consists of a sequence of N/2 neighbour cxchanges. During step m, each node sends a different message to, and receives a different message from each of its neighbours. The steps at which different messages are ejected into this procedure from the application is calculated to avoid channel contention. Again we implemented this algorithm in C directly on the communication transputers, incorporating it into the CSN, for performance purposes. This resulted in a speed up of the global exchange of the 3D FFT by a factor 3. =
=
References [1] C.R.A. Catlow, S.C. Parker, and M.P. Allen, eds., Computer Modelling of Fluids, Polymers and Solids, NATO
Advanced Studies Institute, Series B: Physics, Vol. 293 (Kluwer, Dordrecht, 1990). [2] (Holt—Saunders, N.W. Ashcroft and NewN.D. York,Mermin, 1976) p. Solid 425. State Physics [3] M.J.L. Sangster and M. Dixon, Adv. Phys. 25 (1976) 247. [4] 1. ~tich, R. Car and M. Parrinello, Phys. Rev. B 44 (1991) 4262.
[5] S. Lundqvist and N.H. March, eds., Theory of the InhoElectron Gas, (Plenum, New York, 1983). 161 mogeneous R. Car Sand M. Parrinello, Phys. Rev. Lett. 55 (1985) 2471. [7]
GB.
Bachelet, D.R. Hamann, and M. Schlüter, Phys.
Rev. B 26 (1982) 4199.
[81 W. Kohn and L.J. Sham, Phys. Rev. A 140 (1965) 1133. [9] Mi. Gillan, J. Phys. 1 (1989) 689. [10] I. ~tich, R. Car, M. Parrinello and S. Baroni, Phys. Rev. B 39 (1989) 4997. [11] M.P. Teter, MC. Payne and D.C. Allan, Phys. Rev. B 39 (1989) 12255. [12] G. Galli and M. Parrinello, Bull. Am. Phys. Soc. 36 (1991) 835. W. Yang, Phys. Rev. Lett. 66 (1991) 1438. S. Baroni and P. Giannozzi, Europhys. Lett. 17 (1992) 547. [13] L. Kleinman and D.M. Bylander, Phys. Rev. Lett. 48 (1982) 1425. [14] B RD. King-Smith, 44 (1991) 3063. MC. Payne, and J.-S. Lin, Phys. Rev. [15] Kuck and Associates, Basic Math Library, User’s Guide, Champaign, IL 61820 (1990). [16] G.P. Kerker, J. Phys. C 13 (1980) L189. [17] AD. Simpson, Ph.D. thesis, University of Edinburgh (1992), unpublished. [18] I. ~tich, MC. Payne, RD. King-Smith, J.-S. Lin and L.J. Clarke, Phys. Rev. Lett. 68 (1992) 1351. [19] D.K. Remler and PA. Madden, Mol. Phys. 70 (1990) 691. [20] TA. Arias, MC. Payne and J.D. Joannopoulos, Phys. Rev. B 45 (1992) 1538. [21] H.J. Monkhorst and J.D. Pack, Phys. Rev. B 13 (1976) 5188. 122] L. Verlet, Phys. Rev. 159 (1967) 98.