Computer Physics Communications 56 (1989) 63-67 North-Holland
PARALLEL C O M P ~ G :
63
A CASE STUDY
Jan F.W. SLAETS and Gonzalo TRAVIESO Instituto de Fisica e Qulmica de S~o Carlos, Universidade de S~o Paulo, CP. 369, 13560, ScIo Carlos, Brazil
A simple molecular dynamics simulation is used to analyze some speed optimization techrfiques. The efficiency of sequential and parallel algorithms are discussed. An implementation on a "1"800 transputer array is proposed and the estimated performance is compared with that obtained on a supereomputer.
1. Introduction The need for high speed computer systems becomes important for several simulation techniques in physics. From the viewpoint of the hardware, two approaches can be used to achieve higher speeds: -using faster components to obtain shorter instruction execution time; - introducing parallelism. In the first case, speed is limited mainly by the state of art of the technology of the components. For the second one, an often used technique is the implementation of the well known pipeline structure nowdays heavily used, not only in hardware designs such as vector supercomputers, but also implemented at chip level in microprocessors and number crunching chips. A degree of parallelism, higher than that obtained with pipelines such as used in the vector processors of supercomputers, can be introduced by spreading out the algorithm over a lot of computing elements. In this case a much higher computational throughput can be achieved compared to the actual processing speed of supercomputers. The main problem with parallel architectures is the lack of software tools to exploit the parallel characteristics of the architecture. At present, there are no compilers capable of distributing the workload over a network of parallel processors. If we want to obtain a good performance/cost ratio with parallel hardware, this distribution will have to be done manually. 0010-4655/89/$03.50 © Elsevier Science Pubfishers B.V. (North-Holland)
To show the actual complexity of implementing an algorithm on a parallel architecture we shall study the implementation of a bidimensional molecular dynamics simulation. The target hardware will be a transputer network.
2. Description of the problem Let us quickly define the problem and review how a simple molecular dynamics problem can be calculated. The main purpose of the simulation is to follow the evolution in time of a set of identical interacting particles. Using a Lennard-Jones type of interaction, a first simplification can be introduced by defining a cutoff radius. Only within this radius the Lennard-Jones potential will give a significant contribution to the calculation of the force. Since distances between particles are known, we do not need to make the calculation of the interacting force of all pairs of particles (that would represent N 2 _ N forces/frame, where N is the total number of particles), but only of those with a significant contribution defined by the cutoff radius. Consider as shown in fig. 1 the case of a bidimensional space with periodic boundaries. In order to perform an is0thermic simulation, a renormaliTxtion of the veb3cities is required after the calculation of a certain, number of frames. To do this, the actual temperature of the system at that
64
J.F.W. Slaets and G. Travieso
Particles" Space
I partlcle i
~epetition of thespace
image of
particle i
Fig. 1. Space of particles with periodic boundaries (bidimensional case).
instant is derived from the velocities of all the particles. If needed all velocities are renormalized to maintain the simulation at a fixed temperature. In this study we will use the central difference method to calculate the motion of the particles as proposed by Verlet [1]. In this method, the new position of a particle is determined from the total force acting on it and from the position of that particle in the two previous frames. Based on the above information, the following basic steps will be needed to calculate the position of a particle: 1. find with which particles it interacts; 2. calculate the total force; 3. find the new position. After each frame, a renormalization of the velocities is performing, when needed.
3. Sequenthd algorithms As mentioned before, one of the key points to reduce calculation time is the introduction of the cutoff radius. An efficient method can be implemented using infrequent updated neighbor tables as suggested initially by Verlet in 1967 [1]. In this algorithm, only the forces between particles listed in the neighbor table will be calculated. Frequent
/ Parallel computing: a case study
reevaluation of the neighbor table can be avoided by choosing a larger radius than the cutoff radius. In this case, particles will not move in the cutoff radius between updates of the neighbor table. This will represent a great decrease in the computing time, since the time-consuming task of finding the interacting pairs of particles (which increases with N 2) will be done only in some of the frames. Another approach is the implementation of a coarse-grain structure as suggested by Hackney and Eastwood 1981 (cited in ref. [2]). Here the simulation space is divided by a coarse-grained mesh. If we make the size of the grain equal or slightly larger than the cutoff radius, all the interacting forces of a particle in a grain can be derived from the particles existing in that grain and in the eight surrounding grains. Regarding the efficiencies of sequential algorithms the following can be observed: - The implementation of a neighbor table, in spite of diminishing the time spent with searching interacting particles, results in an execution time wich is proportional to the square of the number of particles. This comes from the fact that all particles pairs have to be considered when the neighbor table is revised. - The coarse-grain algorithm reduces the execution time making it increase almost linearly with the total number of particles of the system, provided the system is sufficiently large. This is
Table 1 Relative comparison of evaluated times of sequential algorithms (for 1000 frames). Number of particles 100 200 500 1000 2000 5000 10000 20000 50000 100000
Basicalg. (h) 0.06 0.25 1.54 6.17 24.67 154.17 616.68 2466.69 15416.72 61666.77
Neighbortable Coarse-grain (h) (h) 0.0070 0.01243 0.0148 0.02487 0.0438 0.06217 0.1097 0.12434 0.3084 0.24867 1.4376 0.62167 5.0974 1.24333 19.0837 2.48667 114.3758 6.21667 450.9739 12.43333
J.F.W. Slaets and G. Travieso / Parallel computing: a case study
due to the fact that, in this case, the inclusion of new particles implies the creation of additional grains (considering that the particle density is kept constan 0. Performance evaluations carried out at our laboratory showed that the coarse-grain algorithm is faster than the neighbor table method when more than !000 particles are used in the proposed molecular dynamics calculation. This important conclusion shows that, even in the case of sequencial programming, a carefull analysis can result in a significant optimization in the calculation speed. Results of our evaluations are shown in table 1.
4. Parallel implementation As we shall see, much more attention is needed for a parallel implementation of this algorithm. Let us analyze the behavior of the implementation of some of the above-mentioned optimizations on a parallel architecture.
4.1. Using neighbor tables Suppose we designate a dedicated processor for each particle to keep track of its specific neighbor table. This implies that: 1. A lot of redundant information has to be stored since each processor needs the coordinates of all the neighbor particles. 2. When the position of a particle is changed all the neighbor tables need to be updated, and the information has to be broadcasted to all processors. This would result in a serious communication problem. A solution to these problems would be the implementation of a unique table with the particle coordinates shared by all the processors. The neighbor lookup tables could then be constructed by pointers to this unique table. This however would result in a serious bottleneck due to very frequent access on this unique table by all the processors. As we caxt see, the use of neighbor tables is not very suitable for an efficient parallel implementations.
65
4.2. Using coarse-grain In this case, each grain could be allocated to a specific processor. Now all data needed for the calculation will be at the specific processor or at its surrounding neighbors. Coarse-graining seems to be much more suitable for a parallel architecture, due to lower inter-processor data communication, and presents a better defined structure regarding program and data spread over the processor network. Even in the case we do not have a processor for each grain, clusters of grains can be treated by the same processor without changing the basic program and data structures. Based on this previous analysis we can conclude that the coarse-~ain algorithm is best suited for a parallel implementation, so we will take it as a base for our proposed implementation.
5. Proposed implementation Our final goal is to find an algorithm suitable to be implemented on a T800 transputer array. The T800 transputer consists of a one chip architecture including a 32 bits CPU, a floating point arithmetic unit, 4Kbyte of internal memory and 4 communication channds (links).
5.1. Geometrical division A geometrical division is obtained when, at the same time, the same type of processing is being done by various processors on different data. "Geometrical division" is generally used to indicate that a division takes place in the data space. For our case we propose a geometrical division of the data over a lot of processors forming a ring structure as shown in fig. 2. A duplication of the data at the borders of the geometrical division (indicated by the dashed lines and corresponding arrbws) was implemented to avoid the need of data access at neighbor processors during calculation. This duplication makes the updates of 'neighbor data also easier since it can be performed smoothly as new data become available.
J.F. W. Slaets and G. Travieso / Parallel computing: a case study
66
I I I I I I I I 1 I I I I
I I I I I [ I I I I I I I I
I I f I I I I I I I I I I I
I' i
'2~' I
;'i
;;
,
I;I I I I I I I
I I I I I!1 I I
I I I I I I
IllI,I
I I i I I I I
i
I I I I I I
I I I I I I
I
I
I
I I I I I I I I
I I I I i I I I
I
,
r-
--
7
i-
-1
Division of particles spoce
II
I I i I I I I I
Clusters A, B, C and D:geometrical division Processors 1 and 2 in each c(uster:0.1g0rithmic
division
Fig. 3. Proposed network of processors.
i
I Ring of
processing elements
Fig. 2. Ring of processing elements and the corresponding grains.
Table 2 Comparationof evaluatedperformancewith real Cray simulations (for 1000 frames) Processor
Cray
X-MP
proposed
Notice that each processor needs, besides his own data, only data coming from his two nearest neighbors. This data exchange can be easely established by the transputer links. A big advantage of this implementation is that the proposed structure can be easily expanded with the introduction of more processors in the existing ring.
5.2. Algorithmic division The m o u n t of processing required in each processor is quite extensive in the above-proposed geometrical division and it is impossible to include all the necessary code in the transputer's internal 4Kbyte of memory. The use of external memory for program storage would be inefficient because external memory access times are about three times higher than on chip access times. To solve this problem an algorithmic division of the code is proposed. A careful analysis is required to obtain a good load balance between the transputers. It can be shown that a subdivision of the code over two processors is possible requiring only the internal transputer memory, and that the calculation load can be balanced in this case to about 50% for each transputer.
Time
architecture
(in hours)
N = 2064
N = 8064
0.031
0.367
0.077
0.230
0.052
0.129
with 8 transputers proposed architecture with 16 transputers
6. The final proposed architecture Several details regarding the final implementations are omitted in this description and can be found in ref. [3]. The final interconnection of the processors, as proposed in fig. 3, has no similarity with traditional known hardware architectures. But the evaluated performance of this architecture is quite impressive and, as shown in table 2, it can be easily compared and supersede the performance of today's supercomputer, but with a much better price/performance ratio.
7. Conclusions A detailed analysis of the implementation of algorithms is highly desirable if high processing speed is to be obtained. Even in the case of a
J.F. W. Slaets and G. Travieso / Parallel computing: a case study
67
sequential computer architecture some optimization can often be implemented. The classical known paraUelization techniques are very useful but, to obtain good results, a combination of various techniques must be considered.
molecular dynamics problem, and for carrying out some simulations on a Cray X-MP.
Acknowledgement
[1] L. Verlet, Phys. Rev. 159 (1967) 98. [2] F.F. Abraham, Adv. Phys. 35 (1986) 1. [3] G. Travieso, Estudo de processamento paralelo para din~nica molecular, Master thesis, Instituto de Fisica e Quimica de S/to Carlos, March 1989.
To Dr. Rajiv Kalia of Argonne National Laboratory for proposing and explaining us the
References