173
Computer Physics Conununications 47 (1987) 173—180 North-Holland, Amsterdam
CELLULAR AUTOMATA MODELING ON IBM 3090/VF S. SUCCI IBM/ECSEC European Center for Scientific and Engineering Computing~Via del Giorgione 159, 00147 Rome, Italy Received 8 June 1987: in final form 17 August 1987
The basic features concerning the implementation of the Hardy—Pomeau—De Pazzis (HPP) cellular automaton on a general purpose computer are illustrated, with explicit reference to the IBM 3090 vector processor. Special attention is paid to the choice of a proper data organization allowing it to take full advantage of the high processing rates offered by a high speed vector processor like IBM 3090.
1. Introduction The concept of cellular automata (CA) was introduced in the early fifties by von Neumann and Ulam [1] to study the behaviour and the orgamsation of complex systems. A cellular automaton is a discrete dynamical system consisting of a set of finite-state variables defined on a regular array of grid points (sites) uniformly distributed in space and evolving at discrete times. Owing to the full discreteness of their phase-space and time, cellular automata may also be viewed as a collection of “Boolean molecules” whose overall dynamics is specified by a finite rule by which at every time each site computes its future state from the current state of a few of its neighbours. Surprisingly enough, in spite of their discrete nature, cellular automata seem to be able to model continuum systems such as fluids or gases, whose description traditionally relies upon the continuum formalism of Partial Differential Equations (PDE). As opposed to PDE’s, CA models enjoy a remarkable property: they can be solved exactly on a digital computer because the systems they treat are intrinsically discrete and consequently do not require any approximation procedure. In fact the dynamic variables they deal with are quantized and take values only in a very small set of integers (typically just zero or one). As a result, they can be represented exactly with a correspondingly small number of bits. On the contrary, the “con-
ventional modeling based on the discretization of PDE’s deals with real variables which suffer from truncation errors when represented in the computer. In fact, the floating point representation of real numbers hierarchically favours bits in the most significant positions and this can be a major cause for round-off and cancellation errors which sometimes drive the numerical solution out of any physical regime. It is therefore clear that CA, being free from these problems, provide quite an attractive potential alternative to PDE’s in simulating physics [2,3]. Following Vichniac [4],we can say that cellular automata “provide a third alternative to the classical dichotomy between models which are solvable exactly (by analytical means) but are very stylized, and models that are more realistic but can be solved approximately (by numerical means)”. Although, at least in our opinion, the capability of the CA approach of modelling physics still needs a definite assessment, the alternative they offer with respect to PDE’s is very appealing and deserves full consideration. In particular, the success of a special class of “Lattice Gas Automata”, currently referred as to FHP (Frisch—Hasslacher—Pomeau) automata [5], in reproducing a number of complicated hydrodynamical phenomena [6] is very encouraging and stimulating. This paper makes no contribution to this issue; instead it shows that on a general purpose computer like IBM 3090 with Vector Facility, simulations based on the simplest cellular automa-
0010-4655/87/$03.50 © Elsevier Science Publishers B.V. (North-Holland Physics Publishing Division)
174
S. Succi
/ Cellular automata modeling on IBM 3090/ VF
ton, namely the Hardy—Pomeau—De Pazzis [7] automaton, can be performed at a very high rate of processing. So, although CA modeling is best supported by special purpose machines [8], general purpose vector processors can also play a valuable role in the advancement of this field since they provide a powerful and flexible computational framework in which new CA models can be developed and tested prior to the implementation of a special purpose machine.
2. Basic concepts Let us consider a D-dimensional hypercube A (Lattice) consisting of N’2 sites uniformly distributed in space. On each site we place a vector field ~ (Lattice Field, LF hereafter) with_the prescription that each spatial component of /i~ iJ4, s 1, S, i 1, N’2 takes values in the discrete set B {0, 1 }. Finally, suppose that the LF evolves discretely in time according to a first order deterministic rule: =
=
~(t+1)
=
=
F(4.~÷
Each site can hold up to S 4 particles endowed with a unit mass and unit velocity pointing towards one of the four directions defined in the lattice. We conventionally label them as follows: =
• • • •
UP (increasing values of Y) DOWN (decreasing values of Y) RIGHT (increasing values of X) LEFT (decreasing values of X)
The particles obey an exclusion principle which forbids the presence of two or more of them in the same site with the same velocity. An allowed set of positions and velocities on the lattice is called a configuration and can be put in a one to one correspondence with a 4N2-dimensional vector field i.~
1:
1~(u1,L.,
(1)
where j runs over a neighbourhood of J sites around i. In other words, the value of the LF at time t + 1 is determined by the state of a small number of neighbours at time I; this defines a first order D-dimensional deterministic cellular automaton. As first shown by Wolfram [9] even with D and J as small as 1 or 2, the overall behaviour of the automaton affords remarkable examples of complexity, like the formation of self-similar spatial patterns, fractal structures and other features typical of self-organized systems. Leaving aside the general definition given above, let us now specify the automaton we are interested in, namely the Hardy—De Pazzis— Pomeau (HPP) automaton. In the HPP model [7,10] onepoints considers square lattice and a set of 2 grid (sites)a uniformly distributed along N the directions X and Y. The automaton consists of a small set of simple prescriptions which govern the evolution of a large system of dynamical variables (particles) which constitute the “HPP gas”.
2.
(2)
~, ~, 4ifl i=1, N Obviously, each component of ~,takes the value 0
or 1 according whether the site I is occupied by a particle pointing in the corresponding direction or not. Thus (1111)_means that the ith site is fully occupied while (0000) means that it is empty (hole). Therefore, on each site we have 2~ 16 possible configurations which are exhaustively and exactly represented with 4 bits. The evolution law of the HPP automaton is very simple and consists of two steps: ~(i.
1),
D,, R1)
=
~i
=
1. Move step 2. Collision step The Move step (M-step hereafter) is just a free translation in which each particle makes one step forward in the direction of its velocity. The collision step (C-step hereafter) consists in turning at right angles the particles with opposite velocities (head-on collision) provided that the other directions are free. It is easy to verify that the only transitions allowed by the conservation laws (number of particles an linear momenta) and the aforementioned exclusion principle are the following two: 0101 1010
—~ —‘
1010 0101
S. Succi
/
Cellular automata modeling on IBM 3090/ VF
Note also that the collision are “ultralocal”, in that they take place between particles lying on the same site. With these simple prescriptions the microdynamics of the HPP automaton is entirely specified. They are in fact sufficient to allow a precise mathematical definition of two microdynamic operators and A’ whose product yields the Liouvile operator 3’=A~’governing the evolution of the dynamical system. The hydrodynamics of the HPP automation can be derived from its microdynamics with the procedures of Statistical Mechanics described in ref. [7]. Referring the reader to the original papers of Hardy, Pomeau and De Pazzis, here we only mention that the “coarse grained” hydrodynamic field (density n and macroscopic flow V) is defined in terms of the “fine grained” lattice field by taking the appropriate ensemble average. In practice, by postulating a (yet unproven) ergodic theorem, one replaces the ensemble average with spatial averages taken over some macroscopic region A1. ~‘
n’= ~
(U+L~+D,+R~)/lA1I
iEA1
v~=~
(R,-L1)/1A11
“
‘
(4)
i~A~
~
(U,—D1)/~Aj
,
~
i~A1
As usual, the averaging procedure annihilates all the shorter wavelengths of the system in such a way that the hydrodynamic field can only exhibit smooth variations over space scales longer than those associated with the averaging regions. That is why the behaviour of this field can obey fluidlike equations like the continuity and the Navier Stokes equations. In spite of its simplicity, the HPP automaton possesses several interesting properties; above all the existence of thermodynamical equilibria characterized by three free parameters, namely the total density and momentum which are global invariants of motion. As pointed out by Frisch et al. [5], a deeper insight into these properties shows that the HPP automaton suffers also of some drawbacks, primarily a lack of isotropy of the
175
pressure tensor which prevents a complete equivalence with the Navier—Stokes equation. However, as pointed out in the introduction, the main concern of the present work is not to address physical issues, but rather to investigate the technical feasibility of the CA approach on a general purpose computer: To this end, we restrict our attention to the simple HPP automaton.
3. Implementing the HPP model on a vector processor In this section we describe the main lines of the coding technique adopted to implement the HPP model on a general purpose vector processor, namely the IBM 3090 with Vector Facility (VF). All performance data refer to a single CPU/VF of a 3090/200/VF. To date, most of the numerical simulations with cellular automata have been performed on special purpose machines [12] explicitly designed to best exploit the main characteristic of cellular automata, namely the locality of the interactions and their intrinsic parallelism. However, general purpose vector processors can be used effectively, provided one is willing to design algorithms which take full advantage of their high-speed and largememory capabilities. This is precisely the point addressed by the present investigation. To run CA successfully on a general purpose computer one must devise strategies which process as many sites as possible on the same footing in a structured and organized way, i.e. one uses the computer basically as an array processor. This is precisely the idea behind the Multispin coding technique (MSC) [11], first applied in the domain of Lattice Quantum Chromodynamics and now currently used in the simulation of various Isinglike physical systems. In MSC one starts from the consideration that is a waste of memory to use a whole computer word, namely W bits (W = 64, 32) to store variables like spins which by definition can take values only zero or one and consequently need just a single bit to be exactly represented! Instead, one can pack a whole array of W spins in a single word thus allocating just one bit/spin. This yields automatically a factor 1/ W
176
S. Succi
/
Cellular automata modeling on IBM 3090/VF
memory saving or reciprocally a factor W speed up provided that all the bits of the word can be operated upon simultaneously. Thus, a fully Mu!tispin code can treat a W times larger problem than the corresponding “conventional” version at the same CPU cost. It follows that the MSC idea applies also to cellular automata provided that the spins are replaced by the quantized velocities, (although, as pointed out in ref. [4], when put on quantitative grounds, this correspondence is much less straightforward than one would qualitatively expect). However, unlike spin systems, which are usually in thermodynamic equilibrium, CA describe evolutionary problems and consequently require the displacement of the single variables along the Lattice in the course of time. This implies that in order to construct an efficient CA algorithm, one has first to make some judicious choice on the structure of the data. In this respect it is straightforward to see that the collision step and the move step call for different data organization. The optimal way of handling the M-step consists in choosing the Multispin direction orthogonal to the direction of the free motion. In fact, by doing so one automatically gets rid of the internal boundaries otherwise generated by the finiteness of the word length W. More specifically, this means choosing the data organization in such a way that the arrays which move along the X direction are packed along the Y dimension and vice-versa. Thus, we are led to the following data organization: Move Up/Down: U(IY, IX), D(IY, 1 ~ IY ~ N.
Ix),
Move Left/Right: L(JX, JY), R(JX, JY), 1 ~ JY ~ N/32.
1 ~ IX ~ N/32, (6) 1
sites. This means that any efficient execution of the C-step rests on a data representation such that the four velocities attached to a given site are assigned to homologue bits in the corresponding computer words. It is simple to realize that the data structure outlined above can match this requirement only if a transposition of the bit matrices associated either with the arrays U and D or with the arrays L and R is performed. Since this operation is too expensive, we must abandon the data configuration described above in favour of some intermediate solution which works more efficiently for the C-step. In this respect, we believe that a good choice is to keep four distinct arrays U, L, D, R whose rows and columns correspond to lines X = const. and Y = const. in the two-dimensional Lattice. More explicitly U, L, D, R are all rank-two arrays with dimension N along X and N/32 along Y. In other words, the four arrays U, L, D, R are a!! dimensioned according to expression (7). With such a choice the C-step can be performed very efficiently with a 100% degree of “multispin” as follows. For each word we define a collision mask (just an INTEGER 4) whose bits are ON if the associated sites correspond to a collisional configuration and zero otherwise. The construction of this mask is straightforwardly achieved by means of the elementary Boolean functions AND( A), OR( V), XOR(®) and NOT(-) available as FORTRAN bit functions [14]. One proceeds as follows: *
For I = 1 ,N * N/32 CMASK, = (((LA R)
A
(U
AD))
v((UAD)A(LA~)))1
~JX ~, (7)
With such a structure the MOVE step can be realized with a simple pair of vectorizable DO loops which are optimal in the sense that in both cases they run along the largest dimension of the arrays and over contiguous memory locations. Unfortunately, such a data organization conflicts with an efficient execution of the C-step. This is immediately realized if one recalls that in the HPP automation collisions take place on single
(U, L, D, R)1
~—
(8)
(U, L, D, R)~0 CMASK~
End (9) Actually, one can do better and use the following collision mask CMASK=(U®L)A(L®D)A(D®R)
(10)
which, while being logically equivalent to the previous one, only needs five boolean operations instead of eleven. Coming now to the MOVE step, we must dis-
S. Succi
/
Cellular automata modeling on IBM 3090/ VF
tinguish between horizontal (L, R) and vertical (U, D) displacements. The former are executed quite simply by just copying the old state into the new one with the spatial index increased/decreased by one according whether we are moving right or left (of course, the last element in a row must go in the first one for periodic boundary conditions). Again, each copy operation on a single integer updates 32 velocities simultaneously and we have a fully “multispin” phase. Unfortunately, as anticipated, this is not the case for the vertica! displacements. In this case, the relevant operation to be performed is a shift of all the 32 bits belonging to a given subscript of the arrays U and L. This shift operation can be performed by means of the ISHFT FORTRAN routine which has the following syntax: KS = ISHFT(K,m) means that all the bits of the integer K are shifted of m positions (—31
~—
ISHFT(U(IX, IY,
t),
1) (11)
This operation is efficient, but not entirely Multispin because the last bit (the rightmost in the present case) rather than becoming the first bit of the next subscripts U(IX, IY + 1), is barely lost: thus the Multispin efficiency is 31/32. Unfortunately, it turns out that handling the remaining bit is rather involved and expensive. To data, the better procedure we have found is the following: 1. Pick up the boundary bit (m = 0) of U(IX, IY) with a mask 2. Shift it to the leftmost position of the mask (m=31) 3. Shift the subscript U(IX, IY) 4. Put the 3lth bit of the mask into the 3lth bit of U(IX, IY + 1) We see that this procedure requires three very inefficient (in terms of multispin) operations, namely 1, 2 and 4 which are all required to treat a
177
single bit with a resulting global efficiency of 32/4 bits/operation, i.e., 25% multispin efficiency. Note, however, that it could be simplified by performing the shift operation in scalar mode. In fact, the 370 architecture offers a SHIFT instruction in which the operands are treated as 64-bit signed binary integers residing in an even-odd pair of genera! register [13]. In this way,.when performing, say, the “MOVE RIGHT”, the rightmost bit of the even register instead of being lost becomes the leftmost bit in the odd register and could therefore be used without defining any auxiliary mask. However, since such an opportunity can only be exploited in scalar mode, we concluded that the four step procedure outlined above, being completely vectorizable, is to be preferred even if it requires more operations. In conclusion our algorithm can be logically split into three sections: 1. Collision 2. Move L/R 3. Move U/D whose Multispin efficiency is 1, 1 and 0.25, respectively. As already mentioned, all of them are well vectorized.
4. Performance data In order to test the performance of the HPP automaton on the IBM 3090 with Vector Facility (VF), we have written a code (AUTOMA) which implements the ideas described in the previous section. AUTOMA is written in FORTRAN, although an effort has been made to check whether the VF assembler might have outperformed the compiler in the strategic sections of the code. This was not the case, as will be discussed below. From the point of view of programming, the evolution of the automation consists of three logical steps: 1. COLLIDE 2. MOVE 3. ADVANCE
~
Since the HPP automaton is of first order, we
178
S. Succi
/
Cellular automata modeling on IBM 3090/ VF
need to work with two time levels, which we conventionally label as OLD for time t and NEW for time t + 1. Since the collisions are instantaneous, they work on the same time-level
Scan the elements in the LOOP counterwise with respect to the direction of the movement.
ADVANCE: OLD(t +1)4- NEW(t)
One can easily convince oneself that by doing so, once an element (we mean a subscript) has been “MOVED” it can also be “ADVANCED”. Boundary elements require extra attention. To date, the best technique we have found consists in saving the boundary bit into a temporary location (“ghost” element) prior to the MOVE operation and copy the content of this ghost element at the completion of the overall MOVE + ADVANCE operation. For sake of clarity, we report here below a symbolic scheme for the simplest case of “MOVE RIGHT + ADVANCE”:
The cost of these three steps, in terms of CPU s/step on a single processor are reported below for a 1024—1024 grid:
Save ROLD(N) into GHOST I ForI=N—1,N—2,...,1
OLD
COLLIDE(OLD)
On the contrary, the MOVE step evolves the state at time t to a new state at time (1 + 1) and thus works with both OLD and NEW arrays: NEW
MOVE(OLD)
Finally, ADVANCE, refreshes the values of OLD for the execution of the next step:
COLLIDE MOVE ADVANCE
Scalar
Vector
0.025 0.046 0.021
0.014 0.023 (s/step) 0.009
As expected, we see that the MOVE step is the most expensive one. Also, we note a vector/scalar speed up factor of about two. The considerable amount of time required by the ADVANCE step was somehow surprising. The implication is that with such a fast computational kernel (Collide + Move), one is not entitled to neglect the price of just loading data from the memory and storing them back! This situation is somehow particular to CA modeling, and contrasts with what is commonly experienced in PDE discretization where the cost of data transfer is negligible compared to that of arithmetic elaboration. However, it is easy to recognize that one can lump the MOVE and ADVANCE steps and perform them “in loco” (as a physical system would do!) within the same DO loop, thus saving 8 Vector Load operation (4 OLD and 4 NEW arrays). This is a delicate operation, in the sense that one must keep sure that only those data which will not be used any longer in the course of the same time step undergo the ADVANCE operation. Again, one has to stick to a simple, and yet efficient rule, namely
MOVE: RNEW(I +1) = ROLD(I) IADVANCE: ROLD(I +1) = RNEW(I +1) RNEW(1) = GHOST Boundary: ROLD(1) = RNEW(1) By merging the MOVE and ADVANCE steps and implementing a few other minor tricks inspired by the same criterion of doing as many operations as possible within the same DO loop, we finally ended up with the results reported in table 1. Thus, we see that the total cost of the computational kernel is about 0.03 s for a 1024 x 1024 grid, or, in current terminology, about 30 Msites/s. Practically speaking, this means that one can run a 10000 step long dynamical evolution in something like 5—6 mm CPU time. To date, we have not addressed physical questions, but just concentrated on technical aspects. Nonetheless, a few physical features, like the propagation of sound waves (figs. 1, 2) in the HPP “gas” and the formation of macroscopic flow patterns behind a plate (fig. 3) have been observed in the numerical simulations. The implementation of the obstacle (the plate) is straightforwardly achieved by processing the corresponding bits with a collision mask which reflects back the incoming
S. Succi Table 1 2 N 256x256 512X512 1024x1024
/
Cellular automata modeling on IBM 3090/ VF
COLLIDE (s/step)
MOVE+ ADVANCE (s/step)
9.5x104 3.5x103 1.4x102
1.4x103 5.3x103 2.1x102
particles. The bits associated with the plate are processed by both collision masks in order not to break the DO loop along the X direction. Of course such a choice depends on the position of the plate within the Lattice. Future work will focus on the physical validation of these quantitative features. As anticipated, we started this work with the bold intent of possibly doing better than the FORTRAN compiler in generating an efficient set of vector assembler instructions. However, it turned out that our boldness was “frustrated” in the sense that by examining the assembler code generated by the FORTRAN compiler one hardly sees how to outperform it. In fact, the only intervention we could imagine was to remove a number of scalar “LOAD ADDRESS” (LA) instructions from the vector DO loops and place them beforehand by using a few registers rather than just one as done by the compiler. However, given the fact
179
Fig. 2. The formation of the HPP sound wave after 100 steps.
that the cost of the LA instrument is small compared to that of the vector instructions, we concluded that the vectorizing compiler does an excellent work and there is no point of modifying its operation. We found nonetheless some room for a VF assembler intervention. This was required in the VFIEL~
:t:~~::::::: !‘
—..—-
a
- -
-
- __._.*.—__.__
~
f I~ -
I
—
— — — —
—
,
-~
I
—~
I
J~
4’~ I
t ~ ‘I .1 -_--_-_
— — —
— __________
~r~-i-i
~
-4
Fig. 1. The HPP automaton at t = 0: the central spot contains four particles/site, while the rest of the gas is at density 0.5.
vx Fig. 3. A macroscopic flow behind a plate.
180
S. Succi
/
Cellular automata modeling on IBM 3090/ VF
diagnostic part of the code, where one needs to count bits which are ON in order to make check tests or compute average quantities. In FORTRAN, the most natural way of doing this is to use the bit function IBSET which tests the single bits within a word. According to the general philosophy outlined in section 2, this is exactly the type of operation one should try to avoid! In fact, we verified that such an operation is very inefficient and can even overwhelm the CPU cost of the computational core if it happens to be performed too frequently. By programming our own bit counting routine which makes use of the appropriate Vector Facility assembler instructions [15], this problem has been overcome.
5. Conclusion We have discussed the basic features concerning the implementation of the HPP mode on the IBM 3090/VF, with a special regard to the choice of an optima! structure of the data in order for the code to run as fast as possible. Our study shows that processing rates up to 30 Msites/s can be achieved, thus proving that on a high-speed large memory general purpose computer like the IBM 3090/VF Cellular Automata constitute a very appealing computational tool. As anticipated in the Introduction, we do not claim there is any point of contesting the superiority of special purpose machines in terms of computing power versus cost in the modelling with cellular automata. On the contrary, with the present paper we would like to emphasize the complementary role that general purpose vector processors can play in supporting the development of this promising research field.
Acknowledgements I am heavily indebted to Dr. P. Carnevali, who patiently taught me all what I presently know about the VF Assembler language and also for a number of valuable hints he gave me all along this work. Dr. P. Santangelo is also kindly acknowledged for providing me with its nice graphic routines. Finally, I wish to acknowledge Dr. P.
Sguazzero for proposing this interesting research subject. Notes added in proofs 1) As kindly pointed out to the author by Dr. P. Di Chio, the first two steps of the procedure required to handle the boundary bits can be performed by means of a single shift operation: MASK = ISHFT(U(Ix, IY), 31) This is due to the fact that the bits brought in the vector register by the shift operation are all set to zero. However, this has a fairly negligible effect on the overall performance of the code. 2) It should be appreciated that the procedure introduced to merge the MOVE and the ADVANCE steps does relax the need of two distinct sets of arrays for the time t (OLD) and t + 1 (NEW) thereby releasing a net factor two savings in the memory requirements. References [11 J. von Neumann, Theory of Self-Reproducing Automata (Univ. of Illinois press, 1966). [21 T. Toffoli, Physica 1OD (1984) 127. [3] T. Toffoli, N. Margolus and G. Vichniac, Phys. Rev. Lett. 27 (1985) 863. [4] G. Vichniac, Physica 1OD (1984) 96. [5] H. Frisch, H. Hasslacher and Y. Pomeau, Phys. Rev. Lett. 56 (1986) 1065. [6] D. D’Humieres, P. Lallemand and U. Frisch, Europhys. Lett. 2 (1986) 291. [7] J. Hardy, 0. de Pazzis and Y. Pomeau, Phys. Rev. A 13 (1976) 1949. [81 T. Toffoli, Physica 1OD (1984) 195. [9] S. Wolfram, Nature 311 (1984) 419. [10] J. Hardy, 0. de Paths and Y. Pomeau, J. Math. Phys. 14 (1973) 470. [11] M. Creutz, L. Jacobs and C. Rebbi, Phys. Rev. Lett. 42 (1979) 1390. [12] A. Cloqueur and D. D’Humieres, R.A.P.1, A Cellular Automaton Machine for Fluid Dynamics, Proc. of the workshop Modern Approaches to Large Nonlinear Systems, Santa Fe (1986). [13] IBM System 370 Extended Architecture: Principles of Operation SA22-7085-0 (1986). [14] VS FORTRAN Version 2: Language and Library Reference SC26-4221-1 (1986). [15] IBM System 370: Vector Operations SA22-7125-0 (1986).