-
Comouters them. Emmn Vol. 19. No. 617. DD. 719-742. 1995 -C&wright @Z?J 1995&t&r S&ace Ltd Printed in Great Britain. All riehts rcscrvcd 009%1354/9~$9.50 + 0.00 OO9G1354@4)ORO77-8
Pergamon
COMPARISON OF MIMD AND SIMD STRATEGIES MONTE CARLO MODELLING OF KINETICALLY COUPLED REACTIONS S. M. STARK,
M. NEUROCK
FOR
and M. T. KLEIN
Center for Catalytic Science and Technology, Department of Chemical Engineering, University of Delaware, Newark, DE 19716, U.S.A. (Received
5 Augusi 1993; accepted in revtied form 22 February received for publication 24 June I994)
1994;
Ahstrnet-The atomic-level description of a complex heavy hydrocarbon feedstock and its complete temporal product behavior is a challenging and computationally intensive problem which requires robust reaction simulations to accurately account for kinetic interactions and efficient utilization of available computer hardware for solution. Herein we discuss the parallel implementation of different Monte Carlo reaction algorithms for two prototype kinetic problems and examine their efficiency on MIMD and SIMD parallel architectures. The concepts developed are subsequently applied to a complex resid pyrolysis model. Both fixed- and variable-time step Monte Carlo approaches to the stochastic simulation algorithm were investigated. The prototvue kinetic problems consisted of A;sB. with Lsingrnuir-Hi&helwood-Hougen--Watson kinetics for i = l-3 and a hypothetical l-5 ring aromatic hvdroRenation simulation. The two parallel architectures utilized were BBN’s TC2000 (MIMD) and Think&g Machines CM-2 (SIMD). A new modelling approach that stochastically samples the reaction environment to provide an estimate of species’ interactions is introduced. This “uartial system” approach exhibited excellent parallel efficiency ( > 90%) on the TC2000. The simple prototype A,= B, kinetic problem was also successfully implemented on the SIMD CM-2. While the l-5 ring hvdrogenation problem efficiently took advantage-of the MIMD architecture, its performance on the CM-2 was quite poor due to the the lack of a match between the data focus of complex model simulations and the CM-2 data parallel paradigm. The resid pyrolysis model performed extremely well on the MIMD machine _
INTRODUCTION decade of the 1980s brought many remarkable developments in science and policy that affect modern petroleum refineries dramatically. Scientific advances in spectroscopy, for example, now permit the direct or at least indirect measurement of the molecular structures in even the most complex hydrocarbon feedstocks. Refinery-related developments in public policy have paralleled this scientific progress. The Clean Air Act Amendments of 1990, for example, make these spectroscopic capabilities a nearly required part of refinery-related research, management, and optimization. For example, meeting the specifications of reformulated gasoline in an optimal, competitive refinery will be assisted by measuring, tracking, and managing the how of aromatics in refinery streams. These forces help motivate the need for advanced, molecule-based modeling capabilities for the modem petroleum refinery. Molecule-based modeling offers the compelling opportunity for the most rigorous development of structure/property relationships. Reactivity is an especially significant property that can be discerned
The
given a molecule’s (and its reaction environment’s) structure. Other properties fall into performance and environmental classes. Performance properties will include octane number, smoke point, cetane number, and the like, whereas environmental properties include RVP, aromatics content, TW, etc. It is perhaps obvious that the molecular composition of the mixture is the best starting point for the estimation of its properties. Less obvious, however, is that the development of the parallel computing paradigm and associated hardware has provided what may be unique access to the GFlop CPU performance that will be required for integrated refinery models. This CPU demand is because of the staggering complexity of petroleum hydrocarbon mixtures. Modern analytical methods indicate the existence of at least 0( ld) different molecules in these mixtures; heavy-end feeds contain oligomeric species of unique molecular connectivities. The sheer size of the thus-implied modeling problem engenders a conflict between the need for molecular detail and the formulation and solution of the model. Traditional deterministic models would comprise an 719
S. M. STARK et al.
720
extremely large O(l@) number of species and therefore differential equations. Thus the formulation, let alone solution, of the model would be formidable. For very complex problems, in which the enormous molecular detail of the mixture is to be maintained, much of this difficulty can be avoided by formulating and solving the model using a statistical approach. In particular, Monte Carlo simulation techniques can be used in kinetic models to provide molecular species’ yields vs reaction time. In previous communications (Neurock ef al., 1992a; Petti et al., 1993), we discussed the application of Monte Carlo techniques for modelling the upgrading chemistry of complex heavy hydrocarbon feedstocks. The kinetics were described in terms of pseudo-first order rate constants for an extensive set of unimolecular paths. Bimolecular interactions, i.e. kinetic coupling were approximated in terms of the Markovian framework of single event reaction trajectories. Recently we extended the approach to accurately handle intermolecular interactions by allowing all N-species to define the Markovian system state (Neurock et al., 1992b). In principle the approach can handle any kinetically coupled system; in practice however, the severe random access memory requirements prohibit the description of complex feedstocks. The overall goal of this work is the development and implementation of a robust kinetic simulation technique which can (1) accurately account for intermolecular interactions, (2) retain complete-atomic detail which is crucial in and product assessing molecular composition properties, and (3) take full advantage of parallel computing architectures. We first present basic Monte Carlo concepts (techniques), and then analyze in detail corresponding computational algorithms to implement kinetic interactions for simple model reaction systems. We focus on the accuracy of each algorithm to describe kinetic coupling and evaluate their potential to take advantage of both SIMD (single instruction streammultiple data sets) and MIMD (multiple instruction streams and multiple data sets) parallel computer architectures. Lastly, we employ the most flexible technique to describe the pyrolysis of actual resid feedstock. 1.
MONTE
CARLO
CONCEPTS
The Monte Carlo approach to molecular modelling has been described elsewhere and includes two key steps: (1) the transformation of basic analytical chemistry into a model of the molecules in a complex feedstock; and (2) the use of stochastic kinetics to predict the reaction trajectory. The first step
A
b *1 ‘2
A A
t3
A
14 %
B B B
i-1
B
‘5
3 A B B B B B B
k 7
k
7
Average
A A B B B B B
A A A A A A A
A A A B B B B
A
B B
B
2
i -
1
1L i
h
i
B
B B
Markov Chains Fig. 1. An example of Monte Carlo simulation of the irreversible reaction of A to B. The concentration of species A at a given time step is the average of the Markov chains at the time step. involves the construction of a molecular representation of the complex feedstock. This is accomplished via a select set of analytical characterization techniques (e.g. NMR, GUMS, elemental analysis, etc.) which provide relevant structural data within reasonable characterization times. These techniques provide a summary of the structural attributes of the molecules of the complex feedstock in terms of cumulative probability distribution functions (PDF) for each attribute. A stochastic construction technique via Monte Carlo methods is used to assemble an ensemble of different molecules from these PDFs. Each of the PDFs are sampled 10,000+ times to construct 10,000 different reactant molecules that represent the generic complex feedstock. Having the structure of each molecule allows for modeling its reaction path. The Monte Carlo reaction method chronicles the fate of each molecule in the feed in terms of a Markov Chain reaction trajectory, such as that shown in Fig. 1 for the simple irreversible reaction of A to B. This is accomplished by comparing a random number (RN) with a transition (reaction) probability. For example, the exact form of the probability that a molecule in state A will undergo a transition to state B in a time interval At for a first-order irreversible reaction is given in equation (1) Pij=l-exp(-k,Ar).
(1)
If RN< Pij, the molecule being tested for reaction undergoes a transition from state i to i. If RN> Pi,, the molecule does not react. Thus, simulation events are discrete, but averaging the results of many (l@-104) Markov chains provides the predicted value of the feed composition vs time. A key advantage of the Monte Carlo reaction algorithm is that it appears intuitively parallel. For the example of the irreversible reaction of A to B shown in Fig. 1, each Markov chain simulates the
MIMD and SIMD strategies independent reaction process of a single molecule. Because of the independent nature of the Markov chains, the number of chains that can be simulated concurrently is only limited by the parallel architecture on which the simulation is performed. The Markov chains of Fig. 1 are only independent when the reaction kinetics are first order. Under these circumstances, nearly ideal speedup can be achieved-the single processor run time M is reduced to M/N, where N is the number of processors used. The challenge in applying the Monte Carlo method is that reactions of components in mixtures are not, in general, independent. A more appropriate view is that kinetic coupling causes the rate of component i to be dependent on the presence of component j. The implication with respect to the Monte Carlo simulation is that the Markov chains are no longer independent, that is, chains representing reaction of species i depend on chains representing species j. This interaction can reduce the amount of parallelism in the simulation, and complicate the implementation of the parallel algorithm. In this paper we present methods that begin to resolve this conflict. The presentation is in the form of examples of parallel implementations of simple, kinetically coupled reaction systems. We first discuss a representative set of model systems with coupled kinetics. Parallel implementations of the algorithms are considered next. This is followed by perforrun on the results from simulations mance University of Delaware’s BBN Pittsburgh Supercomputing Center’s 2. MODEL
TC2000 CM-2.
and
the rate law for the component tion (5).
721 Ai is given in equa-
(5)
The rate expression is a function of the concentration of all system species. A larger model system is the hydrogenation of a feed of aromatics with one through five rings. The properties of this example aromatic feed in terms of the number and types of aromatic rings are given by the two discrete probability distribution functions (PDFs) of Fig. 2. The quantitative values of these model distributions are arbitrary. The top distribution in Fig. 2 gives the probability of a given ring size. One and two ring aromatics have only one configuration. For the case of three ring aromatics, the two possible types of rings were equiprobabie. Thus, roughly 46% of the aromatics were species with three rings, and the breakdown of the threering species was 50% anthracene and 50% phenanthrene. Of the multitude of five ring aromatics, only the indicated ring structure was allowed. The breakdown among the various four-ring aromatics was determined by the bottom distribution of Fig. 2. When the ring size distribution indicated that a fourring should be constructed, the second distribution was consulted to determine which of the six possible structures should be assembled. The algorithm for the construction of a molecule was: Draw a uniformly distributed random number RI in the range [O-l]. (2) Locate RI on the ordinate of the cumulative probability distribution (CPD) corresponding to the ring size distribution (i.e. X7=, RingSize,) of Fig. 2. Move horizontally to locate the intersection point on the CPD. The abscissa value of the intersection is the ring size for the molecule. (3) If the ring size is equal to three, draw a second random number R2 to determine which threering species to build. If R2 is less than 0.5, the species is anthracene, otherwise it is phenanthrene. (4) If the ring size is equal to four, draw a second random number and use the CPD corresponding to the four-ring type distribution to determine the species identity. A more detailed discussion of the stochastic construction method in the context of a petroleum resid model is available (Neurock, 1992; Trauth ef al., 1993).
(1)
KINETICS
A potential obstacle for the parallel impiementation of Monte Carlo reaction models is that the kinetics will, in general, be coupled. Kinetically coupled reaction systems are those in which the rate of species i is influenced by the presence of species j. In a more specific Monte Carlo sense, kinetically coupled systems axe those in which the transition probability is a function of the reaction time as a result of the changing mixture composition. Traditional examples of rate laws encompassing kinetic coupling include Langmuir-HinsheiwoodHougen-Watson (LHHW) and Rice-Herzfeid kinetic models. Herein we explore LHHW types of systems. LHHW rate laws provide an example reievant to catalytic reactions. For the parallel surface reactions given in equations (2-4), A,C=B 1
(2)
A Z--‘I32
(3)
A.XC=B3
(4)
Stochastic simulation of the hydrogenation model thus consisted of the construction of an initial feed
S. M. STARK er al.
722
by repetitive sampling of the two probability distribution functions, followed by reaction simulation. The hydrogenation chemistry was modeled with LHHW rate expressions and the intrinsic reactions depicted in Fig. 3. The reaction simulation distinguished four hydrogenation pathways using the number of peripheral carbons as a reactivity index. Using the chemistry depicted in Fig. 3, the reaction simulation proceeded by checking each aromatic compound for reaction of one of the four types of reactive sites. Figure 4 illustrates in terms of the anthracene pathways and typical reaction profiles.
The motivation of this computational model was its degree of realism with respect to full-scale, complex system models in terms of both code structure and simulation run time complexity. The scaled down model problem was used to facilitate the development of multiple implementations. In general, Monte Carlo techniques are characterized by system state and the methods of state evolution. As discussed, the system state can be represented by a single component system, M/N partial systems of M components each, or by the full N-component state. System evolution follows either
Number of aromatic rings
5
6
s
6
Type of four member ring
I
1
2
3
4
Fig. 2. PDF distributionsfor the aromatic hydrogenation
example problem.
MIMD and SIMD strategies
4 wll
Reaction I
e
+3H2
+2H2
-8
+ H2
723
-_
k, W’l
9.47OxlO~
0
3.75OxlO'7
-_
z
Fig. 3. Model aromatic hydrogenation
8
2.637~10-~ 8.79Ox1O-5
and dehydrogenation aromatic species.
fixed time intervals (fixed time step approach) or event-based intervals whereby time is updated implicitly at step (variable-time step approach). Each of these techniques present certain advantages and challenges for different serial and parallel computer architectures. Herein we evaluate each of the techniques for the model reaction system of A to B, and compare their performance on two different parallel systems (MIMD vs SIMD). This 3 x 2 x 3 problem is nicely illustrated in the threedimensional scheme depicted in Fig. 5. This is then extended to the more complicated and realistic system of the hydrogenation of model aromatic feeds. Our ultimate aim is to use the developed algorithms as modules in parallel/distributed Monte Carlo libraries for the modelling of heavy oils, resids, coals and other real problems involving complex mixtures of structures like that shown in Fig. 6, and coupled reaction kinetics. Section 5.3 illustrates one such application for the pyrolysis of a complex resid feedstock.
reactions and reaction profiles for the
3. MONTE
CARLO
SIMULATION
COUPLED
OF KINETICALLY
SYSTEMS
Monte Carlo reaction simulations chronicle the discrete transformations of a N-component system throughout time. In the simplest case, all N-components are reacted together and are used to compute the compositions found in the controlling rate laws (e.g. equation 5). and thus define transition probabilities at each reaction time. The transition probability, Pij. of a first order reaction as given by equation (6) is exact for true first-order kinetics (Neurock et al., 1992b; Hillis, 1989). pij = 1 - e -kiiA’_
(6)
The extension of the transition probability to nonsimple kinetics, including coupled systems, can be achieved by replacing the first-order rate constant of equation (6) with a pseudo-first order rate parameter. This rate parameter, k,. is the rate of reaction of component i divided by its concentration, r,lC,.
5. M.
724
- 0.0
2.0
STARKet al.
4.0
6.0 The
Fig. 4. Model
aromatic
hydrogenation
MIMD
SIMD
Fiwd
10.0
and dehydrogenation reactions and rate parameters.
This provides for composition-dependent transition probabilities. The Monte Carlo simulation can be performed using either a fixed-time or a variable-time discretization of the reaction time coordinate. In the fixedtime approach, molecules in the system are tested for reaction and the system state updated at equally spaced time intervals. Multiple events (reactions; e.g.. AB- C) are allowed within each time interval and the probabilities of these compound events are determined using conditional probabilities
Vpricrblc
Fig. 5. General overview of the parallel Monte Carlo simulation
8.0
[See]
available methods of kinetic coupling.
for
(McDermott et ol., 1990). The variable-time simulation proceeds event by event. A single reaction event is selected based on the possible reactions, their corresponding probabilities, and a random number. A second random number in conjunction with the current system state determines the time step required for the reaction (Gillespie, 1976; Neurock et al., 1992a, b). Both techniques are shown schematically in Fig. 7. 3.1. Full system
ffxeci-time
step approach
(FSFT)
The FSFT approach is the most straightforward method for describing kinetically coupled systems. The complete reaction mixture composition is stored in computer memory and concentrationdependent transition probabilities, such as those shown in Table 1 for the LHHW reaction of A to B, are calculated prior to each time step by dividing the reaction rate of component i by its total concentration. The transition probabilities required to simulate the reversible reaction of A to B are given in Table 1. Equations (7) and (8) refer to the simple first order transitions of A - B and B-A. McDermott et al. (1990) have shown that the fixed-time approach is facilitated if the chance that a molecule will undergo more than one reaction within a given At is
MIMD
Fig. 6. An example
moiety
and SIMD strategies
found
725
in an asphaltene
reaction
simulation.
high order transitions effectively have zero probability of occurring. The key to accounting for the kinetic coupling is that the entire mixture composition is retained in memory. Thus, “full system” techniques refer to the maintenance of all N components in random access memory (RAM) throughout the simulation. This algorithm is shown schematically in Fig. 7A. In the initial step (t= to) of the simulation of the reaction A =$B, N molecules of A are constructed. A series of N random numbers are then drawn and compared to the reaction probabilities to determine the fate of
taken into account. Equations (9) and (10) in Table 1 were derived from conditional probabilities which allow for up to two transitions per time step. That is, and B-AB are allowed. Their both A- B-A resulting transition probabilities are given as PABA and PsAsin equations (9) and (10). A more indepth discussion on the derivation of these equations is given in the McDermott paper (1990). Higher-order transitions, e.g. PABnscan also be used, but for most rate parameters, these higher-order expressions are smaller than the smallest number output by a random number generator. This means that such
A. Fixed Time Step Approach t0
tl’t()=Al
1
t2-tl=At
AAABAAAAAAAABAAAAABA
t*
AAA%AAAA%‘AAA%A%AAA%A . AAABAAABBAAABABABABA
I2 tN:l
tN-tN_l’At
1
A%%%A%A%BAAAB%%A%A%B
tN
B. Variable Time Step Approach
tl- to = AtLy t2 - tt = At2
tN_l = AtN
Fig. 7. Schematics CKE,9:6/T-”
AAAAAA
AAAAA’AAA
AAAAAA
AAAAA’AAA
i
: tN -
AAAAAAAA
-
AAAAAA
tN4
ABA%
AB%*AAABAAABAAA
tN
ABA%
AB%?LAA%AAA%AAA
of the fixed-time
and variable-time
stochastic
reaction
algorithms.
S. M. STARK et al.
726
Table I. Transition probabilities for the fixed-time step Monte Carlo simulation of A reacting rcversihLv to B with LHHW kinetics
The simulation proceeds by drawing two random numbers. The first random number corresponds to the ordinate of the CRPDF and is used to compute the associated event or reaction. A second random variable is generated and used to compute the time of this next transition (reaction) by substituting into the rearranged form of the state transition probability, given in equation (12) (Neurock et al., 1992b; Trauth et al., 1993). Pij=l-exp
each component
within the fixed time interval. The mixture composition is updated, and reaction probabilities (equations 7-10) are recomputed based on the current composition. The newly calculated reaction probabilities are then used to determine the changes of state at the next time, t= t,. The simulation moves completely across each row (ti) before proceeding to the next time, ti, ,. This is depicted by the gray shading in Fig. 7A. The overall simulation thus reacts the entire state of N molecules through T transitions, where T= t+,,,,,/At is the number of time steps. 3.2. F&system
variable-time step approach (FSVT)
The combination of the “full system” and variable-time approaches (FSVT) is similar to the FSFT method. This is because the transition probabilities are equivalent to those in the fixed time step method in the limit of small At. Note that the variable
time-step
approach
is truly event-by-event,
transitions A-B B-A are required to model the deterministic solution accurately (Gillespie, 1976). The methodology of this FSV’T algorithm is shown in Fig. 7B. The system state is defined by N components, and all iV components are constructed and used as the basis for computing compositions and reaction probabilities. In practice, every conceivable event is determined and represented by its rate constant ki. The selectivity (relative probability) of each event is computed as: and
therefore
only
the primary
and
si=y
(12)
Rearrangement of equation (12) to equation provides the time required for the transition. At,, =
(13)
- In( 1 - RN)
(13) 2
n,k;
-
i=L
The FSVT simulation proceeds event-by-event with the cumulative time, composition, and transition probabilities being recomputed after each event occurrence. 3.3. Memory issues The small number of components in the systems considered here are deceptively simple in some respects. For example, the computer memory requirements for maintaining large sample sizes and composition information of A and B were easily accommodated on the TCZOOO and CM-2 machines. However, our ultimate application of the method to complex mixtures having on the order of 104 components with explicit atomic connectivity will have memory requirements exceeding that available on most current computers. The Monte Carlo simulation of petroleum asphaltenes (cf. Fig. 6) provides a representative example of the scope of this issue. The
structural
model
requires
information
alone
on the order
of 0.01 megabytes
in the asphaltene (Mb)
ni ki
(11) 2 niki I=I
where ni is the number of molecules capable of undergoing reaction i with rate constant ki, and M is the number of reaction event types. The selectivities in turn are used to generate a cumulative reaction probability distribution function (CRPDF). A hypothetical CRPDF is illustrated in Fig. 8.
0.0
0
1
2
3
4
M-2M-1
M
Event
Fig. 8. An illustration of a hypothetical CRPDF. The distance between any two ordinates give the probability of the corresponding abscissaevent. For example, the shaded distance gives probability of event 3.
MIMD
One 20-Component
and SIMD strategies
727
matrix to hold the structure and interaction information. The four S-component systems each require only a 5 x 5 matrix, and demand 25% of the memory required by the 2O-component system. As the system size increases, the relative savings will increase as well. This breakdown of the system into pseudosystems suggests a natural method of paralielization. The assignment of the P independent pseudosystems to P parallel processors would capitalize on the intrinsic parallel nature of this approach to the Monte Carlo simulation. Thus, the Monte Carlo algorithm maintains its parallel nature even when the reaction kinetics are coupled. This is best illustrated by consideration of the parallel implementation details.
System
Four S-Component Systems
b ‘1 t2
4. PARALLEL
t3 t4 Fig. 9. Decomposition of the N-component M-component systems.
system
into P
molecule (Neurock et al., 1992b). This corresponds to 100 Mb of RAM for the structural description of a lo4 component mixture. The introduction of reaction information increases the memory requirement by roughly a factor of 10. Together, the structural and reaction information would exceed the RAM limits on most currently available architectures. This motivated the formulation of an alternate approach to the modelling of kinetically coupled systems. The goal of this approach was to reduce the number of molecules used to describe the system state in order to reduce memory usage. The cost of this approach is a loss of precision based on a single simulation, as the number of molecules used to describe the system is less than the IO4 possible species. However, this precision can be recovered by averaging several simulation results based on the reduced systems. Thus, the original N-component system is divided into P systems, each composed of M molecules, according to equation (14). per
P
pseudo-system system
molecules pseudo-system >
.
(14)
A schematic of the system reduction is given in Fig. 9. As an example of the savings this method affords, consider the two systems presented in Fig. 8. The 20-component system of Fig. 9 requires a 20x 20
IMPLEMENTATIONS
The opportunity provided by parallel computers is the ability to perform a given algorithm faster than can be done on a serial machine. Hillis (1989) presents a useful overview of several classes of parallel computers. Different parallel architectures exist because the optimal method by which an algorithm can be decomposed into several smaller, “parallel” tasks depends on the nature of the algorithm. This work focuses on the two most common high level classifications of parallel computers. MIMD and SIMD. The only characteristic feature used to differentiate parallel architectures was the primary form of parallelism. If parallelism is based at the instruction level, the classification of the architecture is MIMD (multiple instruction, multiple data). Alternatively, if the parallelism exits primarily at the data level, the architecture is SIMD (single instruction, multiple data) (Flynn, 1966). Figure 10 gives a schematic representation of the MIMD computer used in this investigation, a BBN TC2000. Referring to Fig. 10, the MIMD system consists of n processors and n memory units labeled Pi. . . P,, and M, . . . M. respectively. The solid bold paths are data channels. Each processor is capable of reading from and writing to the memory of any other processor though the “Butterfly Switch” communications channel. This capability is the means by which communication between processors occurs. Each processor in the system is a completely autonmoderate omous, performance UNIX-based computer. The dotted paths represent the processor instruction streams and are labeled IS, . . . IS. in Fig. 10. The instruction streams represent the program instructions that define the tasks each processor performs. For example, if the problem to be
S. M. STARK
728
solved consisted of summing the elements in a vector of length 100, and 10 processors were used, ISI might be the instructions that added elements 1 . . . 10 to a variable common to all processors, IS, might add elements 81 . . .90, and IS, might add 11. . . 20. Each instruction stream is responsible for summing l/n of the vector elements. Note that there need not be any correlation between task and processor numbers. The idea behind the MIMD paradigm is to decompose the problem into a number of small independent tasks which can be distributed to the processors of the system. These tasks become the instruction streams to the processors. The instruction sequences consist of the data manipulation and synchronization commands required to perform the program tasks in paxallel. Monte Carlo based problems are an example of ideal MIMD problems. This is because the Markov chains of the Monte Carlo simulation naturally map to the independent tasks distributed across the processors. Nonideal problems for MIMD machines are those whose tasks require a great deal of synchronization between the processors. An example of this type of problem is Gaussian elimination. Each step of Gaussian elimination consists choosing a pivot row and then eliminating the elements below the lead element of the pivot row. Because of this, processors must synchronize to ensure that the pivot row used during the elimination phase is the correct one. Figure 11 is a schematic representation of the SIMD computer used in this investigation, a Thinking Machines CM-2. Figure 11 represents the CM-2 as being composed of two distinct hardware units, a moderate performance UNIX front end computer (Sun4 Spare) and a 32768 processor parallel computer. In Fig. 11, the single instruction stream, labeled “Instruction Bus”, coming from the front end computer branches out to all the processors Pi_ This is the major difference between the SIMD and MIMD paradigms: program flow is based
et al.
a single sequence of instructions on a SIMD machine rather than the n instruction sequences on the n processor MIMD machine. Inter-processor data transfer is performed by the hardware unit labeled as “Router/NEWS Communication” in Fig. 11. The data stream labeled “Scalar Data Bus” allows scalar data to be efficiently propagated to each processor in the parallel computer. The “Global Result Bus” allows parallel data to be mapped to a single scalar value using parallel operations such as MAX, AVERAGE, and SUM. Another characteristic difference between the two architectures is the number of processors. The CM-2 employs 3276l3 processors, each with 128 kb of memory. The processors are clustered in groups of 32, and each cluster shares a floating point unit and router interface. However, the striking difference in the number of processors between the CM-2 and TC2000 diminishes somewhat when a comparison of the respective power of the processors on the two machines is made. The CM-2 uses 1 bit processors operating at 7 MHz while the TC2000 uses 32 bit processors operating at 16 MHz. Thus, each processor on the CM-2 has roughly 1/70ththe power of a processor on the TCXKKl, but the CM-2 has 1000 times as many processors. Comparison of the features of the fixed-time step reaction simulation code highlights the major differences between the two parallel architectures. On the SIMD machine, the program was written with a sequential instruction flow, and use was made of both front end and parallel variables. The front end variables were either constants or scalars that were used to hold the result of some data reduction step applied to the paraHe1 variables associated with the CM-2 processors. Examples of such reductions include sum, average, minimum and maximum operations. More specifically, in the present case of reaction simulation, the parallel variables represented the various molecular species and each CM-2 processor was assigned a molecular species state
on
Processors
Memory Units I
Butterfly Switch
Fig. 10. Schematic of the BBN TCZKUI parallel MIMD computer.
MIMD
and SIMD strategies
729
Combine
Mtmory Units
Router / NEWS Communication
tParallel
Computer
Fig. 11. Schematic of the Thinking Machines CM2 parallel SIMD computer. variable. The value of this state variable signified which molecular species a CM-2 processor represented. The SIMD reaction simulation proceeded as follows. The first step was to initialize the parallel species state variables to the initial species distribution. Initialization was completed by the calculation of the initial reaction probabilities. The reaction simulation then proceeded by having each processor generate a random number and store it in its local memory. In the simplest implementation, tests for reaction within the current At were then made by having the front end computer go through and test the various reaction pathways for each type of molecular species. As a more concrete illustration, consider the tests that need to be made for the six component LHHW model given in equations (2-4). The front end computer would first send out an instruction asking every processor that represented an A, species to compare its local random number against the current reaction probabilities PAls, and PalAb. If a processor’s random number met the condition P A,A, < RN -C Pa, B, 9 the processor would change its state to B,. Similar instructions would be sent out AZ-*B2, B2+A2. A:,-+B,, and Bsfor B,+A,, As. In all, six sets of instructions would be issued by the front end computer to test for the various reaction possibilities. After all reaction sequences were tested, the amount of each species was computed and stored on the front end. These were then used to update the reaction transition probabilities for the
next At. The simulation continued in this fashion until the final reaction time was reached. On the MIMD machine, the program had a taskbased instruction sequence, and use was made of both variables that were local to each processor and variables that were shared among all processors. Local processor variables were used for constants, intermediate results, and l/P of the problem data, where P is the number of processors. The shared variables were used to store the cumulative simulation results. In terms of the working code, the MIMD simulation began by assigning each processor l/P of the initial species counts. The simulation proceeded by having each processor independently react its local environment. After completion of a reaction time step, each processor added its local species counts to the corresponding global variable. Each processor would then update the transition probabilities in one of two ways, depending on the degree of synchronization employed. If the program used complete synchronization, each processor subsequently waited for all other processors to finish their reaction step, and then updated the transition probabilities based on the shared species counts. If no synchronization was used, each processor updated the transition probabilities based on its local environment and proceeded. The simulation continued in this fashion until the final reaction time was reached. Table 2 gives a coarse summary of SIMD and MIMD features (Hord, 1990) and further emphasizes the major differences between the two reaction
S. M. STARK et al.
730
simulation implementations. The first row of Table 2 emphasizes that processors on the SIMD machine were each given the same instruction, whereas the MIMD processors were given unique sequences of instructions, namely the algorithm for reacting their local environment. The second row of Table 2 contrasts the primary focus of the reaction simulation programs on the two machines. On the SIMD machine, the program is centered about testing the states of the parallel variables (CM-2 processor states). On the MIMD machine, the pregram is based on the independent simulations of processors on their partial systems. The contrasting viewpoints are data oriented versus task oriented. The third point in Table 2 addresses the potential inefficiencies of the programs on the two machines. When a given species is being tested for reaction on the SIMD machine, only those processors whose state variable indicates that they represent the species being tested are active. The processors that have a different state must sit inactive. Thus, only a fraction of the processors are active at any one time. On the MIMD machine, the point at which a processor may sit idle is when it must wait for other processors to complete a given reaction step. During such synchronization periods, the idle processors perform no useful work. The final point in Table 2 concerns the communication of data among the parallel processors. For the SIMD code, the front end computer generated all the instructions required to transfer data among the CM-2 processors. For the MIMD code, each processor was capable of reading and writing to the memory of any other processor that had been declared as shared. The following sections look at the parallel implementations of the full and partial system fixed-time step approaches to modelling kinetically coupled systems. The implementation strategies on both the TC2000 and CM-2 are discussed. Consider the TC200 implementation first. 4.1. FSFT and PSFT on the TC2ooO Figure 12 illustrates the key steps of the FSFT Monte Carlo simulation. In the FSFT algorithm, all Table 2. SIMD versus MIMD parallel computers SIMD All processors given same instruction Each processor operates on different data Processors may sit out a sequence of instructions A single processor coordinates data communications
MIMD Each processor has its own instruction sequence Each processor works on a different part of prohlem Processor may have to wait for other processors or data Each processor may communicatc datas
N components are capable of independent reaction for only a single At. This is because the reaction probabilities are composition and, therefore, time dependent. The first step of the parallel FSFT algorithm consisted of assigning a fraction of the lV components to each processor. At this point, the parallel simulation loop of steps 2 through 5 began. The first step of the reaction loop is the calculation of reaction transition probabilities using the current global reaction environment. After this, each processor performed independent reaction of the fraction of molecules assigned to it for one At. After reaction of its molecules, each processor atomically+ updated the shared global environment variables. This was followed by synchronization of the processors to ensure that all molecules had been reacted through the time step. This sequence of events was repeated for the number of Ats required to reach the final reaction time, tfinal. The essential difference between the FSFT algorithm with and without coupled kinetics is the need to synchronize processors after each At in order to recompute the composition-dependent transition probabilities. This requirement constituted a large restriction, in terms of paralleiization, since a processor could only execute a fraction of the reaction tasks assigned to it before it was required to wait for the other processors to finish the equivalent subtasks. Only after this synchronization could the simulation proceed. The ideally parallel alternative was to assign each processor a single pseudc+system corresponding to the PSFT method. Figure 13 illustrates the strategy used to perform the PSFT Monte Carlo simulation on the BBN TC2000. As with the FSFT approach, the first step is to assign a fraction of the initial species count to each processor. However, the collection of molecules assigned to each processor now corresponds to its pseudo-system environment rather than simply a fraction of the global environment. After assignment of the pseudo-systems, each processor executed an FSFT simulation on its local pseudo-system independent of other processors. entailed This computing the concentrationdependent transition probabilities using only local information and the subsequent reaction of the pseudo-system molecules through a single At. Repetition of this sequence (steps 2-3) through tern, completed the pseudo-system reaction simulation of a processor. As each processor completed its simulation, it atomically updated a shared array with the pseudo-system species profiles. The simulation thus t An atomic update is one that ensures that only a single
modifies a given memory location. This form of access is required when shared memory is accessed by multiple processes.
process
memory
MIMD
and SIMD strategies
731
1)MapxeaaialspcicsLopmcessors -1 Al
A2
Al
A2
I
proceuorP
FWeessorj Al
A2
Al
A2
Al
A2
Al
A2
I
:
0
pmeasar
resets its localenviromcnt
-1 Al
_P
Pm--j
B2
Al
A2
Bl
A2
Al
B2
3) Atomically add processor spcies sums to globsI -mart Pmccssor 1 Sum Ai 1
A2
Bl
A2
speciu counts PKOCUWKP
Rocusorj
Sum Bi .l-rr-lrll
Al
SumAi
Sum Bl -rrrrrrC*llllr--I I I Global Ai Global Bi
SumAi
Sum Bi
GlobalReacda~ 4) sylrhrmize prQassOK -N
-1
Rcomputcs uansitior~probii
9EachP-1
PAB, = ‘(A1 .A21
Repeat whil&uil-ii
based on glob81 enironmerU
Pmcessorj PABj = fCA1.A21
PlWCCSsaN = KAl.A2) pmN
< TEnal)
Fig. 12. Illustration of the steps of the FSFT Monte Carlo Algorithm for AejB architecture.
occurred without synchronization, and overall system results were the combined result of each processor’s pseudo-system simulation. Figure 14 provides a comparison between the deterministic and the PSFT and FSFT stochastic solutions for the six component LHHW model. The solutions are parametric in the number of TC2000 processors used, varying from N= 1 to N= 16. For each run the total number of molecules was constant at 8192. Thus, the number of molecules in the partial systems was 8192/N, with N being the number of processors. Note that the environment of the PSFT simulation with N = 1 is identical to that used in the FSFT simulation. Both of the stochastic models are seen to be in accord with the deterministic solution. At the scale of Fig. 14, it is difficult to distinguish the relative error between the PSFT and P’SFT solutions. Figure 15 provides a better comparison of the two simulation approaches by plotting the error ratio between the stochastic and the deterministic solutions.
on the TC2000
The full system algorithm is an exact algorithm with precision dependent on the overall number of molecules used. For a 10,000 component system, the error between the full system simulation and the analytical solution was found to be less than O(lO-lo). The relative error between partial system solution and analytical solution for systems composed of 2, 10,20,50, and 100 components was 0.6, 0.19, 0.09, and 0.01 [M/s]. This corresponded to variances of 3.6 x lo-‘, 4.0 x 10w4, 2.0 x 10e6, and 8.0 x 10-s [M/s]*. Since the size of a processor’s local environment decreases as the number of nodes increases, the partial system approach would be expected to diverge as the number of nodes increases. That is, a decrease in its size would cause a processors’ local environment to be an increasingly poorer approximation of the global environment maintained by the FSPT approach. However, the lack of divergence in Fig. 15 shows the partial system environment to be a good approximation to the full system environment
S. M. STARK et al.
732
the reaction simulation loop was the calculation of reaction transition probabilities using the global environment. The reaction test phase (step 3) begins by having each CM-2 processor generate a random number in the [O-l] interval. The front end computer then sends the test for reaction instructions to processors currently representing A molecules. During the A -B test, processors representing B molecules are inactive. This is represented by the gray shading of step 3a. The A-B test is followed by the test for B-A reactions. Processors that were inactive during the A-B test are now active while those that represented A molecules at the start of the A -B test axe inactive. Note that processors that have changed from state A to state B, are still deactivated during the B-A test. In general, the front end tests for all possible reaction transitions. On completion of reaction tests, the new global environment was computed by summing over the current processor states. Repetition of this sequence (steps 2-4) defined the fixed-time reaction simulation. The processor synchronization problem of the MIMD FSFT implementation does not exist on the CM-2 since the program has only a single instruction stream that is controlled by the front end computer. Figure 17 shows the good agreement
for the conditions studied here. Of course, there is a limit to the subdivision one can impose on the simulation environment. Beyond a certain point, the partial system environment simply does not contain enough species to simulate all possible reactions adequately. Neurock has shown that, for complex reaction simulations, the minimum system size is on the order of 20 molecules (Neurock, 1992). 4.2. FSFT and PSFT on the CM-2 Optimal implementation of the FSR approach on the CM-2 machine was sought in terms of the data parallelism of the simulation. The data parallelism of the FSFT problem exists in the form of the molecules which define the system, the transformations of which can be achieved by a single sequence of instructions. The FSFT problem mapped well to the CM-2 when each CM-2 processor represented a single molecule in the reaction system. Figure 16 illustrates the FSFT program flow for the A= B system on the CM-2. Following the foregoing data mapping, the first step was the mapping of molecules to CM-2 processors. For the simple kinetic models, this simply consisted of setting a state variable on each CM-2 processor to the value of the species it was to represent. The first step of 1) Map reactionspeciesto processors Processor
Al
r 2)
A2
I
Al
A2
Al
Rocessorj A2 Al
RoKssorP
A2
Al
A2
Al
A2
z
ch prarssor reactsits local enb-iromnt
! I j El
Processor 1
Al
32
I’
Al
A2
RocessorP
procet.zorj A2
Al
B2
Al
A2
81
A2
3) Each processor recomputestransitionprobilities basedon local enkxmxmt
I
ROUSSOrl
PmcessorN PmN = ‘(A1 N.A&
proasforj Pmj = KAlj.A$
pAB, = ‘(All-A21)
4) Atomically add processor spcies sum to global ewdnmem Processor1 Sum Ai I
Praxssor P
Pruxssorj
Sum Bi .--..rrr--
I Fig. 13. Illustration
species coums
Sum 3i Sum Ai Il.r-llC-lr-----r
Globil Ai
&sI
Bi
Sum Ai
Sum Bi
I
of the steps of the PSFT Monte Carlo algorithm for A-B architecture.
on the TC2ooO
MLMD
loo.0
0
300.0
200.0
and SIMD
400.0
733
strategies
500.0 Time
W.0
700.0
800.0
900.0
low.0
_ Fig. 14. Deterministic and stochastic solutions for the B2 component of the AI-B,; system on the TC2000. (A) PSFT algorithm, (B) FSFT algorithm.
-30
’
I
I
40-
OaO
1aHlo
200.00
A,GB,;
Ax-B,
.I
I
Ii
3tx+Lto 400.00
I
I 500.00
6w.w
7w.00
8W.00
9at.w
looo.
1
Fig. 15. The ratio of the error between the PSFT and deterministic solutions to error between the FSFf and deterministic solutions for the A, component of the A,B,; Ala B2 A3B3 system on the TC2000. -A,,,,)/(A,,,-A,,,). For the IV= 1 case, PSFf’=FSFT. but the ratio Error ratio=dA,ldA,=(A,,, differs from unity due to the stochastic nature of the simulation.
734
S.
M.
STARK
et
al.
1) Map reaction species to processors iP1
P2
i
AIA
A
P3
i P4 A
’ pi
Pi+2
i pi+1
Pi+3
P8189 P819OP8191
P8192’
’iA/AAAAA‘AA * :
2) Front end mmputer recomputes transition probabilites based on global envinmment PAB
= f(Global A. Global B)
3) Front end computer tests CM2 prucessors for reaction a) Test for A -B B
b) Test for B -> A
4) Atomically add pmcessor spcies sums to global environment species counts Pl A
P2
P3
lP4
Pi
Pi+1
Pi+2
Pi+3
PSI89
A
A
B I
?
Ai??
! . L---.----~lr-~---r.r-,--.---~----
I I I Global B
I Global A
P819OP8191 A
?
P8192 A
.I-I.rrlI.
i Global Reaction Environment Repeat while(rxnTtme -z Tfinal) Fig. 16. Illustration
4
of the steps of the FSIT Monte Carlo algorithm for A@B architecture. Shading represents inactive processors.
between the deterministic and the FSFT stochastic solutions for the & species of the six component LHHW model. There is a degree of inefficiency with the algorithm given in Fig. 16. This is due to the method by which the molecules are tested for reaction in step 3 of the FSFT algorithm. While any given molecule type is being tested for reaction, all processors representing other molecule types are deactivated for the sequence of test instructions. One would like to perform the reaction tests in a manner that allows every molecule type to be tested by a single instruction. In this way all the CM-2 processors are active and maximum utilization occurs. This can be achieved by restructuring the reaction transition probability parallel array on the CM-2 (Stark, 1992). Using this approach we have derived an implementation that attains 100% processor utilization during the reaction test phase for the simple model systems.
on the SIMD
The question that remains is how well this approach scales to complex models. Unfortunately, we have not been able to develop an efficient implementation for large systems. The problem is that molecules in complex simulations are considered as large collections of reactive sites rather than individuai reactive molecules. The reason for making this distinction is that this greatly reduces the number of reaction transitions one must keep track of. As an illustration of this point, consider the anthracene hydrogenation network of Fig. 4. The roman numerals classify the reactive site reactions according to the intrinsic hydrogenation pathways proposed for the hydrogenation model. When the system reactions are tracked via the pathways of Fig. 3, there are three sets of kinetics. If instead one were to consider the unique molecules as the reactive species, there are six sets of kinetics. In general, the moiecules arising in complex systems have on
MIMD
and SIMD
the order of ten reactive sites per molecule. An example of a typical asphaltene moiety is given in Fig. 6. For molecules such as this, the difference between basing kinetics on molecules and reactive sites is not a simple factor of two as in the anthracene example, but greater than an order of magnitude. A consequence of concentrating on reactive sites rather than molecules for the CM-2 FSm implementation is that the program structure must be based on reactive sites. This means that variables such as the reaction transition probability array must represent the transitions of sites, not molecules. Because of this, we could not formulate a reaction test that was able to keep all processors busy. The reason for this was that the problem data that were mapped to the parallel processors were still the simulation molecules. We were unable to find a good mapping of reaction sites to processors that still maintained the identity of the individual molecules. Reaction tests of the reactive sites was performed by looping over the types of reactive sites and the number of each type of site. The molecules that are mapped to the CM-2 processors have a distribution sites, both in number and type. In the most straightforward implementation this means that the front end computer will have to loop from one to the maximum number of sites among the molecules for each type of site. This introduces an additional point for idle processors since a processor
0
100
200
300
Deterministic a
400
strategies
735
can be idle even if it has the type of reactive site currently being tested. This occurs because some molecules, and therefore processors, may only have one of a given type of reactive site while other molecules have several. In order to evaluate how these issues impact the performance of the CM-2 FSFf algorithm for a system of sufficient complexity, the l-5 member ring hydrogenation problem was examined. The run time of the FSFT CM-2 implementation is compared to the PSFT implementation on the TC2000 in Fig. 18. Figure 18 shows that the performance of the CM-2 implementation using 8192 processors is slightly better than the 2 processor run time on the TC2000. A quick comparison of the processing power of the two architectures based on 1024 x 1024 single precision matrix multiplication yields:
8192 CM-2 procs+ 800 MFLOPS 32 TC2000 procsj 160 MFLOPS 8192 CM-2 procs = 120 TC2000 procs This comparison is biased toward the CM-2 since matrix multiplication is an example of an ideal data intensive problem. Nonetheless, the comparison demonstrates that CM-2 performance on the hydrogenation model is a fraction of what the hardware is capable of. The very poor result implies that there is a major problem with the data mapping. Since the reaction kinetics are centered on reactive sites, what should really be mapped to the CM-2 processors are
500 Time N=8192 m
600
700
800
900
2
1000
Fig. 17. Comparison of the deterministic and stochastic solution for the B2 component of the A ,esB,; AtHE3 system on the CM2. This approach uses a fixed time step and molecules of the system are mapped to the processors of the CM2.
AZ@&;
S. M. S-r~nu et al.
736
0 N=l
N-8
N=4
N=2
N=16
TC2OfM Nodes
w
TC2000
B
CM-2
Fig. 18. Comparison of the CM-2 FSFT and TC2ooO PSFT implementations of the aromatic hydrogena-
tion model problem.
the molecule sites. One cannot lose track of the molecules however, and so there is a problem with the basic formulation of the implementation. We currently have little motivation for attempting to improve the code for any reason other than furthering an understanding of the CM-2. The reason for this view is that the CM-2 is not a general purpose parallel computer; rather, it is a specialized architecture designed for a specific class of data oriented problems. Any effective implementation would have to differ dramatically from the MIMD model, and we do not see sufficient potential to warrant the effort. There was also little point in trying to implement the PSFT approach on the CM-2 because the fundamental idea of the PSFT approach is contrary to the parallel nature of the CM-2. The PSm approach is based on a collection of autonomous partial systems executing the FSFT algorithm. The CM-2 simply is not designed to handle this type of approach. 5. PARALLEL
FSVT AND PSVT MONTE
CARLO
ALGORITHMS
The FSVT approach to the Monte Carlo problem was difficult to fashion in parallel form. Indeed, the algorithm is highly serial. From the MIMD point of view, there was no set of independent tasks to be carried out, and from the SIMD point of view, there was no data structure on which a sequence of instructions could be performed for every member.
The limitation of the algorithm is that, although all molecules are used to compute the event selectivities and the reaction time interval, only a single molecule out of the N possible actually changes state. This limitation motivated the PSVT approach. The PSVT approach cast the variable time step method in a form so as to realize a significant degree of parallelism. As in the PSfl approach, parallelism existed simply because the method defines a number of independent pseudo-systems. This dictated the implementation approach for the two architectures. 5.1. PSVT on the TC2000 For the TC2000, the parallel tasks were the reactions of the pseudo-systems. Each processor implemented the FSVT approach using its local information for the reaction environment, and computed the concentration-dependent transition probabilities using only this information. As in the PSFT implementation, each processor atomically updated a global array using the local pseudo-system results on completion of its simulation. Figure 19 compares the deterministic and the PSVT stochastic solution for the six component LHHW model. The match is excellent. Systematic error is probed in Fig. 20 as a plot of the ratio of the error between the PSVT and deterministic solutions and the FSFT and deterministic solutions for the six-component LHHW model. The solutions are parametric in the number of TC2000 processors used, varying from N = 1 to N = 32. For
MIMD and SIMD strategies each run the total number of molecules was constant at 8192. Thus, the number of molecules in the partial systems was 8192/N, with N being the number of processors. An overall view of the graph shows a lack of systematic error. The lack of any systematic error indicates that the local environment maintained by the processors is a good approximation to the true global environment maintained in the FSFT approach. 5.2. PSVT
737
be considerably smaller than that permitted on a TC2000 processor. The second problem was that each processor on the CM-2 was not capable of independently executing a program to perform the reaction simulation of its pseudo-system. These issues suggested that the efficiency of the PSVT approach would be far from optimal on the CM-2, and as such, extension to systems other than the three simplest model systems was not pursued.
on the CM-2
Implementation of the PSVT approach on the CM-2 mapped the pseudo-systems to the parallel processors. This is different than the FSFI approach on the CM-2, where each processor represented a single molecule. Because the PSVT approach required each processor to represent a pseudosystem, each CM-2 processor was asked, essentially, to mimic the behavior of the TC2000 processors. This use of the CM-2 was largely an artificial mapping and two major difficulties arose upon consideration of scaleup to complex systems. The first problem was the speed and memory characteristics of the CM-2 processors. The CM-2 processors are slower and have less memory than those of the TC2000. Thus, the allowable size of the pseudo-system would
A
6. PARALLEL
6.1. Parallel performance
EFFICIENCY
evaluation
In this paper, parallel performance is evaluated two criteria, parallel speedup and parallel efficiency. Parallel speedup (S,) is defined to be the ratio of the single processor run time to the run time on N processors, i.e. on
s& Parallel efficiency S, to N, i.e.
(IS)
N
(EN) is defined to be the ratio of
E,=$.
(16)
2500
a8 s
lso0
8 loo0 E 8soo 0
0
0
100.0
Fig. 19. Comparison A2a&;
200.0
300.0
400.0
sm.0 ‘Ilms
600.0
700.0
Wm.0
sm.0
1000.0
of the deterministic andstochastic solutions for the 82 component of the A,eB,; system on the TC2ooO. (A) PSVT algorithm, (B) FSFT algorithm.
Ax-B3
738
S. M.
STARKet al.
30 20
10 0 -10 -20 -30
Fig. 20. The ratio of the error between the PSVT and deterministic solutions to error between the FSFT and deterministic solutions for the A, component of the A r(JB,; A,@& A,@& systemon the TC2000. For the N= 1 case, PSf;T-FSfl, but the ratio Error ratio-dA,/dA, = (AIFSbT-AIDe,)I(AIPSFT -A,,,). differs from unity due to the stochastic nature of the simulation.
We are aware that there is a great deal of discussion and even controversy concerning what the “corevaluation measure rect” parallel performance should be. Several other performance measures exist and include those proposed by Eager et al. (1989), Hack (1989), Gustafson (1988), and Hackney and Jesshope (1985). Most simply do not provide one a means of practical evaluation of every parallel algorithm because they require code development or hardware parameter measurements that cannot be made due to either time or resource constraints. Although an algorithm with a good performance rating as defined by either equations (15) or (16) will not necessarily have a high performance rating via one of the alternate methods, it might be argued that any good parallel algorithm must at least have high S, and E, ratings. This is because S, and EN measure how well the algorithm scales with N for a problem of fixed size. One instance where equations (15) or (16) can be misleading is when the problem size is too large to be accommodated by a single processor. For example, if the memory requirements are such that the single processor run requires paging to disk because virtual memory exceeds physical memory, the average memory access time of the single processor result will be much higher than the N processor run in which the memory associated with the parallel tasks does fit within a processor’s physical memory. In this situation an alternate speedup criterion is (Skjellum, 1990): (17)
where IV,, is the minimum number of processors required to execute the given problem, and TNg,is the corresponding time required to solve the problem using iV,, processors. An equivalent efficiency measure is:
(18) Equation is 1.
(18) defines the efficiency such that E$
6.2. MIMD
parallel efjkiency
The parallel efficiency of the FSFT, PSFT and PSVT on the TC2000 was probed via a series of program runs. The simplest model system incorporated a do loop delay of 1000 integer increment instructions after every successful reaction test. Addition of this standard delay was necessary to bring the computational requirements of this small reaction model to a reasonable level. The choice of loo0 was made simply because this was the minimum value for which the simulation run time was nontrivial. Figure 21 summarizes results for the TC2000. Clearly, the PSFT and PSV’T approaches maintained very high processor utihzation ( >90%) up through 32 nodes, whereas the FSFT was below that of the partial system implementations for each model. Note however, that the relative performance of the FSFT code improved as the model problem size increased. Both the FSFT”s poorer efficiency and improvement with problem complexity can be traced to the synchronization required in the FSFT
MIMD
and SIMD
approach. The FSFT approach requires every processor to communicate with every other processor after each fixed-time step. The net effect was that as the number of processors increased, each processor spent increased time determining the status of the other processors. This is the cause of the poorer parallel efficiency. The waiting time due to synchronization after each time step is a weak function of the number of nodes. This means that as the problem complexity increases, the fraction of the run time a processor spends waiting decreases. This is manifested as better relative performance of the FSFT approach for the more complex problems. Neither the PSFT or PSVT approach required synchronization during the course of the reaction simulation, and the nearly ideal observed efficiency reflects the ideally parallel nature of the independent system approach. 6.3. Application of the PSVT algorithm to a complex resid The preceding problems have been simple or relatively simple systems. The forte of the Monte Carlo approach is its facility for handling complex reaction systems. The simple systems were studied in order to develop the mechanisms for dealing with kinetically coupled systems and their subsequent parallelization. To show that the algorithms are applicable to complex systems, we have implemented a resid pyrolysis model using the parallel
B
strategies
739
PSVT approach on the TC2000. The resid model is discussed in detail elsewhere (Petti et al., 1993; Trauth et al., 1993). A brief summary of the reaction model is contained in the following two paragraphs. The resid structure was a statistical representation of experimentally determined attributes such as molecular weight, hydrogen to carbon ratio, NMR, SARA fractions, and simulated distillation (Trauth et al., 1993). The resid reaction chemistry was based on the pyrolysis chemistry of four model compounds, pentadecyl benzene, tridecyl cyclohexane, decyl tetralin, and octane. The pyrolysis chemistry exhibited by this set of compounds defined the types of reactive sites that were incorporated into the resid model. The types of reactions included dealkylation, dehydrogenation of naphthenic rings, and paraffin and olefin cracking. Mapping of the chemical sites represented by the four model compounds to the resid species was based on a series of rules and reactivity to structure correlations. The rules for the most basic site recognition were: l
l
l
a substituted aromatic ring was considered a pentadecyl benzene moiety; a substituted naphthenic ring was considered a decyl tetralin moiety if there were any aromatic rings; otherwise, it was considered a tridecyl cyclohexane moiety; paraffin or olefin molecule was considered an octane moiety.
Al t) Bl,A2 c) B2, A3 t+ B3 LHHW Kinetics
C
1-SMember
Ring Hydrogenation
105% lCX% 95% 90% 86% 00%
Nl
NW
N2 E
FE&
E
P&ikl
N32
PSVT
Fig. 21. Comparison of the parallel speedup for the various Monte Carlo approaches on the TC2000.
740
S. M. STARK et al.
0.0s 6 9
0.07
8
0.06
100
200
300
400
500
600
700
800
900
1000
2
0.02 0.01 0
100
200
300
400
SO0
600
700
800
900
1000
Carbon number Fig. 22. The change in the resid molecular weight as a function of the reaction time. This particular result is for a Hondo resid at 425°C.
that accounted for the presence of aromatic rings and alkyl sidechains attached to the sites identified by the preceding rules were used to modify the base case reactivities. The simulation constructed 10,000 molecules by sampling the structural probability distributions used to represent the resid feed. Reaction of the resulting sample was based on the reactive sites described above. Because of the molecularly explicit nature of the stochastic simulation, many choices exist for representing the output of the resid model. As an example, Fig. 22 gives the molecular weight distribution of a Hondo resid due to thermal pyrolysis at 425°C. Turning to the parallel resid implementation, Fig. 23 shows the parallel speedup of the PSVT model with 500 pseudo-systems of 20 molecules as a function of the number of processors. Available random access memory limited the partial system size to 50 molecules or fewer. Results for both 20 and 50 molecule subsystems were equivalent, therefore the 20 species subsystem was chosen. The speedup observed for this real problem is seen to be very good through the maximum of 32 nodes. The limited fall off at higher numbers of nodes was attributed to contention issues. As the number of processors is Relationships
increased the likehhood that two will write back to the parent node is increased. The success of this application indicates that the algorithms developed for the much simpler chemical models described earlier are readily scalable to the target full-scale complex simulations. 7. CONCLUSION
The Monte Carlo simulation of kinetically coupled reaction systems is not only feasible, but indeed motivates the use of the independent subsystems and hence the use of parallel architectures. For the simple kinetic models examined here, implementation of the partial system fixed-time step and partial system variable-time step methods was very successful on the BBN TC2000. Excellent speedup results were obtained for both partial system methods on the TC2000. This result emphasizes the intrinsic parallel nature of the independent partial systems. The full system fixed-time step method was also efficiently implemented on the CM-2 for the simple kinetic models. Parallel implementations of the FSFT are best suited to the SIMD parallel architecture. The mapping of the molecules to the CM-2 processors and the single instruction sequence
MIMD
1
and SIMD
2
741
strategies
4
8
16
32
Processors Fig. 23. The parallel speedup of the PSVT model with 500 pseudo-systems of 20 molecules as a function of the number of processors. The individual data labels are the percent speedup with respect to the single processor run.
nature SIMD
of
the
algorithm
paradigm.
are
a good
Unfortunately,
match
this result
to
the
is seen
The attempt to extend the FSFT approach to complex model systems failed because the focus of the reaction tests in complex model systems are the reactive molecular sites, while the data mapped to the parallel processors are the system molecules. This data mismatch that requires a reaction test implementation achieves poor utilization of the CM-2 processors. A comparison of the performance of the two parallel architectures could only be made based on
to hold only for simple
the model the simple few
hydrogenation model
seconds)
Comparison
kinetic systems.
problem.
kinetic in
based
the on
absence
the CM-2
priate
parallel
for
the
reaction
The
run time of
was too short of
any
the hydrogenation
results indicates stochastic
problem
architecture structure
of
(a
delay. problem
is less approthe
would like to thank the Acknowledgements-We UNIDEL foundation for the support that enabled the purchase of the TC2000. We also would like to acknowledge the support of Professor R. B. Pipes during the initial phase of our research into parallel computing. We are also grateful to Argonne National Laboratorie for allowing us use of their TC2000. Lastly we would like to acknowledge the support received from the Pittsburgh Supercomputing Center in the form of a startup grant which enabled our use of their CM-=?.
LHHW
rate expression symbols k,., = Surface reaction rate for reaction i Kx,= Adsorption equilibrium constant for component X, CACE 19:5,7-I
Concentration of component Xi Overall equilibrium constantfor reaction i Parallel efficiency Full system fixed-time algorithm Full system variable-time algorithm Pseudo first order rate constant Number of molecules per partial system Multiple instruction multiple data parallel architecture N= Number of processors N,= Total number of molecules for a given simulation P=Number of partial system used in a PS(FT/VT) simulation P,,= Probability of going from state i to state j PSFT= Partial system fixed-time algorithm PSVT= Partial system variable-time algorithm RN= Uniformly distributed random number on the [0, l] interval SIMD= Single instruction multiple data parallel architecture &= Parallel speed up At= Fixed-time step size
complex
simulation.
NOMENCLATURE
C,,= K,= EN= FSFT= FSVT= k’= M= MIMD=
REFERENCES
Eager D. L., J. Zahorjan and E. D. Lazowska, Speedup versus efficiency in parallel systems. IEE Trans. Compufers 38, 408-423 ( 1989). Flynn M. J., Very high speed computing systems. Proc. 1EEE 54, 1901-1909 (1966). Gillespie D. T., A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J. Computat. Phys. 22, 403-434 (1976). Gustafson J. L., Reevaluating Amdal’s law. Commun. ACM 31, 532-533 (1988). Hack J. J., On the promise of general-purpose parallel compuring. Parallel Comput. 10, 261-275 (1989). Hillis W. D., The Connection Machine. MIT Press. Cambridge, Mass. (1989). Hackney R. W. and C. R. Jesshope, (rlnal, &) measurements on the 2-CPU Cray X-NT. Parallel Comput. 2, I14 (1985). Hord. M. R. Parallel Supercomputing in SIMD Architectures. CRC Press. New York (1990).
742
S. M. STARK et al.
McDermott J. B., C. Libanati. C. LaMarca and M. T. Klein, Quantitative use of model compound information: Monte Carlo simulation of the reactions of complex macromolecules. I & EC Res. 29.22-29 (1990). Neurock M., A computational chemical reaction engineering analysis of complex heavy hydrocarbon reaction systems. PhD thesis, University of Delaware (August. 1992). Neurock M., S. M. Stark and M. T. Klein, Molecular simulation of kinetic interactions in complex mixtures. In E. R. Becker and C. J. Pereira (Eds): Computer Aided Design of Catalysts. Marcell Dekker, New York (1992a). Neurock M., S. M. Stark and M. T. Klein. Molecular simulation of kinetic interactions in complex mixtures_ In E. R. Becker and C. J. Pereira (Eds): Computer Aided Design of Catalysts. Marcell Dekker, New York (1992b). Petti T. F. et al., CPU issues in the representation of the molecular structure of petroleum resid through characterization, reaction, and Monte Carlo modelling. In
Symposium on Resid Upgrading, March 28-April 2, Denver, ACS Meeting (1993). Skjellum. A. Concurrent dynamic simulation: multicomalgorithms research applied to ordinary puter differential-algebraic process systems in chemical engineering. PhD thesis, California Institute of Technology (May. 1990). Stark S. M. An investigation of the aoolicabilitv of narallel computation to demanding chemi&tl engineering problems. PhD thesis, University of Delaware (August. . I 1992). Stark S. M., M. Neurock and M. T. Klein, Strategies for modelling kinetic interactions in complex m&ures: Monte Carlo algorithms for mired parallel architectures. _ Gem. Engng !%i.. (1993). Trauth D. M. et al., Representation of the molecular structure of petroleum -resid through characterization and Monte Carlo modelling. In Symposium on Resid Upgrading, March 28-April 2. Denver, ACS Meeting (1993).