Analysis of communication costs for domain decomposed Monte Carlo methods in nuclear reactor analysis

Analysis of communication costs for domain decomposed Monte Carlo methods in nuclear reactor analysis

Journal of Computational Physics 231 (2012) 3119–3125 Contents lists available at SciVerse ScienceDirect Journal of Computational Physics journal ho...

283KB Sizes 1 Downloads 27 Views

Journal of Computational Physics 231 (2012) 3119–3125

Contents lists available at SciVerse ScienceDirect

Journal of Computational Physics journal homepage: www.elsevier.com/locate/jcp

Analysis of communication costs for domain decomposed Monte Carlo methods in nuclear reactor analysis A. Siegel a,⇑, K. Smith b,1, P. Fischer a, V. Mahadevan a a b

Argonne National Laboratory, Theory and Computing Sciences, 9700 S Cass Ave., Argonne, IL 60439, United States Studsvik Scandpower Inc., 1087 Beacon St. #301, Newton Center, MA 02459, United States

a r t i c l e

i n f o

Article history: Received 4 July 2011 Received in revised form 1 December 2011 Accepted 17 December 2011 Available online 27 December 2011 Keywords: Monte Carlo Neutron transport Reactor analysis Performance modeling

a b s t r a c t A domain decomposed Monte Carlo communication kernel is used to carry out performance tests to establish the feasibility of using Monte Carlo techniques for practical Light Water Reactor (LWR) core analyses. The results of the prototype code are interpreted in the context of simplified performance models which elucidate key scaling regimes of the parallel algorithm. Ó 2012 Elsevier Inc. All rights reserved.

1. Introduction Neutron particle transport methods have broad applicability across a range of important application areas, including, for example, nuclear fusion [1], cloud physics [2], medical dosimetry [3], weapons design, and nuclear reactor analysis [4]. The governing equations and underlying theory of transport are the same in each case, but computational details can vary significantly, yielding different numerical approaches and implementation strategies for each domain. In nuclear reactor analysis, for example, deterministic rather than Monte Carlo transport methods have traditionally been preferred for most practical applications. Reactor analysts are concerned with detailed spatial distributions of a number of key quantities – e.g. neutron fluxes, reaction rates, and detailed isotopics. With Monte Carlo methods these quantities converge slowly with increasing particle histories, requiring tremendous computing resources to obtain a level of statistical convergence deemed adequate for most design and safety analyses [5,6]. Furthermore, transient effects, multi-physics coupling, depletion calculations, and uncertainty quantification techniques are relatively well understood for deterministic formulations but to varying degrees remain subjects of research using a Monte Carlo approach [7]. Thus, Monte Carlo methods have been reserved for a smaller subset of applications – e.g. shielding calculations, the analysis of smaller reactor cores, and as ‘‘gold-standard’’ benchmarks that serve as verification cases for deterministic methods. Looking to the future, however, Monte Carlo methods have a number of potential advantages that make them an attractive option for a much broader range of reactor calculations, and make their continued performance enhancement an important topic of research. The avoidance of complex meshing for complicated geometries, the simplification of the cumbersome ⇑ Corresponding author. Tel.: +1 630 2521758. E-mail addresses: [email protected] (A. Siegel), [email protected] (K. Smith), fi[email protected] (P. Fischer), [email protected] (V. Mahadevan). 1 Current address: Massachusetts Institute of Technology, Department of Nuclear Physics and Engineering, 77 Massachusetts Ave., Cambridge, MA 02139, United States. 0021-9991/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. doi:10.1016/j.jcp.2011.12.014

3120

A. Siegel et al. / Journal of Computational Physics 231 (2012) 3119–3125

multi-group cross section generation process, the elimination of numerical artifacts (e.g. ray effects), and, perhaps most importantly, the potentially more natural mapping to beyond-petascale architectures are a few potential advantages that make Monte Carlo a promising path to pursue. In this paper we carry out a feasibility study of using Monte Carlo transport for robust reactor core analysis. The key question is – how far are current leadership class machines from carrying out practical Monte Carlo reactor physics analyses, and what machine characteristics and algorithmic approaches both limit and enable their performance? Performance models are derived that allow us to interpret a set of simplified benchmark calculations in the context of key machine characteristics that both give important practical answers for present machines as well as a hint into issues for pending beyond-exascale architectures. 2. Background A good point of reference for the present analysis are the recent parallel Monte Carlo reactor core calculations carried out by Kelly et al. [8] at Knolls Atomic Power Laboratory (KAPL) and presented at PHYSOR 2010. Their largest calculation used a total of 5.2  1010 neutron histories to compute power distributions on the simplified Hoogenboom and Martin [9] LWR core model. Approximately 70,000 pins  100 axial mesh  70 isotopes  2 reaction types = 109 tallies were calculated, corresponding to approximately 13 Gb of overall memory use. The calculation consumed approximately 1700 node-hours – 18 total hours of simulation time using 400 cpu-cores on a cluster of Intel Xeon E5530 quad-core processors with a 2.4 GHz clock. While not necessarily large by HPC standards for resource consumption, this calculation still stands as the largest and most credible practical LWR Monte Carlo calculations ever reported. However, as the authors acknowledge, it still falls short of what is desired for robust LWR analysis. There are two key aspects to this that we touch on here. First, the number of tallies is roughly two orders of magnitude less than what is desired. A more realistic target for LWR analysis would include approximately 10 radial rings per fuel pin to resolve Gd depletion and temperature distributions, 350 instead of 70 isotopes, and both value and variance tallies – a factor of approximately 100 more tallies or roughly 1 Tb of memory. Secondly, If we aim to carry out pseudo-steady-state reactor depletion studies, it would require hundreds of such calculations, and the corresponding transient reactor dynamic calculations would require thousands more. Thus, even if one assumes perfect scalability, the above time-to-solution estimate would be at least five orders of magnitude larger, making it something near 108 CPU-hours to solve one actual problem of interest. Finally, we point out that Smith [5] proposed a 1% two-sigma confidence in the flux predictions (vs. 2% reported in [8]), and in general approximately 50 temperature intervals are desired in the cross section library representations, requiring significantly more storage. The full-scale LWR Monte Carlo problem is thus a significant HPC challenge. While memory requirements are not extreme, they still far exceed what will be available on a single computing node for the foreseeable future. Furthermore, to meet the 100 million CPU-hour time-to-solution estimate, much larger processor counts will be required to make Monte Carlo solutions practical. Unfortunately, there is no clear, simple path to scale-up of existing reactor Monte Carlo codes. This is largely because the vast majority of reported reactor analysis production codes are parallelized by simple replication, where the main data structures increase in size on each processor proportional to the global problem size. Beyond a relatively small number of processors, global tallies and particle data structures easily consume global memory. Both larger processor counts and a more sophisticated decomposition strategy are obviously needed to make such calculations practical. Implementing physical-space domain decomposition is an attractive potential alternative, where each processing element owns a subset of the physical domain and computes tallies only for the neutrons that interact within its boundaries. A number of researchers have in fact explored various forms of domain decomposition for general Monte Carlo transport codes (see e.g. [10–12]). However, these codes have not specifically targeted reactor physics applications. Whether such an approach would be successful for reactor analysis remains an open question. At the surface, the prospects for performance do not appear good – local neutron leakage rates are large and fast neutrons tend to visit a significant fraction of the global domain before absorption. This leads to significant non-locality in the computation and potentially large communication costs. In this paper we attempt to quantify this cost, estimating the performance requirements of full-scale LWR applications using a combination of performance modeling and simplified numerical experiments. 3. Analysis We aim to roughly estimate the scale of the communication costs for carrying out domain decomposition as described above. Here we study the idealized case of a perfectly load balanced solution – an initially equal distribution of particles across partitions and statistically equivalent material properties. While load balancing may potentially represent a practical barrier to scalability for LWR analysis,2 its impact should be considered as a departure from this idealized case. The tests are based on a parallel Monte Carlo communication kernel – i.e. with the local physics removed and the application only carrying 2 LWRs are relatively homogeneous in a statistical sense and load balancing is likely to be significantly less challenging than with other Monte Carlo application areas.

A. Siegel et al. / Journal of Computational Physics 231 (2012) 3119–3125

3121

out the communication phase of the algorithm. To understand and successfully extrapolate these results, we first carry out a detailed analysis of the characteristics of the communication costs. 3.1. Derivation of performance model To quantify the ideas presented above, we begin with a domain X segmented into n non-overlapping partitions X = [xi, i = 1 . . . n. P0 particles are initially present on each xi, so that there are globally P g0 ¼ nP 0 particles. For simplicity, assume that each partition is then mapped to a single processing element on a three-dimensional virtual cartesian topology of n processors. In each partition, the particles move until they are absorbed or reach a processor boundary. Particles that reach the boundary are buffered locally until all particle trajectories are computed, and are subsequently exchanged with neighbor processors. The particle exchange phase of this process is referred to as a stage, and a complete set of stages (i.e. until all particles are absorbed) is referred to as a cycle. At the end of each cycle, the fission source is updated and the next cycle begins. Define the particle leakage fraction k on each xi for a given stage as

ki ¼

average number of particles leaving xi : total number of particles starting in xi

If we make the estimate that ki = constant = k, then, using reflective boundary conditions, after k stages the number of particles remaining globally is roughly

Pgk ¼ kk P g0 ;

ð1Þ

and the number of stages kc for a cycle to complete is then estimated by

kc ¼ 

log Pg0 : log k

ð2Þ

Since particles are only communicated to each processor’s six neighbors on a standard 3-D Cartesian grid, the total cycle communication time for nearest neighbor particles exchanges can be approximately modeled as

tc ¼ 6akc þ bP0

kc X

kk ;

ð3Þ

i¼1

where a is a of the application-level latency and b the effective inverse bandwidth for nearest-neighbor exchanges h measure i . Using (2) and (3) can be written as in units of particles time

tc ¼ 6a

kc X log Pg0 kk : þ bP 0 log k i¼1

For typical values of k the series

tc ¼ 6a

log Pg0 log k

þ bP 0

ð4Þ Pkc

k i¼1 k

approaches its asymptotic value

k , 1k

and thus (4) can be approximated as

k : 1k

ð5Þ

For practical reactor analysis our main interest is the strong scaling case. That is, for a fixed global problem size defined by P g0 , we seek to understand the communication performance as a function of partition size l. For fixed Pg0 , given a global domain size L the initial number of particles locally is given by

P0 ¼ P g0

 3 l  P g0 f3 ; L

ð6Þ

where the non-dimensional partition size f ¼ Ll has been introduced. Thus, (5) becomes:

tc ¼ 6a

log Pg0 k : þ bP g0 f3 1k log k

ð7Þ

Finally, we seek a relationship between the leakage rate k and partition size l. For both large and small values of Rtl, a reasonable estimate (e.g. [13,14]) of this dependence for a uniform source is given by Wigner’s rational approximation (e.g. see Eq. 5.6.2.1 in [15], Eq. 5.1.1.3 in [16]), which for a cubic geometry can be expressed as



1 ; 1 þ Rt ðEÞl

ð8Þ

where Rt is the total macroscopic cross section for the partition, and E denotes the neutron energy level. Substituting (8) into (7) gives an expression for the total communication time as a function of f:

3122

A. Siegel et al. / Journal of Computational Physics 231 (2012) 3119–3125

t c ¼ 6a

log Pg0 logðD1 f þ 1Þ

þ bPg0 Df2 ;

ð9Þ

where D  lL is the ratio of the neutron mean free path l ¼ R1 to the global domain size. To evaluate (9), we obviously need t an estimate for D. This can be done in one of two ways: (1) by carrying out Monte Carlo LWR analysis and fitting leakage rates measurements to (8), or (2) with an energy-group averaged approximation of the mean free path based on a specified core composition. We emphasize that both approaches are approximations, but that within reasonable bounds their precise value does not impact the main conclusions of the paper. Using technique (1) based on simplified LWR calculations carried out with the PRIDE code, we found an excellent fit to the Wigner approximation with a resulting value of of D = .05. We consider this reasonably consistent with the second approach based on our LWR analysis experience, and have elected to use this value for the remainder of the analysis. For different scenarios other calibrations of the Wigner approximation can be plugged in. Finally, since we are interested in small values of f ¼ Ll , a good analytical approximation for the local minimum of f is given by first expanding log (D1f + 1) to first order around f = 0 as,

tc 

6aD log P g0 þ bPg0 Df2 ; f

ð10Þ

whose global minimum yields the following simple expression for the optimal partition width:

fopt 

3ab1 log Pg0 Pg0

!1=3 :

ð11Þ

3.2. Analysis of performance model Eq. (11) reveals that for a fixed global problem size Pg0 , the optimum partition size depends only on the ratio of latency to inverse bandwidth, a surprising result that has been observed for other models (see e.g. the definition of m2 in [17]). Decompositions with partition size larger than fopt are bandwidth limited and those with partition size smaller than fopt are latency dominated. Fig. 1 shows this behavior together with the local minimum predicted by (11) for a reasonable range of values of a and b, using our target value of Pg0 ¼ 1010 and D = .05. We view a and b as application-level parameters that can be measured for a particular machine, related to but more complex than the reported theoretical peak values. A good estimate for this communication pattern can be obtained, for example, from reported Halo benchmark [18] results. Here we do not attempt to precisely characterize specific machines but rather note that such a characterization is feasible in a practical sense. In the parameter range of interest, (11) apparently does a very good job of predicting the local minimum, which is clearly seen to be identical for common values of a/b. Furthermore, these local minima values are all less than .01, which corresponds to a strong scaling processor limit of n = 1/f3 = 106, corresponding to 1010/106 = 104 particles per processor. Focusing on the bandwidth-dominated regime, the dependence of leakage factor k on partition size gives rise to a number of interesting effects that are apparent from (9). To understand scaling behavior in this regime, it is convenient to rewrite (9) using processor count n in place of the partition size f. Assuming one physical space partition is assigned to each processor, then f = n1/3 and

Fig. 1. Evaluation of the performance model (Eq. (9)) for values of a = [105, 104, 103] from bottom to top and b. Dashed and solid lines indicate identical values of a/b, and the dots indicate the local minimum value predicted by Eq. (11).

3123

A. Siegel et al. / Journal of Computational Physics 231 (2012) 3119–3125

tc ¼ latency term þ bP g0 Dn2=3 : As expected, in the strong scaling case

@Pg

0

@f

ð12Þ

 ¼ 0 , for fixed values of a and b the per processor bandwidth term decreases

with increasing processors (decreasing f). However, the communication time decreases as n2/3 rather than n1, which would be the case if leakages rates did not vary with partition size. Since local work scales as tl  P 0 ¼ Pg0 f3 ¼ P g0 n1 , then the ratio of local work to communication time in bandwidth-dominated regime is given by

tl  n1=3 tc

ðStrong scalingÞ:

ð13Þ

That is, for the strong scaling case ultimately we expect an erosion of performance in the large processor limit. Precisely where this transition occurs is discussed further in the next section.   0 ¼ 0 , which may be of interest for certain applications, the bandwidth term is not conFor weak scaling problems @P @f stant but rather scales as P g0 =f. That is, as the number of processors is increased, f decreases, and, since k increases the total amount of data sent by a processor increases as well. Thus, weak scaling is given by:

tl  n2=3 tc

ðWeak scalingÞ:

ð14Þ

3.3. Numerical experiments To test these results on an existing machine, a very simple domain-decomposed Monte Carlo communication kernel was developed to gather communication performance results. The kernel carries out a simple three-dimensional decomposition on a rectangular global geometry, assigning one partition element to each processor. No physics is represented in the code. Instead, each processor begins by looping over its P0 particles, randomly calculating whether the particle is absorbed or reaches one of the six processor boundaries. The absorption probability is assigned using an input leakage fraction k, which is chosen to match results from full-physics Monte Carlo codes – i.e. as described by (8). Each particle occupies 64 bytes of memory, designed to mimic a basic MC code with the three components of velocity, position and scalar energy particle ID. Particles that reach processor boundaries are buffered until the end of each stage, at which point a single nearest neighbor g

data exchange is carried out with all six neighbors. As described in Section 1, approximately 

log P0 log k

stages are executed for a

given cycle. Since it is not our goal to iterate over a true fission source, the simulation terminates after a single cycle. The kernel is designed to mimic our model problem with specified k, and gives a rough estimate of the expected scale of message passing time for realistic particle numbers on the high processor counts of interest. Non-blocking MPI communication was used to carry out the message passing. A number of additional global meta-data communication calls are required as well to comply with MPI semantics. These however only trivially affect the overall communication time and are not discussed in detail here. The numerical experiments we report on here follow our target problem of Pg0 ¼ 1010 for a range of processor counts and partition sizes. Referring back to (8), we reiterate that our numerical experiments indicate that D = l/L  .05, so that substituting this expression into (8) yields:



1 ; 20f þ 1

0 < f < :05:

ð15Þ

Thus, for f = .05, the processor grid is 20  20  20, the total number of processors n = 8000, and the number of particles per processor P0 = 1.25  106. Problems for decreasing f can be understood based on this reference case. The numerical experiments are carried out on Argonne’s Blue Gene/P computer. To ensure robustness, the results are averaged over multiple independent executions and tested for several communication strategies. As discussed previously, the key parameter is the ratio of the measured communication time compared to local work. The latter can be estimated with reference to Kelly et al., where their results indicate that the MC21 code was able to calculate global neutron trajectories at a rate of approximately 2000 neutrons/s. This allows a direct comparison with the communication costs for an equivalent domain decomposed problem. Table 1 Total communication times tc for numerical experiments run on BG/P using a parallel Monte Carlo communication kernel as described in the main narrative. All experiments use P g0 ¼ 1010 with each particle occupying 64 bytes of storage. tc tl

P0

f

n

k

kc

tc(sec)

est. tl (sec)

est.

10,000,000 1,250,000 640,000 370,370 156,250

.10 .05 .04 .0333 .025

1000 8000 15625 27000 64000

.33 .50 .55 .60 .67

22 37 43 47 56

30.41 7.58 4.63 3.18 2.75

5000 625 320 185 78

.006 .012 .014 .017 .035

3124

A. Siegel et al. / Journal of Computational Physics 231 (2012) 3119–3125

Fig. 2. Relative communication time Tc as a function of processor count n. The ordinate is multiplied by n2/3 so that deviations from the 2/3 scaling law of Eq. (12) are apparent.

The key results are shown in Table 1. In addition to our target global problem, a range of f values were explored between :025 6 f 6 lL . This sample space was adequate to roughly validate the key aspects of the model and more importantly calibrate the results for BG/P. Most importantly, we note that (12) predicts a 2/3 scaling behavior of the communication time with increasing processor counts. To more easily interpret the results in Table 1, the communication time is plotted separately versus processor count in Fig. 2, multiplying the y-axis by 2/3 so that deviations from the scaling law are more readily apparent as departures from a horizontal line. Fig. 2 shows qualitatively good correspondence with the model at low processor counts but a nontrivial under-prediction at 64,000 processors. There are two possible causes for this behavior: (1) with only O[100K] particles per processor (see Table 1), the latency term in (12) adds a non-trivial penalty to the total communication time, and the departures from 2/3 scaling are expected; (2) a reduction in scalability in the network itself at very high processor counts. More tests at larger processor counts are required to make a final determination on the relative contributions of each effect. Finally, our principle goal is to estimate the relative cost of communication to local work for our target problem size. Table 1 shows that for the current set of experiments the communication time surprisingly is at most only several percent of the local physics calculations, indicating that for the set of parameters studied here in principle good parallel performance can be achieved for such an algorithm. These estimates can easily be extrapolated to other machines and parameter regimes using the model equations. 4. Conclusion A simple set of numerical experiments were carried out to gauge the feasibility of domain decomposed Monte Carlo methods specifically for reactor analysis. These experiments showed that, for the target problem size and of interest, communication costs are a very small fraction of the total execution time. These results are valid for the BG/P platform, but can easily be extended via a simple performance model that predicts the results in terms of the key problem and machine parameters. Of course, there are gaps between the idealized analysis presented here and a real-world reactor Monte Carlo codes. First, we have assumed perfect load balancing in terms of the initial distribution of particles. For practical applications, fission source sites vary widely across the reactor core, and managing load balancing may result in a non-trivial correction to the estimates presented here. Secondly, depending on the specific application, real-world reactor Monte Carlo codes may for convenience define particle types with significantly greater storage requirements (up to several hundred bytes), resulting in tc/tl estimates in the bandwidth dominated regime up to an order of magnitude greater than what is shown in Table 1. While these ratios are still relatively small, they could no longer be dismissed as completely trivial. Finally, a full scale production code on more complicated geometry will involve some additional bookkeeping absent on the Cartesian domain that is used here. Notwithstanding these limitations, though, we believe that the results presented here point to a surprising level of feasibility in efficiently implementing domain-decomposed Monte Carlo algorithms in reactor analysis, and more importantly provide a quantitative basis and theoretical foundation for estimating their performance across a range of computing platforms and problem scenarios. Acknowledgments The authors deeply appreciate the valuable perspective and detailed suggestions on this work provided by Ben Forget and Paul Romano of The MIT Department of Nuclear Physics and Engineering.

A. Siegel et al. / Journal of Computational Physics 231 (2012) 3119–3125

3125

References [1] D. Heifetz, D. Post, M. Petravic, J. Weisheit, G. Bateman, A Monte-Carlo model of neutral-particle transport in diverted plasmas, J. Comput. Phys. 46 (2) (1982) 309–327. [2] William O’Hirok, Catherine Gautier, A three-dimensional radiative transfer model to investigate the solar radiation within a cloudy atmosphere, Part I: spatial effects, J. Atmos. Sci. 55 (12) (1998) 2162–2179. [3] D.W.O. Rogers, Fifty years of Monte Carlo simulations for medical physics, Phys. Med. Biol. 51 (13) (2006) R287. [4] Yousry Azmy, Enrico Sartori, Jerome Spanier, Monte Carlo methods, in: Nuclear Computational Science, Springer, Netherlands, 2010, pp. 117–165. [5] Kord Smith, Reactor core methods, Invited Lecture at the M&C 2003 International Workshop, April 2003. [6] J.C. Wagner, D.E. Peplow, S.W. Mosher, T.M. Evans, Review of hybrid (deterministic/Monte Carlo) radiation transport methods, codes, and applications at Oak Ridge National Laboratory, in: Joint International Conference on Supercomputing in Nuclear Applications and Monte Carlo, Tokyo, 2010. [7] Jonhhwa Chang, Nam Zin Cho, Some outstanding problems in neutron transport computation, J. Korean Nucl. Soc. 41 (4) (2009) 381–390. [8] Daniel J. Kelly, Thomas M. Sutton, Timothy Trumbull, Peter S. Dobreff, MC21 Monte Carlo analysis of the Hoogenboom–Martin full-core PWR benchmark problem, in: PHYSOR – Advances in Reactor Physics to Power the Nuclear Renaissance, 2010. [9] J. Eduard Hoogenboom, William R. Martin, A proposal for a benchmark to monitor the performance of detailed Monte Carlo calculation of power densities in a full size reactor core, in: Proceedings of the 2009 International Conference on Mathematics Computational Methods and Reactor Physics, Saratoga Springs, NY, American Nuclear Society, 2009. [10] Thomas A. Brunner, Todd J. Urbatsch, Thomas M. Evans, Nicholas A. Gentile, Comparison of four parallel algorithms for domain decomposed implicit Monte Carlo, J. Comput. Phys. 212 (2006) 527–539. [11] Thomas A. Brunner, Patrick S. Brantley, An efficient, robust, domain-decomposition algorithm for particle Monte Carlo, J. Comput. Phys. 228 (2009) 3882–3890. [12] R.J. Procassini, M.J. Obrien, J.M. Taylor, Load balancing of parallel Monte Carlo transport calculations, in: International Topical Meeting on Mathematics and Computations, 2005. [13] N.F. Landers, L.M. Petrie, J.A. Bucholz, The material information processor for scale, sect. m7, in: SCALE: A Modular Code System for Performing Standardized Computer Analyses for Licensing Evaluation, NUREG/CR-200, Rev. 4 (ORNL/NUREG/CSD-2/R4), 1995. [14] Yu.G. Pashkin, Accuracy of the Wigner approximation, Atom. Energy 28 (1970) 184–185, doi:10.1007/BF01162626. [15] Allan F. Henry, Nuclear Reactor Analysis, MIT Press, Cambridge, Mass., 1975. [16] Yousry Azmy, Enrico Sartori, R.N. Hwang, Resonance theory in reactor applications, in: Nuclear Computational Science, Springer, Netherlands, 2010, p. 260. [17] M.O. Deville, P.F. Fischer, E.H. Mund, High-order Methods for Incompressible Fluid Flow, Cambridge University Press, Cambridge, 2002. [18] Alan J. Wallcraft, SPMD OpenM versus MPI for ocean models, Concurrency – Practice and Experience 12 (12) (2000) 1155–1164.