Parallel linear congruential generators with Sophie–Germain moduli

Parallel linear congruential generators with Sophie–Germain moduli

Parallel Computing 30 (2004) 1217–1231 www.elsevier.com/locate/parco Parallel linear congruential generators with Sophie–Germain moduli Michael Masca...

252KB Sizes 0 Downloads 33 Views

Parallel Computing 30 (2004) 1217–1231 www.elsevier.com/locate/parco

Parallel linear congruential generators with Sophie–Germain moduli Michael Mascagni *, Hongmei Chi Department of Computer Science, School of Computational Science, Florida State University, Tallahassee, FL 32306-4530, USA Received 1 April 2003; revised 12 August 2004; accepted 16 August 2004

Abstract Monte Carlo simulations are thought to be very easy to parallelize; however, the quality of these parallel Monte Carlo computations depends greatly on the quality of the parallel random number generators used. Linear congruential generators (LCGs), the most common number-theoretic pseudorandom number generators, with both power-of-two and prime moduli are used in many popular implementations of pseudorandom number generators. Recently, one of the authors of this paper [M. Mascagni, Parallel linear congruential generators with prime moduli, Parallel Comput. 24 (1998) 923–936] developed an explicit parameterization of prime modulus LCGs for use in parallel computations. This approach was based on an explicit enumeration of all the primitive roots modulo the prime modulus for use as unique multipliers in each parallel LCG. In that paper, only Mersenne prime moduli were considered because of the existence of a fast modular multiplication algorithm for primes close to powersof-two. In the current paper, we investigate the nature of the trade-off implicitly made in the choice of Mersenne primes by comparing them to parameterized Sophie–Germain prime modulus LCGs. While the choice of Mersenne primes trades off initialization time for generation time, the choice of Sophie–Germain primes not only largely reduces initialization time but also provides competitive generation time when an appropriately chosen Sophie–Germain primes are used. The resulting Sophie–Germain prime modulus LCGs have been tested, and

*

Corresponding author. E-mail addresses: [email protected] (M. Mascagni), [email protected] (H. Chi). URL: http://www.cs.fsu.edu/~mascagni.

0167-8191/$ - see front matter  2004 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2004.08.002

1218

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

incorporated into the Scalable Parallel Random Number Generators SPRNG library [SPRNG. Scalable parallel random number generators, http://sprng.fsu.edu], a widely used random number generation suite for parallel, distributed, and grid-based Monte Carlo computations [M. Mascagni, A. Srinivasan, Computational infrastructure for parallel, distributed, and grid-based Monte Carlo computations, Lect. Notes Comput. Sci. 2907 (2004) 39–52].  2004 Elsevier B.V. All rights reserved. Keywords: Parallel random number generation; Prime modulus linear congruential generators; Sophie– Germain prime; Mersenne prime

1. Introduction Monte Carlo (MC) simulations are thought of as easy to parallelize because a large MC problem can often be easily broken up into many small, essentially independent, subproblems. For this reason MC simulations usually have extremely high degrees of parallelism, can tolerate large latencies, and yet require considerable computational effort, making them extremely well-suited to parallel, distributed, and grid-based computational environments. Since the fidelity of MC simulations depend on the nature of the pseudorandom numbers used, the need for high-quality parallel random number generators in scientific applications is great and increasing, as increasingly larger Monte Carlo simulations are being planned and executed. One of the most common parallelizations of MC is to use the same MC algorithm across processors while using an ‘‘independent’’ pseudorandom number stream on each processor [37]. Obtaining different streams that are sufficiently ‘‘independent’’ can be tricky. One family of parallelizations of pseudorandom number generators takes a single, long-period, pseudorandom number and splits the full period into subsequences in some manner [23] and the subsequences become the streams for the individual MC processes. However, splitting a single sequence inevitably leads to correlations among the streams that negatively impact ‘‘independence’’ [4–6]. As a solution to this, one of the authors suggested the use of full-period pseudorandom number generators related to each other through parameterization as an alternative [26]. Linear congruential generators (LCGs) are the best known of the pseudorandom number generators, and have the best developed theory. While there are many viable alternatives, many users have used and prefer to continue to use LCGs in their parallel and distributed computations. This motivated one of the authors of this paper [24] to study a method for parameterizing complete and distinct full-period LCG sequences for use as independent pseudorandom number streams as described above. In that work, also published in this Journal, the consequences of parameterizing fullperiod LCG sequences were analyzed when the modulus, m, of the LCG is prime, and of special form close to a power-of-two. Specifically when the modulus is a Mersenne prime, namely when m = 2p  1 is a prime, and the Mersenne exponent, p, is also prime.

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

1219

The distribution of Mersenne primes is sparse. In particular, there are only 41 Mersenne primes known [12] at the time of the writing of this manuscript. 1 This fact severely limits our choice for Mersenne prime moduli. Also, the parameterization in [24] is based on computing all the primitive roots modulo the Mersenne prime modulus. This is very time-consuming when the modulus is Mersenne prime. The choice of using Sophie–Germain primes as the modulus, m, e.g., m = 2p + 1, with m and p prime renders this computation trivial and hence eliminates many of the performance disadvantages of Mersenne primes while keeping many important advantages. In this paper, we will study the consequences of parameterizing full-period LCG sequences when the modulus is a properly chosen Sophie–Germain prime. We will take the same approach as in [24] for the rest of this paper. The organization of this paper is as follows. In Section 2, an overview of LCGs with prime moduli is described, and a brief introduction to the theory of LCGs is given. In Section 3, we use a full-period exponential sum as the measure of interstream and intrastream ‘‘independence’’. This allows us to justify some of our choices for which LCG parameter to systematically vary. The consequences of choosing a Sophie–Germain prime and the resulting computational issues are illustrated in Section 4. Parallelizations and implementations of LCGs with Sophie–Germain prime moduli are presented in Section 5, and conclusions are presented in Section 6.

2. LCGs with prime moduli An LCG is defined by the following relation: xn ¼ axn1 þ bðmodmÞ;

ð1Þ

where m is called the modulus, b the additive constant, and a is called the multiplier. In practice, the modulus is chosen to be a power-of-two or a prime. However, the disadvantage in terms of quality by using m = 2k is illustrated in [24]. When the modulus is prime, a special form of prime can be chosen so that the costliest part of the generation, the modular multiplication, can be minimized. This occurs when m = 2k ± ‘ where ‘ is as small as possible. When ‘ = +1 then m is called a Fermat prime, and when ‘ = 1, a Mersenne prime. In the rest of this section, several theorems about LCGs with prime moduli are presented. Parallelizations and implementations of LCGs are based on these theorems. Details and proofs of these theorems can be found in [15] and [31]. In this paper, we consider LCGs with b = 0 (the reason is explained and justified in Section 3), namely, 1

Since the Lucas–Lehmer algorithm makes finding primes of the Mersenne type relatively easy, the largest known prime since 1952 has been a Mersenne prime. Thus, of the 41 currently known Mersenne primes, only the 9 smallest are less than 264 i.e., they can be represented as a single integer on the most modern, 64-bit, architecture.

1220

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

xn ¼ axn1 ðmodmÞ:

ð2Þ

This is commonly called a multiplicative congruential generator (MCG). If the multiplier and modulus are properly chosen, then the period will be of maximal length, i.e., of length m  1. 2 The following theorem tells us how to choose a multiplier so that we can produce this maximal period. Theorem 1. The MCG determined by xn = axn1 (mod m) has period m  1 if and only if m is prime, a is a primitive root modulo m, and x0 5 0 (mod m). Of course, some moduli do not have a primitive root. For example, there is no primitive root modulo 15. However, every prime m has a primitive root, and the proof this fact is well-known [15]. One natural question to ask is that for a prime, m, how many primitive roots does it have. Theorem 2 shows us how one can count the number of primitive roots modulo m. Theorem 2. Let m be an odd prime, then an integer a with 2 6 a 6 m is a primitive root modulo m if and only if a(m  1)/q 5 1 (mod m) for any prime factor q of m  1. The number of primitive roots modulo m depends on the number of divisors of m  1. The smaller the number of divisors of m  1 has, the more primitive roots m has. If m = 2p + 1 is a Sophie–Germain prime, then m  1 = 2p has two divisors, a minimum, and m has p = (m  1)/2 primitive roots, a relative maximum. For our purposes, the multiplier, a, is thus chosen to be a primitive root modulo m. This gives us both the maximal period and a generator of acceptable quality in a statistical sense based on the spectral test [21]. In practice, m is always properly chosen among some special form of primes so that modular multiplication is efficiently implemented. Taking advantage of binary operations, the Mersenne prime is a favorable choice, which was extensively studied [24]. After careful study, we believe that if we shrewdly choose Sophie–Germain primes, they will have most of the same advantages as Mersenne primes, but with several extra improvements. There are many ways to provide parallel pseudorandom number generators (PRNGs). Those based on splitting a single PRNG int substrings include blocking, leap-frogging, and recursive halving leap-ahead [23]. As previously mentioned, all methods based on splitting a single sequence have generic internal correlations the necessarily lead to a lack of ‘‘independence’’ among the streams. This, in this paper, we will only consider parameterization of MCGs. However, more details on these splitting methods can be found in Section 5. There are only two choices to parameterize a MCG: parameterization via the multiplier, a, and via the modulus, m. It is unacceptable to vary the modulus for a variety of theoretical and practical reasons [24]. So the only parameter left is a. In order to parameterize an MCG via a, we have to be able to calculate all possible as that give the full period. This means that we have to compute all primitive roots modulo m. In comparison to Mersenne primes, the choice of these MCGs with Sophie– Germain prime moduli can save time when computing new multipliers. During 2

We obtain m  1 and not m because the full period omits zero.

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

1221

initialization, if a single primitive root modulo m is known, we can obtain the rest of the primitive roots from the first. The following theorem tells us the relation among all primitive roots modulo a given prime. Theorem 3. If a is a primitive root modulo the prime, m, and the integer, i, is relatively prime to m  1, then a = ai (mod m) is a primitive root modulo m. Thus, to compute primitive roots we need an algorithm to quickly compute all of the integers relatively prime to m  1. Since we need only one of these integers at a time, we really need an algorithm to compute the kth integer relatively prime to m  1. Let p1 relatively prime to m  1. When m is a m1 ðkÞ denote this kth integer p1 p1 Mersenne prime, m  1 ¼ 2p  2; ¼ 2ð2 2  1Þð2 2 þ 1Þ [24], which is usually a very smooth number. 3 This fact makes it time consuming to compute p1 m1 ðkÞ when m is Mersenne, as all the algorithms known are essentially variants of inclusion–exclusion, and hence have running times that are exponential in the number of factors of m  1. In fact, the majority of the initialization time for parameterized LCGs with Mersenne moduli in the SPRNG (Scalable Parallel Random Number Generators) library is spent in computing p1 m1 ðkÞ [24]. One of the main properties for a desirable parallel PRNG is to be able to produce random numbers extremely fast, so that the generation time is only a small fraction of the time required for a complete MC calculation. In highly branched Monte Carlo simulations one often uses only a few thousand random numbers before branching, and in this case the initialization time of a parallel PRNG stream should not be costly. Thus, we feel that the implementation of faster LCGs with prime moduli on parallel machines is required. When m is a Sophie–Germain prime, m  1 = 2p, where p is prime, and so m  1 has only two factors. We can write down the kth primitive root of m explicitly in this case. Given a single primitive root, a, then the kth primitive root modulo m, in a particularly convenient enumeration, can be written as ak ¼ ak ðmod mÞ;

ð3Þ p1 2

where k = 2j + 1, and j runs from 0 to  1. Thus, for Sophie–Germain primes computing primitive roots is explicit, this is the reason we can reduce the initialization time of multiplier parameterized LCGs with Sophie–Germain moduli. In addition, with a Sophie–Germain prime, m, there are (m  1)/2 = p primitive roots, a relative maximum. When m is a Mersenne prime, we get only approximately m/(ln m) primitive roots [34]. Thus, the choice of a Sophie–Germain prime also gives us more random number streams via parameterization than with a Mersenne prime for a given size modulus. However, we still need one primitive root to start this process. Theorem 2 tells us how to find it. For a Sophie–Germain prime, m = 2p + 1, and the theorem says we need only find an a > 1 such that both ap 5 1 (mod m) and a2 5 1 (mod m). Since

3

An integer, whose prime factors are relatively small, is called a smooth number. Similarly, a smooth integer has a relatively large number of factors.

1222

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

about half of the residues modulo a Sophie–Germain prime are primitive roots, we expect to test only two integers in the above manner to find a candidate primitive root. 3. Exponential sum cross-correlation Exponential sums are used to check the uniformity of sequences, and are used in Weyls theory of uniform distribution: Theorem 4. A sequence, {xn}, for n = 0, 1, 2, 3, . . . , is uniformly distributed modulo one if and only if for all integers h 5 0 lim

N !1

N 1 X e2pihxn ¼ 0: N n¼1

ð4Þ

This result is a very powerful tool  for analyzing the uniformity of sequences. TriviPN ally one notices that  n¼1 e2pihxn  is O(N), as each term in the P sum is a complex number on the unit circle. Thus the theorem above states that  Nn¼1 e2pihxn  for all integers h is O(N) when {xn} is uniformly distributed. Let us denote the exponential sum of a sequence of residues modulo m, x0, x1, . . . , xk1 as CðkÞ ¼

k1 X

xn

e2pi m :

ð5Þ

n¼0

If {xn} is periodic with period k, then the above sum is called a full-period exponential sum. If the pffiffiffi sequence {xn} is uniformly distributed in the integers modulo m, then jCðkÞj ¼ Oð k Þ, [16]. This fact can be used to test the correlation within, and between random number sequences. Consider two random number sequences {xn} and {yn}. Their exponential sum cross-correlation is given by Cði; j; kÞ ¼

k1 X

e2pi

ðxiþn y jþn Þ m

:

ð6Þ

n¼0

Here the sum has k terms and we start with xi and yj. Now we consider the full-period exponential sum cross-correlations for two fullperiod MCGs, one with multiplier a, the other with multiplier al modulo m with gcd(l, m  1) = 1, and a a primitive root modulo m. Without loss of generality let x0 = y0, then we have xn = anx0, and yn = anlx0 (mod m). Since a is a primitive root modulo m, then an will assume the value of all the integers 1 to m  1 as n goes from 0 to m  2. This allows us rewrite Eq. (6) as Cð ; m  1Þ ¼

m2 X n¼0

f ðnÞ

e2pi m ;

ð7Þ

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

1223

where f(n) = x0(n  nl). Finally, we get a bound for C(., m  1), [35] pffiffiffiffi jCð ; m  1Þj 6 ðl  1Þ m:

ð8Þ

This means that the sequence {xn  yn} is uniformly distributed in the integers modulo m. This is a necessary condition if the sequences {xn} and {yn} are not highly correlated. When we consider all pairwise cross-correlations in a large family of MCGs parameterized in this way, the above inequality is minimized if l is the kth integer relatively prime to m  1. Previously, we decide upon MCGs because an LCG with prime modulus and nonzero additive constant, b, was claimed to provide no quality advantage over an MCG. Here we justify that statement, as the argument is similar to the previous result. Consider two LCG sequences modulo m with the same multiplier: xn = axn1 + bx (mod m) and yn = ayn1 + by (mod m). Then we have xn  y n ¼ an ðx0  y 0 Þ þ ðbx  by Þðan  1Þða  1Þ1 ¼ an ðx0  y 0 Þ þ bðan  1Þ;

ð9Þ

where b = (bx  by)(a  1)1, and (a  1)1(a  1) 1 (mod m). When x0 = y0, then the full-period exponential sum cross-correlation between xn and yn is Cð ; m  1Þ ¼

m2 X

2pi

n 1Þ

e m bða

ð10Þ

;

n¼0

Since a is a primitive root modulo m, then an will go through all integers from 1 to m  2 except am1. So Eq. (10) can be rewritten as Cð ; m  1Þ ¼

m1 X

2pi

2pi

e m bn  e m ba

m1

2pi

m1

¼ e m ba

ð11Þ

:

n¼0

pffiffiffiffi So the exponential sum (11) has magnitude 1 Oð mÞ, and so xn and yn are highly correlated in this case!! Since bx 5 by for different generators, then b 5 0. Now consider the case when x0 5 y0. We can compute the full-period exponential sum cross-correlation, as long as there is some n such that xn = yn (mod m), then jC(Æ, m  1)j is the same as in the case of x0 = y0, namely 1. Now we prove that there exists an n such that xn  yn 0 (mod m). Using Eq. (9), we seek a solution to xn  yn 0 (mod m) which implies an ðx0  y 0 Þ þ bðan  1Þ 0 ðmodmÞ:

ð12Þ

This Eq. (12) can be written as an bðx0  y 0 þ bÞ1

ðmodmÞ:

ð13Þ 1

Since the right-hand side of Eq. (13), b(x0  y0 + b) , is well-defined and non-zero, then there exists an n such that an = b(x0  y0 + b)1 (mod m). This follows from the fact that a is a primitive root modulo m, and solving the above equation for n reduces

1224

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

to the well-known discrete logarithm problem over GF(m) [13]. Thus, the value of n that we find above is the index (discrete logarithm) where xn = yn (mod m). This shows that any family of prime modulus LCGs with primitive root multipliers created by varying the additive constant are very correlated. The difference between any two of them yields a sequence that takes on all but one value modulo m over the full period. A random sequence would instead have many pffiffiffiffi values repeated many times and yield a full-period cross-correlation that is Oð mÞ. Therefore, we prefer parameterizations via p the ffiffiffiffi multiplier of a full-period MCG where the fullperiod cross-correlation is Oð mÞ. 4. Computational issues If the modulus of an LCGs is a properly chosen Sophie–Germain prime, the LCG can achieve roughly the same generation speed as LCGs with Mersenne prime moduli. Modular multiplication within the generator PMLCG (Prime Modulus LCG) in SPRNG [36] is replaced by integer addition. We will use the same trick for LCGs with Sophie–Germain prime moduli. 4.1. Modular multiplication In SPRNG [36], modular multiplication in the generator PMLCG (LCGs with Mersenne Prime moduli) is implemented by performing the full integer multiplication with only the use of bitwise shifting and integer addition required to accomplish the modular reduction. It is accomplished as follows. Let m = 2p  1 be our Mersenne prime modulus, and the let low denote the p least-significant bits in the binary representative of an integer and high denote the most significant bits. Any integer, x, can be expressed as x = high * 2p + low. For example if m = 7 = 23  1, and x = 30 = (11110)2, then x = high * 23 + low, with high = (11)2, and low = (110)2. If we wish to reduce 30 modulo 23  1, we proceed as follows: 30ðmod 23  1Þ ¼ 11 23 þ 110ðmod 23  1Þ ¼ 11 ðmod 23  1Þ þ ð11 þ 110Þðmod 23  1Þ ¼ 11 þ 110ðmod 23  1Þ ¼ 1001ðmod 23  1Þ ¼ 1 23 þ 1ðmod 23  1Þ ¼ ð1 23  1Þ þ 10ðmod 23  1Þ ¼ ð10Þ2 ¼ 2: Since every Mersenne prime can be written as 2p  1, the right-hand side of the equation xn = axn1 (mod m) can be written as axn1 = high * (2p  1) + (high + low). So the modular multiplication can be written as high ð2p  1Þ þ ðhigh þ lowÞðmod 2p  1Þ ¼ ðhigh þ lowÞðmod 2p  1Þ:

ð14Þ

It is easy to see that in this case the modular multiplication can be accomplished by only using integer addition, which provides us a competitive speed advantage for

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

1225

implementing LCGs with Mersenne moduli. If we wish to use Sophie–Germain primes instead, perhaps we can find such primes close p toffiffiffiffiffipowers of two, i.e., m = 2p + 1 = 2q ± k, where k is small, in particular k < 2q . However, we only consider m = 2p + 1 = 2q  k in practice, because in that case we can limit the implementation to binary arithmetic in q bits and not q + 1 bits. This modified form of Sophie–Germain prime offers us the same approach for implementing modular multiplication in LCGs as is currently done in SPRNG. Inspired by Eq. (14), we carefully searched for Sophie–Germain primes of the pffiffiffiffiffi form 2q  k, where k can be as large as 2q . In this case, the modular multiplication in an LCG with a Sophie–Germain prime modulus can be written as high ð2q  kÞ þ ðhigh k þ lowÞðmod 2q  kÞ ¼ ðhigh k þ lowÞðmod 2q  kÞ:

ð15Þ

This speeds up the generation of each random number, and it can almost match ffi pffiffiffiffi the speed of a LCG with Mersenne prime modulus. We note that since k < 2q that high * k < 23q/2, and also high * k + low 6 23q/2. So, if high * k + low P 2q, we can take high * k + low modulo (2q  k) by splitting this into pnew ffiffiffiffiffi high and low q-bit words. In this case, the high word will be only as large as 2q and so k * high will be only as large as 2q and so the second pass will complete the modular reduction. For a Mersenne prime modulus, there is no p need ffiffiffiffiffi to recurse, but in the case when 2q  k is a Sophie–Germain prime with k < 2q , at most one recursive use of Eq. (15) would be required. Thus, at most, we expect the cost of modular reduction will be twice that of a similar sized Mersenne modulus. We note that all of the bit operations required to do these reductions are linear in the length of the operands in bits. Thus, we expect that the cost, per bit, of multiplication and modular reduction will differ in these two cases by approximately the fraction of cases where the Sophie– Germain prime reduction requires two passes. 4.2. Choice of modulus Can we find Sophie–Germain primes close to powers-of-two as in the case previously described? The prime number theorem [34] approximates the number of primes not exceeding x, p(x) as Theorem 5. As x ! 1 x pðxÞ  : ð16Þ ln x Of course, the number of Sophie–Germain primes will be less than this number. By the prime number theorem, we can try to estimate the number of Sophie– q Germain primes in the interval ð2q  22 ; 2q Þ. First we have to find the number of Sophie–Germain primes less than or equal to x. If the probability of a random integer x and the integer 2x + 1 being prime were statistically independent events, then it x would follow from the prime number theorem that there are about ln x lnð2xþ1Þ such primes less than or equal to x. Ribenboim [34] conjectured on the number of Sophie–Germain primes, and his result can be expressed as

1226

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

Conjecture 1. The number of Sophie–Germain primes less than N is asymptotic to Z N dx 2CN  ; ð17Þ 2C 2 ln x lnð2x þ 1Þ ðln N Þ 2 where C  0.660162 is constant. From this formula, we can estimate the number of Sophie–Germain primes in q the interval ð2q  22 ; 2q Þ. For example, with q = 32, the formula gives us approximately 160 Sophie–Germain primes in the interval (232  216, 232). This gives us q plenty ffi choices for k to find Sophie–Germain primes of the form 2  k with pffiffiffiffiof q k< 2. After a brief search, we have found suitable Sophie–Germain primes to use as the moduli for our implementations. There seem to always be many Sophie–Germain primes close to 2q. In Table 1, we list the Sophie–Germain primes that are just below the power-of-two for exponents 15–64. There may be ‘‘closer’’ Sophie–Germain primes slightly larger than these powers-of-two, but they would require one more bit in their binary representation. We do not think that using such primes will gain benefit in practice.

5. Parallelization and implementations Many different parallel PRNGs have been proposed [1,2,19,33,28]. Basically, they use one of the following methods to generate parallel random numbers: • Leap-frog [2]: a single random number sequence is partitioned in turn among the processors as are cards around a card table. If each processor leap-frogs by L in the random number sequence of {xn}, processor Pi will generate a random sequence with numbers xi ; xiþL ; xiþ2L ; . . . • Blocking [2]: a single random number sequence is partitioned by splitting it into non-overlapping but contiguous sections. If the length of each section is L in the random number sequence {xn}, processor Pi will generate a random sequence with numbers xiL ; xiLþ1 ; xiLþ2 ; . . . • Parameterization [28]: the initial seed or another parameter in the PRNG can be carefully chosen in such a way to produce long period independent sequences on each processor by varying the parameter across processors. Parallel computations using MC require a source of random number sequences, which are distributed among the individual processing units. The first two methods, leap-frog and blocking, require a fast way of advancing the generator a few steps, and a single PRNG with a very long period. The consequences of choosing some

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

1227

Table 1 The Sophie–Germain (S–G) primes closest to but less than 2q 2q 64

2 262 260 258 256 254 252 250 248 246 244 242 240 238 236 234 232 230 228 226 224 222 220 218 216

S–G primes (2q  k) 64

2  1469 262  10565 260  3677 258  137 256  2249 254  4805 252  473 250  161 248  5297 246  857 244  1493 242  2201 240  437 238  401 236  137 234  641 232  209 230  1385 228  437 226  677 224  317 222  17 220  233 218  17 216  269

2q 63

2 261 259 257 255 253 251 249 247 245 243 241 239 237 235 233 231 229 227 225 223 221 219 217 215

S–G primes (2q  k) 263  4569 261  2373 259  18009 257  3993 255  789 253  1269 251  465 249  2709 247  1485 245  573 243  741 241  1965 239  381 237  45 235  849 233  9 231  69 229  189 227  405 225  633 223  321 221  9 219  45 217  285 215  165

L can lead to unknown correlations [4–6] between streams, and these correlations could seriously affect computational results. Thus, the parameterization method, an alternative for generating parallel random numbers, is recommended, and only seriously considered here. We previously stated the desire to utilize the same PRNG algorithm on every processor in parallel MC simulations. However, we cannot choose to parameterize the modulus and optimize modular reduction based on the modulus. In addition, if we consider modular parameterization, the period of the sequences will be different, and the theoretical measure of interprocessor correlation via exponential sums is analytically intractable. There are two ways to parameterize the LCG in Eq. (1), one is via the judicious choice of different bs to form different sequences with powerof-two moduli [33], and the other is to choose different as with prime moduli [24]. To understand the computational efficiency of the choice of Sophie–Germain primes in parameterized LCGs, we implemented a SGMLCG (linear congruential generator with Sophie–Germain modulus) using the same structure as the PMLCG library in SPRNG. The performance comparisons are presented in Table 2. There we tabulate the initialization and generation times for PMLCG and SGMLCG generators.

1228

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

Table 2 Comparisons between PMLCG and SGMLCG

Modulus Initialize time (seconds) MRS MRS/bits # of independent streams

Mersenne prime (PMLCG)

Sophie–Germain (SGMLCG)

261  1  2.306 · 1018 15.030 2.4390 0.0400 2.88 · 1016  1/80 of m

264  21017  1.845 · 1019 0.04 2.6316 0.0411 9.22 · 1018  1/2 of m

(MRS = million random numbers per second).

The two generators we specifically used for comparison were • PMLCG The generator is defined by the following relation: xn ¼ axn1 ðmod 261  1Þ:

ð18Þ

where the multiplier, a, differs for each process. The multiplier is chosen to be certain powers of 37, a primitive root modulo 261  1, that give maximal period sequences of acceptable quality. The period of this generator is 261  2 and the number of distinct streams available is roughly 258. • SGMLCG The generator is defined by the following relation: xn ¼ axn1 ðmod 264  21017Þ:

ð19Þ

where the modulus, m = 264  21017 is a 64-bit Sophie–Germain prime, and the multiplier, a, differs for each stream. 4 The multiplier is chosen to be certain powers of 13, a primitive root modulo 264  21017, that give maximal period sequences of acceptable quality. The period of this generator is (264  21017)  1, and the number of distinct streams available is exactly 263  10509. This generator passed both the DIEHARD [22] statistical tests of randomness and the extensive randomness test suite in SPRNG [37]. We compared one SGMLCG generator with one SPRNG generator—PMLCG, and ran our codes on a workstation running Red Hat Linux 6.2 with dual processor 500 MHz Pentium III processors. The initialization time reported is the CPU time (in seconds) from when the initialization routine is called until the first random number is produced. Since the initialization time in PMLCG will vary depending on the stream number requested, we publish the average time among 106 PMLCG initializations. The initialization time with SGMLCG is also the average of 106 initializations, but here the initialization times are essentially constant. 5 The initialization 4

The larger Sophie–Germain modulus 264  1469 lead to an LCG that did not pass the DIEHARD [22] statistical tests of randomness. 5 The initialization consists primarily in computing the kth integer relatively prime to m  1. This computations complexity grows with k in PMLCG and is constant in SGMLCG.

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

1229

time of SGMLCG is 400 times faster than for PMLCG. We measured the generation speed by publishing the generation rate in millions of random numbers generated per second (MRS). The procedure is to record the time for producing one million random numbers for each generator after initialization is complete. Since 261  1 is the closest Mersenne prime to 264, the moduli for PMLCG and SGMLCG have slightly different lengths in bits, thus we also publish the MRS value per bit as a normalization. SGMLCG is only 3% slower, per bit, in generation than PMLCG. This must be due to the fact that only about 3% of the multiplications and modular reductions require a second reduction using Eq. (15).

6. Conclusions In [24], the author mentioned as an open question: ‘‘Is it more efficient overall to choose m to be amenable to fast modular multiplication or fast calculation of the kth number relatively prime to m  1?’’ LCGs with Mersenne primes as moduli are the first case, and LCGs with Sophie–Germain prime moduli the second case. In this paper, we provide a quantitative answer to this question. If we carefully choose a Sophie–Germain prime as modulus, m, in the form 2q  k, we can implement an LCG with both a fast modular multiplication and the fast calculation of the kth integer relatively prime to m  1. With a Mersenne prime modulus, modular multiplication can be sped up, but it is time-consuming to compute the kth integer relatively prime to m  1. With a Sophie–Germain prime as modulus, the kth integer relatively prime to m  1 can be written down explicitly. By choosing a Sophie–Germain prime of the form of 2q  k, the method of optimizing the modular multiplication for Mersenne moduli can be applied in the Sophie–Germain case as well. Besides fast modular multiplication and fast calculation of the kth number relatively prime to m  1, we have many more choices of acceptable moduli with Sophie–Germain primes. We expect there to be approximately 1.23 · 1016 Sophie– Germain primes in the interval (215, 264). In contrast, there are only four Mersenne primes this same interval: 217  1, 219  1, 231  1, and 261  1, and the next largest Mersenne prime is 289  1. In practice, the modulus, m = 261  1, is often used in implementation of LCGs with Mersenne moduli. When one uses and needs more random streams, we have to choose the much larger Mersenne prime, m = 289  1. Since SGMLCG gives us both a smaller initialization time and an almost identical generation speed, LCGs with Sophie–Germain prime moduli are clearly the better choice. To that end, SGMLCG is currently being incorporated into SPRNG, and is expected to become the LCG of choice with SPRNG users.

Acknowledgments This work was partially supported by research contract DAAD19-01-1-0675 from the Army Research Office. In addition, the authors were partially supported by the

1230

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

North Atlantic Treaty Organization, the National Science Foundation, and the Department of Energy. Specifically, contracts from Lawrence Livermore and Los Alamos National Laboratories under the Accelerated Strategic Computing Initiative (ASCI) supported further work on SPRNG.

References [1] S.L. Anderson, Random number generators on vector supercomputers and other advanced architectures, SIAM Rev. 32 (1990) 221–251. [2] P.D. Coddington, Random number generators for parallel computers. http://www.npac.syr. edu/users/paulc/papers/NHSEreviewl.1/PRNGreview.ps, 1997. [4] A. DeMatteis, S. Pagnutti, Parallelization of random number generators and long-range correlations, Numerische Math. 53 (1988) 595–608. [5] A. DeMatteis, S. Pagnutti, Parallelization of random number generators and long-range correlations, Parallel Comput. 15 (1990) 155–164. [6] A. DeMatteis, S. Pagnutti, Controlling correlations in parallel Monte Carlo, Parallel Comput. 21 (1995) 73–84. [12] GIMPS. Great Internet Mersenne Prime Search, http://www.mersenne.org, 2004. [13] D. Gordon, Discrete logarithms in GF(p) using the number field sieve, SIAM J. Discrete Math. 6 (1) (1993) 124–138. [15] D.E. Knuth, The Art of Computer Programming Seminumerical Algorithms, vol. 2, 2nd ed., Addison-Wesley, Reading, Massachusetts, 1997. [16] L. Kuipers, H. Niederreiter, Uniform Distribution of Sequences, John Wiley and Sons, New York, 1974. [19] P. LEcuyer, T.H. Andres, A random number generator based on the combination of four legs, Math. Comput. Simul. 44 (1997) 99–107. [21] H. Leeb, Asymptotic properties of the spectral test, diaphony, and related quantities, Math. Comput. 71 (237) (2002) 297–309. [22] M. Marsaglia, The diehard battery of tests of randomness. http://stat.fsu.edu/pub/ diehard, 1987. [23] M. Mascagni, Some methods of parallel pseudorandom number generation, In: Proceedings of the IMA Workshop on Algorithms for Parallel Processing, 1997. [24] M. Mascagni, Parallel linear congruential generators with prime moduli, Parallel Comput. 24 (1998) 923–936. [26] M. Mascagni, S.A. Cuccaro, D.V. Pryor, M.L. Robinson, A fast, high quality, and reproducible parallel lagged Fibonacci pseudorandom number generator, J. Comput. Phys. 119 (1995) 211–219. [28] M. Mascagni, A. Srinivasan, Algorithm 806: SPRNG: a scalable library for pseudorandom number generation, ACM Trans. Math. Software 26 (2000) 436–461. [31] H. Niederreiter, Random Number Generations and Quasi-Monte Carlo Methods, SIAM, Philadephia, 1992. [33] O.E. Percus, M.H. Kalo, Random number generators for mimd parallel processors, J. Parallel Distr. Comput. 6 (1989) 477–497. [34] P. Ribenboim, The New Book of Prime Number Records, Springer-Verlag, New York, 1996. [35] W.M. Schmidt, Equations over Finite Field: An Elementary Approach, Berlin, Springer, Lecture Notes in Mathematics, vol. 536, 1976. [36] SPRNG. Scalable parallel random number generators, http://sprng.fsu.edu. [37] A. Srinivasan, M. Mascagni, D. Ceperley, Testing parallel random number generators, Parallel Comput. 29 (2003) 69–94.

M. Mascagni, H. Chi / Parallel Computing 30 (2004) 1217–1231

1231

Further reading The following references may also be of interest to the reader: [3] R. Crandall, Prime Numbers: A Computational Perspective, Springer-Verlag, New York, 2001. [7] W.F. Eddy, Random number generators for parallel processors, J. Comp. Appl. Math. 31 (1990) 63– 71. [8] J. Eichenauer-Herrmann, H. Grothe, A remark on long-range correlations in multiplicative congruential pseudo-random number generators, Nuerische Math. 56 (1989) 609–611. [9] K. Entacher, On the cray-system random number generator, Simulations 72 (1999) 163–169. [10] K. Entacher, A. Uhl, S. Wegenkittl, Linear and inversive pseudorandom numbers for parallel and distributed simulation. In: 12th Workshop on Parallel and Distributed Simulation PADS98, 26–29th May, Banff, Alberta, Canada, 1998, IEEE Computer Society, Los Alamitos, California, pp. 90–97. [11] G.A. Fishman, L.R. Moore, An exhaustive analysis of multiplicative congruential random number generators with modulus 231  1, SIAM J. Sci. Stat. Comput. 7 (1986) 24–45. [14] P. Hellekalek, Good random number generators are (not so) easy to find, Math. Comput. Simul. 46 (1998) 485–505. [17] A. Lastovetsky, Parallel Computing on Heterogeneous Networks, John Wiley and Sons, New York, 2003. [18] P. LEcuyer, Tables of linear congruential generators of different sizes and good lattice structure, Math. Comput. 68 (225) (1998) 249–260. [20] P. LEcuyer, S. Cote, Implementing a random number package with splitting facilities, ACM Trans. Math. Software 17 (1991) 98–111. [25] M. Mascagni, S. Cuccaro, D. Pryor, M.L. Robinson, Recent developments in parallel pseudorandom number generation, In: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing, 1993, pp. 524–529. [27] M. Mascagni, Y. Li, Computational infrastructure for parallel, distributed, and grid-based Monte Carlo computations, Lect. Notes Comput. Sci. 2907 (2004) 39–52. [29] M. Mascagni, A. Srinivasan, Parameterizing parallel multiplicative lagged Fibonacci generators, Parallel Computing 30 (2004) 899–916. [30] M. Nathanson, Elementary Methods in Number Theory, Springer, New York, 2000. [32] S.L. Park, K.W. Miller, Random number generators: good ones are hard to find, Comm. ACM 31 (10) (1988) 1192–1201. [38] P.C. Wu, Multiplicative, congruential random number generators with multiplier 2kl + 2k2 and modulus 2P  1, ACM Trans. Math. Software 23 (1997) 255–265.