Future Generation Computer Systems (
)
–
Contents lists available at ScienceDirect
Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs
Non-standard pseudo random number generators revisited for GPUs Christoph Riesinger a,∗ , Tobias Neckel a , Florian Rupp b a
Department of Informatics, Technical University of Munich, Munich, Germany
b
Department of Mathematics and Science, German University of Technology in Oman, Muscat, Oman
highlights • Three methods to generate normally distributed pseudo random numbers which have properties making them interesting for an implementation on the GPU.
• Established GPU random number libraries are outperformed by a factor of up to 4.53, CPU libraries by a factor of up to 2.61. • One of the three methods has never been considered for GPUs before but delivers best performance on many benchmarked GPU architectures.
article
info
Article history: Received 26 February 2016 Received in revised form 5 October 2016 Accepted 17 December 2016 Available online xxxx Keywords: Fine-grain parallelism and architectures GPU Pseudo random number generation Ziggurat method Rational polynomials Wallace method
abstract Pseudo random number generators are intensively used in many computational applications, e.g. the treatment of uncertainty quantification problems. For this reason, the right selection of such generators and their optimization for various hardware architectures is of big interest. In this paper, we analyze three different pseudo random number generators for normally distributed random numbers: The Ziggurat method, rational polynomials to approximate the inverse cumulative distribution function of the normal distribution, and the Wallace method. These uncommon generators are typically not the first choice when it comes to generation of normally distributed random numbers. We investigate the properties of these three generators and show how their properties can be used for an efficient high-performance implementation on GPUs making these generators a good alternative on this type hardware architecture. Various benchmark results show that our implementations outperform well established normal pseudo random number generators on GPUs by factors up to 4.5, depending on the utilized GPU architecture. We achieve generation rates of up to 4.4 billion normally distributed random numbers per second per GPU. In addition, we show that our GPU implementations are competitive against state-of-theart normal pseudo random number generators on CPUs by being up to 2.6 times faster than an OpenMP parallelized and vectorized code. © 2016 Elsevier B.V. All rights reserved.
1. Introduction Random numbers play a vital role in numerous areas of Computational Science and Engineering (CSE). They are frequently used in methods for uncertainty quantification (such as Monte Carlo sampling, realization and approximation of stochastic processes), in performance modeling but also for other applications such as cryptography. One popular way to generate random numbers on computers are pseudo random number generators (PRNGs) [1]. In contrast to real random number generators which rely on some
∗
Corresponding author. E-mail addresses:
[email protected] (C. Riesinger),
[email protected] (T. Neckel),
[email protected] (F. Rupp). http://dx.doi.org/10.1016/j.future.2016.12.018 0167-739X/© 2016 Elsevier B.V. All rights reserved.
physical process (e.g. radioactive decay), PRNGs follow a deterministic rule to generate random numbers [2,3]. Thus, such random numbers are not entirely random but fulfill certain statistical criteria [4]. There are many PRNGs for different statistical distributions (uniform, normal, exponential, etc.) with different capabilities, properties, and characteristics [5–7]. In general, PRNGs produce uniformly distributed random numbers (in the following called uniform random numbers and analogously normal, exponential, etc. random numbers). If a different distribution is required, uniform random numbers have to be altered to the desired target distribution by special transformation operations. Most of such transformation operations try to approximate the inverse cumulative distribution function (CDF) of the aimed distribution. Such combinations of a uniform PRNG and a transformation function are also
2
C. Riesinger et al. / Future Generation Computer Systems (
called PRNG, even if it is a two stage process; but exceptions exist where a PRNG directly produces random numbers of, e.g., normal distribution avoiding the generation of uniform random numbers. On CPUs, there is already much experience with PRNGs of different distributions available in literature, in particular with respect to the performance. In recent years, new processor architectures such as GPUs, the Intel Xeon Phi, or the PEZY accelerator1 gained huge relevance in HPC. Occasionally, such accelerators have very different performance values compared to CPUs such as degree of parallelism, bytes per flop ratio, memory bandwidth, instruction throughput, context switching overhead, and many more. Thus, experiences made for CPUs cannot be directly generalized to accelerators, in particular not to GPUs. In this paper, we focus on three PRNGs for non-uniform distribution which are not the first choice on CPUs but which have interesting properties for GPUs. Reasons why such non-standard generators are not attractive for CPUs can be hard implementations, high complexity, or just bad performance due to immense computational intensity. Such properties do not have to be disadvantages on GPUs or can even be advantageous. The three PRNGs discussed in this paper are the Ziggurat method, rational polynomials approximating the inverse CDF (in the following just called rational polynomials), and the Wallace method. We investigate the characteristics of these three methods with respect to suitability on GPUs. Since we need normal random numbers for our application (solving random ordinary differential equations (RODEs) [8,9]), we focus on this particular distribution. Nevertheless, the three nonstandard PRNGs which are subject to this paper are not limited to normal distribution but can also generate random numbers of alternative distributions. When the non-standard methods are introduced in detail, we also briefly depict how different distributions can be achieved. Statistical properties of PRNGs can be experimentally checked by test batteries like Diehard [10] or TestU01 [11,12]. Investigating them is, however, not within the scope of this paper. Instead, we concentrate on implementation aspects, optimization for GPUs, and exploitation PRNGs’ properties for better performance. This paper uses CUDA terminology (thread, block, grid, warp, occupancy, throughput) when it comes to GPU aspects. In addition, only CUDA-capable GPUs are used for performance measurements. But, all ideas presented in this paper also work with OpenCL and GPUs from vendors other than NVIDIA without any limitations. The remainder of this paper is structured as follows: Section 2 lists relevant literature in the context of the non-standard PRNGs and implementation approaches on different hardware architectures. An introduction to the non-standard PRNGs together with ways how to exploit their properties for GPUs is given in Section 3. In Section 4, benchmark and profiling results are presented as well as a comparison with state-of-the-art CPU and GPU libraries for random numbers. Section 5 concludes this paper by summarizing the main achievements of our GPU implementations. 2. Related work In this section, we briefly present literature which introduces, discusses, and adapts (e.g. for special hardware architectures) the three non-standard PRNGs used in this paper. A high-performance implementation of uniform generators, especially for GPUs, can be found in [13]. The Ziggurat method was first introduced in [14]. Over time, it was improved in terms of simplicity and performance which led to the most recent version in [15]. This is also the version that we
1 http://www.pezy.co.jp/en/index.html.
)
–
use for our GPU implementation. Detailed discussions concerning the statistical properties of the Ziggurat method can be found in [16,17]. There are several attempts to implement the Ziggurat method on special purpose hardware, mainly on Field Programmable Gate Arrays (FPGAs). Examples can be found in [18–20]. All of these attempts realize a straight-forward implementation neglecting the special properties of the Ziggurat method. An extensive survey on several, also massively parallel architectures is done by Thomas et al. [21] where the Ziggurat method turns out to be the best choice on CPUs but not on GPUs. This result is not in accordance with our results presented in this paper. The idea of a runtime/memory trade-off as suggested in Section 3.1 is also seized by Buchmann et al. [22] where it is used for their cryptosystem application. Their implementation is limited to a normal distribution for integers. Numerous examples exist to directly approximate the inverse normal CDF, e.g. in [23,24]. Wichura [25] suggests a set of coefficients for an approximating rational polynomial which we also use in this paper. An alternative coefficient set with different properties in terms of error bounds for specific regions of R is proposed by Beasley et al. [26]. The Wallace method was first presented in [27] with some very useful comments in [28]. The basis for our GPU implementation originates from the vectorized version of Brent [29] provided in the library rannw [30]. There is also a FPGA implementation of the Wallace method developed by Lee et al. [31]. In a former paper, we discussed and compared PRNG implementations on GPUs [32] where we focused on popular PRNGs for CPUs such as Mersenne Twister [33] or the Box/Muller method [34] which are not necessarily the best choice for GPUs. The Ziggurat method for GPUs was already discussed intensively by us in [35]. This paper augments our previous work by extended explanations how the Ziggurat is set up and by additional non-standard PRNGs with interesting properties especially for GPUs. 3. Methods In this section, we present three non-standard PRNGs. The Ziggurat method in Section 3.1 and rational polynomials in Section 3.2 are transformation functions which approximate the inverse normal CDF. Hence, they require input from an uniform PRNG. The Wallace method in Section 3.3 directly generates normal random numbers and does not depend on any other PRNG. 3.1. Ziggurat method The Ziggurat method is a rejection method which realizes the transformation from uniform to normal distribution in the best case with only one table lookup and one multiplication. It approximates the area under the normal probability density x2
function (PDF) f (x) = e− 2 for x ≥ 0 using N vertically stacked strips where N can be an arbitrary number ≥ 2. An approximation for N = 8 is depicted in Fig. 1. N − 1 of the strips have rectangular shape. These rectangles Ri , i = 0, . . . , N − 2 have upper-left corners (0, yi ) and lower-right corners (xi+1 , yi+1 ), where f (xi ) = yi and 0 = x0 < x1 < · · · < xN −1 = r. The last strip RN −1 = RB is the base strip and is hatched from bottom left to top right in Fig. 1. It does not have rectangular shape but is bounded at x = 0 from the left, by f (x) from the right, at y = 0 from the bottom, and at y = yN −1 from the top. Each strip (the rectangles and the base strip) has the same area v . Every rectangle Ri , i = 0, . . . , N − 2 is further subdivided in three subregions: central region, tail region, and cap region. While central regions are not hatched, tail regions are hatched with diagonal crosses, and cap regions are hatched from bottom right to top left in Fig. 1. Central regions lie completely
C. Riesinger et al. / Future Generation Computer Systems (
)
–
3
Fig. 1. A Ziggurat with eight strips (seven rectangles and one base strip). Central regions are not hatched, tail regions are hatched with diagonal crosses, and cap regions are hatched from bottom right to top left. The base strip is hatched from bottom left to top right. A similar version of this figure can be found in [35].
below the PDF, have rectangular shape, upper-left corners (0, yi ), and lower right corners (xi , yi+1 ). Tail regions also lie completely below the PDF and are bordered by x = xi from the left, y = yi+1 from the bottom, and f from right and top. Cap regions lay completely above the PDF and are the remainder of every rectangle Ri , i = 0, . . . , N − 2 which is neither a central nor a tail region. The stacked strips look like a Ziggurat pyramid2 which gives this method its name. To do the transformation from uniform to normal distribution, a uniform random number u ∈ [0, 1) is generated. Using u, one of the strips is randomly selected. This can be done especially efficient if the number of strips N is chosen as a power of 2, thus N = 2n . In such a case, only the last n bits of u are considered for the strip selection. The selected strip has index k. Now, one of the following four cases can occur: (a) If k = ̸ N − 1 and u ≤ xk /xk+1 ⇒ A central region is hit. (b) If k ̸= N − 1, (a) is not satisfied, and u · (f (xk ) − f (xk+1 )) < f (u · xk+1 ) − f (xk+1 ) ⇒ A tail region is hit. (c) If k ̸= N − 1 and neither (a) nor (b) are satisfied ⇒ A cap region is hit. (d) If k = N − 1 ⇒ The base strip is hit. If a central or tail region is hit, the transformation can be done very quickly because only a multiplication of u with xk+1 is necessary. For the cap region case (c), the Ziggurat method has to be restarted with a new u. If the base strip is hit (case (d)), two different subcases have to be considered depending on the value x = (v · u)/f (r ):
• x < r: x is used as transformed normal random number.
• x ≥ r: A much more complex treatment is required which is described in detail in [36]. It requires the generation of additional uniform random numbers and the evaluation of ln(). Before the Ziggurat can be used for transformation, it has to be set up. This means resolving all xi , i = 0, . . . , N − 1 such that all strips have identical area v . It is sufficient to determine ∞ the correct r = xN −1 for a given N because v = r · f (r ) + r f (x)dx. With r and v , the Ziggurat can be set up from the bottom to the top (i.e. higher indicesi) step by step: New xi are computed by the formula xi = f −1
v
xi+1
+ f (xi+1 ) where xi+1 is already known.
r is determined by a binary search: First, r is chosen arbitrarily and
2 Ziggurats were massive buildings having the form of a terraced step pyramid of successively receding levels.
the Ziggurat is set up for this r as described above. If r is chosen too large, y0 < 1 and thus, the Ziggurat does not completely cover the area under f (see Fig. 2(a)). In this case, a smaller r has be used. If r is chosen too small, the Ziggurat becomes too big with y0 > 1 (see Fig. 2(b)). In this case, a larger r has to be used. A table which lists r for various N together with resulting v is given in [35]. The Ziggurat is stored in a lookup table which contains all values xi . The size of this lookup table depends linearly on N. Since the values xi are constant for a given N, the lookup table has only to be computed once when the PRNG is started and only one lookup table is required even multiple instances of a PRNG are used in parallel. This makes shared memory the best choice to store the lookup table on the GPU. Due to reasons of performance, it is desired that cases (a) and (b) occur much more frequently because they are computationally much cheaper then cases (c) and (d). By varying N, it is possible to influence the share central regions have to the area of all strips, hence influencing the likelihood to hit a central region. The more strips are used, the larger the area of central regions gets in comparison to the complete area of the Ziggurat as depicted in Fig. 3 for N = 4, 16, 32, and 64. Furthermore, it is of special interest for GPUs to increase the likelihood to hit one specific region: On GPUs, 32 threads are executed in parallel as a warp. Conditional statements can lead to warp divergence if some threads take a different execution path than others resulting in a drop in performance since different code cannot be executed in parallel within one warp. Therefore, it is important that all threads of a warp hit the same region, preferably a central region. Table 1 lists, for different N, the share that the central regions have on the total area under the normal PDF and the likelihood a warp diverges because at least one thread does not hit a central region. This property leads to a runtime/memory trade-off: The more strips are used, the larger the lookup table gets and the more memory is consumed. But more strips lead to a better central region to area under f ratio and, thus, to better performance, especially on GPUs. Since the Ziggurat method requires uniform random numbers as input, it always depends on a uniform PRNG. Such uniform PRNGs usually introduce some memory footprint, e.g. to store the state of the generator. Basically, there are two candidates of memory where to store this data on the GPU: local memory and shared memory. Local memory is actually slow off-chip global memory which gets a performance boost by being cached in L2 cache. Shared memory is fast low-latency high-bandwidth on-chip memory. Its performance can be reduced by bank conflicts and its usage can lower occupancy. In Section 4.1, benchmarks are presented for a local and a shared memory implementation. In general, the Ziggurat method can be used for every strictly antitone PDF. Marsaglia shows in [14] how the Ziggurat method can be used for the exponential distribution.
4
C. Riesinger et al. / Future Generation Computer Systems (
)
–
Fig. 2. Two examples of a Ziggurat using eight strips with wrong r. r is chosen too large in Fig. 2(a) resulting in small v so eight strips are not enough to approximate the area under f correctly (space filled with dots). But then r is chosen too small in Fig. 2(b) so N v is too large for the approximation.
Fig. 3. Striped approximation of the Gaussian according to the Ziggurat method. Fig. 3(a)–(d) show a discretization with 4, 16, 32, and 64 strips, respectively. One can observe that the ratio of the central regions for case (a) to the area under the Gaussian becomes larger as more strips are used. A similar version of this figure can be found in [35].
Table 1 Depending on the number of used Ziggurat strips, the second column gives the ratio of the area covered by central regions to the total area under the normal PDF (a larger ratio is better). The last column gives the likelihood that at least one thread of a warp takes another branch than the one for a central region (a smaller likelihood is better). Source: This table and caption are taken from [35]. Number of strips 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
1
=2 = 22 = 23 = 24 = 25 = 26 = 27 = 28 = 29 = 210 = 211 = 212 = 213
% of covered area
Likelihood of warp divergence
0.0% ∼42.98% ∼69.75% ∼84.13% ∼91.71% ∼95.64% ∼97.71% ∼98.80% ∼99.37% ∼99.67% ∼99.83% ∼99.91% ∼99.95%
1 − 0.032 = 100.0% 1 − 0.429832 ≈ 99.9% 1 − 0.697532 ≈ 99.9% 1 − 0.841332 ≈ 99.9% 1 − 0.917132 ≈ 93.7% 1 − 0.956432 ≈ 76.0% 1 − 0.977132 ≈ 52.3% 1 − 0.988032 ≈ 32.0% 1 − 0.993732 ≈ 18.3% 1 − 0.996732 ≈ 10.0% 1 − 0.998332 ≈ 5.3% 1 − 0.999132 ≈ 2.8% 1 − 0.999532 ≈ 1.6%
3.2. Rational polynomials Explicit functions, e.g. rational polynomials, can be used to directly approximate the inverse normal CDF. Thus, rational polynomials can be transformation functions to transform a uniform random number u ∈ [0, 1) to normal distribution. There is no unique set of coefficients for such a rational polynomial but they vary depending on the polynomial degree and the intervals where the ra-
tional polynomial is defined. To increase the precision of the approximation and to have an approximation for complete R, partially defined rational polynomials are used, e.g. one rational polynomial for the inner region around 0 and one rational polynomial for the tail regions of f . In general, rational polynomials can be used to approximate every kind of CDF making them very flexible and widely applicable. In this paper, we use the coefficient set suggested by Wichura [25]. It uses one rational polynomial for the inner region [−0.425, 0.425] and one for the tail region. √ To further enhance accuracy, there is an auxiliary variable s = − log(min(u, 1 − u)) which selects (s > or ≤ 5) one of two possible rational polynomials for the tail region. Nominator and denominator degree are 7 for all rational polynomials. The alternative coefficient set given in [26] offers higher precision in [−3.5, 3.5] but less precision in the tails of the distribution. Accuracies and error bounds for the coefficient sets are given in [25,26]. Depending on u it is possible that branching occurs for this transformation method. Either the rational polynomial for the inner or the tail region has to be used and depending on s, two rational polynomials are possible for the tail region. On GPUs, this can lead to warp divergence resulting in a reduction of performance. To overcome this issue, blending can be used: Independent of u, all branches of this method are computed. Afterwards, the result of every branch is multiplied by the result of the conditional statement of the corresponding branch (0 for false, 1 for true). Finally, all multiplied results are summed up resulting in the transformed value because the correct branch is weighted
C. Riesinger et al. / Future Generation Computer Systems (
by 1 and all wrong branches are weighted by 0. This approach leads to many superfluous computations but GPUs possess high computational power so additional computations can make sense to avoid warp divergence. Furthermore, the superfluous computations mainly consist of additions and multiplications because of the evaluation of rational polynomials using the Horner scheme. Additions and multiplications have high throughput on GPUs. In Section 4.1, Fig. 6 gives performance results of an implementation which suffers from warp divergence and an implementation which uses blending to eliminate warp divergence. 3.3. Wallace method An alternative approach to transformation functions are PRNGs which directly generate normal random numbers. Such a generator is the Wallace method which uses a chunk of previously computed normal random numbers to transform them to a new chunk of normal random numbers. To do so, it does not need any transcendental functions such as log() or sin() but just a linear operator. This is possible by using the maximum entropy property [37], which is E (x2 ) = 1 for normal distribution where E is the expected value of a random number x. The Wallace method takes a vector X of K previously generated normal random numbers and forms a new vector Y of normal random numbers by Y = AX . A is an orthogonal K × K matrix to satisfy the maximum entropy property of the normal distribution because it preserves the sum of squares. L of such vectors of size K form a pool, thus a pool consists of ν = KL normal random numbers. In L transformation steps, a new pool of normal random numbers is generated by applying A to every K elements of the current pool. The transition from an old to an updated pool is called a pass. We use K = 4, thus A is a 4 × 4 orthogonal matrix. Other values for K are also possible: An example for K = 2 is given in [38,28] where A simply becomes a rotation matrix in the plane. K should not become too large because the computational effort depends quadratically on K leading to high computational load even on GPUs. Exploiting the maximum entropy property is not limited to the normal distribution. For the uniform distribution, the maximum entropy property is 0 ≤ x ≤ 1. This leads to the class of generalized Fibonacci generators [39] where (u1 + u2 ) mod 1 is a uniform random number if u1 and u2 are uniform random numbers. For exponential distribution, the maximum entropy property is E (x) = 1, x ≥ 0. To achieve good statistical results, there are some modifications necessary for this basic Wallace method. First, it is desirable that every entry of a pool has a contribution to all entries of future pools formed after some passes. One approach to satisfy this property is to interpret the K × L pool which is stored linearly in memory as column-major order in every even pass instead of row-major order in every odd pass, thus doing an implicit transposition. For large L > 256, a transposition only is not sufficient for adequate mixing. Thus, a random odd stride other than one is used for the row index which is also started from a random value different to zero. This makes it necessary to take the adapted row index modulo L. Second, the quality of future pools can further be improved by using several different orthogonal matrices instead of only one. Wallace e.g. suggests to use multiple matrices A1 , . . . , A4 randomly chosen in every pass instead of only one A. 1 1 −1 1 1 −1 −1 −1 1 1 1 1 −1 1 1 −1 1 1 A1 = A2 = −1 −1 −1 1 −1 1 2 1 2 1 −1 −1 −1 1 −1 −1 −1 1 1 −1 1 1 −1 1 −1 −1 1 −1 −1 1 −1 −1 1 −1 1 −1 A3 = A4 = −1 −1 −1 1 1 1 2 1 2 −1 −1 1 1 1 1 1 1 −1
)
–
5
Table 2 Properties of all three GPU architectures used to benchmark the performance of all four building blocks to solve an RODE on a single GPU. SP stands for single precision (float), DP stands for double precision (double). GPU architecture
Fermi
Kepler
Maxwell
Model Compute capability
M2090 2.0
Tesla K40m 3.5
GTX 750 Ti 5.0
16 × 32 –a
15 × 192 15 × 64
5 × 128 5×4
#Processing elements
SP DP
Shared memory (kB) (Base) Clock rate (MHz) Peak performance (TFLOPS)
1300 SP DP
Peak memory bandwidth (GB/s)
48 667
64 1280
1.3312 0.6656
3.84192 1.28064
1.6384 0.0512
177.4
288.384
96.128
a
In contrast to the SMX and SMM microarchitecture of Kepler and Maxwell, respectively, the SM microarchitecture of Fermi does not have dedicated double precision units. Instead, two single precision units are combined to perform double precision operations.
A1 , . . . , A4 have the same properties in terms of required arithmetical operations: seven additions and one multiplication with 21 . Third, an obvious defect of the Wallace method is that the sum of squares of the numbers of one pool is constant for all pools since A is orthogonal and, thus, ∥Y ∥2 = ∥AX ∥2 . Instead, the sum of squares should have chi-squared distribution χν2 . There are several solutions to overcome this issue. A straightforward measurement is already given in the original paper [27]: One element of every pool is used to approximate a variate V from χν2 . This element is used to setup a scaling factor
V KL
which is used for all remaining
KL − 1 elements of the pool which are returned to the user. Further details how to deal with such statistical defects and alternative solutions are given in [27,29,28,31]. Brent suggests to vectorize the Wallace method along L direction [29] which can also be applied to GPUs: Every thread transforms K elements of the pool resulting in L threads working in parallel to do a pass. Hence, one thread block consists of L threads, and different thread blocks are working on different pools. To deal with the statistical defects of the Wallace method mentioned in the previous paragraph, a uniform PRNG is required to randomly determine a stride and start index and to randomly select Ai , i = 1, . . . , 4, even if the Wallace method itself does not require uniform random numbers as input. As already mentioned for the Ziggurat method in Section 3.1, there are two options where to store the states of the uniform PRNG: local memory or shared memory. The advantages and disadvantages are the same as discussed for the Ziggurat method. To reduce memory latency when doing passes, the current pool is kept in shared memory. In Section 4.1.3, benchmark results are given for a local and a shared memory implementation. 4. Results In this section, we first evaluate the performance of our optimized implementations for GPUs of the PRNGs presented in Section 3 in Section 4.1. Afterwards, we correlate these results and compare them with state-of-the-art CPU and GPU library functions in Section 4.2. We do benchmarks and profilings on three different GPUs representing three generations of NVIDIA GPU architectures: Fermi, Kepler, Maxwell. Architecture-specific properties are given in Table 2. While the Teslas M2090 and K40m represent high-end products of their generation, the GTX 750 Ti is a mid-range consumer product. From the software-side, CUDA 6.5 is used on all GPUs. Since all three PRNGs in this paper demand for uniform random numbers (either they are transformed by the Ziggurat method or
6
C. Riesinger et al. / Future Generation Computer Systems (
Table 3 Achieved SP FLOPS efficiency of different normal PRNG implementations on the three benchmarked GPU architectures. For all runs, parameters have been used which lead to best performance of the specific implementation (see peaks in Figs. 4, 6, and 7). Method
Implementation
Fermi
Kepler
Maxwell
Ziggurat
Local Shared
0.02% 0.04%
0.02% 0.02%
0.38% 0.58%
Rational polynomial
Branching Blending
0.29% 0.88%
0.08% 0.26%
1.65% 17.87%
Wallace
No treatment Local Shared
0.50% 0.06% 0.23%
0.13% 0.01% 0.04%
1.44% 0.42% 0.70%
rational polynomials or they are required to deal with the statistical defects of the Wallace method), we use cuRAND’s [40,41] XORWOW shift-register generator [42] to produce them. On the one hand, it offers good performance and small memory footprint, on the other hand, it has good statistical properties such as long periods, a uniform distribution, and a hard predictability of the sequence, which is discussed in [42]: It passes all tests of Diehard [10], except the binary rank test. Single precision (SP) is used3 and 1 GB of random data (corresponds to 228 = 268 435 456 float numbers) is generated for all runs. Every thread produces 212 random numbers. Benchmarks and profilings are run for different grid configurations in terms of ‘‘threads per block’’ and ‘‘blocks per grid’’. This is the configuration of the parallel setup during runtime in the CUDA programming model [43]. We denote it by numberOfThreadsPerBlock × numberOfBlocksPerGrid varying from 25 × 211 to 29 × 27 . Thus, 216 threads × 212 random numbers per thread = 228 random numbers. We measure performance in giga pseudo random numbers per second (GPRNs/s). All measured runtimes include the times for uniform random number generation and saving them to global memory. Times to setup the PRNGs are given in Table 4. There are optimized versions of the source code for particular GPU architectures: The Ziggurat and the Wallace method each having implementations using either local or shared memory to store the states of the PRNGs. For rational polynomials, there is one version using branching and one version using blending. Section 4.1 indicates which implementation performs best on a particular GPU architecture. 4.1. Performance of non-standard PRNGs on GPUs First, we summarize the results of the Ziggurat method in Section 4.1.1 which are already published by us in [35]. Then, results of our implementations of rational polynomials and the Wallace method are given in Sections 4.1.2 and 4.1.3, respectively. 4.1.1. Ziggurat method Benchmark results for the Ziggurat method in dependence of the number of used strips N are given in Fig. 4. Different lines represent different grid configurations (numberOfThreadsPerBlock × numberOfBlocksPerGrid), local and shared memory implementation are colored in red and blue, respectively. We are interested in how the performance in terms of GPRNs/s depends on the number of used Ziggurat strips N so that is the value we use for the abscissae in Fig. 4. On Fermi and Kepler, we see a single peak of best performance for almost all implementations and configurations. While warp divergence becomes the bottleneck when using less strips than
3 Since the output of XORWOW has just 32 bits, only 232 different normal random numbers can be generated out of it at maximum. Alternatively, two successive numbers of XORWOW could be concatenated to get an input of 64 bits.
)
–
Table 4 Setup times to initialize the states of different normal PRNGs on the three benchmarked GPU architectures. For all runs, parameters have been used which lead to best performance of the specific implementation (see peaks in Figs. 4, 6, and 7). Method
Implementation
Fermi (ms)
Kepler (ms)
Maxwell (ms)
Ziggurat
Local Shared
24.063 24.042
61.889 26.801
52.176 34.417
Rational polynomial
22.089
18.332
31.631
3617.30 3619.79
3193.11 4015.01
1311.23 2841.78
Wallace
Local Shared
the peak performance configuration (that is exactly what we expect from the runtime/memory trade-off), low occupancy limits performance for big strip numbers. This behavior can also be observed on Maxwell but less noticeable than on Fermi and Kepler. In general, the shared memory implementation performs better than the local memory implementation. This is more significant for low strip numbers and relativizes for high strip numbers where the shared memory implementation is more sensitive to low occupancy. Best performance on Fermi is achieved with the local memory implementation. Using 1024 strips and a grid configuration of 26 × 210 gives 1.23 GPRNs/s. For Kepler and Maxwell, the shared memory implementation shows the best performance. 1024 strips and the grid configuration 26 × 210 lead to 2.17 GPRNs/s on Kepler and 256 strips and the grid configuration 25 × 211 to 1.91 GPRNs/s on Maxwell. On a first look it is surprising that best achieved performance on the three GPUs does not differ much although there are big differences in terms of number of processing elements, clock rate, memory bandwidth, and SP peak performance (see Table 2): The best achieved performance for Ziggurat on Kepler is 1.76 times higher than on Fermi, and the best achieved performance on Maxwell is 1.13 times lower than on Kepler. The reason is that both Ziggurat implementations are neither compute-bound (the transformation itself only requires one multiplication in most cases) nor memory-bound (neither to off- nor to on-chip memory). Instead, they are latency-bound due to low instruction level parallelism and a bad ratio of operations to memory accesses. This leads to similar characteristics on all GPU architectures. Fig. 5 shows the achieved occupancy of the Ziggurat method for different implementations and grid configurations. The color coding and axis labeling are the same as for Fig. 4. Since the local memory implementation uses less shared memory, its occupancy is higher than the shared memory implementation which can be observed on all three GPU architectures when using identical settings in terms of N and the grid configuration. All implementations and grid configurations on all GPU architectures have in common that the occupancy stays constant until a certain value of N and then decays. That is the point where shared memory usage becomes a limiting factor for occupancy also influencing the overall performance as discussed in the previous paragraph. 4.1.2. Rational polynomials Benchmark results for rational polynomials using the coefficient set of Wichura are plotted in Fig. 6. GPRNs/s are again plotted on ordinates, but, the grid configuration is assigned to abscissae. The results of implementation containing conditional statements are colored in red and the results of implementation avoiding them with blending are colored in blue. Although there is a dependency on the grid configuration recognizable in Fig. 6, rational polynomials are not as sensitive to it as the Ziggurat method. Occupancy is only slightly affected by the grid configuration and is not correlated to performance for rational polynomials. The implementation using blending always performs better than the implementation containing conditional statements
C. Riesinger et al. / Future Generation Computer Systems (
)
–
7
Fig. 4. Performance in GPRNs/s over number of strips N of different implementations of the Ziggurat method using different grid configurations on three different GPU architectures. The local memory implementation is colored in red and given in the first row of plots, the shared memory implementation is colored in blue and given in the second row of plots. Five different grid configurations are benchmarked, ranging from 25 × 211 to 29 × 27 (numberOfThreadsPerBlock × numberOfBlocksPerGrid). Due to the limited size of the shared memory, the results for certain implementations and grid configurations are missing: N > 211 strips on Fermi, N = 213 strips for the shared memory implementation with 29 × 27 on Kepler, and N > 211 strips for the shared memory implementation with 29 × 27 on Maxwell. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Source: This caption and the GPU results are taken from [35].
Fig. 5. Achieved occupancy over number of strips N of different implementations of the Ziggurat method. Profiled grid configurations, implementations, and GPU architectures, the restrictions, and the used color coding are the same like for Fig. 4. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
for the same grid configuration. On Fermi and Tesla, there is no big difference observable between the two implementations, on Fermi it is even imperceptible. This does not hold for Maxwell: There, the best achieved performance is 3.49 times faster than the best achieved performance for the default implementation. Maxwell seems to suffer much more from overhead when it comes to warp divergence (again synchronizing all threads of a warp) than
previous NVIDIA GPU generations where this overhead is small or even neglectable. Best achieved performance is 0.97 GPRNs/s on Fermi, 1.70 GPRNs/s on Kepler, and 2.77 GPRNs/s on Maxwell each using grid configuration 25 × 211 . Like for the Ziggurat method, the approximation with rational polynomials is neither compute-bound (see Table 3) nor memorybound but latency-bound. That is the reason why peak perfor-
8
C. Riesinger et al. / Future Generation Computer Systems (
)
–
Fig. 6. Performance in GPRNs/s over grid configuration (numberOfThreadsPerBlock × numberOfBlocksPerGrid) for different implementations of rational polynomials on three different GPU architectures. The implementation using conditional statement is colored in red, the implementation using blending is colored in blue. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 7. Performance in GPRNs/s over grid configuration (numberOfThreadsPerBlock × numberOfBlocksPerGrid) for different implementations of the Wallace method on three different GPU architectures. The local memory implementation is colored in red, the shared memory implementation is colored in blue. Results of the implementation resigning any additional statistical treatment is colored in purple. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
mance values do not correlate to the values given in Table 2. The SMM micro-architecture of Maxwell [44] deals much better with latency-bound problems than the SMX [45] architecture of Kepler because of improved instruction scheduling, latencies, and throughput. 4.1.3. Wallace method The results of benchmarking our implementations of the Wallace method are given in Fig. 7. Axis assignments are the same as in Fig. 6 for rational polynomials. Local and shared memory implementation using special treatment to handle statistical defects are colored in red and blue, respectively. In addition, the results of an implementation without any uniform PRNG are added in purple. Note that the quality of normal random numbers generated this way is much worse, and they cannot be used for real applications. Nonetheless, this is an interesting result because it shows how high-performant the Wallace method with its idea of doing passes only using a linear orthogonal operator actually is. On all three GPU architectures, the shared memory implementation performs better than the local memory implementation. Similar to the Ziggurat method, the shared memory implementation always shows a single optimum of performance. Since the current pool is kept in shared memory and the pool size depends on the number of threads per block, shared memory consumption grows the more threads per block are used and thus, occupancy is lowered. This reduces performance if using more threads per
block than the peak performance configuration. On Fermi and Kepler, the local memory implementation performs better the less threads per block are used while on Maxwell, it performs better the more threads per block are used. Best achieved performance is 4.46 GPRNs/s on Fermi, 4.18 GPRNs/s on Kepler, and 2.21 GPRNs/s on Maxwell using grid configuration 27 × 29 on Fermi and Maxwell and 26 × 210 on Kepler. Comparing these results with the pure Wallace method implementation shows how big the influence of the necessary statistical treatment is: A factor of 2.14 on Fermi, a factor of 2.44 on Kepler, and a factor of 1.93 on Maxwell. Like the Ziggurat method and rational polynomials, the Wallace method is latency-bound. On a first look, the orthogonal linear operator to update the pools looks expensive but actually it only causes seven additions and one multiplication per K vector. This property is validated by measurements of the FLOPS efficiency given in Table 3 (see row ‘‘Wallace’’). Since the current pool is kept in shared memory, there are also no memory bandwidth issues. Table 3 lists the achieved SP FLOPS efficiencies of our implementations of the PRNGs discussed in this paper on three different GPU architectures. The efficiencies refer to the configuration which achieves best performance in the benchmarks discussed above. FLOPS efficiency is the ratio of achieved FLOPS rate to peak performance FLOPS rate (see Table 2). The FLOPS efficiencies are in general relatively low: Except only three cases, they are less than one percent which is an indicator for low computational intensity of the kernels. This is OK since none of the non-standard PRNGs is computationally expensive and still, the generation rates are very high.
C. Riesinger et al. / Future Generation Computer Systems (
)
–
9
Fig. 8. Performance of different methods for random number generation on three different GPU architectures. The three methods explained in Section 3 are colored in red where the straight lines refer to the Ziggurat method (local memory implementation on Fermi, shared memory implementation on all other GPUs), the dotted lines refer to the rational polynomial (implementation using blending), and the dashed lines refer to the Wallace method (shared memory implementation). In green, the performance of libraries provided by CUDA is given, normcdfinvf() as straight lines and cuRAND Box/Muller as dotted lines. Pure XORWOW just generating uniform random numbers without any further modification is colored in purple. Five different grid configurations are benchmarked, ranging from 25 × 211 to 29 × 27 . For the sake of comparability, the performance of two Intel Xeon E5-2680 v2 using the MKL is added in blue. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Finally, Table 4 lists the setup times of all three normal PRNGs presented in this paper. Setup times are the times to initialize the states of the PRNGs, e.g. setup of the Ziggurat. These times refer to the configuration which achieves the best performance in the above discussed benchmarks. For rational polynomials, the state initialization is identical for both implementations (branching and blending), hence only one value is indicated. The same holds for two of the three implementations of the Wallace method: The shared memory implementation and the implementation without special treatment for statistical defects initialize their states in the same way. While the setup times of the Ziggurat method and rational polynomials are neglectable, they take a considerable amount of time for the Wallace method. Long setup times for the Wallace method are caused by the generation of ν = KL random numbers to form the first pool. 4.2. Comparison of PRNGs In this section, we compare the benchmark results of our implementations with each other and with other well-established random number libraries for CPU and GPU. For the Ziggurat method (local memory on Fermi, shared memory on Kepler and Maxwell), rational polynomials (blending), and the Wallace method (shared memory) we use those implementations and parameters (number of strips N, grid configuration) which achieve best the performance for the corresponding PRNG on the three GPU architectures. These parameters are determined in Section 4.1 and are also used for Table 3. To have a comparison of our GPU implementations with results from a CPU, we benchmark the generation rate of two Intel Xeon E5-2680 v2 which represent a high-end CPU system. Since there are no absolute values in terms of GPRNs/s for CPUs given in literature (sources such as [46] only provide relative speedup values, e.g. in comparison with C rand()), we wrote our own CPU code and highly optimized it. It uses AVX vectorization and OpenMP for multithreading using 20 threads. The Math Kernel Library (MKL) 11.3 and the Intel C++ Compiler (ICC) 16.0 are applied. For uniform random number generation, the 59 bit multiplicative congruential generator of the MKL is utilized. The transformation to normal distribution is done by the MKL implementation of the inverse CDF of the Gaussian function. This combination delivers best performance on the named CPUs leading to 1.71 GPRNs/s.
As widely used random number library for GPUs, we use cuRAND [40,41]. Besides the mentioned XORWOW uniform generator, it implements the Box/Muller method [34] for transformation to normal distribution. CUDA offers the normcdfinvf() function, similar to our rational polynomials a built-in approximation of the inverse normal CDF. We also benchmark the pure XORWOW uniform generator without any transformation to get a feeling how much share it has in total execution time. XORWOW also produces the uniform random numbers transformed by cuRAND’s Box/Muller implementation and by normcdfinvf(). Performance results of all these PRNGs are summed up in Fig. 8. GPRNs/s is plotted on ordinates, grid configuration on the abscissae. On Fermi and Kepler, the Wallace method shows best performance with factors of 3.62 and 1.93 speed-up in comparison to the second best method, the Ziggurat method. Rational polynomials approximating the inverse normal CDF show worst performance of the three non-standard methods. Performance values of normcdfinvf() and cuRAND (XORWOW and Box/Muller) do not differ much which also holds for rational polynomials on Fermi. They seem to be limited by the performance of XORWOW. All non-standard PRNGs show at least similar performance as the GPU libraries, and the Wallace method outperforms them by factors 4.53 and 2.55 on Fermi and Kepler, respectively. On Maxwell, a totally different behavior can be observed: There, the best performing option is cuRAND which is slightly more efficient than rational polynomials (speed-up factor 1.08). In contrast to Fermi and Kepler, rational polynomials show best performance of those methods introduced in Section 3 (factor of 1.25 better than the Wallace method and factor of 1.44 better than the Ziggurat method). If the right PRNG is chosen, the generation rate on all applied GPUs outperforms the CPU by a factor of 2.61 on Fermi, a factor of 2.45 on Kepler, and a factor of 1.74 on Maxwell. Such factors are common if comparing highly-optimized GPU with highly-optimized CPU code. Our implementations of the Ziggurat method and rational polynomials often outperform the XORWOW generator. This is surprising at first sight since every transformation of these two methods requires at least one uniform random number generated by the XORWOW generator. The same phenomena can be observed for the normal PRNGs of the cuRAND library. One possible explanation is the higher degree of instruction level parallelism when executing the uniform random number generation plus the transformation to normal distribution in one single, fused kernel, like we do.
10
C. Riesinger et al. / Future Generation Computer Systems (
5. Conclusion In this paper, we presented three non-standard normal PRNGs and their performance on GPUs. All these methods have special properties which can be used for high-performance implementations: The Ziggurat method offers a runtime/memory trade-off which can be used to influence the likelihood to do the transformation from uniform to normal distribution at very cheap costs and to reduce warp divergence on GPUs. Rational polynomials are compute-intense candidates to approximate the inverse normal CDF which does not harm on GPUs. The negative effects of warp divergence can be resolved by blending. Finally, the Wallace method offers very high generation rates on GPUs due to its simple orthogonal transformation to generate a new pool of random numbers. This behavior is intensified if no uniform PRNG is required to handle statistical defects. We showed that our implementations on all three benchmarked GPU generations outperform a highly optimized CPU version using the MKL by a factor of up to 2.61 using the Wallace method on Fermi. On this same GPU architecture, we could outperform cuRAND by a factor of 4.53 leading to a generation rate 4.46 GPRNs/s. The performance of all of our non-standard PRNGs highly depends on the number of Ziggurat strips N, the grid configuration, the specific implementation, or the hardware architecture, but all of them are very well suited for GPUs offering top generation rates. GPU implementations for various CSE applications relying on PRNGs may considerably profit from the presented implementations of non-standard generators. Acknowledgment This work has been supported by the Open Research Grant ORG/CBS/14/008 of the Research Council of Oman. References [1] J.E. Gentle, Random Number Generation and Monte Carlo Methods, Springer Science & Business Media, 1998. [2] P. Bratley, B.L. Fox, L.E. Schrage, A Guide to Simulation, Springer-Verlag, New York, 1983. [3] P. L’Écuyer, Random numbers for simulation, Commun. ACM 33 (10) (1990) 85–97. [4] M.G. Luby, Pseudorandomness and Cryptographic Applications, Vol. 31, Princeton University Press, 1996. [5] P. L’Écuyer, Uniform random number generation, Ann. Oper. Res. 53 (1) (1994) 77–120. [6] L. Devroye, Nonuniform random variate generation, in: Handbooks in Operations Research and Management Science, Vol. 13, Springer, 2006, pp. 83–121. [7] D.B. Thomas, W. Luk, P.H.W. Leong, J.D. Villasenor, Gaussian random number generators, ACM Comput. Surv. (CSUR) 39 (4) (2007) 11. [8] T. Neckel, F. Rupp, Random Differential Equations in Scientific Computing, Versita, De Gruyter Publishing Group, Warsaw, 2013. [9] C. Riesinger, T. Neckel, F. Rupp, Solving random ordinary differential equations on GPU clusters using multiple levels of parallelism, SIAM J. Sci. Comput. 38 (4) (2016) C372–C402. [10] G. Marsaglia, The Marsaglia Random Number CDROM including the Diehard Battery of Tests of Randomness, 1995. http://stat.fsu.edu/pub/diehard/. [11] P. L’Écuyer, R. Simard, A Software Library in ANSI C for Empirical Testing of Random Number Generators. Technical Report, Département d’Informatique et de Recherche Opérationnelle Université de Montréal, 2002. [12] P. L’Écuyer, R. Simard, TestU01, ACM Trans. Math. Softw. 33 (4) (2007). [13] L.Y. Barash, L.N. Shchur, PRAND: GPU accelerated parallel random number generation library: Using most reliable algorithms and applying parallelism of modern GPUs and CPUs, Comput. Phys. Comm. 185 (4) (2014) 1343–1353. [14] G. Marsaglia, W.W. Tsang, A fast, easily implemented method for sampling from decreasing or symmetric unimodal density functions, SIAM J. Sci. Stat. Comput. 5 (2) (1984) 349–359. [15] A.J. Walker, An efficient method for generating discrete random variables with general distributions, ACM Trans. Math. Software 3 (3) (1977) 253–256. [16] P.H.W. Leong, G. Zhang, D.-U. Lee, W. Luk, J.D. Villasenor, A comment on the implementation of the ziggurat method, J. Stat. Softw. 12 (7) (2005) 1–4. [17] J.A. Doornik, An Improved Ziggurat Method to Generate Normal Random Samples, University of Oxford, 2005.
)
–
[18] G. Zhang, P.H.W. Leong, D.-U. Lee, J.D. Villasenor, R.C.-C. Cheung, W. Luk, Ziggurat-based hardware Gaussian random number generator, in: International Conference on Field Programmable Logic and Applications, 2005., 2005, pp. 275–280. [19] H.M. Edrees, B. Cheung, M. Sandora, D. Nummey, S. Deian, Hardwareoptimized ziggurat algorithm for high-speed Gaussian random number generators, in: International Conference on Engineering of Reconfigurable Systems & Algorithms, ERSA, 2009, pp. 254–260. [20] C. De Schryver, D. Schmidt, N. Wehn, E. Korn, H. Marxen, R. Korn, A new hardware efficient inversion based random number generator for nonuniform distributions, in: Proceedings - 2010 International Conference on Reconfigurable Computing and FPGAs, ReConFig 2010, 2010, pp. 190–195. [21] D.B. Thomas, L. Howes, W. Luk, A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation, in: Proceeding of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays - FPGA’09, 2009, pp. 63–72. [22] J. Buchmann, D. Cabarcas, F. Göpfert, A. Hülsing, P. Weiden, Discrete Ziggurat: A time-memory trade-off for sampling from a Gaussian distribution over the integers, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), in: LNCS, vol. 8282, 2014, pp. 402–417. [23] A.L. Brophy, Approximation of the inverse normal distribution function, Behav. Res. Methods Instrum. Comput. 17 (3) (1985) 415–417. [24] W.A. Richards, R. Antoine, A. Sahai, M.R. Acharya, An efficient polynomial approximation to the normal distribution function and its inverse function, J. Math. Res. 2 (4) (2010). [25] M.J. Wichura, Algorithm AS241: The percentage points of the normal distribution, Appl. Stat. 37 (1988) 477–484. [26] J.D. Beasley, S.G. Springer, Algorithm AS 111: The percentage points of the normal distribution, Appl. Stat. 26 (1) (1977) 118. [27] C.S. Wallace, Fast pseudorandom generators for normal and exponential variates, ACM Trans. Math. Softw. 22 (1) (1996) 119–127. [28] R.P. Brent, Some comments on C. S. Wallace’s random number generators, Comput. J. 51 (5) (2008) 579–584. [29] R.P. Brent, A fast vectorised implementation of Wallace’s normal random number generator. Technical Report, Australian National University, 1997. [30] R.P. Brent, Uniform and Normal Random Number Generators, 2016. http://maths-people.anu.edu.au/~brent/random.html. [31] D.-U. Lee, W. Luk, J.D. Villasenor, G. Zhang, P.H.W. Leong, A hardware Gaussian noise generator using the Wallace method, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 13 (8) (2005) 911–920. [32] C. Riesinger, T. Neckel, F. Rupp, A. Hinojosa Parra, H.-J. Bungartz, GPU optimization of pseudo random number generators for random ordinary differential equations, in: Procedia Computer Science, Vol. 29, Elsevier, 2014, pp. 172–183. [33] M. Matsumoto, T. Nishimura, Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator, ACM Trans. Model. Comput. Simul. 8 (1) (1998) 3–30. [34] G.E.P. Box, M.E. Muller, A note on the generation of random normal deviates, Ann. Math. Statist. 29 (2) (1958) 610–611. [35] C. Riesinger, T. Neckel, A runtime/memory trade-off of the continuous Ziggurat method on GPUs, in: 2015 International Conference on High Performance Computing & Simulation, (HPCS), IEEE, 2015, pp. 27–34. [36] G. Marsaglia, Generating a variable from the tail of the normal distribution, Technometrics 6 (1) (1964) 101–102. [37] E.T. Jaynes, Bayesian methods: General background, in: J.H. Justice (Ed.), Maximum Entropy and Bayesian Methods in Applied Statistics, Cambridge University Press, Cambridge, 1986, pp. 1–25. [38] R.P. Brent, Random number generation and simulation on vector and parallel computers, in: Euro-Par’98 Parallel Processing, Springer, 1998, pp. 1–20. [39] D.E. Knuth, The Art of Computer Programming, Volume 2: Seminumerical Algorithms, third ed., Addison-Wesley, 1997. [40] NVIDIA Corporation. cuRAND Library, 7.5 edition, Sept. 2015. [41] N. Corporation, cuRAND, Feb. 2016. https://developer.nvidia.com/cuRAND. [42] G. Marsaglia, Xorshift RNGs, J. Stat. Softw. 8 (2003) 1–6. [43] NVIDIA Corporation. CUDA C Programming Guide, 7.5 edition, Sept. 2015. [44] NVIDIA Corporation. Tuning CUDA applications for Maxwell, 7.5 edition, Sept. 2015. [45] NVIDIA Corporation. Tuning CUDA applications for Kepler, 7.5 edition, Sept. 2015. [46] Intel Corporation. Benchmarks for Intel Math Kernel Library, Jan. 2016. https://software.intel.com/en-us/intel-mkl/benchmarks.
Christoph Riesinger holds a Diploma in Computer Science from the Universität Passau, Germany. He is currently doctoral candidate at the Chair of Scientific Computing in Computer Science of Prof. Bungartz at the Technische Universität München, Germany. His research interests comprise high performance computing (especially heterogeneous large-scale systems using GPUs), the numerical solution of random differential equations, and the Lattice–Boltzmann method.
C. Riesinger et al. / Future Generation Computer Systems ( Tobias Neckel studied Technomathematics at the Technische Universität München, Germany, and holds a Ph.D. in Computer Science from the TU München. He is currently research fellow at the Chair of Scientific Computing in Computer Science of Prof. Bungartz at the TU München. His research interests in Computational Science and Engineering comprise the numerical solution of partial differential equations, uncertainty quantification, hierarchic and adaptive methods, and various aspects of high performance computing.
)
–
11
Florian Rupp studied Mathematics at the Technische Universität München, Germany, and holds a Ph.D. in Mathematics from the TU München. He is currently a Professor at the Department of Mathematics and Science at the German University of Technology in Oman. His research interests comprise stochastic dynamical systems and their stability, Hamiltonian systems and (classical) molecular dynamics, applications in Chemistry, Biology, Engineering and the Social Sciences, as well as the simulation of partial differential equations with stochastic/random forcing or controls.