Algorithmic and language-based optimization of Marsa-LFIB4 pseudorandom number generator using OpenMP, OpenACC and CUDA

Algorithmic and language-based optimization of Marsa-LFIB4 pseudorandom number generator using OpenMP, OpenACC and CUDA

Journal of Parallel and Distributed Computing 137 (2020) 238–245 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal home...

867KB Sizes 0 Downloads 40 Views

Journal of Parallel and Distributed Computing 137 (2020) 238–245

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc

Algorithmic and language-based optimization of Marsa-LFIB4 pseudorandom number generator using OpenMP, OpenACC and CUDA Przemysław Stpiczyński Maria Curie-Skłodowska University, Institute of Computer Science, Akademicka 9/519, 20-033 Lublin, Poland

article

info

Article history: Received 2 July 2019 Accepted 1 December 2019 Available online 10 December 2019 Keywords: Pseudorandom numbers Recursive generators Vectorization Algorithmic approach OpenMP, OpenACC and CUDA

a b s t r a c t The aim of this paper is to present new high-performance implementations of Marsa-LFIB4 which is an example of high-quality multiple recursive pseudorandom number generators. We propose an algorithmic approach that combines language-based vectorization techniques together with a new divide-and-conquer parallel method that exploits a special sparse structure of the matrix obtained from the recursive formula that defines the generator. Our portable OpenACC implementation achieves the performance comparable to those achieved by our CUDA-based and OpenMP-based implementations on GPUs and multicore CPUs, respectively. © 2019 The Author. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction Pseudorandom numbers are very important and pseudorandom number generators are often central parts of scientific applications such as simulations of physical systems using Monte Carlo methods. There are a lot of such generators with different properties [12]. Recursion-based generators have good statistical properties and they are commonly used [5,13,15,19,20]. MarsaLFIB4 [18] is a great example of such recursive generators. It is simple and it passed all empirical tests from TestU01 Library [16], and it was used in practical applications [14]. However, we do not know any of its high performance parallel implementations. The problem of effective implementation of pseudorandom number generators is very important from a practical point of view [1,3,21,22]. It is clear that an efficient implementation should utilize not only multiple cores of modern processor architectures but also exploit their vector extensions. Then one should expect such implementations to achieve really high performance. SPRNG Library [19] has been developed using cycle division or other parameterizing techniques like block splitting or leapfrogging [2,4]. Our proposed approach for developing not only parallel but also fully vectorized pseudorandom number generators is quite different. Instead of using rather complicated parallelization techniques [4], we rewrite recurrence relations as systems of linear equations and try to optimize them for multicore processors. Then statistical properties of such parallel generators are exactly the same as for corresponding sequential ones, thus there is no need to perform special statistical (and rather expensive) tests. Such systems can be solved using vectorized parallel algorithms E-mail address: [email protected].

with more efficient data layouts. This algorithmic approach was successfully applied to develop new parallel versions of Linear Congruential Generator and Lagged Fibonacci Generator [25,27,28]. Unfortunately, in case of LFG, the number of operations required by the algorithm increases when the lag parameter increases. This is the reason why this approach cannot be applied directly to Marsa-LFIB4, where the lag is 256. In order to design high-performance implementations of the generator, we have proposed [26] an algorithmic approach that combines language-based vectorization techniques together with a new divide-and-conquer parallel algorithm that can exploit the special sparse structure of the matrix obtained from the recursive formula. This OpenMP-based implementation of Marsa-LFIB4 achieves good performance and speedup on several multicore architectures and it is more energy-efficient than Simple (i.e. non-optimized) and SIMD (i.e. vectorized but nonparallel) implementations [26]. Data collected on a server with two Intel Xeon E5-2670 v3 processors using Intel’s Running Average Power Limit (RAPL) [11] show that the power consumption of the high-performance implementation of Marsa-LFIB4 is 22% and 13% of the power consumption of SIMD and Simple, respectively. We have also shown that our intrinsic-based SIMD implementation is faster than the implementation based on the OpenMP simd construct on both Intel MIC architectures [9,10]: up to 5% on KNL and up to 6% on KNC. In case of Xeon E5-2670 both implementations achieve almost the same performance (the intrinsic-based one is a little bit faster). Unfortunately, the disadvantage of the use of intrinsics is the lack of the code portability between different versions of vector extensions. The aim of this paper is to present two new high performance implementations of Marsa-LFIB4. The first one uses OpenACC [6,8]

https://doi.org/10.1016/j.jpdc.2019.12.004 0743-7315/© 2019 The Author. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-ncnd/4.0/).

P. Stpiczyński / Journal of Parallel and Distributed Computing 137 (2020) 238–245

239

Fig. 1. Two SIMD optimized sequential versions of Marsa-LFIB4 using the OpenMP simd directive (left) and AVX512 intrinsics (right).

Fig. 2. Vectorization of Marsa-LFIB4 using SIMD extensions.

Fig. 3. Simple SIMD-optimized version of Marsa-LFIB4 using OpenACC.

and is fully portable. It can be run on both CPU-based systems and GPUs. Another one dedicated to NVIDIA GPUs uses CUDA [7]. 2. SIMD Optimization of Marsa-LFIB4 A multiple recursive generator (MRG) of order k is defined by the linear recurrence of the form xi = (a1 xi−1 + · · · + ak xi−k ) mod m. It produces numbers from Zm = {0, 1, . . . , m − 1}. Usually m is a power of two, thus modulus operations to be computed by merely truncating all but the rightmost 32 bits. When we use ‘‘unsigned int’’ C/C++ data type, we can neglect ‘‘mod m’’. A simple example of MRG is Lagged Fibonacci generator xi = (xi−p1 + xi−p2 ) mod m, 0 < p1 < p2 . Another important high-quality recursive generator is Marsa-LFIB4 [18] based on the following recursive equation xi = (xi−p1 + xi−p2 + xi−p3 + xi−p4 ) mod 232 ,

(1)

where p1 = 55, p2 = 119, p3 = 179, and p4 = 256. A simple implementation of (1) requires 3n arithmetic operations (additions) to generate a sequence of n numbers. Fig. 1 (left) shows a simple SIMD-optimized version of the generator. It utilizes the OpenMP simd directive that asks the compiler to make every possible effort to vectorize the loop [10]. The safelen clause indicates the maximum number of iterations per chunk. It is clear that it should be less than p1 , but the best performance can be achieved when the indicated value is a power of two, thus 32 is a good choice. It should be noticed that due to obvious data dependencies, the loop from lines 7–10 cannot be automatically vectorized by the compiler, even if the highest optimization level is switched on. Fig. 1 (right) shows another SIMD-optimized version using intrinsics for Intel AVX512 instructions to take full advantage

of Intel SIMD 512 bit extensions. Intrinsics allow programmers to write constructs that look like C/C++ function calls corresponding to actual SIMD instructions. Such calls are replaced with assembly codes inlined directly into programs. The general idea is presented in Fig. 2. The output data is produced as a sequence of vectors of length vl. Necessary previously computed numbers are loaded into vector registers and added using simple vectoradd operations. Note that in each iteration we have one load of aligned data, three load operations of unaligned vectors (less efficient) and one store to aligned memory area. Intrinsic-based implementations of Marsa-LFIB4 for Intel AVX2 256 bit extensions and older KNC 512 bit extensions are presented in [26]. OpenACC is a standard for accelerated computing [6,8]. It offers compiler directives for offloading C/C++ and Fortran programs from host to attached accelerator devices. Such simple directives allow to mark regions of source code for automatic acceleration in a vendor-independent manner. OpenACC provides the parallel construct that launches gangs of threads that will execute in parallel. Gangs may support multiple workers that execute in vector (i.e. SIMD) mode. This standard also provides several constructs that can be used to specify the scope of data in accelerated parallel regions. Fig. 3 shows a simple OpenACC version of SIMD-optimized Marsa-LFIB4 which is an analogue of the OpenMP version from Fig. 1. However, sometimes it is necessary to apply some high-level transformations of source codes to improve the performance [24]. Fig. 4 presents a more sophisticated cache-aware version of the generator. The output vector is divided into chunks with a length of p4 = 256. Initially, the seed is loaded into the first half of the table with a length of 2p4 allocated in cache memory. Each chunk is constructed in the second half of the table and then it is stored in main memory.

240

P. Stpiczyński / Journal of Parallel and Distributed Computing 137 (2020) 238–245

Fig. 4. Cache-aware SIMD-optimized version of Marsa-LFIB4 using OpenACC.

approach for solving linear recurrence systems (see Algorithm 1 in [23]). Let n = rs and s > 2p4 . To find a sequence of pseudorandom numbers defined by (1) for a given seed d0 , . . . , dp4 −1 , we have to solve the following system of linear equations A0 ⎢ B

⎤⎡



A

..

⎢ ⎣

.

..

⎥⎢ ⎥⎢ ⎦⎣

.

B

A

x0 x1

.. .



f ⎢ 0 ⎥





⎥ ⎥ = ⎢ . ⎥, ⎦ ⎣ . ⎦ .

xr −1

(2)

0

where f = (d0 , . . . , dp4 −1 , 0, . . . , 0)T ∈ Zsm , xi = (xis , . . . , x(i+1)s−1 )T ∈ Zsm , and the matrices A, A0 , B ∈ Zsm×s are as shown in Fig. 6. The block system of linear equations (2) can be rewritten as follows

{

A0 x0 = f Bxi−1 + Axi = 0, i = 1, . . . , r − 1.

(3)

Let ek denote the kth unit vector from Zsm , i.e. ek = (0, . . . , 0, 1, 0, . . . , 0)T . It can be observed that non-zero columns of B satisfy the equation given in Box I. From (3) we have Fig. 5. CUDA version of Marsa-LFIB4.

Note that on NVIDIA K40m the function accLFIB4_v2() is 2.5× faster than accLFIB4_v1(). The same approach can be applied to develop a CUDA version of the generator. The source code presented in Fig. 5 can be easily derived from accLFIB4_v2(). Note that the kernel cudaLFIB4() should be executed as a block of 256 threads. 3. New algorithmic approach Recently, we have developed a parallel approach that can be used to implement multiple recursive generators [25,27,28]. Unfortunately, in case of Lagged Fibonacci generator, the number of operations required by the algorithm increases when the values of p2 increases. This is the reason why this approach cannot be directly applied for Marsa-LFIB4, where p4 = 256. In order to design a high-performance implementation of the generator, we will propose a new approach that will combine techniques presented in Section 2 with more efficient divide-and-conquer

p4 −1

xi = −A−1 Bxi−1 =



xis−k−1 A−1 (−Bs−1−k ).

k=0

It is clear that A

−1

(ei + ej ) = A−1 ei + A−1 ej , thus

a0 = A−1 (−Bs−p4 )

(4)

−1

(−Bs−p3 ) = a0 + ap4 −p3

(5)

−1

(−Bs−p2 ) = a0 + ap4 −p2 + ap3 −p2

(6)

d0 = A−1 (−Bs−p1 ) = a0 + ap4 −p1 + ap3 −p1 + ap2 −p1

(7)

b0 = A c0 = A

Moreover, vectors ak = A−1 (−Bs−p4 +k ), k = 1, . . . , p4 − p3 − 1, ak = (0, . . . , 0, a0 , . . . , as−1−k )T ,

(8)

   k

where a0 = (a0 , . . . , as−1 )T . Similarly, bk = A−1 (−Bs−p3 +k ), k = 1, . . . , p3 − p2 − 1, ck = A−1 (−Bs−p4 +k ), k = 1, . . . , p2 − p1 − 1, dk = A−1 (−Bs−p4 +k ), k = 1, . . . , p1 − 1, can be easily derived from (5)–(7) using simple shift operations (8). Finally, we get the

P. Stpiczyński / Journal of Parallel and Distributed Computing 137 (2020) 238–245

241

Fig. 6. Shapes of A0 , B, A. Blue dots on the main diagonal of A0 , and A: 1, green dots: −1, otherwise: 0. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

⎧ ei−s−p4 , ⎪ ⎪ ⎨ ei−s−p3 + ei−s−p4 , −Bi = ⎪ei−s−p2 + ei−s−p3 + ei−s−p4 , ⎪ ⎩ ei−s−p1 + ei−s−p2 + ei−s−p3 + ei−s−p4 ,

i = s − p4 , . . . , s − 1 − p3 , i = s − p3 , . . . , s − 1 − p2 , i = s − p2 , . . . , s − 1 − p1 , i = s − p1 , . . . , s − 1.

Box I.

following: n3 −1

n4 −1

xi =



xis−p4 +k ak +



k=0

k=0

the value of r grows. However, then the potential parallelism of the algorithm also grows.

n2 −1

xis−p3 +k bk +



xis−p2 +k ck

k=0

n1 −1

+



4. Parallel implementations (9)

xis−p1 +k dk ,

k=0

where n4 = p4 − p3 , n3 = p3 − p2 , n2 = p2 − p1 , n1 = p1 . Note that n1 + n2 + n3 + n4 = p4 , thus each xi is the sum of p4 vectors. The direct application of (9) allows to develop a new fully vectorized parallel algorithm. Unfortunately, the required number of operations is really huge N1 (s, r) = 2p4 (s − p4 )(r − 1) = 512n − 2p24 r − 2p4 s + 2p24 . Algorithm 1: Parallel Vectorized Marsa-LFIB4

1 2 3

4 5 6 7 8 9

Data: s, r, x0 , . . . , xp4 −1 – seed ▷ s should be a multiple of p4 Result: xp4 , . . . , xrs−1 – generated numbers apply (1) to find x0 ▷ using a SIMD-optimized method apply (1) to find 2p4 last entries of a0 apply (5)–(7) to find 2p4 last entries of b0 , c0 , d0 ▷ vectorized for i = 1, . . . , r − 1 do apply (9) to find p4 last entries of xi ▷ vectorized end parallel for i = 1, . . . , r − 1 do apply (1) to find s − p4 first entries of xi end

In order to propose a more efficient method, let us observe that to find each vector xi , i = 1, . . . , r − 1, we need p4 last entries of xi−1 . Thus, we can apply (9) to find p4 last entries of each xi and then find in parallel s − p4 first entries of each vector using the SIMD-optimized version from Section 2 (see Algorithm 1 for details). It can be easily verified that the total number of arithmetic operations required by the new algorithm is only N2 (s, r) = 12p4 r + 2p24 (r − 1) + 3(s − p4 )(r + 1)

= 3n + 3s + (9p4 + 2p24 )r − 2p24 − 3p4 .

(10)

The question is how to choose the values of the parameters r and s. It is clear that the total number of operations grows when

Algorithm 1 can be easily implemented using OpenMP or OpanACC. The lines 1–2 simply apply LFIB4(). Lines 3–6 are straight loops that can be vectorized using appropriate SIMD constructs. The parallel part of the algorithm (lines 7–9) can be expressed as the ‘‘parallel for’’ construct. Our implementation based on OpenACC uses accLFIB4_v2() to find x0 . Similarly we can find 2p4 last entries of a0 . The loops from the lines 3–6 can be vectorized using the ‘‘parallel’’ construct with ‘‘vector_length(256)’’ and ‘‘num_gangs(1)’’. Parallelization of the loop 7–9 is a little bit more complicated. Details can be found in Fig. 7. The iterations are assigned to workers within gangs (line 8). Each gang consists of eight workers (line 3). Each column is computed by one worker in SIMD mode (lines 11–25) using the table cx allocated in cache memory. Note that two-dimensional table cx is shared by all workers within a gang. Our CUDA-based implementation is similar to the implementation using OpenACC. However instead of using high-level programming constructs we have to assign tasks to threads within blocks manually (see Fig. 8). 5. Results of experiments All experiments were carried out on four different target architectures which are modern accelerated systems allowing OpenMP, OpenACC and CUDA programming models: E5-2670: a server with two Intel Xeon E5-2670 v3 (totally 24 cores with hyperthreading, 2.3 GHz), 128 GB RAM, running under Linux with Intel Parallel Studio version 2017, KNC: like E5-2670, additionally with Intel Xeon Phi Coprocessor 7120P (61 cores with multithreading, 1.238 GHz, 16 GB RAM), all experiments have been carried out on Xeon Phi working in native mode [9], KNL: a server with Intel Xeon Phi 7210F (KNL, 64 cores, 1.3 GHz, AVX512), 128 GB RAM running under Linux with Intel Parallel Studio version 2017,

242

P. Stpiczyński / Journal of Parallel and Distributed Computing 137 (2020) 238–245

Fig. 7. Parallel part of Marsa-LFIB4 using OpenACC.

Fig. 8. Parallel part of Marsa-LFIB4 using CUDA.

K40m: like E5-2670, additionally with NVIDIA Tesla K40m GPU [17] (2880 cores, 12 GB RAM), CUDA 9.0 and Portland Group PGI compilers and tools version 19.4 with OpenMP and OpenACC support.

In case of first three servers we tested Simple (non optimized) implementation, two SIMD-optimized (non parallel) implementations of Marsa-LFIB4 using OpenMP simd construct and intrinsics, respectively, and the parallel algorithm using OpenMP. Examples

P. Stpiczyński / Journal of Parallel and Distributed Computing 137 (2020) 238–245

243

Fig. 9. Performance of SIMD-optimized and OpenMP implementations of Marsa-LFIB4: timing (left) and speedup (right).

of the results are presented in Fig. 9. Our OpenACC and CUDA implementations have been tested on K40m. Exemplary results can be found in Fig. 10. As expected, the simple (non optimized) implementation achieved the worst performance. Intrinsic-based SIMD implementation is faster than the implementation based on OpenMP simd construct on both Intel MIC architectures (up to 5% on KNL and up to 6% on KNC). In case of Xeon E5-2670 both implementations achieve the same performance. Thus, our parallel implementation (OpenMP) uses intrinsics. It should be noticed that on all platforms the best performance is achieved when only one thread per core is used and the value of r is equal to the number of cores. The use of SIMD extensions improves the performance of Marsa-LFIB4 5–6× on KNL and about 1.8× on CPU with AVX2. In case of KNC, the SIMD-optimized implementation is only 18% faster than Simple. The use of multiple cores results in a significant increase in performance. On E5-2670 the highest

speedup relative to SIMD is up to 12, thus the efficiency is about 0.5. On KNC and KNL the efficiency of the use of multiple cores is worse, especially in case of KNL. However, on KNL we can observe the best speedup relative to Simple (about 31). In case of E52670 such speedup is about 21. It means that on this platform the efficiency of our parallel implementation relative to Simple is up to 88%. The low efficiency of using multiple cores on KNC and KNL is probably due to time overheads associated with the fork-join operations and the synchronization of multiple threads. Fig. 10 compares the performance of our OpenACC-based (OpenACC) and CUDA-based (CUDA) implementations on K40m, and the performance of OpenACC (built for multicore architectures using PGI) and OpenMP (built using both Intel and PGI compilers) on E5-2670. GPU implementations (OpenACC and CUDA) achieve the best performance when s/r ≈ 2048. The right-hand side plots show the speedup of the implementations measured over SIMD built using Intel icc compiler

244

P. Stpiczyński / Journal of Parallel and Distributed Computing 137 (2020) 238–245

Fig. 10. Performance of OpenACC, CUDA and OpenMP implementations of Marsa-LFIB4: timing (left) and speedup (right) results.

that gives the fastest executable code. It can be observed that OpenACC and CUDA have similar performance but OpenACC is a bit slower for smaller sizes. In case of E5-2670, OpenACC is faster than OpenMP compiled using PGI but slower than OpenMP built with Intel. 6. Conclusions We have shown that Marsa-LFIB4 which is a fine example of linear recurrence computations can be efficiently implemented on modern multicore processors with vector extensions and GPUs using language-based tools together with the algorithmic approach. Using intrinsics instead of the simple simd construct increases performance slightly but also limits code portability. Our parallel SIMD-optimized implementations using OpenMP, OpenACC and CUDA achieve good performance. Moreover, our portable OpenACC implementation achieves the performance comparable to those achieved by our CUDA-based and OpenMPbased implementations on GPUs and multicore CPUs, respectively. Declaration of competing interest No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.jpdc.2019.12.004. References [1] S. Aluru, Lagged Fibonacci random number generators for distributed memory parallel computers, J. Parallel Distrib. Comput. 45 (1) (1997) 1–12, http://dx.doi.org/10.1006/jpdc.1997.1363.

[2] H. Bauke, S. Mertens, Random numbers for large-scale distributed Monte Carlo simulations, Phys. Rev. E 75 (2007) 066701, http://dx.doi.org/10. 1103/PhysRevE.75.066701. [3] R.H. Bisseling, Parallel Scientific Computation. A Structured Approach using BSP and MPI, Oxford University Press, Oxford, 2004. [4] T. Bradley, J. du Toit, R. Tong, M. Giles, P. Woodhams, Parallelization techniques for random numbers generators, in: GPU Computing Gems, Gems Emerald ed., 2011, pp. 231–246. [5] R.P. Brent, Uniform random number generators for supercomputers, in: Proc. Fifth Australian Supercomputer Conference, 1992, pp. 95–104. [6] S. Chandrasekaran, G. Juckel (Eds.), OpenACC for Programmers: Concepts and Strategies, Addison-Wesley, 2018. [7] J. Cheng, M. Grossman, T. McKercher, Professional CUDA C Programming, Wiley and Sons, 2014. [8] R. Farber (Ed.), Parallel Programming with OpenACC, Morgan Kaufmann, 2017. [9] J. Jeffers, J. Reinders, Intel Xeon Phi Coprocessor High-Performance Programming, Morgan Kaufman, Waltham, MA, USA, 2013. [10] J. Jeffers, J. Reinders, A. Sodani, Intel Xeon Phi Processor High-Performance Programming, Knights Landing ed., Morgan Kaufman, Cambridge, MA, USA, 2016. [11] K.N. Khan, M. Hirki, T. Niemi, J.K. Nurminen, Z. Ou, RAPL in action: Experiences in using RAPL for power measurements, ACM Trans. Model. Perform. Eval. Comput. Syst. 3 (2) (2018) 9:1–9:26, http://dx.doi.org/10. 1145/3177754. [12] D.E. Knuth, Seminumerical Algorithms, second ed., in: The Art of Computer Programming, vol. II, Addison-Wesley, 1981. [13] D.E. Knuth, MMIXware, A RISC computer for the third millennium, in: Lecture Notes in Computer Science, vol. 1750, Springer, 1999. [14] K. Lapa, K. Cpalka, A. Przybyl, K. Grzanek, Negative space-based population initialization algorithm (NSPIA), in: Artificial Intelligence and Soft Computing - 17th International Conference, ICAISC 2018, Zakopane, Poland, (2018) 3-7, Proceedings, Part I, in: Lecture Notes in Computer Science, vol. 10841, Springer, 2018, pp. 449–461, http://dx.doi.org/10.1007/978-3-319-912530_42. [15] P. L’Ecuyer, Good parameters and implementations for combined multiple recursive random number generators, Oper. Res. 47 (1) (1999) 159–164, http://dx.doi.org/10.1287/opre.47.1.159.

P. Stpiczyński / Journal of Parallel and Distributed Computing 137 (2020) 238–245 [16] P. L’Ecuyer, R.J. Simard, TestU01: A c library for empirical testing of random number generators, ACM Trans. Math. Software 33 (4) (2007) 22:1–22:40, http://dx.doi.org/10.1145/1268776.1268777. [17] Y. Li, L. Schwiebert, E. Hailat, J.R. Mick, J.J. Potoff, Improving performance of GPU code using novel features of the NVIDIA Kepler architecture, Concurr. Comput.: Pract. Exper. 28 (13) (2016) 3586–3605, http://dx.doi.org/10. 1002/cpe.3744. [18] G. Marsaglia, Random numbers for C: The END? Posted to the electronic billboard sci.crypt.random-numbers, 1999. [19] M. Mascagni, A. Srinivasan, Algorithm 806: SPRNG: a scalable library for pseudorandom number generation, ACM Trans. Math. Software 26 (3) (2000) 436–461, http://dx.doi.org/10.1145/358407.358427. [20] M. Mascagni, A. Srinivasan, Parameterizing parallel multiplicative laggedFibonacci generators, Parallel Comput. 30 (5–6) (2004) 899–916, http: //dx.doi.org/10.1016/j.parco.2004.06.001. [21] G. Ökten, M. Willyard, Parameterization based on randomized quasi-Monte Carlo methods, Parallel Comput. 36 (7) (2010) 415–422, http://dx.doi.org/ 10.1016/j.parco.2010.03.003. [22] O.E. Percus, M.H. Kalos, Random number generators for MIMD parallel processors, J. Parallel Distrib. Comput. 6 (3) (1989) 477–497, http://dx.doi. org/10.1016/0743-7315(89)90002-6. [23] P. Stpiczyński, Parallel algorithms for solving linear recurrence systems, in: Parallel Processing: CONPAR 92 - VAPP V, Second Joint International Conference on Vector and Parallel Processing, Lyon, France, (1992) September 1-4, Proceedings, in: Lecture Notes in Computer Science, vol. 634, Springer, 1992, http://dx.doi.org/10.1007/3-540-55895-0_428, 343–348. [24] P. Stpiczyński, Semiautomatic acceleration of sparse matrix–vector product using OpenACC, in: Parallel Processing and Applied Mathematics, 11th International Conference, PPAM 2015, Kracow, Poland, (2015) September 6–9, Revised Selected Papers, Part II, in: Lecture Notes in Computer Science, 9574, Springer, 2015, pp. 143–152, http://dx.doi.org/10.1007/978-3-31932152-3_14.

245

[25] P. Stpiczyński, Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures, J. Supercomput. 74 (2) (2018) 936–952, http://dx.doi.org/10.1007/s11227-017-2172-x. [26] P. Stpiczyński, Parallel fully vectorized Marsa-LFIB4: Algorithmic and language-based optimization of recursive computations, parallel processing and applied mathematics, in: 13th International Conference, PPAM 2019, Białystok, Poland September 8–11, 2019, in: Lecture Notes in Computer Science, Springer, 2020, in Press. [27] P. Stpiczyński, D. Szałkowski, J. Potiopa, Parallel GPU-accelerated recursionbased generators of pseudorandom numbers, in: Proceedings of the Federated Conference on Computer Science and Information Systems, September 9–12, 2012, IEEE Computer Society Press, Wroclaw, Poland, 2012, pp. 571–578, http://fedcsis.org/proceedings/2012/pliks/380pdf. [28] D. Szałkowski, P. Stpiczyński, Using distributed memory parallel computers and GPU clusters for multidimensional Monte Carlo integration, Concurr. Comput.: Prac. Exper. 27 (4) (2015) 923–936, http://dx.doi.org/10.1002/ cpe.3365.

Przemysław Stpiczyński is an associate professor at Institute of Computer Science, Maria Curie-Skłodowska University in Lublin, Poland. He received his Ph.D. degree in Computational Mathematics in 1995 and DSc degree in Computer Science in 2010. He has published over 68 research papers. His main interests are parallel and distributed computing, parallel programming, numerical analysis. He is a program committee member of several scientific conferences like Parallel Processing and Applied Mathematics, Computer Aspects of Numerical Algorithms and High Performance Computing and Simulations.