J. Parallel Distrib. Comput. 111 (2018) 251–259
Contents lists available at ScienceDirect
J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc
FPGA optimized cellular automaton random number generator Lucian Petrica * University Politehnica of Bucharest, 313 Splaiul Independentei Street, 060042, Bucharest, Romania National Institute for Research and Development in Microtechnologies, 126A Erou Iancu Nicolae Street, 077190, Bucharest, Romania
highlights • • • • •
We present SLC-SPCA, a FPGA-optimized pseudo-random number generator architecture. The architecture expands on the concept of self-programmable cellular automatons. SLC-SPCA generates 3 random bits per utilized look-up-table in Xilinx FPGAs. PRNG quality is evaluated with the NIST Statistical Test Suite. Power and Energy evaluations indicate 3x improvement compared to previous work.
article
info
Article history: Received 28 June 2016 Received in revised form 30 May 2017 Accepted 31 May 2017 Available online 9 June 2017 Keywords: FPGA Cellular automaton Random number generator
a b s t r a c t Pseudo-random number generators (PRNGs) are important to applications ranging from cryptography to Monte-Carlo methods. Consequently, many PRNG architectures have been proposed, including some optimized for FPGA, e.g the LUT-SR family of PRNGs which utilize embedded FPGA shift registers, and self-programmable cellular automaton (SPCA) PRNGs. However, LUT-SR and other PRNGs do not utilize key features of modern Xilinx FPGAs: embedded carry chains and splittable Look-Up Tables (LUTs), i.e., 6-input LUTs which can operate as two 5-input LUTs which share inputs. In this paper we explore the SPCA structure and derive a set of parameter constraints which allow a SPCA PRNG to produce 2 random bits per LUT in every clock cycle on modern Xilinx FPGAs. We determine this to be the maximum logic density achievable for SPCA, and propose an architectural improvement of SPCA to enable further density increase by making use of FPGA embedded carry chains as a method to compute an additional random bit per LUT in each clock cycle. The resulting Split-LUT-Carry SPCA (SLC-SPCA) PRNG achieves 6x improvement in logic density compared to LUT-SR, and a 1.5x density increase compared to SPCA. We evaluate the randomness of SLC-SPCA utilizing the NIST Statistical Test Suite, and we provide a power and energy comparison of LUT-SR and SLC-SPCA on a Xilinx Zynq 7020 FPGA device. Our results indicate that SLC-SPCA generates 3x more bits per clock at approximately the same power dissipation as LUT-SR, and consequently 3x less energy to generate 1 gigabit of random data. SLC-SPCA is also 1.5x more energy-efficient than a SPCA PRNG. © 2017 Elsevier Inc. All rights reserved.
1. Introduction Pseudo-random number generators (PRNGs) are essential building blocks of many applications, the most evident of which are cryptography, where encryption key randomness is essential, on-chip generation of random bit-streams for VLSI testing [5], and in electronic gambling equipment, including on-line gambling. The importance of random number generation has led to the development of a large variety of PRNG architectures, most of which have been designed for software or VLSI implementation. Field
*
Correspondence to: University Politehnica of Bucharest, 313 Splaiul Independentei Street, 060042, Bucharest, Romania. E-mail address:
[email protected]. http://dx.doi.org/10.1016/j.jpdc.2017.05.022 0743-7315/© 2017 Elsevier Inc. All rights reserved.
Programmable Gate Array (FPGA) based PRNGs have received less attention because FPGAs have up to date been utilized mostly as glue logic or prototyping equipment. However, the emerging popularity of FPGAs as energy-efficient accelerators in supercomputing and the data-center is leading towards the complete FPGA implementation of randomness-based applications such as stream ciphers and highly parallel Monte-Carlo simulations for e.g. realtime stock option pricing [17,18]. The success of Monte-Carlo and other randomness-based FPGA applications depends on the quality, speed, and resource efficiency of FPGA-implemented PRNGs. The quality of a PRNG is defined as the randomness of its outputs, whereby high-quality, cryptographically strong PRNGs are practically indistinguishable from true random number generators. PRNG resource efficiency is expressed
252
L. Petrica / J. Parallel Distrib. Comput. 111 (2018) 251–259
as the number of random bits generated per utilized LUT in each clock cycle, and the speed of a FPGA-implemented PRNG is its maximum operating frequency. While high-quality PRNGs have been implemented in FPGA [6], most often these were not designed for FPGA and their resource efficiency and speed are sub-optimal. Most FPGA-implemented PRNGs are variations on linear feedback shift registers (LFSRs) due to the simplicity of the LFSR concept and its straight-forward implementation in FPGA look-up tables (LUTs) and Flip-Flops (FFs). Despite some notable results in the scientific literature, such as the LUT-SR PRNG family [16] which makes intelligent use of FPGA embedded shift registers, LFSRbased FPGA PRNGs are still not optimal with regard to FPGA resource efficiency. Our work focuses not on LFSRs but on cellular automaton (CA) based random number generation. CA-based PRNGs have been evaluated in the scientific literature and were found to perform well [19] on statistical randomness benchmarks. Compared to LFSRs, the quality of CA based PRNGs is more difficult to analyze theoretically, but cellular automatons remain attractive for FPGA implementation due to their simple, repetitive structure. There is however little recent work with regard to FPGA implementation of CAs. Existing implementations target out-dated FPGA architectures, and do not include randomness, performance or energy comparisons to state-of-the-art LFSR-based PRNG architectures such as the LUT-SR. This article presents our successful attempt at achieving maximum PRNG resource efficiency on the Xilinx Virtex 6, 7 Series, and UltraScale/UltraScale+ FPGA families, which share the same underlying LUT and carry chain structure. We describe the SplitLUT-Carry Self Programmable Cellular Automatons (SLC-SPCA), a FPGA tailored CA PRNG family. SLC-SPCA is designed such that it utilizes each 6-input LUT as two independent 5-input LUTs, and further utilizes the embedded FPGA carry chain logic to generate a total of three PRNG bits per LUT in each clock cycle, an improvement of six times over LUT-SR and 1.5x more than is achievable with SPCA. We utilize the NIST Statistical Test Suite (STS) [8] to evaluate the quality of SLC-SPCA, and comparatively evaluate the energy consumption of SLC-SPCA versus LUT-SR on a Xilinx Zynq 7020 FPGA device. Our results indicate that SLC-SPCA produces random numbers of comparable quality to state of the art in FPGA PRNGs, and is at least 50% and up to 3 times more energy-efficient than existing FPGA-optimized PRNGs, due to its reduced resource utilization. 2. Overview of FPGA PRNGs In this section we refer to a number of PRNGs designed for, or implemented in FPGA, specifically Xilinx FPGA architectures, which consist of a 4- or 6-input look-up tables. Xilinx FPGAs have been popular in research because approximately one-third of the FPGA LUTs may be configured to operate as shift registers (SRL), which as we shall see is an opportunity for significant resource optimization of LFSRs. 2.1. LFSR-based FPGA PRNGs As the name implies and Fig. 1 illustrates, the LFSR inputs are generated by a linear combination of various bits at the output of the shift register. This combination characterizes the LFSR, may be described as a polynomial in GF(2) space, and determines the LFSR period, which is the number of cycles before the LFSR starts repeating previous output values. The capability to configure a LUT as a shift register makes the LFSR a natural fit to the Xilinx FPGA architecture. Fig. 1 indicates that the LFSR may be implemented utilizing three LUTs and one Flip-Flop, and two of the LUTs are to be configured as SRLs to implement a chain of two and three Flip-Flops respectively. The SRL-based implementation replaces 5
Fig. 1. Example 6-bit LFSR and its FPGA implementation.
FFs with two LUTs but more importantly reduces the number of wires required to connect the LFSR components, reducing power dissipation and increasing the maximum operating frequency. Harnessing the FPGA architectural features for random number generation with LFSRs has been the topic of several papers. In [13], the authors design the LUT-OPT family of PRNGs by imposing a set of constraints on the LFSR polynomial which ensure that all LFSR logic functions are small enough to be implemented in the available LUT, e.g., for a 6-input LUT FPGA architecture, the LUTOPT LFSR is constructed from 6-input or smaller logic functions. Consequently, a k-bit LUT-OPT LFSR always maps to k LUT–FF pairs, and the top operating frequency is constant irregardless of k. The disadvantages of LUT-OPT, as stated by the authors in [16], are the poor quality of the random sequence when k is small, due to the simple linear dependency between consecutive generated random numbers, the short period, and the difficulty of identifying LFSR polynomials which conform to the LUT-OPT constraints. An improved FPGA LFSR structure is presented in [14], called LUT-FIFO, which implements longer LFSRs utilizing a Block RAM FIFO memory instead of LUTs and FFs. While this approach reduces the LUT requirement of the LFSR and enables much longer LFSR periods, RAM is a much more scarce resource than LUTs in FPGA, rendering the benefits of the trade-off questionable. A better compromise is the LUT-SR family of LFSRs, described in [15,16] whereby the FIFO memories required for LUT-FIFO are implemented utilizing LUTs configured as shift registers. Thus, the dependency on Block RAM is eliminated while retaining the desirable long period characteristic of a quality PRNG. By optimally partitioning the LUT-FIFO LFSR and mixing functions, LUT-SR achieves a resource utilization of two 6-input LUTs per output bit, one of which operates as a 32-stage (maximal depth) SRL and the other as a 6-input mixing function. 2.2. Cellular automaton PRNGs A simple one-dimensional elementary CA is illustrated in Fig. 2, with each 1-bit cell connected to its immediate neighbors to the left and right. The next state Sit +1 of a cell is computed in each clock cycle according to Eq. (1). A total of 256 possible R(Sit+1 , Sit , Sit−1 ) functions exist, called CA rules. Each rule is named by the decimal conversion of the binary values in its truth table, as suggested in Fig. 2 for rule 90. At the edges of the CA, boundary conditions must be specified, e.g., null (constant zero) or periodic (wrapping) boundary. Sit +1 = R(Sit+1 , Sit , Sit−1 ).
(1)
By selecting an appropriate CA rule, e.g., rule 30 [19], the sequence of values of the center CA cell exhibits strong pseudorandom behavior. While appearing random to the human eye, the multi-bit number formed by aggregating the states of all CA cells does not exhibit satisfactory statistical randomness to the standards required for e.g. cryptography because of the correlation between the states of adjacent CA cells. Consequently, more complex CA structures have been explored in scientific literature which
L. Petrica / J. Parallel Distrib. Comput. 111 (2018) 251–259
253
Fig. 2. Example 4-bit rule 90 elementary cellular automaton.
Fig. 4. FPGA slice, logic group, and carry logic structure.
Fig. 3. Example 4-bit SPCA.
can generate better quality multi-bit random numbers by replacing the static, homogeneous rule of elementary CA with dynamic rules, i.e, rules which change in time, heterogeneous rules, i.e., cells have different rules each, or both [2,3,9]. Self programmable CAs (SPCAs) are a class of complex cellular automatons whereby the CA rule of each cell is dynamically selected in each clock cycle from a set of two rules R1 and R2 based on the values of a meta-state bit, as expressed in Eq. (2). The metastate itself is computed in each clock cycle according to Eq. (3) from the values of the state bits of two cells, which form the meta-state neighborhood. Sit +1 =
R1 (Sit+1 , Sit , Sit−1 ) R2 (Sit+1 , Sit , Sit−1 )
{
Mit +1 = Sit+m + Sit+n .
if Mit = 0 if Mit = 1
(2) (3)
Any SPCA is characterized by the tuple (k, R1 , R2 , m, n), where k is the number of cells, while m and n are positive or negative integers. Fig. 3 illustrates a (k = 4, R1 , R2 , 1, 0) SPCA with null boundary conditions. A k-cell SPCA PRNG generates two random bits per CA cell (the state and meta-state of the cell) which may be utilized to form a random number of up to 2k bits in every clock cycle. Research in [9] indicates that the only cryptographic strength CA rules are the linear rules 150, 105, 90, and 165. These are pairwise-complementary, i.e., rule 105 is obtained by negating rule 150, and 165 by negating 90. Perhaps because of these findings, work on SPCA has focused on the particular case where the rule set (R1 , R2 ) consists of a pair of complementary linear rules, e.g. 150 and 105, or 90 and 165. Theoretical analysis in [3] proves linear
rule SPCAs do not exhibit graveyard (stuck) states, while statistical analysis with the Diehard test suite [7] of (k < 24, 150, 105, m < 3, n < 3) and (k < 24, 90, 165, m < 3, n < 3) SPCAs reveals good randomness for a variety of meta-state neighborhoods and CA sizes. The authors of [4] analyze a 22-cell SPCA and hybrid LFSRSPCA PRNGs with regard to FPGA resource utilization on a Spartan3 FPGA, and indicate resource utilization of 2 to 3 LUTs per PRNG output bit, dissipating 103 mW at 115 MHz. 3. Maximizing logic density of CA PRNG cells A FPGA-optimized CA PRNG cell exploits the logic structure of the target FPGA architecture to reduce resource utilization per PRNG output bit. Fig. 4 presents a Xilinx 7-Series FPGA slice, the elementary FPGA building block. Please note that some structures have been left out for clarity, and the complete structure is described in [20]. A slice consists of four logic groups, each containing one six-input LUT, two Flip-Flops (FFs), and carry-chain logic, which enables a slice to operate as a 4-bit adder. Each LUT may implement the following functionalities:
• Any 6-input logic function. • Any combination of two 5-input logic functions of the same inputs.
• Any combination of two 3-input logic functions of different inputs.
• A 32-stage shift register (SRL). If the LUT is configured to implement a combination of two 5- or 3-input logic functions, it produces the results of the two functions on O5 and O6 simultaneously, otherwise it produces data on O6 only. The O5 and O6 LUT outputs connect to the carry chain generate (G) and propagate (P) inputs respectively. The SUM output of the carry chain logic is the XOR of CI and P, while CO is connected to CI when P is one or to G when P is zero. The carry-chains of adjacent
254
L. Petrica / J. Parallel Distrib. Comput. 111 (2018) 251–259
Fig. 5. SL-SPCA cell.
slices may be cascaded, enabling the construction of larger adders or custom carry-chain based logic. The F5 flip-flop connects to O5, while the F6 flip-flop connects to either O5, O6, or SUM. Each logic group has three outputs, named O, OQ and OMUX. Outputs O and OQ are connected to O6 and the output of F6 respectively, while OMUX is connected to the internal slice signals by a multiplexer. Except the carry chain multiplexer, all multiplexer select inputs are connected to the FPGA configuration memory and are therefore not user-accessible. 3.1. Constraining SPCA to increase density Let us characterize the FPGA resource utilization as a function of the SPCA configuration, and identify the sub-set of configurations which enable a SPCA to utilize the FPGA 6-input LUT as a combination of two smaller LUTs. In a generic SPCA implementation, the 4-input state update function is implemented in a LUT with inputs connected to Sit+1 , Sit , Sit−1 , and Mit . The meta-state update function is implemented in a separate LUT with inputs connected to Sit+m and Sit+n . Therefore a k-cell SPCA implementation utilizes 2k LUTs if it places no restriction on the values of m and n, and the rule set (R1 , R2 ). Let us now consider under which circumstances we may combine both state and meta-state update functions into a single 6-input LUT, implementing a Split-LUT SPCA (SL-SPCA). As stated in the LUT description, the functions may be combined if either (i) they each have 3 inputs or less or if (ii) the functions share inputs. The 3-or-less inputs constraint affects only the state transition function, since the meta-state transition function only has two inputs. The constraint translates to tighter selection criteria for the SPCA rule set. Specifically, the rules must be selected from the 48 CA reduced neighborhood rules, i.e., rules which ignore the values of either Sit+1 , Sit , or Sit−1 . This is disadvantageous because, of the four cryptographically interesting linear rules, 90 and 165 are reduced-neighborhood, while 150 and 105 are not. Alternatively, with any SPCA rule set, the functions may be implemented in a single 6-input LUT if n is 1, 0, or -1, reducing the combined set of inputs of the two functions to Sit+1 , Sit , Sit−1 , Mit , and Sit+m . In such SL-SPCAs, Sit +1 and Mit +1 are generated at the O5
Fig. 6. SLC-SPCA cell.
and O6 outputs of the LUT respectively, and loaded into the F5 and F6 flip-flops, as illustrated in Fig. 5. The SL-SPCA cell generates two bits at the OQ and OMUX outputs, and a k-cell SL-SPCA utilizes k 6-input LUTs, two times less than the generic SPCA. While the SLSPCA constraints enable optimal packing of SPCA in the Xilinx FPGA fabric, it still leaves the slice carry logic unutilized, and therefore an opportunity for PRNG logic density increases through architectural development beyond SPCA. 3.2. Split-LUT-Carry SPCA cell The Split-LUT-Carry SPCA (SLC-SPCA) cell design in Fig. 6 extends the SPCA concept, making use of the O output and carry chain logic of the slice logic group, which the SL-SPCA cell leaves unutilized, to further increase the logic density of the cell. The next state function of the SLC-SPCA is identical to that of SPCA described in Eq. (2). An intermediate mixing function Iit of state bits is expressed in Eq. (4), with n constrained according to SLSPCA guidelines to enable Sit +1 and Iit to be generated from a single 6-input LUT at the O5 and O6 LUT outputs respectively. The next meta-state Mit +1 , expressed in Eq. (5), is obtained by GF (2) addition (logic XOR) of Iit and the carry input of the logic group, and is therefore equivalent to the SUM output of the carry chain logic. The SLC-SPCA cell carry output is generated according to Eq. (6) and is equivalent to the CO output of the carry chain logic. Iit = Sit+m + Sit+n , n ∈ (−1, 0, 1)
(4)
Mit +1 = Cit−1 + Iit
(5)
Cit = Sit +1 · I¯it + Cit−1 · Iit .
(6)
L. Petrica / J. Parallel Distrib. Comput. 111 (2018) 251–259
255
In this configuration, flip-flops F5 and F6 store the state and meta-state of the SLC-SPCA cell, which exit the logic group through outputs OMUX and OQ, while Iit exits the logic group through O. The cell therefore produces three bits per cycle, which are useraccessible through the FPGA interconnection network and may be utilized for random number generation. The SLC-SPCA carry output can only connect to the next logic group carry input or to an adjacent FPGA slice. By design, the SLC-SPCA has maximum logic density for the targeted FPGA fabric, utilizing all three outputs of the logic group in each cycle to generate random bits, more than either SPCA or LFSR on current Xilinx FPGA fabric. It was demonstrated in the previous section that the maximum density achievable for SPCA is 2 output random bits per LUT, whereas SLC-SPCA produces 3 random bits per LUT. LFSR designs generate output bits at LFSR taps, and each tap must be the output of either a FF or a SRL within a logic group, as in Fig. 1. If a SRL is utilized, F5 and OMUX are unusable. Therefore a maximum of 2 taps (output bits) may be implemented per LUT. It follows that SLC-SPCA is also more resource-efficient than any SPCA–LFSR hybrid, such as [4].
4.3. SLC-SPCA boundary conditions
4. SLC-SPCA PRNG
Theorem 1. The SLC-PRNG output is biased if C−1 is constant.
Random numbers of arbitrary length may be generated by connecting multiple SLC-SPCA cells together. This section discusses practical aspects of utilizing a multi-cell SLC-SPCA as a PRNG. Design decisions such as the CA rule set, boundary conditions, initialization, and output mixing affect the quality of the PRNG and will be analyzed in turn.
Proof of Theorem 1. Let us assume the SLC-PRNG is unbiased, therefore the values of Si , Mi , and Ii are uniformly distributed for all values of i. Therefore, the zero-probability, i.e. the probability of an output bit being zero at any one time, is:
4.1. Rule set
At the edges of the SLC-SPCA (cells 0 and k-1) the next state neighborhood is not fully defined. The undefined cell inputs are S−1 for cell 0 and Sk for cell k-1. Depending on the value of m, Si+m is an undefined member of the meta-state neighborhood for one or more SLC-SPCA cells, e.g., if m = −2, Si−2 is undefined for cells 0 and 1. Finally, the carry input of cell 0, denoted C−1 , is undefined as well. The SLC-SPCA boundary conditions are the values assigned to the undefined inputs of each cell. Analysis in [9] indicates that periodic (wrapping) boundary cellular automatons provide better statistical randomness. We therefore utilize a periodic boundary for Si , as described in Eq. (9). Sjt
j ∈ [−k, 0) j ∈ [k, 2k).
Skt +j Sjt−k
{ =
To our knowledge, no carry-chain based PRNG has been proposed in the scientific literature, therefore no guidance for the C−1 boundary exists. We analyze this boundary with a view to eliminate PRNG bias, i.e., a statistical preference for either ones or zeros in each bit of the PRNG output.
1 1 1 P0M = P0I = . 2 2 2 From Eq. (6) we extrapolate the zero-probability of an arbitrary Ci as a function of Si , Ii , and C−1 : P0S =
C
SLC-SPCA does not impose any restriction on the rule set, but we take note of the theoretical analysis presented in [9] which indicates that linear rules 150, 105, 90, and 165 produce better quality CA PRNGs than non-linear rules. Analysis in [3] of SPCAs with rule sets (R1 = 150, R2 = 105) and (R1 = 90, R2 = 165) suggests these rule combinations produce statistically strong SPCA PRNGs, therefore we opt for the same rule sets in the SLCSPCA PRNG, and will attempt to identify through statistical analysis which rule set, if any, produces random numbers of better quality.
(9)
C
1
C
4 1
P0 0 = P0S P0I + P0 −1 P0I = C
P0 1 = P0S P0I + P0 0 P0I = C
C
4
P0 n = P0S P0I + P0 i−1 P0I =
1
C
+ P0 −1
+
2 1
1
8
4
n ∑ 1 i=1
C
+ P 0 −1
2i+1
+
1 2n
C
P0 −1 .
Replacing the sum with its analytic formula: C
2n − 1
1 1 C 1 1 C + n P0 −1 = − (n+1) + n P0 −1 . 2(n+1) 2 2 2 2 From Eq. (5) we extrapolate
P0 n =
4.2. PRNG output mixing
Cit−1 = Mit +1 + Iit .
Each SLC-SPCA cell produces three bits per cycle, which may be arranged arbitrarily in the 3k-bit output of the PRNG. A simple concatenated output arrangement OC is described in Eq. (7). However, the structure of the SLC-SPCA and the restrictions placed on the rule neighborhood introduce an unwanted correlation between adjacent bits. Initial experiments with the SLC-SPCA confirmed this effect and indicated that Mit produces frequent runs of ones and zeros, reducing the quality of the PRNG concatenated output. We therefore utilize the output arrangement OM in Eq. (8), which mixes the bits of Mit to reduce the number of runs. It must be noted that mixing the output bits requires no additional LUT resources, as it is implemented entirely within the FPGA signal routing network.
And therefore, under the assumption of uniformly distributed Mi , and Ii , Ci must also be uniformly distributed. It follows that:
OtC = (S0t , S1t , . . . , Skt −1 , I0t , I1t , . . . , Ikt −1 , M0t , M1t , . . . , Mkt −1 )
(7)
OtM = (S0t , . . . , Skt −1 , I0t , . . . , Ikt −1 , M0t , Mkt −1 , M1t , Mkt −2 , . . . , Mkt /2−1 , Mkt /2 ).
(8)
C
1
C
2 1
P0 n =
=
1 2
−
1 2(n+1)
+
1 2n
C
P0 −1
P0 −1 =
2 which state that the values of C−1 are uniformly distributed, contradicting our initial assumption of constant C−1 . We cannot therefore connect C−1 to logic zero or one without biasing the PRNG. Conversely, the values of C−1 must be uniformly distributed to minimize PRNG bias. This may be achieved precisely by connecting a toggle Flip-Flop to C−1 , with the possible disadvantage of a predictable (low entropy) input carry sequence. An alternative boundary solution is to feed back the carry output of the SLC-SPCA PRNG through a Flip-Flop to the carry input, but this does not guarantee a uniform distribution of C−1 . The feedback will
256
L. Petrica / J. Parallel Distrib. Comput. 111 (2018) 251–259
Fig. 7. Example SLC-SPCA PRNG with carry boundary and initialization logic.
propagate and possibly amplify any statistical deviation (bias) of the carry output back into the SLC-SPCA. However, any bias in the carry output should be small, especially for large values of k. In any case, the best type of C−1 boundary is difficult to determine analytically and we will instead rely on statistical evaluation of all the possible boundary conditions to pick the most suitable. 4.4. Initialization and entropy injection An important aspect of PRNG functionality is its initialization. The PRNG must be capable of loading an initial state, or seed, which thereafter determines the random number sequence to be generated. Similarly, it is advantageous for a PRNG to be capable of adding entropy from external sources during operation in order to improve the randomness of its output. This entropy injection process is often employed in cryptographic PRNGs, which operate continuously and are frequently injected with true random data captured from e.g. network traffic. The dense SLC-SPCA PRNG structure leaves little room for extra inputs for initialization and entropy injection. As all the available LUT inputs are utilized for PRNG operation, external control of the PRNG may be achieved only by additional LUTs, reducing the resource efficiency of the PRNG. To minimize this effect, we implement a serial initialization mechanism which injects the seed or additional random data through the carry input C−1 . The updated carry boundary is therefore described by Eq. (10), where V t is the initialization and external entropy injection data stream. t t C− 1 = V +
⎧ ⎨0 t − 1 Ck−1
or or
(10)
⎩ t¯−1 C−1 .
A complete 4-cell SLC-SPCA PRNG design is exemplified in Fig. 7, including the carry boundary and initialization logic. The three carry boundary methods are illustrated together, although in practice a SLC-SPCA PRNG designer would choose and implement only one. The entire initialization and carry boundary logic may be implemented in a single supplementary logic group. Except for the carry, all cell output signals are routed through the FPGA interconnection network to cell inputs and to the PRNG output. 5. Evaluation Statistical testing is the standard method of evaluating the quality of a random number generator. The Diehard [7] test suite in particular has been extensively utilized in previous work on CA PRNGs. We perform all statistical analysis with the NIST [8] test suite, which supersedes Diehard and is more stringent. We selected the NIST Statistical Test Suite (STS) version 2.1.2 for our evaluation. STS analyzes a random number sequence generated by the PRNG, by applying several tests designed to detect bias, long runs of zeros,
Table 1 SLC-SPCA evaluation parameters. Parameter
Values
k (R1 , R2 ) m n C−1
32 (150,105) and (90,165) [−5, −2] and [2, 5] [−1, 1] Zero, Toggle, Feedback
patterns, and other PRNG defects. STS determines a p-value for each test run over a single PRNG output sequence. The p-value is the probability that the sequence under test is random. If the pvalue is below a significance level α , the test is considered a fail, but even true random numbers can sometimes produce seemingly non-random sequences which fail. However, over multiple runs of a test on several different sequences from the same PRNG, each test expects a certain distribution of p-values, which STS analyzes to decide with greater certainty whether the PRNG under test is truly random. Utilizing the STS, we evaluated a range of SLC-SPCA configurations, in order to identify which perform best with regard to random number quality. Table 1 summarizes the parameter values evaluated. We focused on 32-bit configurations (k = 32) because most applications operate with random numbers which are multiples of 32 bits. The three carry boundary conditions under analysis are constant zero, alternating ones and zeros (toggle), and feedback from the SLC-SPCA carry output. For each SLC-SPCA configuration, 100 sequences of 1 Mbits each were generated with random seeds and evaluated by NIST STS at a significance level α = 0.01. STS performed 188 tests on the PRNG output data. We extracted the number of failed tests and the proportion of failed tests from NIST STS test reports. An identical 100-sequence evaluation was performed on the LUT-SR PRNG configuration (n = 1024, r = 32, t = 5, k = 32, s = 1c48), which is available as VHDL code on the authors’ website at [12], and serves as a quality reference. We also extract NIST evaluation results of the Chasm [11] and the Blum–Blum–Shub [1] cryptographic PRNGs from [10]. Table 2 lists the number of failed NIST tests for the SLC-SPCA configurations with rule set (R1 = 150, R2 = 105). The table also lists the average number of failed tests for m < 0 and m > 0. We observe that configurations with m < 0 perform significantly better on average, which we attribute to the asymmetry of the SLCSPCA, which propagates the carry signal from cell 0 to cell k-1. Over a third of SLC-SPCA configurations with this rule set passed all NIST tests. Table 3 presents quite a different picture of SLC-SPCA configurations with rule set (R1 = 90, R2 = 165). None of the configurations pass all NIST tests, which is quite unexpected given the results in [3], where SPCA rule sets (R1 = 90, R2 = 165) and (R1 =
L. Petrica / J. Parallel Distrib. Comput. 111 (2018) 251–259
257
Table 2 Number of failed NIST tests for rule set (R1 = 150, R2 = 105) in various SLC-SPCA configurations. C−1 boundary
Constant zero Feedback flop Toggle flop
n
m
−5
−4
−3
−2
2
3
4
5
−1
0 0 0
0 4 0
3 0 0
0 0 1
1 1 6
1 2 0
0 1 3
0 1 1
1 2 1
1 0 0
0 1 0
0 1 0
1 3 5
0 2 3
0 0 2
1 1 3
0 1 1
0 0 2
1 0 1
2 0 0
1 2 3
1 1 3
1 1 3
0 1 1
0 1
−1 0 1
−1 0 1
Average
0.64
1.55
Fig. 8. NIST failures versus metarule neighborhood size for rule set (R1 = 150, R2 = 105).
Table 3 Number of failed NIST tests for rule set (R1 = 90, R2 = 165) in various SLC-SPCA configurations. C−1 boundary
Constant zero Feedback flop Toggle flop Average
n
−1 0 1
−1 0 1
−1 0 1
m
−5
−4
−3
−2
2
3
4
5
1 7 6
6 7 6
3 4 5
5 11 3
4 3 10
4 6 4
4 4 2
2 5 4
1 7 7
5 5 2
7 3 5
7 7 4
6 3 1
4 9 5
8 3 3
4 8 6
4 3 4
6 5 8
5 6 5
8 8 2
2 1 6
2 5 5
8 3 6
3 4 8
5.22
4.86 Fig. 9. NIST failures versus metarule neighborhood size for rule set (R1 = 90, R2 = 165).
Table 4 Effect of C−1 Boundary on NIST Results. C−1 boundary Constant zero Feedback flop Toggle flop
Rule set (90, 165)
Rule set (150, 105)
Avg. fails
Max. fails
Avg. fails
Max. fails
4.83 5.41 4.87
11 11 8
1.04 1.16 1.08
6 5 3
150, R2 = 105) perform similarly. The explanation for the poor performance may be the fact that rules 90 and 165 are reduced neighborhood, and ignore Si . With the added complexity of SLCSPCA, the reduced neighborhood rules do not provide sufficient state mixing and therefore perform badly. Figs. 8 and 9 illustrate the relationship between the size of the metarule neighborhood, defined as the absolute difference between m and n, and the number of failures for each rule set. Since the choice of n is limited to [−1, 1], these findings indicate that good performance is to be achieved by picking a large value for m. Table 4 condenses the NIST results to highlight the effect of the carry boundary on the quality of the PRNG. For both rule sets, the feedback boundary performs worst, in both average and maximum number of NIST failed tests. The toggle and constant boundaries perform roughly the same. The minimum number of NIST failed tests is the same for all boundary types. The overall effect of the carry boundary on the quality of NIST results is minimized by the relatively large value of k in our evaluation but it does appear that the Toggle Flop boundary is most advantageous, with a close to minimal average number of failed tests and the smallest number of maximum failed tests. Evaluation results therefore support our theoretical analysis in Section 4.3. Finally, average NIST pass proportion of the best SLC-SPCA configuration is presented in Table 5 with LUT-SR, Chasm, and Blum– Blum–Shub as reference. The 96-bit, 32-cell SLC-SPCA has similar
NIST results to the other three generators, indicating that SLC-SPCA produces a high-quality output random sequence. While Chasm and Blum–Blum–Shub are cryptographic strength PRNGs, we make no claim of such properties for SLC-SPCA because (i) the number of sequences evaluated for SLC-SPCA is less than for either of the cryptographic PRNGs and (ii) we present no theoretical analysis of the forward or backward security of SLC-SPCA. However, we conjecture that the Ii output bits of the SLC-SPCA may be utilized to construct a cryptographic PRNG, because unlike Si and Mi , they do not expose the state bits of the underlying CA. We also include the results of a 36-bit, 12 cell SLC-SPCA PRNG to demonstrate that the quality of the SLC-SPCA output sequence, like many of the CA based PRNGs in previous work, is strongly dependent on the number of cells. The selected 12-cell SLC-SPCA output repeats every 2556 cycles, and while other 12-cell configurations may exhibit longer cycles and better statistical randomness, this result is an indication that SLC-SPCAs are most useful in large-k configurations for generating large numbers of random bits simultaneously and with good resource efficiency. In our evaluation, the (k = 32, R1 = 150, R2 = 105, m = −4, n = 0) SLC-SPCA does not repeat any output value even after 226 clock cycles, irregardless of C−1 . A lower bound of the sequence length is thus established at 3 × 226 32-bit values. Resource utilization, power, and energy evaluations were performed on a target Zynq 7020 FPGA. Four FPGA designs containing 20 PRNGs of type LUT-SR, SL-SPCA, and SLC-SPCA were synthesized with Xilinx ISE 14.7 and post place and route simulations were performed at a clock frequency of 100 MHz in order to generate simulation activity files (SAIF). From each synthesized design and
258
L. Petrica / J. Parallel Distrib. Comput. 111 (2018) 251–259
Table 5 Comparative NIST performance of SLC-SPCA. PRNG
Sequences
Avg. NIST Proportion
96-bit SLC-SPCA (k = 32, R1 = 150, R2 = 105, m = −4, n = 0, C−1 Toggle) 36-bit SLC-SPCA (k = 12, R1 = 150, R2 = 105, m = −4, n = 0, C−1 Toggle) 32-bit LUT-SR (n = 1024, r = 32, t = 5, k = 32, s = 1c48) Chasm Blum–Blum–Shub
100 100 100 11400 1000
0.991 0.450 0.990 0.990 [10] 0.990 [10]
Table 6 PRNG power and energy. PRNG
Total power [mW]
Logic power [mW]
Signal power [mW]
Energy per Gb [mJ]
96-bit SLC-SPCA (k = 32, R1 = 150, R2 = 105, m = −4, n = 0, C−1 Toggle) 36-bit SLC-SPCA (k = 12, R1 = 150, R2 = 105, m = −4, n = 0, C−1 Toggle) 64-bit SL-SPCA (k = 32, R1 = 150, R2 = 105, m = −4, n = 0) (architecturally same as [4] and [3]) 32-bit LUT-SR (n = 1024, r = 32, t = 5, k = 32, s = 1c48)
2.56 0.93 2.49
1.48 0.52 1.43
1.08 0.41 1.06
0.26 0.29 0.39
2.46
1.52
0.94
0.76
SAIF file, we produced a power estimation with the Xilinx Power Analyzer (XPA) tool. From the XPA report we eliminated power dissipated in components other than the PRNGs, and averaged out the PRNG power to obtain the dissipation of one PRNG. We utilized this information to estimate the energy required to generate 1 Gbit of random data with each of the PRNGs under analysis. Table 6 summarizes the power and energy evaluation. We provide the total power dissipated by each PRNG, as well as logic and signal power. LUTSR and SLC-SPCA dissipate approximately the same total power, even though LUTSR utilizes twice as many slices. This is explained by the fact that SLC-SPCA utilizes one additional FF per LUT and the carry logic within the slice, which LUTSR does not. LUTSR dissipates more power in logic components, specifically in the SRL components which dissipate 2–3 times more power than a LUT in normal configuration. Conversely, SLC-SPCA has three times more wires than LUTSR and dissipates more power in signals. A 64-bit SL-SPCA was included to represent previous implementations of SPCA in [4] and [3]. As expected, it dissipates slightly less power than SLC-SPCA as it does not utilize the carry logic, but more than LUTSR because of signal power. Energy results favor the SLC-SPCA, which generates 96 bits per clock as opposed to 32 for LUTSR and 64 for SL-SPCA, finishing the task quicker and saving energy. 6. Conclusion and future work The SLC-SPCA PRNG proposed in our work generates three random bits per utilized LUT, the maximum bit density possible on the target Xilinx FPGA architecture. As an added benefit, it does not utilize LUTs configured as shift registers, which are a more limited resource and which dissipate more power than normal LUT operation. The density advantages of SLC-SPCA over LUT-SR is achieved without compromising the quality of the generated random number sequence, evident in the NIST STS results of 32cell SLC-SPCA PRNGs. Our results with the SLC-SPCA add to the large body of literature on FPGA PRNGs and enable greater freedom of choice for designers balancing resource utilization, energy consumption, power dissipation, and functionality. Further exploration of the SLC-SPCA design space is an avenue of future work, as is a deeper theoretical analysis of the properties of SLC-SPCA with regard to linearity and cryptographic strength. In this paper we explored only two CA linear rule combinations of SLC-SPCA, out of a multitude of possibilities. Even though previous work indicates the evaluated linear rules as the strongest, it is nevertheless possible that other rules might provide better randomness in conjunction with the more complex structure of SLCSPCA. With regard to cryptographic strength, we conjecture that
the I output of the SLC-SPCA has potential cryptographic strength as it hides the CA state, but such claims will have to be supported in future work. Acknowledgment The work was funded by the Sectoral Operational Programme Human Resources Development 2007–2013 of the Romanian Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/132397. References [1] L. Blum, M. Blum, M. Shub, A simple unpredictable pseudo-random number generator, SIAM J. Comput. 15 (2) (1986) 364–383. [2] A. Gheolbanoiu, D. Mocanu, R. Hobincu, L. Petrica, Global feedback selfprogrammable cellular automaton random number generator, Revista Tecnica De La Facultad De Ingenieria Universidad Del Zulia 39 (1) (2016) 1–9. [3] S.-U. Guan, S.K. Tan, Pseudorandom number generation with selfprogrammable cellular automata, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 23 (7) (2004) 1095–1101. [4] D.H. Hoe, J.M. Comer, J.C. Cerda, C.D. Martinez, M.V. Shirvaikar, Cellular automata-based parallel random number generators using FPGAs, Int. J. Reconfigurable Comput. 2012 (2012) 4. [5] P.D. Hortensius, R.D. McLeod, H.C. Card, Parallel random number generation for VLSI systems using cellular automata, IEEE Trans. Comput. 38 (10) (1989) 1466–1473. [6] S. Konuma, S. Ichikawa, Design and evaluation of hardware pseudo-random number generator MT19937, IEICE Trans. Inf. Syst. 88 (12) (2005) 2876–2879. [7] G. Marsaglia, DIEHARD test suite, 1998. Online: http://www.stat.fsu.edu/pub/ diehard/. [8] NIST, Special Publication 800-22, A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications, 2010. [9] S.-H. Shin, K.-Y. Yoo, Analysis of 2-state, 3-neighborhood cellular automata rules for cryptographic pseudorandom number generation, in: Computational Science and Engineering, 2009, CSE’09, International Conference on, vol. 1, IEEE, 2009, pp. 399–404. [10] J. Spencer, Cellular automata in cryptographic random generators, 2013. CoRR abs/1306.3546. URL http://arxiv.org/abs/1306.3546. [11] J. Spencer, Pseudorandom bit generators from enhanced cellular automata, J. Cell. Autom. 10 (3) (2015) 295–317. [12] D.B. Thomas, LUT-SR example code, 2015. Online: http://cas.ee.ic.ac.uk/ people/dt10/research/rngsfpgalut_sr.html. [13] D.B. Thomas, W. Luk, High quality uniform random number generation using LUT optimised state-transition matrices, J. VLSI Signal Process. Syst. Signal, Image, Video Technol. 47 (1) (2007) 77–92. [14] D.B. Thomas, W. Luk, FPGA-optimised high-quality uniform random number generators, in: Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays, ACM, 2008, pp. 235–244. [15] D.B. Thomas, W. Luk, FPGA-optimised uniform random number generators using LUTs and shift registers, in: International Conference on Field Programmable Logic and Applications, FPL, IEEE, 2010, pp. 77–82. [16] D.B. Thomas, W. Luk, The LUT-SR family of uniform random number generators for FPGA architectures, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 21 (4) (2013) 761–770.
L. Petrica / J. Parallel Distrib. Comput. 111 (2018) 251–259 [17] X. Tian, K. Benkrid, X. Gu, High performance Monte-Carlo based option pricing on FPGAs, Eng. Lett. 16 (3) (2008) 434–442. [18] A.H. Tse, D.B. Thomas, K.H. Tsoi, W. Luk, Reconfigurable control variate MonteCarlo designs for pricing exotic options, in: International Conference on Field Programmable Logic and Applications, FPL, IEEE, 2010, pp. 364–367. [19] S. Wolfram, Random sequence generation by cellular automata, Adv. in Appl. Math. 7 (2) (1986) 123–169. [20] Xilinx, UG474 7-Series FPGA CLB User Guide, 2014. Online: http://www.xilinx. com/support/documentation/user_guides/ug474_7series_CLB.pdf.
259
Lucian Petrica is a Lecturer with the Electronics and Telecommunication Faculty of the Politehnica University of Bucharest, Romania. He has obtained his Ph.D. in 2012, and has authored more than 20 journal and conference papers. His research interests cover a broad area of computer-related technologies, from reconfigurable computing with FPGAs, to computer architecture, and parallel computing applications, with a focus on energy efficiency.