RTL implementations and FPGA benchmarking of selected CAESAR Round Two authenticated ciphers

RTL implementations and FPGA benchmarking of selected CAESAR Round Two authenticated ciphers

Microprocessors and Microsystems 52 (2017) 202–218 Contents lists available at ScienceDirect Microprocessors and Microsystems journal homepage: www...

3MB Sizes 37 Downloads 84 Views

Microprocessors and Microsystems 52 (2017) 202–218

Contents lists available at ScienceDirect

Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

RTL implementations and FPGA benchmarking of selected CAESAR Round Two authenticated ciphers William Diehl∗, Kris Gaj George Mason University, Fairfax, VA 22033, USA

a r t i c l e

i n f o

Article history: Received 10 January 2017 Revised 27 March 2017 Accepted 4 June 2017 Available online 6 June 2017 Keywords: Authentication Cipher Cryptography Encryption Field programmable gate array

a b s t r a c t Authenticated ciphers are cryptographic transformations which combine the functionality of confidentiality, integrity, and authentication. This research uses register transfer-level (RTL) design to describe selected authenticated ciphers using a hardware description language (HDL), verifies their proper operation through functional simulation, and implements them on target FPGAs. The authenticated ciphers chosen for this research are the CAESAR Round Two variants of SCREAM, POET, Minalpher, and OMD. Ciphers are discussed from an engineering standpoint, and are compared and contrasted in terms of design features. To ensure conformity and standardization in evaluation, all four candidates are implemented with an identical version of the CAESAR Hardware API for authenticated ciphers. Functionally correct implementations of all four ciphers are realized, and results are compared against each other and previous results in terms of throughput, area, and throughput-to-area (TP/A) ratio. SCREAM is found to have the highest TP/A ratio of these four ciphers in the Virtex-6 FPGA, while Minalpher has the highest TP/A ratio in the Virtex-7 FPGA. © 2017 Elsevier B.V. All rights reserved.

1. Introduction Authenticated Ciphers combine the functions of confidentiality, integrity, and authentication. Input to authenticated ciphers consists of a plaintext message M, associated data AD (which may include, for example, a header or trailer of a packet used in communication protocols), a secret key, a public message number npub, and an optional secret message number nsec. The resulting ciphertext C and optional encrypted nsec are computed as a function of the inputs. This transformation ensures the confidentiality of the transaction. In most authenticated ciphers, a Tag, which is function of all blocks of the AD and plaintext, as well as npub, nsec, and key, is produced at the conclusion of plaintext encryption. The Tag is appended to the end of the ciphertext to assure and verify the integrity and authenticity of the transaction. Decryption of the ciphertext and optional encrypted nsec is conducted in a similar fashion. The ciphertext, as well as identical inputs, for AD, key, and message numbers, are required for validation. Tag’ is then computed as above, and checked against the concatenated Tag. If Tag = Tag then authentication and integrity of the transaction are assured; otherwise the decrypted ciphertext is not released. If authenticity and integrity are verified, the outputs of the transaction

are the AD, plaintext, and optional decrypted nsec. A notional authenticated cipher is depicted in Fig. 1. In July 2015, the Committee of CAESAR, the Competition for Authenticated Encryption: Security, Applicability, and Robustness, announced the candidates selected for advancement to Round Two, including SCREAM, POET, Minalpher, and OMD [1]. Specification updates (so called “tweaks”) were permitted before the beginning of Round Two, and a hardware implementation was additionally required before the conclusion of the round. This research implements SCREAM, POET, Minalpher, and OMD using Round Two specifications and compares them in terms of throughput, area, and throughput-to-area (TP/A) ratio on two hardware platforms – the Virtex-6 and Virtex-7 FPGAs. Three of the ciphers – SCREAM, POET and Minalpher – leverage Liskov, Rivest and Wagner’s Tweakable Block Cipher (TBC) as a common building block, providing a strong basis for comparison among the implementations of these ciphers [2]. The fourth cipher, OMD, is different in that it leverages a keyed compression mode of authenticated encryption using a hash function. 2. Previous work 2.1. Hardware evaluations in earlier competitions



Corresponding author. E-mail addresses: [email protected] (W. Diehl), [email protected] (K. Gaj).

http://dx.doi.org/10.1016/j.micpro.2017.06.003 0141-9331/© 2017 Elsevier B.V. All rights reserved.

Hardware efficiency has been an evaluation criteria in several cryptographic competitions in the last two decades, including

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218

203

Fig. 1. Notional authenticated encryption.

Advanced Encryption Standard (AES), eSTREAM, and the new secure hash function (SHA-3). Hardware implementations are often examined upon down-selection to later rounds, as the amount of work to evaluate all early round candidates for hardware efficiency can be daunting. Yet hardware evaluations are valuable, as algorithms that perform well in software sometimes perform poorly, or require excessive area or complexity, in hardware. Research in which third parties design and implement a significant set of ciphers from published specifications and compare them using a common hardware interface tends to emerge during the latter rounds of cryptographic competitions, as the number of candidates dwindles. For example, several large-scale hardware comparisons of SHA-3 candidates were presented at the Second SHA-3 Candidate Conference in 2010 [3]. Examples of comparisons of all 14 second round candidates were conducted for FPGA [4–10] and for ASIC [11–13], and benchmarking comparisons of the five SHA-3 finalists were conducted in [14–19]. 2.2. Hardware evaluations in CAESAR Round One In CAESAR Round One, which began in 2013 and concluded in 2015, there was no requirement to submit any hardware implementation of candidates. Therefore, only a small subset of candidates were implemented in FPGA or ASIC – mostly by the submission teams themselves. In these cases, the submitters evaluated their own candidate against the existing standard, AES-GCM. For example, the SCREAM and Minalpher Round One submissions contain evaluations of this sort [20,21]. Several single-candidate evaluations of hardware implementations by CAESAR submitters were presented at Directions in Authenticated Ciphers (DIAC) workshops in 2014 and 2015. Additionally, there have been several third-party implementations of single CAESAR authenticated ciphers in hardware, such as POET, as described in [22]. More recently, a comparison of four CAESAR Round One candidates (ICEPOLE, Prøst, Tiaoxin-346, and Silver) in FPGAs and ASICs was conducted in [23]. Large-scale (i.e., 10 or more ciphers) comparisons of CAESAR Round One candidates were also conducted as part of the foundational research of [24,25]. 2.3. Hardware evaluations in CAESAR Round Two In CAESAR Round Two, which took place between 2015 and 2016, a hardware submission was required for all candidates. This requirement precipitated the biggest and earliest hardware benchmarking effort in the history of cryptographic competitions. For Round Two, a hardware Applications Programming Interface (API), the CAESAR Hardware Applications Programming Interface (API) for authenticated ciphers [26–28] was introduced. This API provides flexible and comprehensive input-output (I/O) functionality, services common to multiple authenticated ciphers, and a fair and common basis for evaluation of the ciphers against one another in terms of area and performance.

Fourteen design teams completed a total of 43 hardware design packages (most of which were compliant with the CAESAR Hardware API), which included 28 out of the 29 Round Two candidate families. A comprehensive benchmarking of high-speed FPGA implementations of CAESAR candidates, including a comparison to a baseline implementation of AES-GCM, was completed and documented in [29,30]. This research implements Round Two specifications of SCREAM, POET, Minalpher, and OMD. Previous Round Two FPGA implementations of SCREAM Version 3, POET Version 2.01, and Minalpher Version 1.1 are available at [31,32], and [33,34], respectively. In general, the main contribution of this research beyond what is accomplished in previous implementations is that ciphers are embedded in a universal hardware interface defined by the CAESAR Hardware API. The implementations in [31–34] will be further compared and contrasted to our implementations in subsequent sections, after necessary background has been made available to the reader. In contrast to the other three ciphers, our implementation is the only-known FPGA implementation of OMD. As such, there is no basis for comparison with other implementations. 3. Hardware API for authenticated ciphers Authenticated cipher implementations in this research use the CAESAR Hardware API for Authenticated Ciphers [26–28]. Based on the George Mason University (GMU) Hardware Applications Programming Interface (API) for Authenticated Ciphers, the CAESAR Hardware API was adopted by the CAESAR committee as the interface standard for hardware implementations of CAESAR authenticated cipher candidates in May 2016. This API provides a common, flexible, and practical I/O standard and is especially applicable for the fair evaluation of large numbers of hardware design candidates, as the employment of custom interface solutions causes wide variations in throughput and area calculations, and makes it more difficult for evaluators to judge efficiencies based on a standard baseline. The Development Package for the CAESAR Hardware API provides input and output processors (called PreProcessor and PostProcessor), which receive data, parse data into applicable segments to be used by ciphers, and format output [28]. Serialinput/parallel-output (SIPO) and parallel-input/serial-output (PISO) allow for fewer I/O connections (e.g. 32 bits) for public and secret data interfaces, which expands the available FPGA platforms on which designs can be implemented. For example, all four designs in this research use a 32-bit bus width for public data interface (PDI), secret data interface (SDI), and data output (DO). The total I/O pin requirement during our implementations is 107 – well within the pin limits of low-end FPGAs. The Development Package also provides services common to many authenticated ciphers, which would otherwise be time consuming to implement, and if realized differently, could lead to a biased comparison. One example is 10∗ padding, where in each non-

204

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218

Fig. 2. External interface in CAESAR Hardware API for authenticated ciphers (denoted by the dashed line in left of figure), and interface between CipherCore and I/O Processors (right of figure). Further details in [26–28].

full n-bit block (i.e., empty or partially full blocks consisting of less than n bits of input), a ‘1’ is placed in the left-most non-occupied position, and the remainder of the block is filled with ‘0’. All four ciphers in this research use 10∗ padding for associated data. In the case of Minalpher, 10∗ padding is used for plaintext. In the cases of POET (associated data) and Minalpher (plaintext), final blocks must terminate with 10∗ , which sometimes results in the creation of an extra block. The padding and ciphertext expansion modes of the CAESAR Hardware API efficiently create extra blocks when necessary which reduces both datapath and controller complexity. A second example is truncation of plaintext or ciphertext output during final block processing. SCREAM, POET, and OMD require variable truncation operations. Truncation can be costly to realize in hardware when the truncation length is not known at synthesis time. However, it is efficiently computed using the bdi_valid_bytes input from the CAESAR Hardware API at minimal additional cost in CipherCore. The interface to the authenticated cipher core with associated data, AEAD, and the proposed division of a high-speed AEAD core into basic building blocks is shown in Fig. 2. The Development Package is “universal,” in that it allows for specific functionality required in most CAESAR Round Two candidates. Designers of high-speed implementations can insert their own CipherCore Datapath and CipherCore Controller into CipherCore in order to use the corresponding interface. Note that this requires a robust controller design, which can implement all the features required in authenticated encryption and decryption, and interpret and interact with interface control signals, including those driving external I/O. Specifications for signals and additional features of this hardware API are available in [28].

4. Design methodology All four implementations in this study are developed using the CAESAR Hardware API compliant code development process, depicted in Fig. 3. This process uses the Development Package (available at [35]) and is performed using the following steps: 1. The developer begins by using the published specification and the reference C code, conducts register transfer level (RTL) design, and encapsulates the design in the CipherCore module (shown in Fig. 2) or as required using the Development Package.

2. The developer uses the reference C code and the automatic test vector generator “aeadtvgen” to produce test vectors which conform to the established CAESAR protocol. 3. The developer’s design is instantiated inside the test bench (from the Development Package “AEAD_TB”) which is used to conduct functional verification of the design. 4. Xilinx synthesis and implementation tools (including Xilinx 14.7 ISE and Vivado Design Suite) are used to synthesize and implement results. In this research, the target platforms are the Xilinx Virtex-6 (xc6vlx240tff1156-3) and Virtex-7 (xc7vx485tffg17613) FPGAs. Results are implemented using synthesis options that prohibit Block RAM (BRAM) or Digital Signal Processor (DSP) unit generation, in order to ensure uniform results for fair evaluation. 5. The Automated Tool for Hardware Evaluation (ATHENa) is used to optimize TP/A ratio for Virtex-6 implementations, and Minerva is used to optimize TP/A ratio for Virtex-7 implementations [36–38]. It should be emphasized that a “design compliant with the CAESAR Hardware API for authenticated ciphers” refers to an author’s implementation that complies with the signaling and protocol requirements of the CAESAR Hardware API in references [26,.27]. The use of the Development Package at [35] and the CAESAR Hardware API Implementer’s Guide [28] are not required but are provided to facilitate compliant implementations. 5. Designs of selected authenticated ciphers The authenticated ciphers which are the subject of this paper, the CAESAR Round Two versions of SCREAM, POET, Minalpher, and OMD, are formally defined in [39–42], respectively. The design methodology for all four authenticated ciphers is register transferlevel (RTL) design. In each case, the top-level of the cipher core (shown in Fig. 2) consists of two modules: Datapath and Controller. Only the datapaths for respective designs are illustrated in this paper. The controllers are not depicted, but can be derived from a knowledge of the required sequence of operations of the authenticated cipher and the CAESAR Hardware API. However, it is emphasized that the development of controllers for authenticated ciphers is not trivial and occupies a significant portion of the design space. The design process is influenced by three factors, the formal specification of the underlying cryptographic primitive, the optimization target (e.g., lightweight, high-speed, etc.), and the

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218

205

Fig. 3. Compliant Code Development Process using the CAESAR Hardware API for Authenticated Ciphers [30].

Table 1 Terminology and notation common to ciphers in this research. Symbol

Definition

Mi Ci Ai |X i | n a Trunc

Block i of plaintext Block i of ciphertext Block i of associated data Length of corresponding block in bits Number of message blocks Number of associated data blocks An operation clearing a specific number of rightmost bytes in a block of data Comparator Concatenation Exclusive OR Multiplication by the constant 2 in Galois Field GF(2n ), as defined in the specification Right shift by variable number of bits Left shift by variable number of bits

= | 

×2  

CAESAR Hardware API for authenticated ciphers. The underlying cryptographic primitive tends to define the number, size, and sequence of required input and outputs. The optimization target defines the architecture (e.g., basic-iterative, unrolled, folded, parallel, etc.), which in turn defines the number of clock cycles required to process a block, and the critical path of the design. The CAESAR Hardware API defines the availability of message, associated data, and key; defines input-output (I/O) sequencing, and includes features which are necessary to conduct certain operations in authenticated ciphers. The design and operating considerations for the four authenticated ciphers are described below. Table 1 summarizes some of the terminology and notation which is common to all ciphers in this research. Additional terminology and notation which are specific to individual ciphers are described in their respective sections. 5.1. SCREAM SCREAM (Side-Channel Resistant Authenticated Encryption with Masking) is an authenticated cipher built around Tweakable Authenticated Encryption (TAE) [39]. The SCREAM top-level datapath is shown in the simplified block diagram for SCREAM in Fig. 4. SCREAM processes 128-bit blocks of associated data, 128-bit input blocks of plaintext (authenticated encryption) or ciphertext (authenticated decryption), and produces an identical number of 128-bit output blocks of ciphertext (encryption) or plaintext (de-

Fig. 4. SCREAM block diagram. All boldface type signals are described in the Implementer’s Guide to the CAESAR Hardware API [28]. Bus width of thick wires is 128 bits unless indicated. Bus width of thin wires is one bit.

cryption), as well as a 128-bit tag. In SCREAM (as in all implementations of authenticated ciphers using the Development Package for the CAESAR Hardware API [35,28]), the blocks of message and associated data arrive through the bdi port, and blocks of ciphertext and tag exit through the bdo port (shown in Fig. 4). Incomplete blocks of associated data are padded with a ‘1’ in the left-most vacant bit, followed by ‘0’ in the remaining bits (this padding scheme is referred to as “10∗ ” for the remainder of this article). The padding scheme is implemented by the module PreProcessor, as shown in Fig. 2 and according to the protocol described in [28]. Authenticated encryption of all message blocks Mi and tag generation are conducted according to Algorithm 1: In Algorithm 1, Trunc() indicates that final ciphertext block is truncated to match the length of the final plaintext block, i.e., the

206

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218

Fig. 5. SCREAM Tweakable Block Cipher (TBC) showing encryption and decryption for a basic-iterative architecture (i.e., one round per clock cycle) corresponding to module EK /EK -1 in Fig. 4.

Algorithm 1 Pseudocode of the SCREAM authenticated encryption, assuming that both AD and message contain incomplete blocks. 1. Auth ← 2. 3. 4. 5. 6.

a −2 i=0

EK (T i , Ai )

Auth ← Auth  EK (T ∗ , Aa−1 ) Ci ← EK (Ti , Mi ) for i = 0..n − 2 (all full blocks) Cn−1 ← T runc[EK (T∗ , |Mn−1 | )]  Mn−1 (partial block)  −1) Sum ← ((ni=0 ) Mi Tag ← EK (T , Sum) Auth

least significant 128 - |Mn−1 | bits are set to zero using the interface signal bdi_valid_bytes. Additionally, |Mn−1 | is derived from the interface signal bdi_size. T’i is the tweak value generated by the Tweak module for blocks of AD, except for the last block whose tweak is T’∗ . Ti is the tweak value for blocks of plaintext or ciphertext. T∗ and T are special tweak values generated for a final message block and tag, respectively. All other operations of the pseudocode in Algorithm 1 can be tracked back to the corresponding portions of the block diagram. The registers Auth and Sum are assumed to be initialized to 0 at the start of computations, using local synchronous resets provided by the control unit. The block encryption of TAE occurs in the Tweakable Block Cipher (TBC), which is denoted by EK /EK −1 in Fig. 4, and is detailed in Fig. 5. Within EK /EK −1 , a block encryption or decryption consists of 10 steps (i.e., NS = 10) of two rounds each (i.e., NR = 2 ). In Tweakable Authenticated Encryption (TAE), a unique tweakey is produced for each step of a block encryption. The tweakey is a function of the secret key key, and an initial tweak value T, which is a function of a public message number npub (which is registered after arriving in the bdi port), a counter (such as a block number), and mode (such as encryption or tag generation). The value of T, which could be T’i , T’∗ , Ti , T∗ , or T depending on the stage of authenticated encryption or decryption, is generated by the Tweak module and is supplied to EK /EK −1 . Each block encryption begins with the computation of a tweakey TK(0), which occurs in the EK /EK −1 module. Within each round, the 128-bit status word goes through a substitution (S-Box) and a permutation (L-Box). In between, a round constant RC(ρ , σ ) is XORed to the status word. The ρ parameter is the round number, where ρ ∈ {0, 1}, and the σ parameter is the step number, where σ ∈ {0, 1, . . . , NS − 1}. The round portion of the block cipher is circumscribed by the dashed line in Fig. 5. After

Algorithm 2 Pseudocode of the SCREAM authenticated decryption, assuming that both AD and ciphertext contain incomplete blocks. 1. Auth ←

a −2 i=0

EK (T i , Ai )

2. Auth ← Auth  EK (T ∗ , Aa−1 ) 3. Mi ← DK (Ti , Ci ) for all full blocks 4. Mn−1 ← T runc[EK (T ∗ , |Cn−1 | )]  Cn−1 (partial block) n −1 Mi 5. Sum ← i=0

6. Tag ← EK (T , Sum)Auth

the second round, the next tweakey TK(σ ) is XORed to the state variable prior to commencing the next step. The tweaks may either be pre-computed or computed “on-the-fly.” We follow the reference implementation by computing tweaks “on-the-fly.” Tweaks computed on the fly are simpler to implement, but incur a slight performance penalty. There are two substantive updates to SCREAM in Round Two. The first is an update to the S-Box. The S-Box retains the “bitslice” construction used in Versions 1 and 2, but uses a slightly expanded set of new equations, in order to improve differential properties and algebraic degree [39]. Additionally, the Version 3 SBoxes are “nearly involutive.” An involutive S-Box would achieve the condition x ← S(S(x)), i.e., S−1 (x ) would be equivalent to S(x), and would therefore save hardware by not having to instantiate an inverse S-Box. The SCREAM Version 3 S-Boxes achieve near involution, which means that the majority of Boolean equations describing S and S−1 are identical. The difference amounts to two out of 16 Boolean equations. As a result, the hardware implementations of S and S−1 in Version 3 of SCREAM can still substantially overlap. This saves area, even if these two functions are not perfectly involutive, due to security concerns. SCREAM decryption, shown in Fig. 5, is identical to encryption, except that we are forced to instantiate an inverse L-Box, and the order of computation is reversed. Authenticated decryption of all ciphertext blocks Ci and tag generation is conducted according to Algorithm 2: The second update is to the round constant. The round constant, previously XORed with the most significant byte of the status variable as RC (ρ , σ ) = 27 (2σ + ρ ) mod 28 is now XORed with the most significant half-word (16 bits) of the status variable, as RC (ρ , σ ) = 2199 (2σ + ρ ) mod 216 . In Versions 1 and 2, the multiplication by 27 was efficiently computed using shifts and additions, incurring a path delay of

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218

207

Fig. 6. POET block diagram. Bus width of thick wires is 128 bits unless indicated. Bus width of thin wires is one bit. All boldface type signals are described in the Implementer’s Guide to the CAESAR Hardware API [28].

four shifts and three additions. In Version 3, multiplication by the constant 2199 incurs a path delay of six shifts and five additions. Therefore, the authors decided to instantiate precomputed look-up tables (LUT) for all possible values of 2199 (2σ + ρ ) for all values of σ up to NS = 12, the maximum number of steps considered in this implementation. Tabulated round constants are additionally shared between encryption and decryption chains to save area. Several choices of architecture are possible, including singleround basic iterative architecture (with one round per clock cycle, and thus 20 clock cycles per block), and unrolled architecture (with two rounds per clock cycle, and thus 10 clock cycles per block). We provide results for the single-round basic iterative architecture, since this design produces a better throughput-to-area ratio. Finally, although the title of the cipher SCREAM stands for “Side-Channel Resistant Authenticated Encryption with Masking,” a masking mechanism (e.g., Boolean masking) is not included in the cipher specification. Rather, the authors have designed the cipher, particularly the bitsliced nature of the SCREAM S-Box, to facilitate the addition of a masking scheme directed against side channel attack [20,39]. An example of a Boolean masking scheme for the SCREAM cipher is available at [43].

5.2. POET POET (Pipelineable On-line Encryption with Authentication Tag) is an authenticated cipher designed to be independent of a specific block cipher and hash function, i.e., the user is free to choose any block cipher or hash function which meets criteria identified in [40]. The cipher consists of a middle layer which uses the Electronic Code Book (ECB) block cipher mode sandwiched between upper and lower layers, which generate secret values that are added to the ECB layer. The upper (“top”) and lower (“bottom”)

layer should contain є − AXU (i.e., “Almost (XOR) Universal) hash functions keyed by pairwise independent subkeys. Fig. 6 shows the POET simplified block diagram. The POET authors introduced significant changes in their Round Two tweaks, beginning with POET Version 2.0. The most significant changes include reducing the number of required subkeys from five to three and elimination of field multiplications by factors of 3 or 5 for the final header block. The authors also provide optional support for intermediate tags, which is a potentially useful feature for high-speed networks, but is not required by the CAESAR competition. Specifically, this research implements POET using the authors’ recommended configuration of AES-128 for main encryption, AES4 for the є − AXU hash function, and no intermediate tags, i.e. ls = 0, lt = 0. The AES-128 encryption/decryption and AES4 encryption cores are depicted as EK /EK −1 and FK , respectively, in Fig. 6. The source code for the AES-128 implementation used in this research is available at [44]. As previously mentioned, in their Round Two specification update, the POET authors eliminated the separate subkeys for AESTop and AES-Bottom (LTF OP and LBOT , respectively), and introduced F a common subkey KF . This opened the possibility of using a total of two AES cores, one 10-round AES main (consisting of encryption and decryption) and one “AES4” core, consisting of encryption only. Since AES cores consume significant resources in the design, elimination of one core can produce significant area savings. For example, a single AES4 core requires 1439 LUTs on the Virtex-6. However, there is increased cost in switching multiplexers in the datapath, and additional states and control signals required in the controller, since multiple functions must now time-share the AES hash core. POET uses 128-bit blocks of message text, associated data and secret key, and produces 128-bit blocks of ciphertext and a 128-bit tag. Each time the user updates the secret key (provided by the port key in CAESAR interface), and prior to commencing operation, POET produces three subkeys, L, K, and KF , by encrypting the key constants const0 − const2 using the module EK /EK −1 in Fig. 6,

208

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218

Fig. 7. POET authenticated encryption, decryption, and a part of the tag generation sequence. Encryption of non-final blocks Mi is shown at top-left. Encryption of final block Mn∗ is shown at top-center. Decryption of non-final blocks Ci is shown at bottom-left. Encryption of final blocks Cn∗ is shown at bottom-center. A part of tag generation is shown at right. All bus widths are 128 bits.

Algorithm 3 Computation of keyed-hash τ of associated data in POET. 1.  ←   EK (Ai 2. τ ← EK ( )

 2(i−1) L ) f or i

= 1..a

and stores these subkeys in the respective registers with the same names. The first step in POET authenticated encryption is to compute a running hash on 128-bit blocks of associated data (A). Blocks of A are first masked by multiples of subkey L. All multiplications are Galois Field multiplications by a constant of 2 and are defined on GF (2128 ) using the Galois Counter Mode (GCM) polynomial: x128 + x7 + x2 + x + 1. The multiplication unit is depicted as × 2 in Fig. 6. The A block is then encrypted using EK and summed to a running hash value, which is shown as  in Fig. 6. The final block of associated data is padded using 10∗ padding, which is performed on the input bdi by the PreProcessor module. The hash itself is then encrypted using EK to form the authentication value τ , which is registered for later use. The computation of τ is summarized below in Algorithm 3: The next step is encryption or decryption, which is performed on 128-bit blocks of message. The K subkey is used for all AESmain operations. AES-top and AES-bottom registers X and Y are initialized with τ and τ  1, respectively. The AES-top and AESbottom functions timeshare the single AES4 module, denoted in Fig. 6 as FK . The four-round AES4 hash is keyed with a single subkey, KF . Subsequently, the output of AES-top (FK ) is added to blocks of plaintext Mi as the input to AES-main (EK ). The output of AESmain is combined with the output of AES-bottom (FK ) to produce the resulting ciphertext block Ci . The input Xi to AES-main and the output Yi from AES-main are used as the subsequent inputs for AES-top and AES-bottom, respectively. The authenticated encryption and decryption chains are shown in Fig. 7, and are described in Algorithms 4 and 5, respectively. The inputs to the computation of final blocks for both encryption and decryption must be full blocks. In encryption, the least

Algorithm 4 Authenticated encryption in POET. 1. X0 ← τ ; Y0 ← τ  1 2. for i = 1..n − 1 3. Xi ← FKF (Xi−1 )  Mi ← FKF (Xi−1 )  bdi 4. Yi ← EK (Xi ) 5. bdo ← Ci ← FKF (Yi−1 )  Yi 6. S ← EK (|M|) 7. Xn ← FKF (Xn−1 )  Mn ∗  S 8. Yn ← EK (Xn ) 9. Cn ∗ ← FKF (Yn−1 )  Yn  S

Algorithm 5 Authenticated decryption in POET. 1. X0 ← τ ; Y0 ← τ  1 2. for i = 1..n − 1 3. Yi ← FKF (Yi−1 )  Ci ← FKF (Yi−1 )  bdi 4. Xi ← EK −1 (Yi ) 5. bdo ← Mi ← FKF (Xi−1 )  Xi 6. S ← EK (|C|) 7. Yn ← FKF (Yn−1 )  Cn ∗  S 8. Xn ← EK −1 (Yn ) 9. Mn ∗ ← FKF (Xn−1 )  Xn  S

significant 128 − |Mn | bits of the plaintext are filled with the most significant 128 − |Mn | bits of τ to form Mn ∗ (shown in Fig. 8). In decryption, the least significant 128 − |Cn | bits of the plaintext are filled with the most significant 128 − |Cn | bits of T to form Cn ∗ (shown in Fig. 9). This implementation accomplishes this merge by using the variable shifter and the bitwise OR, shown in the topright corner of Fig. 6. Note that this requires run-time determination of the number of bits of τ or T to be included in Mn ∗ or Cn ∗ , respectively, which is why the shifts must be variable. There are a total of six signals which require variable shifters (three right-shifts and three left-shifts), as shown in Figs. 8 and 9. This implementation multiplexes all inputs into either the right or left shifter, as appropriate. This provides significant area savings, as these 16-byte variable shifters consume about 200 LUTs each.

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218

209

Fig. 8. POET computation of last block and tag generation sequence for authenticated encryption, including the details of apportionment of final block of ciphertext and tag.

Fig. 9. POET computation of last block and tag verification sequence for authenticated decryption, including the details of apportionment of final block of plaintext and portions used in tag verification.

The encryption output is always a full 128-bit block of ciphertext. This final block must be apportioned between ciphertext Cn and tag T. The part of the final ciphertext block Cn ∗ that will be used as “tag” is equal to the amount of bits that were borrowed from τ to fill the final block of plaintext, and is denoted Tα . In other words, the output Cn ∗ of the final block computation chain is Cn |Tα , where Cn is the last block of ciphertext, and Tα is used for the tag. The final step is the computation of the remaining bits of Tag. First, Cn+1 ∗ is computed using the sequence of operations shown on the right side of Figs. 7 and 8. Then, the |Mn | most significant bits of Cn+1 ∗ , denoted Tβ , are extracted to form the least significant |Mn | bits of the tag. Therefore, the “tag” consists of Tα |Tβ . The remainder of Cn+1 ∗ is discarded. The tag generation sequence is shown in detail in Fig. 8. All computed ciphertext blocks, as well as the tag, exit the cipher core through the bdo port. The final output of plaintext and tag verification sequence in authenticated decryption is shown in Fig. 9. The full block Cn ∗

is formed by the concatenation Cn |Tα , where Tα is the 128 − |Cn | most significant bits of the expected tag T (supplied through bdi). The remaining |Cn | bits of the expected tag are shifted to the left and registered as Tβ |0. After decryption of the final block Mn ∗ , the least significant 128 − |Cn | bits of Mn ∗ are shifted to the left and registered as τ  |0. The intermediate variable τ , obtained by processing all blocks of associated data (according to Algorithm 3) is used as an input to the calculations shown in the bottom-right portion of Fig. 9. First, τ is truncated to obtain τ α , which is registered. Then, τ is processed using ENCn+1 (shown in the right side of Fig. 7) to obtain Cn ∗ . This value is truncated to the |Cn | most significant bits, leading to T |0. The tag verification is then computed as tag_valid ← (τ α = τ  ) ∧ (T β = T  ) ← (τ α |0 = τ |0 ) ∧ (T β |0 = T |0). Please note that the direct comparisons, such as (τ α = τ  ) are not possible as the lengths of the compared values are variable and not known in advance.

210

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218

Additionally, since the Trunc circuit is a relatively complex circuit, and that a total of four truncations are required during encryption and decryption (as shown in Figs. 8 and 9), the Trunc circuit is shared among all these operations, as shown in the bottomright portion of Fig. 6. In contrast to our implementation, the implementation at [32] uses three AES cores, all of which are 10-round AES-128. The implementation in [32] has an AES S-Box using combinational logic to perform inversions based on composite field arithmetic in subfields of GF(28 ). While such an implementation is desirable for ASIC, it is less efficient than an AES using LUTs in FPGA. For example, an AES using composite field inversion is 20% larger in terms of LUTs and has a critical path 80% longer than the equivalent AES version using LUTs, according to our measurements. The most significant difference between [32] and our implementation is that the implementation in [32] provides the value τ to an output, but does not attempt to perform tag generation or verification. As seen above, the main complexity in POET is tag generation and verification, which requires variable shifts and truncations for all but the simplest cases. Given this large delta in functionality, a direct comparison between our results and [32] is difficult.

5.3. Minalpher Minalpher, in contrast to the above ciphers, inputs 256-bit blocks of associated data and message, and outputs 256-bit blocks of ciphertext and a 128-bit tag. The number of blocks of ciphertext does not always equal the number of blocks of plaintext; this peculiarity is explained subsequently. Minalpher leverages the concept of a Tweakable Block Cipher (TBC). Specifically, Minalpher defines a cryptographic core called the Tweakable Even-Mansour (TEM). The core of the TEM is the Minalpher-P permutation [41]. The Minalpher-P consists of three transformations. They include “S” SubNibbles; “T” which is ShuffleRows and SwapMatrices; and “M” which is MixColumns and XorMatrix, and an XOR with the round constant, RC. The differences between “forward” (encryption) and “backward” (decryption) operations are a reversal of the orders of the MixColumns and XorMatrix, and the use of different round constants. A TEM computation consists of 17.5 Minalpher-P rounds; the “0.5” round is a final additional S and T transformation. The combined implementation of the Minalpher-P and Minalpher-P−1 permutation is shown in Fig. 10. The input X0 and output X17.5 are 256-bit blocks. However, the input is separated into two 128-bit blocks A and B, according to the procedure listed in [41]. All internal computations are then conducted on the 128-bit A and B. For example, each SN unit located within S (SubNibbles) consists of 32 4-bit S-Boxes. Within T (ShuffleRows and SwapMatrices) there is an SR unit which computes ShuffleRows on A, and an SR−1 unit, which computes the inverse of ShuffleRows on B. “SwapMatrices” swaps the blocks A and B. Within MFWD and MBACK are MixColumns (MC), which conduct permutations on 128-bit words, and XorMatrix, where B ← A  B. At the conclusion of the 17-round permutation, the result undergoes one more S and T transformation (i.e., the half round). The A and B values are then merged into the 256-bit output X17.5 . This Minalpher implementation uses two Tweakable EvenMansour (TEM) cores to enable parallel processing of plaintext/ciphertext, and tag generation. One core TEM supports both forward and backward Minalpher-P functions. The second TEM core, TEM aux, performs only forward transformations. TEM and TEM aux are shown in Figs. 11 and 12, respectively. Like other tweakable ciphers, a unique tweak is produced for each call to TEM. Each tweak is of the form yi (y + 1 ) j Lin , where

Lin = L or L , y is a root of a composite GF polynomial, and i and j are variable exponents which increase by either 0, 1, or 2 with each new block. The tweaks are calculated on the fly by the TEM and TEM aux units, using circuits shown on the left side of the diagrams in Figs. 11 and 12, respectively. The select signals of multiplexers shown in these circuits are provided by the control unit. Multiplications in Minalpher occur on GF(2256 ), which is represented using a tower of field extensions as follows: GF (28 ) = GF (2 )[x]/ f (x ), where f (x ) = x8 + x7 + x5 + x + 1, and GF (2256 ) = GF (28 )[y]/g(y ), where g(y)=y32 + y3 + y2 + x. The simplified block diagram of Minalpher is shown in Fig. 13. First, the associated data (AD) initial tweak L’ is computed using the TEM aux module. Next, each block of AD is processed through the M port in the TEM aux module. The result, available at the C port, is added to a running hash (denoted by T in Fig. 13), which is registered at the conclusion of AD processing and later used to compute the tag. Prior to the start of message processing, a new tweak (valid for both plaintext and tag processing) is computed by the TEM aux module, made available at the Lout port, and registered for subsequent use. During authenticated encryption, TEM processes a plaintext block Mi into a corresponding ciphertext block Ci . The result Ci is then sent through TEM aux, and the result is added to the running hash computed during AD processing. While the second TEM core TEM aux is computing the tag hash, the first TEM core TEM is freed to compute the next ciphertext block Ci+1 . After the computation of the final plaintext block Mn , the running hash (registered in T) is then added to the final plaintext block and is used as the input to the final TEM calculation. The most significant 128 bits of the 256-bit TEM output are used as the tag (the least significant 128 bits are discarded). The authenticated decryption flow is slightly simpler, in that each block of ciphertext Ci can be simultaneously entered into the two TEM cores. Therefore, the tag is produced in parallel with the decryption of the final block of ciphertext. Minalpher uses 10∗ padding for both associated data and for plaintext processing. However, a Minalpher plaintext block must always have 10∗ padding, even if it is a full block. Therefore, if the last plaintext block is a full block, and there are n plaintext blocks, there will be n + 1 ciphertext blocks. In Minalpher, all ciphertext blocks are full blocks (i.e., 32 bytes long). Therefore, the decryption processor does not know the length of the final plaintext block Mn until after decryption is complete. One way to ensure the proper length of Mn is to use a parallel “truncator,” which checks all of the 32 possible locations of the original 10∗ padding, and eliminates it from the output. This method executes in a single clock cycle, but is area-intense, and could increase the critical path if not pipelined. A second method is to cycle through all possible combinations of Mn until the 10∗ padding is located, one byte at a time, using a 256-bit shift register. This method has a low-area requirement and minimal effect on the critical path, but requires up to 32 clock cycles to process the final block of plaintext Mn . We have chosen the latter method for this implementation, which occurs in LengthCounter in Fig. 13. The length of Ci is forwarded to the PostProcessor via the port bdo_size. In contrast to our design, the Minalpher implementation described in [33] uses both straightforward and four-state pipelined implementations of the Minalpher-P transformation. While both the straightforward and pipelined implementations achieve very high throughput (i.e., 2 – 3 times that of AES-GCM), it is not clear to what extent the full authenticated encryption is implemented, as there is no discussion of top-level block diagram, interface, or test cases. The Minalpher design in [34] also significantly differs in architecture, in that it uses only one TEM core whereas our design uses two. The design in [34] therefore has a lower throughput, but

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218

211

Fig. 10. Block diagram of the combined Minalpher-P and Minalpher P−1 permutations. Forward branches correspond to Minalpher-P, and backward branches correspond to Minalpher-P−1 .

Fig. 11. Tweakable Even-Mansour (TEM) Block, with Minalpher P/P−1 , and limited tweak generation.

saves in terms of resources and control complexity. The results will be subsequently discussed. 5.4. OMD Offset Merkle-Damgård (OMD) is a keyed compression function operation mode for nonce-based authenticated encryption, and is described in [42]. In contrast to the three ciphers described above, this authenticated cipher uses a hash function, the SHA-2 (SHA256), as its underlying cryptographic primitive. The authors’ ratio-

Fig. 12. Auxiliary Tweakable Even-Mansour (TEM aux) Block, with Minalpher P only, and full tweak generation and initial tweak key generation using K, flag,  and N’ = npub or 0104 , where γi = yi L; ya  = ya−1 (y + 1 )L ; ψi = y2i L; and ψm = y2m−1 (y + 1 )L.

nale is to leverage the decades of research of proven security of this compression function, and to take advantage of an instruction set extension in Intel microprocessors that includes SHA-2 instructions.

212

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218

Fig. 13. Minalpher block diagram. All signals in boldface type are described in the Implementer’s Guide to the CAESAR Hardware API [28]. Bus width of thick wires is 256 bits, except where indicated. Bus width of thin wires is one bit.

Algorithm 6 Computation of OMD masking for message blocks. 1. L∗ = FK (0256 , < τ >256 ) 2. L[0] = 4L∗ 3. L[i] = 2L[i − 1] for i = 1.. log2 max(i )

4. i = i−1  L[ntz(i )] Note: i is the mask for block i; ntz(i) is the number of trailing zeroes of the binary representation of i; max(i) is the maximum block number = [2G__MAX __LEN − 1/25 ] = 2G__MAX __LEN −5 , where G_MAX_LEN = 32 for single-pass authenticated ciphers [27], 2G__MAX __LEN − 1 is the maximum message size in bytes, and 25 is the size of the OMD message block in bytes.

We implement the authors’ primary recommendation which is OMDsha256 (i.e., based on SHA-256; OMDsha512 is the secondary recommendation). The key size is 80 ≤ |k| ≤ 256; we implement a 128-bit key. The nonce and tag size are also variable; we use a 96-bit nonce and 128-bit tag. The L array is initialized through a series of multiplications in GF(2256 ) by the constant 2 and using the polynomial x256 + x10 + x5 + x2 + 1. The SHA-256 is shown as FK , and the multiplier is denoted × 2 in simplified block diagram of OMD shown in Fig. 14. The initialization is required for first-use of any secret key. The formal procedure is documented in Algorithm 6, and the challenges of precomputation of L values are discussed below. After the initialization, the running hash function of 512-bit blocks of associated data is computed using the output of the SHA2 compression function FK , and is registered in taga until needed for tag computation. Next, 256-blocks of message Mi are processed by FK into ciphertext Ci and output through the bdo port. During the encryption of each message block Mi , the state input Hi is calculated using the updated mask value i , the previous hash state Hi−1 , and Mi as Hi ← FK (Hi−1  i , Mi ) which produces a feedback effect. Finally, the tag is generated by encrypting the last block with the updated hash state to produce tage , and combining the result with the registered value taga . The most significant 128 bits of taga tage are retained as the tag; the least significant 128 bits are discarded.

Table 2 Number of trailing zeroes for a four-bit binary representation of indices 1 to 8. i10

1

2

3

4

5

6

7

8

i2 ntz(i)

0 0 01 0

0010 1

0011 0

0100 2

0101 0

0110 1

0111 0

10 0 0 3

There are several significant complexities associated with this hardware design. The first is that the blocks of associated data are 512-bit blocks, whereas the blocks of message (i.e., plaintext or ciphertext) are 256-bit. Since the current version of the CAESAR Hardware API for Authenticated Ciphers was intended for a maximum block width of 256 bits, a control sequence is required to ingest a variable combination of 256-bits (including applying padding to the final 256-bit block) prior to computing a block of associated data. The second significant complexity is the use of the “number of trailing zeroes” (ntz) function. The ntz(i) function is used as an index to calculate the appropriate masking value. Since L(ntz) varies non-linearly for sequentially increasing i (as shown in Table 2), we are precomputing all possible values of L[i] during the initialization. For default values of parameters, this precomputation requires an additional memory of the size of 32 × 256 bits, and an initialization latency of 29 clock cycles, but maximizes the throughput for long messages. The CAESAR committee specified that single-pass algorithms must be able to process 232 –1 bytes, or 227 blocks in the case of OMD. Two-pass algorithms (such as AEZ) are only required to process 211 –1 bytes, or 26 blocks in the case of OMD. OMD contains generics to allow the user to select single-pass or double-pass limits for an OMD instantiation. Whereas for the majority of CAESAR candidates there is very little impact on throughput-to-area (TP/A) ratio, there is a significant difference in Distributed RAM (DRAM) generation in OMD, depending on the user’s choice. For example, in the case of the double-pass limit, 26 blocks must be potentially processed. Since there are seven bit positions in this number, the maximum number of trailing zeroes can be six

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218

213

Fig. 14. OMD block diagram. All signals in boldface type are described in the Implementer’s Guide to the CAESAR Hardware API [28]. Bus width of thick wires is 256 bits, except where indicated. Bus width of thin wires is one bit.

(i.e., a one followed by six zeroes). Given max[ntz(i)] = 6, 2[log2 6] = 8, and eight memory locations are required. For the double-pass limit, 2[log2 27] = 32 memory locations are required. Since each memory location holds a 256-bit (i.e., 32 byte) value, there is a significant delta in area requirements between the single-pass and double-pass instantiations. The third complexity is with the SHA-2 function. SHA-2 relies extensively on modulo 232 additions. In the worst case, a state register is updated by six additions, which can be shaped by the synthesis tools (in the best case) to three levels of additions using carry propagate adders. Such an arrangement has a ruinous effect on the critical path. The SHA-2 implementation in this research is a custom design which leverages [3, 45] to pipeline two stages of addition, and uses carry-save adders, which individually reduce the cost of two levels of addition to one level of addition and some Boolean logic delay. 6. Results 6.1. Direct comparison of the four ciphers The results for the four RTL implementations are described below. These RTL implementations meet all the minimum compliance criteria for hardware implementations of CAESAR candidates [26,27]. Namely, they are required to check for “boundary conditions,” such as null data (i.e., empty associated data or message), support multiple key activations, continuously process messages (i.e., complete one message and immediately proceed to the next), process full and partial final blocks, support message lengths up to 232 −1 bytes, and to support conditions which “stall” the processor, such as input or output “not ready” conditions [26,27]. Adherence to minimum compliance criteria means that Algorithmic State Ma-

chine (ASM) controllers can be significantly larger and more complex than controllers supporting simple encryption or decryption. The VHDL source codes for these implementations are available at [44]. Implementation statistics in the Virtex-6 and Virtex-7 FPGAs are shown together in Table 4. These results are compared using throughput-to-area (TP/A) ratio as the primary metric. Throughput is defined in terms of 106 bits/second (Mbps), and area is defined in terms of LUTs (although area can also be defined in terms of slices). The formulas used to calculate throughput for long messages are shown in Table 3. By this metric, SCREAM has the highest TP/A ratio on the Virtex-6 at 0.506, followed by Minalpher at 0.478, POET at 0.384, and OMD at 0.264. On the Virtex-7, Minalpher has the highest TP/A at 0.427, followed by SCREAM and POET at 0.423, and OMD at 0.228. SCREAM is the smallest in terms of area, with 26%, 34%, and 58% of the Virtex-6 LUTs of POET, Minalpher, and OMD, respectively. Conversely, POET and Minalpher both have nearly three times the throughput of SCREAM and OMD, and are nearly on par with each other. The results used in this research are recorded in the “Database of FPGA Results for Authenticated Ciphers” at the link in [46]. 6.2. Comparison with Round Two FPGA implementations of the same ciphers Direct comparison with previous results is generally difficult, since other implementations either have less functionality, have significantly different interface standards and assumptions, use different architectural strategies, or lack published source codes. In the case of SCREAM, we were able to synthesize and implement the source codes available at [31] on Virtex-6 hardware as

214

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218 Table 3 Formulas for execution time and throughput for long messages.

Key Setup Encryption Decryption Throughput

SCREAM

POET

Minalpher

OMD

0 21a + 21n + 21 21a + 21c + 21 (128 / 21) ∗ fCLK

50 11a + 10n + 59 11a + 10c + 59 (128 / 10) ∗ fCLK

38 19a + 19n + 19 19a + 19c + |Mn | (256 / 19) ∗ fCLK

92 66 2a + 66n + 196 66 2a + 66n + 196 (256 / 66) ∗ fCLK

Explanation of Table 3 calculations: a is number of AD blocks; n is number of plaintext blocks, and c is the number ciphertext blocks. Key setup time includes AES round key initialization and subkey generation. The constant in encryption and decryption fields includes tag generation and other factors which are non-recurring. |Mn | refers to the number of bytes of the final decrypted plaintext block. Throughput is calculated for “long messages,” for which the dominant component is assumed to be number of plaintext or ciphertext blocks. Table 4 Results of implementation in Virtex-6 (V-6) and Virtex-7 (V-7). SCREAM FPGA Platform LUTs fCLK (MHz) Throughput(Mbps) Throughput/Area

V-6 2052 170 1039 0.506

POET V-7 2594 180 1097 0.423

V-6 7695 231 2957 0.384

part of the benchmarking effort of [29]. The results are recorded in the database at [46]. This implementation used 1762 LUTs, with a maximum clock frequency of 196 MHz. This is slightly smaller and faster than our own implementation of SCREAM using an analogous architecture (i.e., basic iterative architecture with one round per clock cycle) as shown in Table 4. Analysis of the source codes at [31] show that the constructions are remarkably similar; the differences are explained by our addition of the CAESAR Hardware API for Authenticated Ciphers, the registers required to store intermediate results, and the more complex controller in our implementation. We were also able to synthesize the POET implementation at [32]. This implementation requires 6252 LUTs in the Virtex-6, and has a maximum clock frequency of 159 MHz. However, 12 cycles are required between computations of blocks, vice the 10 cycles in our implementation, which leads to a throughput of 1696 Mbps. The area of 6252 LUTs is less than our implementation, which requires 7695 LUTs. However, our implementation is significantly more functional, in that it includes the CAESAR Hardware API for authenticated ciphers, and performs full tag generation and verification, which require several hardware-intense variable shifts. The results of the above POET implementation are recorded at [46]. In contrast to our implementations, several previous CAESAR implementations use their own custom-designed interfaces, with no standardization and with varying degrees of complexity. The top-level POET interface used in [32], and SCREAM interface in [31], are shown in Fig. 15a and b, respectively. For example, the interface for POET used in [32] provides significantly less functionality than our implementation. It provides only one output port, meaning that the details of tag generation and verification are left to a higher level of protocol. Likewise, it does not provide the core information on the length of the final block. This is a critical parameter in POET, which determines the required variable shifts in tag generation and verification. The SCREAM interface described in [31] is more complex, uses I/O protocol designed to mirror AXI4-stream protocol, includes a 5-bit “length” field, which enables an embedded controller to process partial blocks, and is arguably adequate for the full range of SCREAM functionality. However, the interface does not include a mechanism for tag verification in authenticated decryption. We are not attempting to show the adequacy or inadequacy of individual cipher interfaces, but rather to show that fair compari-

V-7 7466 247 3162 0.423

Minalpher

OMD

V-6 5953 211 2843 0.478

V-6 3562 242 939 0.264

V-7 7381 234 3153 0.427

V-7 4701 276 1071 0.228

Fig. 15. a (POET) and 15b (SCREAM) top-level interfaces, based on analysis of source codes in [32] and [31], respectively.

son of authenticated ciphers with interfaces that allow varying degrees of functionality is not practical. Our implementation of all candidates in a standardized, robust, and flexible interface levels the playing field and is essential for wide-ranging hardware evaluations in competitions such as CAESAR. The Minalpher implementation at [34] is fully compliant with the CAESAR Hardware API for Authenticated Ciphers. This implementation has a maximum clock frequency of 281 MHz and uses 2879 LUTs on the Virtex-6. However, it employs a different architecture, namely, it uses only one TEM core (instead of two in our design). Since encryption and decryption require two passes in Minalpher, the design at [34] is required to use the single TEM core twice, and requires 39 clock cycles to process a 256-bit block. Therefore, the throughput of [34] is 1845 Mbps, and the TP/A ratio is 0.641. Thus, the implementation of [34] outperforms our design in terms of TP/A ratio by 34%. In contrast, the Minalpher implementation at [33] does not use the CAESAR Hardware API for Authenticated Ciphers, so direct comparison is not possible. In [33] the authors include comparisons of dual architectures (straightforward versus four-state pipelined) against AES-GCM on multiple FPGAs. Its stated frequencies for the Virtex-6 of 479 MHz (straightforward) and 599 MHz (pipelined) outperform our design. However, it is not clear that full authenticated encryption and decryption are performed on test vectors conforming to the CAESAR minimal compliance standards,

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218

215

Fig. 16. Comparison of TP/A ratios of CAESAR round two hardware implementations relative to AES-GCM in the Virtex-6 FPGA [30].

as only the implementation of the Minalpher-P transformation is discussed. Finally, there is no reference to the top-level interface, and source codes are not available for examination. As previously noted, we are not aware of any other hardware implementation of OMD. 6.3. Comparison with Round Two FPGA implementations of all ciphers Benchmarking of all CAESAR Round Two hardware implementations was completed in July 2016. Fig. 16 shows the comparison of TP/A ratios of CAESAR candidates relative to AES-GCM in the Virtex-6 (i.e, the ratio of AES-GCM is 1.0). These rankings, excerpted from [30], include the Minalpher results from [34], since the implementation with the highest TP/A is considered in the rankings. However, the results from SCREAM, POET, and OMD are from this research. In some cases, the TP/A ratio would change if one computed throughput based on processing of associated data (and not messages). This is indicated with a ratio followed by “(A)” where applicable. The ciphers in this research all had a TP/A ratio less than AESGCM. However, an analysis of the overall rankings is beyond the scope of this research. The reader is referred to [30] for more information. 6.4. Analysis Our SCREAM implementation is very similar to the implementation in [31], with the main exception being our addition of the CAESAR HW API for Authenticated Ciphers. SCREAM is nearly a lightweight cipher in terms of area, and is simple to implement. Its only drawback is the high number of rounds required to process a single block, which ultimately limits its throughput. In a comparison of an implementation of SCREAM with an identical architecture and interface but using High-Level Synthesis (HLS), the ratio of TP/A of our RTL design to the HLS design is 1.24, indicating that our SCREAM design is plausibly efficient [47]. However, it should be possible to improve TP/A ratio through use of alternative architectures. For example, a study of two-stage pipelined architectures of CAESAR authenticated ciphers achieved a 70% improvement in TP/A ratio over a non-pipelined version of SCREAM based on an unrolled architecture [48]. POET is very complex, since it requires the use of two or three AES cores. Our design using two AES cores is more area-efficient than a design using three AES cores, but comes at the price of a more complex controller. Likewise, the presence of two variable

shifters and one set of truncation logic adds to the required resources, which pushes POET to the low-end of TP/A ratios for authenticated ciphers [30]. In the same HLS versus RTL comparison mentioned above, the RTL to HLS ratio of TP/A ratios of POET is 0.89. This suggests that our POET RTL design is suboptimal, and that we likely pay too high a price in controller complexity and numbers of added states in our attempt to reduce from three to two AES cores. We assess that it would be difficult to produce a pipelined implementation of POET, because of the XEX structure consisting of top, middle, and bottom computation layers, as outlined in [40]. Although the Minalpher implementation at [34] has a higher TP/A ratio than our design, there may be situations where higher throughput is valued over higher TP/A. The design at [34] uses significantly fewer resources (2879 versus 5953 LUTs) compared to our design. This is partly explained because [34] uses only one TEM core, whereas our design uses two. However, the use of two TEM cores in our design to conduct both passes in parallel comes with additional requirements in terms of registers, control logic, and control states. The designers of [34] were able to save resources by employing a simpler (but lengthier) computation chain. It is not clear that a pipelined Minalpher could improve the overall TP/A ratio. The two stage pipelining study referenced in [48] did not achieve an improved TP/A ratio for Minalpher. In [33], the authors achieve a higher frequency using a four-stage pipelined architecture, however, the lack of commonality of a design using the CAESAR Hardware Interface makes a direct comparison suspect. Unfortunately, OMD ends up near the bottom of TP/A ratios for all CAESAR Round Two candidates [30]. Although the authors’ rationale for employing SHA-2 seems credible for highperformance CPU implementations with instruction set extensions, we are forced to pay a heavy price in terms of throughput in FPGA, since the SHA-2 takes 64 rounds to process a 256-bit message digest (calculated in our circuit using 66 clock cycles). Likewise, the use of the ntz() function and the intermediate variables L[ntz(i)] is disadvantageous in hardware and practically forces the instantiation of additional memory resources. On a positive note, the throughput based solely on the processing of associated data (AD) in OMD would be twice the throughput based on the processing of plaintext or ciphertext blocks, since OMD processes 512-bit blocks of AD and only 256-bit blocks of message/ciphertext during the same 66 clock cycles. In all four ciphers, the influence of control signals on the critical path has been nearly eliminated by computing control inputs and registering them before they are used. In all cases, this increases the number of required registers, the number of control

216

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218

signals, the number of control states, and the complexity and size of the controller. Controllers for authenticated ciphers are inherently complex – much more so than the controllers for symmetric block ciphers or secure hash functions. Finally, it is notable that the maximum clock frequency on the Virtex-7 is higher than the Virtex-6 for all four ciphers, yet the TP/A ratio is worse on the Virtex-7 than the Virtex-6 for all ciphers except one. The explanation for this phenomenon is not certain. The Virtex-7 uses a 28 nm technology with newer features than the Virtex-6, which uses 40 nm technology. However, the number of used resources (e.g., LUTs) generally increases in the Virtex-7. Since the synthesis strategies for the Virtex-6 (Xilinx ISE 14.7) and Virtex-7 (Vivado Default Synthesis 2015) are completely different, one possibility is that Vivado chooses to employ more liberal use of LUTs for routing or duplication of resources to avoid bottlenecks. Another possibility is that Vivado is more aggressive in its duplication of resources to reduce loading and critical path, and exerts maximum effort toward meeting user-defined clock constraints prior to attempting to reduce area. Additionally, the Minerva optimization tool in [38] is a new construct and will likely require some fine-tuning in order to reach the most-optimal TP/A.

7. Conclusion These four CAESAR Round Two candidates, SCREAM, POET, Minalpher, and OMD were successfully implemented using RTL design in VHDL on the Virtex-6 and Virtex-7 FPGAs. SCREAM was found to have the highest TP/A ratio on the Virtex-6 FPGA, followed by Minalpher, POET, and OMD. In contrast, Minalpher had the highest TP/A ratio on the Virtex-7 FPGA, followed by SCREAM, POET, and OMD. All four ciphers were evaluated using an identical version of the CAESAR Hardware API for authenticated ciphers. The use of a standardized interface is critical to ensuring the fair and efficient evaluation of large numbers of hardware implementations of authenticated cipher candidates. In comparison with designs from other teams, only one design – Minalpher – was compliant with the CAESAR HW API for Authenticated Ciphers, and thus available for direct comparison. This design employed a different architecture which resulted in a better TP/A ratio than our Minalpher design. Analysis of results of identical architectures implemented with alternative methodology (i.e., HLS versus RTL) suggests that our version of SCREAM is plausibly approaching optimality, however, that our POET design could theoretically be improved. Comparison with results of pipelining studies on similar architectures suggests that TP/A ratios for some authenticated ciphers could be improved using pipelining techniques or other architectural innovations.

8. Areas for further research In all of these implementations, design choices were made involving trade-offs between logic and routing delay, critical path, the number of control signals and controller states, the number of registers, and pipelining techniques, and area. It is possible that improved performance with reduced area could be achieved by a more optimal combination of the above factors. Such techniques could be explored by the authors of the ciphers or by third parties to produce more efficient implementations. The reason for the reduction of TP/A ratios for ciphers optimized using Minerva on the Virtex-7 versus those optimized using ATHENa on the Virtex-6 is an open issue and warrants further investigation.

Acknowledgment This research is based on work supported by the National Science Foundation under Grant No. 1314540. References [1] CAESAR: Competition for Authenticated Encryption: Security, Applicability, and Robustness.” Internet: http://competitions.cr.yp.to/caesar.html, Jun. 16, 2014 [Mar. 18, 2017]. [2] M. Liskov, R. Rivest, D. Wagner, Tweakable block ciphers, J. Cryptology 24 (3) (Jul. 2011) 588–613. [3] The Second SHA-3 Candidate Conference, Papers and Presentations, Santa Barbara, CA, Aug. 23-24, 2010, Internet: http://csrc.nist.gov/groups/ST/hash/sha-3/ Round2/Aug2010/documents/Program_SHA3_Aug2010.pdf [Mar. 18, 2017]. [4] E. Homsirikamol, M. Rogawski, K. Gaj, “Comparing Hardware Performance of fourteen round two SHA-3 candidates using FPGAs,” Dec. 21, 2010, Internet: https://eprint.iacr.org/2010/445.pdf [Mar. 18, 2017]. [5] K. Gaj, E. Homsirikamol, M. Rogawski, Fair and comprehensive methodology for comparing hardware performance of fourteen round two SHA-3 candidates using FPGAs, in: LNCS 6225, Cryptographic Hardware and Embedded Systems - CHES 2010, Santa Barbara, CA, USA, Aug. 2010, pp. 264–278. [6] K. Gaj, E. Homsirikamol, M. Rogawski, Comprehensive comparison of hardware performance of fourteen round 2 SHA-3 candidates with 512-bit outputs using field programmable gate arrays, The Second SHA-3 Candidate Conference, Aug. 23-24, 2010 http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/Aug2010/ documents/Program_SHA3_Aug2010.pdf [Mar. 18, 2017]. [7] M. Knezevic, K. Kobayashi, J. Ikegami, S. Matsuo, A. Satoh, U. Kocabas, F. Junfeng, T. Katashita, T. Sugawara, L. Sakiyama, I. Verbauwhede, K. Ohta, N. Homma, T. Aoki, Fair and consistent hardware evaluation of fourteen round two SHA-3 candidates, Very Large Scale Integration (VLSI) Syst., IEEE Trans. 20 (5) (Apr. 29, 2011) 827–840. [8] B. Baldwin, N. Hanley, M. Hamilton, L. Lu, A. Byrne, M. O’Neill, W.P. Marnane, FPGA implementations of the round two SHA-3 candidates, The Second SHA3 Candidate Conference, Aug. 23-24, 2010 http://csrc.nist.gov/groups/ST/hash/ sha-3/Round2/Aug2010/documents/Program_SHA3_Aug2010.pdf. [9] B. Baldwin, A. Byrne, L. Lu, M. Hamilton, N. Hanley, M. O’Neill, W.P. Marnane, FPGA implementations of the round two SHA-3 candidates, in: 20th International Conference on Field Programmable Logic and Applications – FPL 2010, Aug. 31-Sep. 2, 2010, pp. 400–407. [10] R. Shahid, U. Sharif, M. Rogawski, K. Gaj, Use of embedded FPGA resources in implementations of 14 round 2 SHA-3 candidates, in: 2011 International Conference on Field Programmable Technology - FPT 2011, New Delhi, India , Dec. 2011, pp. 1–9. [11] S. Tillich, M. Feldhofer, M. Kirschbaum, T. Plos, J.-M. Schmidt, A. Szekely, Uniform evaluation of hardware implementations of the round-two SHA-3 candidates, The Second SHA-3 Candidate Conference, Aug. 23-24, 2010 http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/Aug2010/documents/ Program_SHA3_Aug2010.pdf. [12] X. Guo, S. Huang, L. Nazhandali, P. Schaumont, Fair and comprehensive performance evaluation of 14 second round SHA-3 ASIC implementations, The Second SHA-3 Candidate Conference, Aug. 23-24, 2010 http://csrc.nist.gov/groups/ ST/hash/sha-3/Round2/Aug2010/documents/Program_SHA3_Aug2010.pdf. [13] S. Matsuo, M. Knezevic, P. Schaumont, I. Verbauwhede, A. Satoh, K. Sakiyama, K. Ota, How can we conduct "fair and consistent" hardware evaluation for SHA-3 Candidates? The Second SHA-3 Candidate Conference, Aug. 23-24, 2010 http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/Aug2010/ documents/Program_SHA3_Aug2010.pdf. [14] Z. Chen, X. Guo, A. Sinha, P. Schaumont, data-oriented performance analysis of SHA-3 candidates on FPGA accelerated computers, in: Design, Automation and Test in Europe – DATE 2011, Mar. 14-18, Grenoble, France, 2011, pp. 1–6. [15] X. Guo, Meeta Srivistav, S. Huang, D. Ganta, M. Henry, L. Nazhandali, P. Schaumont, Pre-silicon characterization of NIST SHA-3 final round candidates, in: 14th Euromicro Conference on Digital System Design – DSD 2011, Oulu, Finland, Aug. 31-Sep. 2, 2011, pp. 535–542. [16] E. Homsirikamol, M. Rogawski, K. Gaj, Throughput vs. area trade-offs in high-speed architectures of five round 3 SHA-3 candidates implemented using xilinx and altera FPGAs, in: LNCS 6917, Cryptographic Hardware and Embedded Systems – CHES, Nara, Japan, Sep. 28-Oct. 1, 2011, pp. 491–506. [17] M. Srivastav, X. Guo, S. Huang, D. Ganta, M.B. Henry, L. Nazhandali, P. Schaumont, Design and benchmarking of an ASIC with five SHA-3 finalist candidates, Elsevier Microprocessors and Microsystems – Embedded Hardware Design (Special Issue on "Digital System Security and Safety”), 2012. [18] X. Guo, M. Srivistav, S. Huang, D. Ganta, M.B. Henry, L. Nazhandali, P. Schaumont, ASIC implementations of five SHA-3 finalists, in: Design, Automation and Test in Europe – DATE 2012, Dresden, Germany , Mar. 12-16, 2012, pp. 1006–1011. [19] K. Gaj, E. Homsirikamol, M. Rogawski, R. Shahid, and M.U. Sharif, "Comprehensive evaluation of high-speed and medium-speed implementations of five SHA-3 finalists using xilinx and altera FPGAs," Cryptology ePrint Archive: Report 2012/368. [20] V. Grosso, G. Leurent, F. Standaert, K. Varici, F. Durvaux, L. Gaspar, S. Kerckhof, SCREAM & iSCREAM, side-channel resistant authenticated encryption with masking, Presented at Directions in Authenticated Ciphers (DIAC 2014), 2014.

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218 [21] Y. Sasaki, Y. Todo, K. Aoki, Y. Naito, T. Sugawara, Y. Murakami, M. Matsui, S. Hirose, Minalpher, Presented at Directions in Authenticated Ciphers (DIAC 2014), 2014. [22] Moradi, A Hardware Implementation of POET, University of Bochum, Germany, Feb. 2014. [23] Arnould, Towards Developing ASIC and FPGA Architectures of High-Throughput CAESAR Candidates MS Thesis , Swiss Federal Institute of Technology in Zurich (ETHZ), Mar. 2015. [24] E. Homsirikamol, W. Diehl, F. Farahmand, A. Ferozpuri, K. Gaj, C vs. VHDL: benchmarking CAESAR candidates using high-level synthesis and registertransfer level methodologies, Presented at Directions in Authenticated Ciphers (DIAC) 2015, Sep. 27, 2015 Internet: https://cryptography.gmu.edu/athena/ presentations/DIAC_2015_gaj_HLS.pdf [Mar. 18, 2017]. [25] E. Homsirikamol, W. Diehl, A. Ferozpuri, F. Farahmand, M.U. Sharif, K. Gaj, A universal hardware API for authenticated ciphers, 2015 International Conference on Reconfigurable Computing and FPGAs, ReConFig 2015, Dec. 7-9, 2015. [26] E. Homsirikamol, W. Diehl, A. Ferozpuri, F. Farahmand, P. Yalla, J.P. Kaps, K. Gaj, “CAESAR hardware API,” Cryptology ePrint Archive, Report 2016/626, Internet: http://eprint.iacr.org/2016/626.pdf [Mar. 18, 2017]. [27] E. Homsirikamol, W. Diehl, A. Ferozpuri, F. Farahmand, P. Yalla, J.P. Kaps, K. Gaj, “Addendum to the CAESAR hardware API v1.0,” Internet: https://cryptography. gmu.edu/athena/CAESAR_HW_API/CAESAR_HW_API_v1.0_Addendum.pdf [Mar. 18, 2017]. [28] E. Homsirikamol, W. Diehl, A. Ferozpuri, F. Farahmand, K. Gaj “Implementer’s guide to the CAESAR hardware API v1.1” available at https://cryptography.gmu. edu/athena/index.php?id=CAESAR. [29] E. Homsirikamol, P. Yalla, A. Ferozpuri, W. Diehl, F. Farahmand, M. Lyons, K. Gaj, “Benchmarking of round 2 CAESAR candidates in hardware: methodology, designs & results,” Aug. 2016, Internet: https://cryptography.gmu.edu/athena/ presentations/CAESAR_R2_HW_Benchmarking_v1.1.pdf [Mar. 18, 2017]. [30] E. Homsirikamol, W. Diehl, A. Ferozpuri, F. Farahmand, M.X. Lyons, P. Yalla, K. Gaj, Toward fair and comprehensive benchmarking of CAESAR candidates in hardware: a standard API, high-speed implementations in VHDL/Verilog, and benchmarking using FPGAs, DIAC 2016: Directions in Authenticated Ciphers, Sep. 25-27, 2016 http://www.nuee.nagoya-u.ac.jp/labs/tiwata/diac2016/ slides/diac2016_08_Kris.pdf. [31] S. Kerckhof, L. Gaspar, SCREAM Version 3, Université Catholique de Louvain, Dec. 13, 2015, available at http://www.uclouvain.be/crypto/static/SCREAM/ 2015_12_13_Scream_TAE_HW_reference.zip [Mar. 18, 2017]. [32] Moradi, “A hardware implementation of POET 2,” Ruhr-Universität Bochum, Germany, https://www.uni-weimar.de/de/medien/professuren/ mediensicherheit/research/poet/ [Mar. 18, 2017]. [33] M. Kosug, M. Yasuda, A. Satoh, FPGA implementation of authenticated encryption algorithm minalpher, in: 2015 IEEE 4th Global Conference on Consumer Electronics (GCCE), Osaka, Japan, 20-30 Oct. 2015, pp. 572–576. [34] T. Sugawara, “Minalpher v1.1 (Minalpher Team) Hardware Implementation,” Mitsubishi Electronics Corporation, Jun. 2016, Internet: info.isl.ntt.co.jp/crypt/ minalpher [Mar. 18, 2017].

217

[35] “Development package for the CAESAR hardware API, v1.0-3,” https:// cryptography.gmu.edu/athena/index.php?id=CAESAR [Mar. 18, 2017]. [36] Cryptographic Engineering Research Group (CERG) at GMU, “Automated tool for hardware evaluation (ATHENa)”, Internet: https://cryptography.gmu.edu/ athena/ [Mar. 18, 2017]. [37] K. Gaj, J.P. Kaps, V. Amirineni, M. Rogawski, E. Homsirikamol, B.Y. Brewster, ATHENa – automated tool for hardware evaluation: toward fair and comprehensive benchmarking of cryptographic hardware using FPGAs, 20th International Conference on Field Programmable Logic and Applications, Aug. 31st-Sep. 2nd, 2010. [38] F. Farahmand, Tools and Experimental Setup for Efficient Hardware Benchmarking of Candidates in Cryptographic Contests MS Thesis, ECE Department, George Mason University, Fairfax, U.S.A., Nov. 2016. [39] V. Grosso, G. Leurent, F. Standaert, K. Varici, A. Journault, F. Durvaux, L. Gaspar, S. Kerckhof, “SCREAM, Side-Channel Resistant Authenticated Encryption with Masking,” Version 3 (Second Round Specifications), Internet: http: //competitions.cr.yp.to/round2/screamv3.pdf Aug. 2015 [Mar. 18, 2017]. [40] F. Abed, S. Fluhrer, J. Foley, C. Forler, E. List, S. Lucks, D. McGrew, J. Wenzel, “The POET family of on-line authenticated encryption schemes,” Version 2.01, Internet: https://www.uni-weimar.de/de/medien/professuren/ mediensicherheit/research/poet/ Sep. 15, 2015 [Mar. 18, 2017]. [41] Y. Sasaki, Y. Todo, K. Aoki, Y. Naito, T. Sugawara, Y. Murakami, M. Matsui, S. Hirose, “Minalpher v1.1,” Internet: http://competitions.cr.yp.to/ caesar-submissions.html , Aug. 29, 2015 [Mar. 18, 2017]. [42] S. Cogliani, D. Maimut, D. Naccache, R. Portella, R. Reyhanitabar, S. Vaudenay, D. Vizar, “Offset Merkle Damgård (OMD) version 2.0, A CAESAR proposal,” May 11, 2016, Internet: https://competitions.cr.yp.to/round2/omdv20c.pdf [Mar. 18, 2017]. [43] W. Diehl, K. Gaj, Implementation of a boolean masking scheme for the SCREAM cipher, in: 2016 Euromicro Conference on Digital System Design (DSD), Limassol, 2016, pp. 723–726. [44] GMU Source Code of Round 2 CAESAR Candidates, AES-GCM, AES, AES-HLS, and Keccak Permutation F," Aug. 11, 2016, Internet: https://cryptography.gmu. edu/athena/index.php?id=CAESAR [Mar. 18, 2017]. [45] R. Chaves, G. Kuzmanov, L. Sousa, S. Vassiliadis, Improving SHA-2 hardware implementations, in: Cryptographic Hardware and Embedded Systems – CHES 2006, vol. 4249 of LNCS, Springer, Oct. 2006, pp. 298–310. [46] “Database of FPGA Results for Authenticated Ciphers,” George Mason University, Fairfax, U.S.A., available at https://cryptography.gmu.edu/athenadb/fpga_ auth_cipher/table_view. [47] E. Homsirikamol, K. Gaj, “An alternative approach to hardware benchmarking of CAESAR candidates based on the use of high-level synthesis tools,” Directions in Authenticated Ciphers (DIAC 2016), Presentation to, Sep. 2016, Internet: https://cryptography.gmu.edu/athena/presentations/GMU_DIAC_2016_HLS. pdf [Mar. 18, 2017]. [48] S. Deshpande, Analysis and Inner-Round Pipelined Implementation of Selected Parallelizable CAESAR Competition Candidates MS Thesis, ECE Department, George Mason University, Fairfax, U.S.A., Nov. 2016.

218

W. Diehl, K. Gaj / Microprocessors and Microsystems 52 (2017) 202–218 William Diehl is a Doctoral Candidate in the Department of Electrical and Computer Engineering (ECE) at George Mason University, U.S.A. His interests include secure and efficient implementations of cryptography in hardware and software. His recent research topics include power analysis side channel attacks of lightweight authenticated ciphers, design and implementation of authenticated ciphers in Field Programmable Gate Arrays (FPGA), and implementation of lightweight cryptography in very lightweight reconfigurable microprocessors. William holds a B.A. degree in Computer Science from Duke University, and a M.S. degree in Electrical Engineering from the U.S. Naval Postgraduate School.

Kris Gaj received the M.Sc. and Ph.D. degrees in Electrical Engineering from Warsaw University of Technology in Warsaw, Poland. He is currently an Associate Professor at George Mason University, doing research and teaching courses in the area of cryptographic engineering and reconfigurable computing. His research projects focus on benchmarking cryptographic algorithms competing in cryptographic contests, as well as on the development of new hardware architectures and embedded software for public key cryptosystems, authenticated ciphers, secret key ciphers, hash functions, post-quantum cryptography, physical unclonable functions, and codebreaking. He is a co-director of the Cryptographic Engineering Research Group at George Mason University.