Resource efficient implementation of T-Boxes in AES on Virtex-5 FPGA

Resource efficient implementation of T-Boxes in AES on Virtex-5 FPGA

Information Processing Letters 110 (2010) 373–377 Contents lists available at ScienceDirect Information Processing Letters www.elsevier.com/locate/i...

240KB Sizes 3 Downloads 99 Views

Information Processing Letters 110 (2010) 373–377

Contents lists available at ScienceDirect

Information Processing Letters www.elsevier.com/locate/ipl

Resource efficient implementation of T-Boxes in AES on Virtex-5 FPGA Dur-e-Shahwar Kundi ∗ , Arshad Aziz, Nasar Ikram Electrical Engineering Department, National University of Science and Technology (NUST), Habib Rehmatullah Road, Karachi-75350, Pakistan

a r t i c l e

i n f o

a b s t r a c t

Article history: Received 15 December 2009 Received in revised form 3 March 2010 Accepted 3 March 2010 Available online 10 March 2010 Communicated by A. Tarlecki Keywords: Cryptography AES FPGA T-Box Virtex-5

This work presents a resource efficient implementation of T-Box module of Advanced Encryption Standard (AES) on Xilinx’s Virtex-5 Field Programmable Gate Array (FPGA). The proposed architecture utilizes the 100% capacity of FPGA’s dedicated Block RAM (BRAM) as compared to conventional techniques, where the consumption of BRAM memory is from 25% to 50%. The results show that the module fits into 4 BRAMs, thus reducing on device resources by 50%. © 2010 Elsevier B.V. All rights reserved.

1. Introduction National Institute of Standards and Technology (NIST) has selected Rijndael algorithm to be the new AES standard in October 2000 to replace Data Encryption Standard (DES) which was vulnerable to attack by key exhaustion [1]. Important factors of AES are the security, hardware/software efficiency and large block/key size. Modern FPGAs offers special features that can be effectively utilized to speed up the operation of digital system. Such features include on-chip BRAM memories, Digital Clock Manager (DCMs), multipliers, serial transceivers and PowerPC cores, etc. 2. Overview of AES AES is a symmetric block cipher with a block length of 128-bit and variable key sizes of 128, 192 and 256 bits. The data is arranged in a 4 × 4 array of bytes called the State, with four rows and four columns consisting of 16 bytes in total. AES algorithm uses a round function that is executed 10 times depending on 128-bit key size. Each

*

Corresponding author. E-mail addresses: [email protected] (D.-S. Kundi), [email protected] (A. Aziz), [email protected] (N. Ikram). 0020-0190/$ – see front matter doi:10.1016/j.ipl.2010.03.004

© 2010

Elsevier B.V. All rights reserved.

round is composed of four different byte-oriented transformations: SubByte, ShiftRow, MixColumn and AddRoundKey except for the last round in which MixColumn transformation is not performed. Apart from this there is an initial round at the start that comprise of only AddRoundKey transformation [2]. 3. T-Box approach An approach for 32 bit processors proposed in [3] is based on the combination of Look-Up-Table (LUT) for the MixColumn and SubByte operation into a single table consisting of 256 4-bytes columns. This is generally known as T-Box approach. By directly mapping the transformation of SubByte on MixColumn, we get a single expression as follow:



s0,c



⎡ ⎤ ⎡ ⎤ 02 03 ⎢ s ⎥ ⎢ 02 ⎥ ⎢ 1,c ⎥ ⎢ 01 ⎥ ⎢  ⎥ = ⎣ ⎦ • S [b0,c ] ⊕ ⎣ ⎦ • S [b1,c ] 01 01 ⎣ s2,c ⎦ s3,c

03



01







01 01 ⎢ 03 ⎥ ⎢ 01 ⎥ ⊕ ⎣ ⎦ • S [b2,c ] ⊕ ⎣ ⎦ • S [b3,c ] 02 03 01 02

374

D.-S. Kundi et al. / Information Processing Letters 110 (2010) 373–377

Defining the table T 0 to T 3



⎢ T 0 [a] = ⎢ ⎣ ⎡

S [a] • 02 S [a] S [a] S [a] • 03

⎤ ⎥ ⎥ ⎦



S [a] • 03 ⎢ S [a] • 02 ⎥ ⎥ T 1 [a] = ⎢ ⎣ S [a] ⎦





⎢ S [a] • 03 ⎥ ⎥ S [a] • 02 ⎦

T 3 [a] = ⎢ ⎣

S [a]

T 2 [a] = ⎢ ⎣

S [a]





S [a] S [a] S [a]

⎤ ⎥ ⎥

S [a] • 03 ⎦ S [a] • 02

The final result will obtain by XORing the output of T-Box tables as given by the following expression:



s0,c



⎢ s ⎥ ⎢ 1,c ⎥ ⎢  ⎥ = T 0 [a] ⊕ T 1 [a] ⊕ T 2 [a] ⊕ T 3 [a] ⎣ s2,c ⎦ s3,c

In the last round Mixcolumn transformation is excluded, while SubByte operation has to be performed. So there is a need to extract S-Box from the T-Box. For this we use same approach as defined by [4], that each T-Box tables contain direct SubByte values multiplied by ‘01’ i.e. S [a]. 4. Related work A lot of FPGA architectures of AES have been presented in open literature since 2001, there by targeting different applications ranging from low-cost [5] to high speed [6]. Later several AES architectures were developed for implementing S-Box [7] and T-Box [8,9] by using the dedicated embedded memories of the FPGAs. However most of them results in poor and inefficient utilization of the resources. The more resent implementations of AES on Virtex-5 FPGA are [5,6,10–13] and [14]. The AES implementations [6,11,12] and [13] are utilizing the same special embedded feature i.e. BRAMs of the Xilinx Virtex-5 FPGA for the LookUp-Tables. Drimer et al. [6] recently presented the T-Box based AES implementation on Virtex-5 FPGA by exploring the embedded features such as DSPs and BRAMs blocks while Bulens et al. [10] presents the efficient implementation of the AES substitution boxes (S-Box) by utilizing the new Virtex-5 LUT architecture. In [5] they give the hardware realization of AES by utilizing the Slice LUTs while in [14] they gives implementation of AES using reduced residue of prime number based S-Box but they have not explored the special embedded feature available with in the Virtex-5 FPGA. Algotronix [12], Alma [11] and Jet Stream Media [13] technologies have also presented the commercial AES cores on Virtex-5 FPGA. Most of these AES implementations do not effectively exploit the features of the BRAM blocks and results in a inefficient device utilization and also the usage of large memory blocks. 5. Efficient T-Box module The conventional technique used to implement T-Box is not an area efficient as look-up-table requires 4 times

more memory i.e. 8 Kb ((16 ∗ 16) ∗ 32) as compared to the S-Box technique [4]. But dedicated memory blocks in Xilinx FPGA are ideal for implementing T-Boxes if efficiently utilized. Embedded dual- or single-port RAM modules are easily implemented using the Xilinx CORE GeneratorTM block memory modules [15]. One T-Box uses 8 Kb of memory and for 100% utilization four T-Boxes can be stored in one Virtex-5 BRAM. T-Box implementation of AES requires 16 instances of T-Boxes for 16 parallel lookup operations and previous conventional approach utilizes total of 16 BRAMs. This approach is very inefficient because it make use of only 25% of the memory while the remaining 24 Kb (32-8) of memory with in one BRAM will be wasted. For example the design defined by Aziz and Ikram [16] enables to accommodate two T-Boxes per BRAMs by using Dual port functionality, but this approach is still inefficient and utilizes only the 50% available space of BRAM and at most can accommodate only two T-Boxes per BRAM. Our proposed design of T-Box Module improves the utilization of BRAM by a ratio of 100%. This will enables us to fit four T-Boxes into one BRAM by using quad-port memories [17]. In the quad-port memory four ports are available for the storage. This is implemented by using the existing Dual-Port RAMs by adding a doubled clock and some extra logic. The detail of our proposed T-Box architecture is shown in Fig. 1. The BRAM is configured as Dual-Port ROM (Read Only Mode) to access the 32-bit lookup values corresponding to the 8-bit input addresses. Read operation uses one clock edge and the data in the memory location selected by the addresses appear on the output ports after the BRAM access time [15]. The 8 Kb T-Box Look-Up-Table is initialized into BRAMs using coefficient (COE) file. The input bytes are delivered to the address pins ADDRA [7:0] and ADDRB [7:0] of BRAM while corresponding lookup values will be taken from the output pins DOUTA [31:0] and DOUTB [31:0] of the BRAM. The common clock signal is applied to input clock pins CLKA and CLKB that operate both the ports (PORT A and PORT B) of BRAM at double speed from the rest of the circuitry. For this we have used DCM which is used to generate two clocks; CLK0 (same frequency as that of input source clock) and CLK2X (double to the input source frequency) from the input clock source CLKIN. The CLK2X will enable the BRAM to output twice during one complete cycle of CLK0. At the input side two 2 × 1 multiplexers M0 and M1 are being used to select the corresponding input data bytes Din0,0 , Din1,0 , Din2,0 , Din3,0 . The controlling signal for the multiplexer is CLK0; when the signal is high, the input data at Byte0 of M0 and Byte1 of M1 will be selected and when the signal is low, the input data at Byte2 of M0 and Byte3 of M1 will be selected by the multiplexers. Four 32-bit registers R0–R4 are being used to store the BRAM’s output at both rising and falling edge of CLK0; R0 and R1 are positive edge triggered while R2 and R3 are negative edge triggered. All the four 32-bit register’s output (Tout0,0 , Tout1,0 , Tout2,0 , Tout3,0 ) are arranged in a column of four bytes and XORed to get the final output as shown in Fig. 1. The timing diagram is shown in Fig. 2 that illustrates the overall operation of our efficient T-Box module. The input data to the multiplexers M0 and M1 changes at every

375

Fig. 1. 32-bit T-Box module.

D.-S. Kundi et al. / Information Processing Letters 110 (2010) 373–377

376

D.-S. Kundi et al. / Information Processing Letters 110 (2010) 373–377

Fig. 2. Timing diagram.

Table 1 Comparison results of T-Box module. Implementation

Device

Data path

BRAMs

Frequency (MHz)

Fischer and Drutarovsky [4] Rouvroy et al. [9] Algotronix [12] Drimer et al. [6] Our T-Box Module only McLoone and McCanny [8] Alma [11] Jet Stream Media [13] Bulens et al. [10] Aziz and Ikram [16] Drimer et al. [6] Our T-Box Module only

ACEX1K5-1 XC2V40-6 Virtex-5 (XC5VLX30) Virtex-5 Virtex-5 (XC5VLX110) XCV812E-8 Virtex-5 (XC5VLX30) Virtex-5 (XC5VLX30) Virtex-4 XC3S200-5 Virtex-5 Virtex-5 (XC5VLX110)

32 32 32 32 32 128 128 128 128 128 128 128

10 3 2 2 1 24 10 10 8 8 8 4

– 123 250 550 251.42 93.9 198 289 250 321.67 485 251.42

D.-S. Kundi et al. / Information Processing Letters 110 (2010) 373–377

rising edge of CLK0 (point 0 and point 2, in Fig. 2). During the positive cycle of CLK0, M0 selects the Din0,0 while M1 selects the Din1,0 and this data is then delivered to the address ports ADDRA [7:0] and ADDRB [7:0] of BRAM via a multiplexers. As BRAM is operating at double-rated clocked, this data will be registered into the memory at falling edge of the CLK0, i.e., the next rising edge of the CLK2X (point 1 in Fig. 2). The data in the memory location selected by the address appears on both DOUTA [31:0] and DOUTB [31:0] ports after clock-to-out time of the BRAM. As both R0 and R1 are positive edge triggered, will reregistered the output data from BRAM at the rising edge of CLK0 and this data will be valid for the complete CLK0 cycle (point 2 in Fig. 2). In the mean time M0 and M1 changes state at point 1, now Din2,0 at Byte2 and Din3,0 at Byte3 are delivered to the address ports of BRAM on the second half of the clock cycle. The BRAM on the rising edge of the CLK2X registers them in the normal way (point 2, in Fig. 2). The output data from port DOUTA [31:0] and DOUTB [31:0] are available after clock-to-out time of BRAM CLK0 so as to be valid when necessary (point 3, in Fig. 2). So we will get the valid data each of 32-bits Tout0,0 , Tout1,0 , Tout2,0 , Tout3,0 from all the four T-Boxes at point 4. In this way each BRAM accommodates four T-Boxes. 6. Performance results The design was implemented on a Xilinx Virtex-5 XC5Vlx110-3ff676 FPGA device. It occupies only 4 BRAMs out of 128 (3%) and 199 (1%) Slices. Our design is operating at a clock frequency of 251.421 MHz which is still high enough for real time cryptographic applications and offers a better throughput of 32.128 Gbps. Table 1 detail the comparison results with previous FPGA implementations using T-Box approach. The results clearly shows that our proposed implementation saves 50% BRAM blocks as compared to others. Drimer et al. [6] recently reported a T-Box based AES implementation on Virtex-5 which utilizes 2 BRAMs blocks in addition to 4 DSPs Slices for a single 32-bit basic module. This basic module is then replicated four times for a full AES round with a 128 bit datapath there by utilizes 8 BRAMs and 16 DSPs blocks. Their design account for the high frequency at the expense of large memory resources as compared to our design and also their BRAM utilization is inefficient. Bulens et al. [10] presents the T-Box implementation of AES on Virtex-4 FPGA that also utilizes twice the embedded BRAMs. We also considered some of the commercial AES cores based on new Xilinx Virtex-5 FPGA by Algotronix [12], Alma [11] and Jet Stream Media [13] technologies as they explored the same special feature i.e. BRAMs of Virtex-5 but their cores use more BRAMs blocks for the Look-Up-Tables. Our design clearly shows the effective utilization of BRAMs on new FPGAs, like Virtex-5 and the same is applicable to Virtex-6 also and results in 50% areas reduction.

377

ments to half and it turns out to be more attractive approach as compared to S-Boxes, as 32-bit T-Box fits into single BRAM similar to 32-bit S-Box. The potential application areas are systems that require Giga-bit encryptor and decryptor such as high performance e-commerce IPsec servers, giga-bit link encryptor and secure medical imaging systems, etc. The future work includes the implementation of AES encryptor/decryptor Compact and Fast cores using the above compact and fast T-Box module, that aims to target 3 Gbps to 32 Gbps applications. References [1] W.E. Burr, Selecting the advanced encryption standard, IEEE Security & Privacy 1 (2) (Mar.–Apr. 2003) 43–52. [2] FIPS-197, Federal Information Processing Standards Publication FIPS-197, Advanced Encryption Standard (AES), http://csrc.nist.gov/ publications/fips/fips197/fips-197.pdf. [3] J. Daemen, V. Rijmen, AES proposal: The Rijndael Block Cipher, Version 2 (Sept. 1999) pp. 1–45. [4] V. Fischer, M. Drutarovsky, Two methods of Rijndael implementation in reconfigurable hardware, in: CHES 01: Proceedings of the Third International Workshop on Cryptographic Hardware and Embedded Systems, Springer-Verlag, 2001, pp. 77–92. [5] M.H. Rais, S.M. Qasim, Efficient hardware realization of advanced encryption standard algorithm using Virtex-5 FPGA, IJCSNS International Journal of Computer Science and Network Security 09 (2009) 59–63. [6] S. Drimer, T. Guneysu, C. Paar, DSPs, BRAMs and a Pinch of Logic: New Recipes for AES on FPGAs, in: 16th IEEE International Symposium on Field-Programmable Custom Computing Machines, FCCM, 2008, pp. 99–108. [7] D.-S. Kundi, S. Zaka, Q.U. Ain, A. Aziz, A compact AES encryption core on Xilinx FPGA, in: 2nd International Conference on Computer, Control and Communication (IC-4), 2009, pp. 1–4. [8] M. McLoone, J. McCanny, Rijndael FPGA implementations utilizing look-up-tables, Journal of VLSI Signal Processing Systems 34 (2003) 261–275. [9] G. Rouvroy, F.-X. Standaert, J.-J. Quisquater, J.-D. Legat, Compact and efficient encryption/decryption module for FPGA implementation of the AES Rijndael very well suited for small embedded applications, in: Proceedings International Conference on Information Technology: Coding and Computing (ITCC’04), vol. 2, IEEE Computer Society, 2004, p. 583. [10] P. Bulens, F. Standaert, J. Quisquater, P. Pellegrin, G. Rouvroy, Implementation of the AES-128 on Virtex-5 FPGAs, in: AFRICACRYPT 2008, in: Lecture Notes in Computer Science, vol. 5023, Springer, 2008, pp. 16–26. [11] Alma Technology Ltd., CAST’s Xilinx AES Optimized Encrypt/Decrypt Core, Dec. 2008; http://www.cast-inc.com/cores/aes-c/cast_aes-c-x. [12] Algotronix Technology Ltd., AES G3 data sheet, Xilinx edition, Oct. 2007; http://www.algotronix-store.com/kb_results.asp?ID=7. [13] Jet Stream Media Technology Ltd., Fast High Speed AES Core, Oct. 2006; http://www.security-cores.com/us4s/JetAES_4F_1675317.pdf. [14] M.H. Rais, S.M. Qasim, A novel FPGA implementation of AES-128 using reduced residue of prime numbers based S-Box, IJCSNS International Journal of Computer Science and Network Security 09 (2009) 305–309. [15] Xilinx, Virtex-5 FPGA User Guide, available at http://www.xilinx.com/ support/documentation/user_guides/ug190.pdf.

7. Conclusions

[16] A. Aziz, N. Ikram, A look-up-table implementation of AES, International Conference on High Performance Computing, Networking and Communication Systems (HPCNCS-07), 2007, pp. 187–191.

In this work we have presented an efficient T-Box Module that not only reduces the over all memory require-

[17] Xilinx, Quad-port Memories in Virtex Devices, 2002, available at http://www.xilinx.com/support/documentation/application_notes/ xapp228.pdf.