ARTICLE IN PRESS Microelectronics Journal 40 (2009) 1032–1040
Contents lists available at ScienceDirect
Microelectronics Journal journal homepage: www.elsevier.com/locate/mejo
IDEA and AES, two cryptographic algorithms implemented using partial and dynamic reconfiguration Jose´ M. Granado , Miguel A. Vega-Rodrı´guez, Juan M. Sa´nchez-Pe´rez, Juan A. Go´mez-Pulido Department Technologies of Computers and Communications, University of Extremadura, Spain
a r t i c l e in f o
a b s t r a c t
Article history: Received 12 July 2007 Received in revised form 5 November 2008 Accepted 12 November 2008 Available online 9 January 2009
In this work, we present our experience in implementing two different cryptographic algorithms in an FPGA: IDEA and AES. Both implementations have been done by means of mixing Handel-C and VHDL and using partial and dynamic reconfiguration in order to reach a very high performance. In both cases, we have obtained very satisfactory results, achieving 27.948 Gb/s in the IDEA algorithm and 24.922 Gb/s in the AES algorithm. & 2008 Elsevier Ltd. All rights reserved.
Keywords: IDEA AES Cryptography FPGA VHDL Handel-C
1. Introduction We all have listened to the sentence ‘‘the information is power’’ more than once and, when we speak about digital information, this statement takes a special importance. We live in a digital world and a lot of important information goes through insecure networks. The wireless networks are especially insecure because the unauthorised access to these networks is very easy; we only need to scan the air near a company to capture important information of this company. To solve this problem we can employ cryptographic algorithms, which transform the raw information to an unintelligible sequence of bytes. But the new network standards and new technologies (like the Internet private TV) demand that these algorithms must be very fast. One solution to reach a high performance is to employ FPGAs to implement the algorithms. Besides, these devices allow taking advantage of the parallel parts that these algorithms have. In this work, we have implemented two different algorithms: the international data encryption algorithm (IDEA), one of the most secure cryptographic algorithms, and the advanced encryption standard (AES), the one used in wireless networks. In both cases, we have employed pipelining, and dynamic and partial reconfiguration.
Corresponding author. Tel.: +34 616 557 935; fax: +34 927 257 202.
E-mail address:
[email protected] (J.M. Granado). 0026-2692/$ - see front matter & 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.mejo.2008.11.044
We can find other implementations of the IDEA algorithm, like [1], where a pipelining implementation using partial and dynamic reconfiguration is done to reach a very high performance (8.3 Gb/s). In our case, we have developed a new reconfigurable element, concretely the constant coefficient adders (KCAs) and we use Handel-C [2] and VHDL [3] to make the implementation (VHDL to implement the reconfigurable elements and Handel-C to implement the non-reconfigurable elements). Besides, we duplicate the data path, encrypting two data blocks at the same time (or to encrypt and decrypt at the same time). These improvements give us a better performance (27.948 Gb/s). In addition, Handel-C language allows us to decrease the development time because it is a high-level language closer to the traditional programmer. The pipelining is also a well-known technique and we can find implementations which use it, like [4], where a seven stages of implementation are carried out, unlike in our implementation, where we have 182 stages. Finally, as we will see, multipliers are the critical elements of this algorithm, and so, to optimize them is a very important task. In [4] partial product generation and a three-stage diminished-one adder are used to calculate the multiplication modulo 216+1. We can find another multiplier implementation technique in [5,6], where parallel-serial multipliers are used. However, our dynamically and partially reconfigurable implementation uses constant coefficient multipliers (KCMs) [7], which are very fast because they employ look-up tables (LUTs) to store part of the result of the multiplication. Taking all this into account, we achieve to improve the results obtained by these authors, among whom the best result is 8.3 Gb/s [1], and our result is 27.948 Gb/s.
ARTICLE IN PRESS J.M. Granado et al. / Microelectronics Journal 40 (2009) 1032–1040
On the other hand, we have the AES algorithm. We have implemented a 128-bit version of this algorithm using pipelining, and partial and dynamic reconfiguration. The most complex element of this algorithm is the multiplication modulo the irreducible polynomial m(x) ¼ x8+x4+x3+x+1. We find one possible implementation of this element in [8], where the xtime function is implemented to do the multiplication. In our case, we have implemented this element simply calculating the XORs operations needed to make the multiplication. Another important element is the KeyExpansion phase. It is very usual to implement this phase and we can find several implementations of it, like [9], where a hierarchical simultaneous key generation methodology is used. In our case, we use dynamic and partial reconfiguration to modify the LUTs which contain the sub-keys and so, we do not recalculate each sub-key and we do not use resources to implement this method. Besides, we employ Handel-C and VHDL in the same way as we do in the IDEA algorithm, i.e., VHDL to implement the reconfigurable elements and Handel-C to implement the nonreconfigurable elements, obtaining the same advantages. Taking all this into account, we achieve a very good result (24.922 Gb/s), only surpass by [9] (29.8 Gb/s), but we have a better performance/area, occupation and latency values. In conclusion, there are many implementations of cryptographic algorithms using FPGAs. All the FPGA-based implementations found in the literature employ some of the characteristics we mix in this work, but there are no papers that employ our methodology completely (pipelining, replication, partial and dynamic reconfiguration, and the mix of three hardware languages—Handel-C, VHDL and JBits). We will see how using our methodology we improve the results obtained by other FPGAbased implementations of IDEA and AES. In fact, our results surpass the best results published in the literature. This paper is structured as follows. Section 2 reviews the different techniques and approaches used in all the FPGA-based implementations of both cryptographic algorithms that we have found in the literature. In Section 3, we describe briefly the IDEA and AES algorithms. We explain the used methodology in Section 4, focusing on the union between VHDL and Handel-C, and the partial and dynamic reconfiguration was performed. Later, we describe the exact implementation of both algorithms. Finally, in Section 6, we analyze the results, and we give the conclusions in the last section.
2. Related work This section gives a background review, detailing different techniques and approaches used by other FPGA-based implementations. This allows distinguishing our work from others. Table 1 shows all the papers we have found in the literature about FPGA-based implementations of the IDEA algorithm. For Table 1 Summary of techniques used in the FPGA-based implementations of IDEA. Ref.
This work [1] [4] [6] [10] [11] [5] [12]
Year
2008 2003 2002 2001 2002 2004 2000 2002
Techniques PDR
Pi
R/Pa
KCM
KCA
Mult.
VHDL
H-C
X X
X X X
X
X X
X
??
X X X ?? ?? X ?? X
X
?? ??
PDR PDR TAT MLPM ?? ?? MLPM PDR
X ?? X X
?? X
X ?? X
X
X
?? ?? ??
1033
Table 2 Summary of techniques used in the FPGA-based implementations of AES. Ref.
This work [13] [14] [15] [9] [16] [17] [18] [19] [20] [21]
Year
2008 2001 2003 2004 2005 2005 2003 2004 2004 2003 2001
Techniques PDR
Pi
X
X X X X X X X X X X
BRAM
X X X X X X
??
MCP
KEP
VHDL
H-C
xtime LUTs xtime ?? ?? TBOX ?? ?? xtime GF24 ??
PDR ?? ?? ?? HSKG SCD OKG OKG OKG GF24 ??
X ?? ?? ?? ?? X X X ?? X X
X ?? ?? ?? ??
??
every work we detail the reference, the publication year, and if this work uses the following techniques and approaches: partial and dynamic reconfiguration (PDR), pipelining (Pi), replication/ parallelism (R/Pa), constant coefficient multipliers and constant coefficient adders. In this table, column ‘‘Mult.’’ indicates the technique used for the implementation of the multipliers modulo 216+1: PDR (implemented by means of partial and dynamic reconfiguration, i.e., by using KCMs), three-state adder tree (TAT) and modified Lyon’s parallel-serial multiplier (MLPM). Finally, we also state the hardware description language used in each work (VHDL or Handel-C—H-C). In this table, ‘‘??’’ means that those data were not found in the corresponding reference. Observing Table 1, we can conclude that our work is the only one that combines all the techniques shown in this table. For example, any other work uses the Handel-C language (or its combination with VHDL) or KCAs. In the same way, although KCMs are interesting, few works use them. Table 2 lists all the papers we have found in the literature about FPGA-based implementations of the AES algorithm. For every work we detail the reference, the publication year, and if this work uses PDR, pipelining (Pi) and/or BRAM. In this table, column MCP indicates the MixColumns phase implementation technique used xtime (implemented by means of the xtime function), LUTs (precalculated tables which store all the possible multiplying results), TBOX (combining the MixColumn and SubByte operations) and GF24 (implemented by using the GF(24) instead of GF(28)). In the same way, column KEP shows the KeyExpansion phase implementation technique used PDR (implemented by means of Partial and Dynamic Reconfiguration), hierarchical simultaneous key generation (HSKG), Separated Clock Domain implementation (SCD), online key generation (OKG) and GF24 (implemented by using the GF(24) instead of GF(28)). Finally, we also provide the hardware description language used in each work (VHDL or Handel-C—H-C). In this table, ‘‘??’’ means that those data were not found in the corresponding reference. From Table 2 we obtain conclusions similar to those indicated for Table 1 (IDEA algorithm): our work is the only one that combines all the techniques shown in this table (except the use of BRAM, in fact, the half of the references does not use BRAM). We have to highlight that no other work uses the Handel-C language (or its combination with VHDL) or PDR (we use this important technique in order to implement the KeyExpansion phase).
3. The cryptographic algorithms used In this section, we give a brief overview about the two cryptographic algorithms implemented in this work: IDEA and AES. Among the possible secure modes, for both algorithms we
ARTICLE IN PRESS 1034
J.M. Granado et al. / Microelectronics Journal 40 (2009) 1032–1040
16 Bits
16 Bits
Z1i
Z2i
16 Bits
Plain Text
16 Bits Z4i
Z3i
Initial Round
Z5i
Add Round Key
Master Key K
Sub Byte
Key Expansion
Shift Row
Key Register
9 equal Rounds
Z6i
Mix Columns Round Key Add Round Key
16 Bits
16 Bits
16 Bits
16 Bits
16 Bits
16 Bits
16 Bits
16 Bits
9
9
Z1
Z2
9
Sub Byte Final Round
Shift Row
9
Z3
Z4
Final Key Add Round Key
16 Bits
16 Bits
16 Bits
16 Bits Cipher Text
216
: Modulo adder : XOR : Modulo 216+1 multiplier i Z j : Sub-key j of the phase i Fig. 1. Description of the IDEA cryptographic algorithm phases.
will use the electronic code block (ECB) mode. This is the secure mode used in all the FPGA-based implementations that we have found in the literature (all the works referenced in this paper).
3.1. The IDEA algorithm The IDEA [22] is a 64-bit block cryptographic algorithm which uses a 128-bit key. This key is the same for both encryption and decryption, in other words, it is a symmetric algorithm, and it is used to generate 52 16-bit sub-keys. The algorithm consists of nine phases: eight identical phases (Fig. 1(a)) and a final transformation phase (Fig. 1(b)). The encryption takes place when the 64-bit block is propagated through each of the first eight phases in a serial way, where the block, divided into four 16-bit sub-blocks, is modified using the six sub-keys corresponding to each phase (elements Zji of Fig. 1: six sub-keys per phase and four sub-keys for the last phase). When the output of the eighth phase is obtained, the block goes through a last phase, the transformation one, which uses the last four sub-keys.
Fig. 2. Description of the AES cryptographic algorithm.
The AES algorithm is divided into four different phases, which are executed in a sequential way forming rounds. The encryption is achieved passing the plain text through an initial round, nine equal rounds and a final round. In all the phases of each round, the algorithm operates on a 4 4 array of bytes (called the State). In Fig. 2, we can see the structure of this algorithm. FIPS 197 [23] gives a complete mathematical explanation of the AES algorithm. In this paper, we only explain the MixColumns phase, because this is the critical part of the algorithm and, as we will see in a later section, we have improved its implementation. The MixColumns transformation operates on the State columnby-column, treating each column as a four-term polynomial. The columns are considered as polynomials over GF(28) and multiplied modulo x4+1 by a fixed polynomial a(x): aðxÞ ¼ f03gx3 þ f01gx2 þ f01gx þ f02g
(1)
This can be written as a matrix multiplication, as we can see in Eq. (2): S0 ðxÞ ¼ AðxÞ SðxÞ : 3 2s 3 2 s 0;c 02 03 01 01 0;c 7 6 0 7 6 6 s 7 for 0pcp4 6 s 1;c 7 6 01 02 03 01 7 6 1;c 7 6 7 6 0 7¼6 ¼ 7 6 s 2;c 7 4 01 01 02 03 5 6 s2;c 7 5 5 4 4 s0 3;c s3;c 03 01 01 02 2
0
3
(2)
3.2. The 128-bit AES algorithm The AES [23] is a symmetric block cipher algorithm that can process data blocks of 128 bits using cipher keys with lengths of 128, 192 and 256 bits. This algorithm is based on the Rijndael algorithm [24], but Rijndael can be specified with key and block sizes in any multiple of 32 bits, with a minimum of 128 bits and a maximum of 256 bits. In our case, we have used a key with a length of 128 bits.
As a result of this multiplication, the four bytes in a column are replaced as S0 0;c ¼ ðf02g S0;c Þ ðf03g S1;c Þ S2;c S3;c S0 1;c ¼ S0;c ðf02g S1;c Þ ðf03g S2;c Þ S3;c S0 2;c ¼ S0;c S1;c ðf02g S2;c Þ ðf03g S3;c Þ S0 3;c ¼ ðf03g S0;c Þ S1;c S2;c ðf02g S3;c Þ
(3)
ARTICLE IN PRESS J.M. Granado et al. / Microelectronics Journal 40 (2009) 1032–1040
1035
we will only see the process for a 4 20 KCM LUT, but it is similar for the other components). In the second step, we will employ the synthesized VHDL elements in our Handel-C code by means of interfaces. This utilization is made in two parts:
Fig. 3. The xtime function.
where is the XOR operation and the is a multiplication modulo the irreducible polynomial m(x) ¼ x8+x4+x3+x+1. Fig. 3 shows the implementation of the function B ¼ xtime(A), which will be used to make the multiplications of a number by 2 modulo m(x). So, we will only have binary operations, as we can see in f02g S0 x;c ¼ xtimesðS0 x;c Þ f03g S0 x;c ¼ xtimesðS0 x;c Þ S0 x;c
(4)
4. Methodology In order to implement both algorithms we have employed the same methodology. Our approach is based on the mix of two different hardware description languages. On the one hand, we have VHDL [3], a hardware description language which will allow us to implement the components that will be reconfigured in runtime by means of JBits [25], a Java library developed to make this task. In order to do this, we have to locate manually the components and to synthesize them. On the other hand, the final implementation of both algorithms has been made by means of Handel-C [2] (the other language used). This language is very close to the classical programmer (it is similar to C language) and allows us to reduce the final development time thanks to its ability to implement and manage several elements of our design, like pipelining registers, memories, the host-FPGA communication system, etc. Finally, as we use two different languages, we have to join both codes. To do this we synthesize the VHDL components by means of XST tool of Xilinx ISE 8.1i and we reference them in Handel-C by means of interfaces. Besides the languages used, we have followed a pipelined and parallel methodology. In both algorithms, we have pipelined as much as possible. This pipelining has been made in two ways: an operation pipelining and an inner pipelining. In the operation pipelining, all the operations of both algorithms are executed in a pipeline way, i.e., when the first operation finishes with the first block to encrypt, the second block goes into that operation while the first goes into the next operation. In the inner operation pipelining, the most complex operations will be pipelined in order to increment the clock frequency. This other pipelining stage is only used in the IDEA algorithm since the AES algorithm has no complex operations. Finally, we have used parallelism as much as possible, calculating, for example, the result of all bytes of the State in a parallel way in all phases of the AES algorithm. Later we will see the pipelining and the parallelism in detail. 4.1. The union between VHDL and Handel-C We have mixed VHDL and Handel-C in our design. This mix is made in two phases: a first step in which all the VHDL components must be synthesized, and a second step in which these components will be used in Handel-C by means of interfaces. The first step will be explained later (in Section 4.2.1,
1. Interface definition: In this first phase, the interface of all elements must be set up. Besides, we will establish the inputs to and the outputs from the interfaces. The name of each element interface must be the same as the synthesized VHDL entity name of that element. 2. Interface access: After the interface is defined, we can access it by means of its instance name. It is important to emphasize that two instances of the same interface are different and they will be implemented separately. In order to access to the output data of an instance we only have to write the name of the instance followed by the output port, the two separated by a dot. 4.2. Performing partial and dynamic reconfigurations We have used JBits [25], which is a Java library designed by Xilinx, and the Xilinx constraints [26] to make dynamic and partial reconfigurations. Let us see how to make the dynamic and partial reconfigurations using the 4 20 LUTs (i.e., LUTs with 4-bit input and 20-bit output) of a KCM as an example. 4.2.1. Synthesizing the 4 20 LUTs In order to use VHDL components in a Handel-C design, we have to synthesize those components, but before the synthesis can be made, we have to constrain the elements into the FPGA. To do this, we have used the following constraints within the VHDL code. All the constraints used in the design are on logic placement. No routing constraints are needed.
LUT_MAP constraint: This constraint indicates that an element
must be synthesized using LUTs. In the KCM, we employ this constraint with each LUT of the 4 20 LUTs. LOC constraint: This constraint allows us to place a 4 1 LUT in a concrete Slice of the FPGA. BEL constraint: This constraint will allow us to indicate if the LUT F or LUT G is used (let us remember that the Virtex-II Slices have 2 LUTs, LUT F and LUT G). LOCK_PINS constraint: This constraint indicates that the order of the bits of a component must be respected. This constraint is very important in the 4 20 LUTs, let us see why: let us suppose that we have a 4 1 LUT with the inputs L3 (MSB), L2, L1, L0 (LSB) and we have a 4-bit bus B(3:0) connected to the LUT. Besides, the bus signal is equal to 1011. If we do not use the LOCK_PINS constraint, the following situation could happen: B1-L3, B3-L2, B0-L1 and B2-L0. In this situation, with the bus value 1011 we will not reach the 1011 entry, instead of this, we will obtain the 1110 entry. Nevertheless, when we use the LOCK_PINS constraint, the connection between the bus and the LUT will be B3-L3, B2-L2, B1L1 and B0-L0, and we will reach the 1011 LUT entry with the 1011 bus value.
With all these constraints, we have defined a 4 1 LUT, placed this LUT in a concrete LUT into a specific Slice and restricted the input order of this LUT. If we repeat this process 20 times, we will define a 4 20 LUT. When we have our 20 constrained 4 1 LUTs, we have to synthesize all of them together into a 4 20 LUT by means of the Xilinx XST utility (we can do that in the Xilinx ISE Project
ARTICLE IN PRESS 1036
J.M. Granado et al. / Microelectronics Journal 40 (2009) 1032–1040
Navigator). We will repeat this process for each 4 20 LUT in our design. It is important to highlight that each 4 1 LUT in our design has a different (LOC, BEL) value. In case the (LOC, BEL) pair of two different 4 1 LUTs is equal, we are placing two LUTs in the same place and the final implementation will not work correctly (one of them will be placed randomly). 4.2.2. Runtime reconfiguration of the 4 20 LUTs In [27] we can see a detailed description of the KCM. One of the components of the KCM is the Super-LUT (4 20 LUT) which is formed by 20 4 1 LUTs. These Super-LUTs will be runtime reconfigured in order to store the sub-keys of the IDEA algorithm. In this section, we will describe how to implement the reconfiguration. The first step is to generate the values which will be stored in the Super-LUTs, i.e., the values resulting of multiplying the constant (16 bits) and the 4-bit input datum. To calculate these values we define a matrix with 20 columns and 16 rows. When we have the calculated data, we only have to perform the reconfiguration by means of the setCLBBits JBits function [25]. Finally, we need to know how and where the KCM Super-LUTs are located. In our design, each Super-LUT is located in a 10 2 Slices matrix, alternating between the LUTs F and G. So, the four Super-LUTs of a KCM are disposed in a 10 4 Slices matrix using both LUT F and G. It is important to take into account that the LOC constraint makes reference to a concrete Slice but the setCLBBits function references a concrete CLB and the 4 Slices of this CLB in the form of a 2 2 matrix.
5. The algorithms implementation The final implementations of both algorithms have been made in a Xilinx Virtex-2 6000 FPGA [28] included in a Celoxica [29] ADMXRC2 board. We have used this FPGA because it has a very large number of resources, particularly a total of 33,792 Slices, and it also allows us to make partial and dynamic reconfigurations. 5.1. The IDEA implementation To implement the IDEA algorithm we have used constant coefficient multipliers and constant coefficient adders [27]. These elements operate a variable datum with a constant datum and, as we can see in Fig. 1, all the multipliers and several adders of the IDEA algorithm operate with a constant. Thanks to these elements, we reach a high performance, above all using the KCMs which are much faster than the traditional multipliers (KCAs are only a little faster than the normal adders, but we need them to do the key runtime reconfiguration). On the other hand, in order to establish the constant for each of these elements, we have employed partial and dynamic reconfigurations. We do this task using JBits and following the next process: 1. 2. 3. 4.
To To To To
obtain the key. calculate the sub-keys by software. reconfigure the KCMs and KCAs using JBits. execute the reconfigured hardware.
5.1.1. KCM and KCA pipelining As we introduced in a previous section, we have employed an inner pipelining in the IDEA algorithm. This pipelining is done in the KCM and KCA elements since they are the slowest elements. In the KCM, as we can see in Fig. 4 (a detailed explanation of KCM implementation is given in [27]), we have used a total of
6 stages and several pipelining registers (one 32-bit register, six 20-bit registers, eight 16-bit registers, four 4-bit registers and five 1-bit registers) to execute the KCM completely. In the KCA, on the other hand, we only use four stages to execute it completely and the number of pipelining registers is lower too (one 19-bit register, three 5-bit registers and five 4-bit registers), as we can see in Fig. 5 ([27] also includes a detailed explanation of KCA implementation). As we said previously, this component is not much faster than a normal adder, but it allows us to increment the clock frequency (because it uses small adders) and to make the key reconfiguration. 5.1.2. The operation pipelining Besides the inner pipelining we use an operation pipelining. This pipelining is done among all simple operations (like adders or XORs) of all the phases of the IDEA algorithm, as we can see in Fig. 6. In this figure, we can observe how several pipelining registers are included to wait the result of the KCMs because these elements have the major number of pipelining stages of the algorithm. We can see that the pipelining done is a very fine-grain pipelining (the total number of pipelining stages is 182) which reduces the clock frequency a lot (the clock frequency is 4.58 ns). This fact causes that we have a high latency (182 cycles), but if we are speaking about encrypting a great number of blocks (the usual case), the benefit achieved by the high clock frequency clearly overcomes the disadvantage of the latency. 5.1.3. The dual path Finally, we have implemented two separated and parallel data paths, the reason why we encrypt two blocks at the same time, i.e., two blocks per cycle. Besides, these data paths allow us both to duplicate the performance of the encryption/decryption by means of reconfiguring the KCM and KCA with the same sub-keys and to use one path to encryption and another to decryption. This last option is the best in case FPGA is installed into a server which has the responsibility to encrypt the data coming from an extranet (like the Internet) to, for example, the wireless insecure intranet and vice versa. 5.2. The AES implementation Opposite to the IDEA algorithm, AES does not have complex operations but, on the other hand, it has a high memory cost. If we want to do the transformation of all bytes of the State in the SubByte phase in a parallel way, we must implement 16 SBox by phase. Besides, if we want to implement a pipelined version of the algorithm, we must implement 10 SubByte phases, each of which has 16 SBox tables. Furthermore, as we want to implement a pipelined version, we must define 11 4 4 1-byte sub-keys, because we have 11 AddRoundKey phases. All these storing elements give us a total of 3,29,088 bits ( ¼ 10 SubBytePhases 16 SBoxTablesPerSubBytePhase256 ElementsPerSBox8 BitsPerElement+11 AddRoundKeyPhases16 ElementsPerAddRoundKey Phase8 BitsPerElement). On the other hand, we only have 41 128-bit pipelining registers (Fig. 7), less than the IDEA algorithm. 5.2.1. MixColumns phase implementation As we have said, the AES algorithm has no complex phases and the implementation of all of them is simple. However, the MixColumns phase implementation deserves a special consideration. In Section 3.2, we saw a brief mathematical description of this phase and now we will see how to implement it. To explain the
ARTICLE IN PRESS J.M. Granado et al. / Microelectronics Journal 40 (2009) 1032–1040
1037
a 1 Pip_Reg_32b Subtractor
ZeroComp
KCM LUTs
Pip_Reg_20b Pip_Reg_16b Pip_Reg_4b
20b Adder
Pip_Reg_1b
20b Adder
24b Adder
Comparator
Subtractor
Adder
Selector Circuit
Fig. 4. KCM pipelining.
a
KCA LUTs
Pip_Reg_19b Pip_Reg_5b 5b Adder
Pip_Reg_4b
5b Adder
4b Adder
Pip_Reg_16b Fig. 5. KCA pipelining.
Fig. 6. Operation pipelining of one phase of the IDEA algorithm.
ARTICLE IN PRESS 1038
J.M. Granado et al. / Microelectronics Journal 40 (2009) 1032–1040
Table 4 Comparison between different implementations of the KeyExpansion phase.
Plain Text
Add Round Key
Sub Byte
Shift Row
Shift Row
Add Round Key
Performance (Gb/s)
Occupation (Slices)
With KeyExpansion Without KeyExpansion (using partial and dynamic reconfiguration)
14.657 5.136
8.133 24.922
17,455 13,816
Mix Columns
Cipher Text
Add Round Key
Pip_Reg_128b
But, why do we use this technique? It is easy to think that we do not need to use this technique since, at first, we do not achieve an improvement of the performance as it happened with the IDEA algorithm (we do not have complex operations), but this question is automatically answered if we see the data described in Table 4. In this table we can see the result of implementing the AES algorithm with and without (using partial and dynamic reconfiguration) the KeyExpansion phase. As we can observe, the occupation is reduced in a 20.85%, concretely in 3639 Slices, and the clock cycle passes from 14,657 to 5.136 ns, resulting in an important improvement in the performance (from 8.133 to 24,922 Gb/s). This performance improvement is easy to explain if we think that the KeyExpansion phase is strictly sequential, to calculate one column of one sub-key we need the previous column, and to calculate one sub-key we need the previous sub-key.
Fig. 7. The phase level pipelining of the AES algorithm.
Table 3 Calculation of the State’s S0 0,0 byte in the MixColumns phase.
7 6 5 4 3 2 1 0
Clock cycle (ns)
Sub Byte
9 times
S0 0,0 Bit
Implementation
Operations to calculate the resulting bit xtime(S0,0)
xtime(S1,0) S1,0
S2,0
S3,0
S0,0[6] S0,0[5] S0,0[4] S0,0[7] S0,0[3] S0,0[7] S0,0[2] S0,0[1] S0,0[7] S0,0[0] S0,0[7]
S1,0[7] S1,0[6] S1,0[5] S1,0[7] S1,0[7] S1,0[2] S1,0[7] S1,0[7]
S2,0[7] S2,0[6] S2,0[5] S2,0[4] S2,0[3] S2,0[2] S2,0[1] S2,0[0]
S3,0[7] S3,0[6] S3,0[5] S3,0[4] S3,0[3] S3,0[2] S3,0[1] S3,0[0]
S1,0[6] S1,0[5] S1,0[4] S1,0[4] S1,0[3] S1,0[3] S1,0[2] S1,0[1] S1,0[1] S1,0[0] S1,0[0]
implementation we will take the calculation of the element S0 0,0 of the S0 (x) matrix (Eq. (2)). The equation to solve this element is S0 0;0 ¼ ðf02g S0;0 Þ ðf03g S1;0 Þ S2;0 S3;0
6. Results In this section, we are going to analyze the obtained results of both cryptographic algorithms, comparing them with the results reached by other authors.
(5)
Let us remember that the operation is done by means of xtime function (Eq. (4)). So, Eq. (5) changes to S0 0;0 ¼ xtimeðS0;0 Þ ðxtimeðS1;0 Þ S1;0 Þ S2;0 S3;0
5.2.3. The AES pipelining As we did in the IDEA algorithm, we use an operation level pipelining. However, as in the AES algorithm all the operations of a phase (remember that in the AES algorithm a round has several phases, please, see Fig. 2) are done at the same time, actually it is a phase level pipelining. This pipelining can be seen in Fig. 7.
(6)
And, finally, taking into account the representation of the xtime function in Fig. 3, we will take the final result described in Table 3 (the result of bit n will be the XOR among the four columns of row n). In conclusion, in order to implement the MixColumns phase we only need XOR gates. 5.2.2. The KeyExpansion phase If we observe Fig. 2, we can see that the KeyExpansion phase calculates the sub-keys which will be used by the different AddRoundKey phases. However, we do not implement this method, instead we will use partial reconfiguration. The process is similar to the IDEA runtime reconfiguration, but, in this case, we reconfigure the LUTs which will be used in the AddRoundKey phases, i.e., the LUTs which store the sub-keys, instead of the components which use these LUTs (as it happened in the IDEA algorithm).
6.1. The IDEA results After the synthesis with Xilinx ISE 8.1i, our implementation only takes the 45% of the FPGA Slices (exactly, 15,016 Slices) and it has a clock of 4.58 ns. In conclusion, after the initial latency, it encrypts two 64-bit blocks every 4.58 ns. This gives us a total performance of 27,948 Gb/s, surpassing the best results achieved by other authors, as Table 5 shows. Besides, our results are better than the results of other authors in both the performance/area and the latency. It is important to emphasize that although we have a high latency (182 cycles) which is bad for small data size encryption, the applications usually encrypt data with a great size, particularly when we speak about an encryption server. On the other hand, we have a double data path, and therefore, our occupation and latency (in Table 5) are in order to encrypt two data blocks (the rest of authors only encrypt one block). Another interesting datum is the reconfiguration time. This includes the time needed to rewrite the configuration file (around 141 ms) and the time of the FPGA configuration (18.77 ms, because our implementation needs 15,016 Slices, where one CLB includes 4 Slices, and the time to configure one CLB for Virtex-II FPGAs is 5 ms
ARTICLE IN PRESS J.M. Granado et al. / Microelectronics Journal 40 (2009) 1032–1040
1039
Table 5 Comparative results table of the IDEA algorithm. Device
Perform. (Gb/s)
Occupation (Slices)
Perf./Area (Mb/s per slice)
Latency (Cycles–ns)
Reference
XC2V6000-6 XCV600-6 XCV2000E-6 XCV1000E-6 XCV1000E-4 XC2V1000-5 XCV1000E-6 XCV800-6
27.948 8.3 6.8 5.1 4.3 3.008 2.3 1.28
15,016 6,078 8,640 11,602 ?? 11,700 11,204 6,312
1.86 1.39 0.66 0.45 ?? 0.26 0.21 0.21
182–833 158–1,205 132–1,250 ??–2,134 ?? ??–190 ??–7,372 25–1,250
This work [1] [4] [6] [10] [11] [5] [12]
Table 6 Comparative results table of the AES algorithm. Device
Perform. (Gb/s)
Occupation (Slices)
Perf./Area (Mb/s per Slice)
Latency (Cycles–ns)
Reference
XC2V6000-6 XCV812E XCV3200E-8 XC2VP20-7 XC2VP70-7 XCV3200E-8 XCV2000E-8 XCV2000E-8 XC2V4000 XCV1000E-8 XCV1000E-8 XCV812E-8 XCV812E-8
24.922 12.02 11.78 21.54 29.77 18.56 23.65 20.30 23.57 21.56 17.80 11.97 12.20
3,720 2,000 2,784 5,177 7,761 5,016 6,597 5,810 6,842 11,022 10,750 9,406 12,600
6.70 6.01 4.23 4.16 3.84 3.70 3.58 3.49 3.44 1.96 1.66 1.27 0.97
41–210.5 ??–106 ?? 46–292.8 60–253.8 ??–496 ??–379 ??–323 ??–162.9 ??–422 ??–318 ??–322 ??–1,768.8
This work [13] [14] [15] [9] [14] [16] [17] [18] [19] [20] [19] [21]
[28,30]). This time may seem too high, but the reconfiguration is done when the user introduces the key and so s/he does not realize about this time (when the user pushes the ‘‘configure’’ button, the reconfiguration is already done).
(0.98 Gb/s), [40] (0.90 Gb/s), [39] (0.69 Gb/s), [37] (0.39 Gb/s) or [41] (0.35 Gb/s).
7. Conclusions 6.2. The AES results Let us see now the AES results. Our implementation uses a total of 3720 Slices, only the 11% of the FPGA resources. This result is one of the best in that field as we can see in Table 6 (it is important to emphasize that Table 6 does not include the Slices or BRAMs which are used to the SBox storage because these data are very similar for all the implementations). On the other hand, thanks to the pipelined implementation, we can reach a clock cycle of 5.136 ns, i.e., a performance of 24,922 Gb/s. This datum is not the best of Table 6 (it is the second) but we improve the work [9] in the rest of columns of this table (in particular, we highlight the performance/area ratio), so our result is comparable with the [9] implementation, and better than the rest of implementations by other authors. The performance/area ratio gives us an important advantage. As we can see, our result is very good, surpassing notably the results obtained by the other authors. Finally, the Latency column also is favourable to us, where we have the best result again. Furthermore, we must point out that in Table 6 we only compare our implementation with the best AES implementations found in the literature. Other FPGA-based implementations with worse results have not been included, for example, [17] (8.90 Gb/s), [31] (7.68 Gb/s), [32] (7 Gb/s), [33] (6.95 Gb/s), [34] (3.65 Gb/s), [35] (2.96 Gb/s), [36] (1.94 Gb/s), [8] (1.94 Gb/s), [37] (1.91 Gb/s), [38] (1.88 Gb/s), [39] (1.75 Gb/s), [8] (1.60 Gb/s), [38]
In this work, we have presented the implementation of two cryptographic algorithms (IDEA and AES) using partial and dynamic reconfigurations. To implement these algorithms, we have employed two different languages, i.e., Handel-C and VHDL, and we have combined both to achieve a very high-performance implementation. In both cases, the implementations are pipelined and we use JBits to reconfigure the elements involved with the keys (KCM and KCA in IDEA and the LUTs which store the keys in AES). The achieved results in both cases are very good, surpasses the best result in the literature in the case of IDEA and in the second best result in the case of AES, but we have the best results in the occupation, performance/area and latency fields. In conclusion, to use FPGAs and partial and dynamic reconfigurations is a very good idea to implement cryptographic algorithms, because we can obtain a performance around 25 Gb/s. This performance is accordant with the future communication standards for wired and wireless networks.
Acknowledgements This work has been partially funded by the Spanish Ministry of Education and Science and FEDER under contracts TIN200508818-C04-03 (the OPLINK project) and TIN2008-06491-C04-04/ TIN (the MSTAR project).
ARTICLE IN PRESS 1040
J.M. Granado et al. / Microelectronics Journal 40 (2009) 1032–1040
References [1] I. Gonza´lez, S. Lopez-Buedo, F.J. Go´mez, J. Martı´nez, Using partial reconfiguration in cryptographic applications: an implementation of the IDEA algorithm, in: 13th International Conference on Field Programmable Logic and Application, 2003, pp. 194–203. [2] Celoxica, Handel-C Language Reference Manual, version 3.1, 2005. [3] V.A. Pedroni, Circuit Design with VHDL, The MIT Press, 2004. [4] A. Ha¨ma¨la¨inen, M. Tomminska, J. Skitta¨, 6.78 gigabits per second implementation of the IDEA cryptographic algorithm, in: 12th International Conference on Field Programmable Logic and Application, 2002, pp. 760–769. [5] M.P. Leong, O.Y.H. Cheung, K.H. Tsoi, P.H. W. Leong, A bit-serial implementation of the international data encryption algorithm IDEA, in: IEEE Symposium on Field-Programmable Custom Computing Machines, 2000, pp. 122–131. [6] O.Y.H. Cheung, K.H. Tsoi, P.H.W. Leong, M.P. Leong, Tradeoffs in parallel and serial implementations of the international data encryption algorithm IDEA, in: Workshop on Cryptographic Hardware and Embedded Systems, vol. 2162, 2001, pp. 333–347. [7] R. Vaidyanathan, J.L. Trahan, Dynamic Reconfiguration: Architectures and Algorithms, first ed., Kluwer Academic/Plenum Publishers, 2003. [8] S.-S. Wang, W.-S. Ni, An efficient FPGA implementation of advanced encryption standard algorithm, IEEE International Symposium on Circuits and Systems 9 (2004) 597–600. [9] S.-M. Yoo, D. Kotturi, D.W. Pan, J. Blizzard, An AES crypto chip using a highspeed parallel pipelined architecture, Microprocessors and Microsystems 29 (7) (2005) 317–326. [10] J.L. Beuchat, J.O. Haenni, C. Teuscher, F.J. Go´mez, H.F. Restrepo, E. Sa´nchez, Approches mete´rielles et logicielles de l’algorithme IDEA, in: Technique et Science Informatiques 21(2) (2002) 203–204. [11] P. Kitsos, N. Sklavos, M.D. Galanis, O. Koufopavlou, 64-bit Block ciphers: hardware implementations and comparison analysis, Computers and Electrical Engineering 30 (2004) 593–604. ˜ o en Sistemas Reconfigurables basado en Java, Internal [12] I. Gonza´lez, Codisen Technical Report, UAM, Spain, 2002. [13] M. McLoone, J.V. McCanny, Rijndael FPGA implementation utilizing look-up tables, in: IEEE Workshop on Signal Processing Systems, 2001, pp. 349–360. [14] F.X. Standaert, G. Rouvroy, J.J. Quisquater, J.D. Legat, Efficient implementation of Rijndael encryption in reconfigurable hardware: improvements and design tradeoffs, CHES 2003, LNCS 2779, 2003, pp. 334–350. [15] A. Hodjat, I. Verbauwhede, A 21.54 Gbits/s fully pipelined AES processor on FPGA, in: 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2004. [16] T. Good, M. Benaissa, AES on FPGA from the fastest to the smallest, in: Seventh Cryptographic Hardware and Embedded Systems (CHES), 2005, pp. 427–440. [17] G.P. Saggese, A. Mazzeo, N. Mazzoca, A.G.M. Strollo, An FPGA-based performance analysis of the unrolling, tiling, and pipelining of the AES Algorithm, FPL 2003, LNCS 2778, 2003, pp. 292–302. [18] J. Zambreno, D. Nguyen, A. Choudhary, Exploring area/delay tradeoffs in an AES FPGA implementation, in: 14th Field Programmable Logic and Applications (FPL), 2004, pp. 575–585 [19] X. Zhang, K.K. Parhi, High-speed VLSI architectures for the AES algorithm, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 12 (9) (2004) 957–967.
[20] K.U. Jarvinen, M. T. Tommiska, J. O. Skytto, A fully pipelined memoryless 17.8 Gbps AES-128 encryptor, in: Proceedings of the 11th International Symposium on Field Programmable Gate Arrays, 2003, pp. 207–215. [21] K. Gaj, P. Chodowiec, Fast implementation and fair comparison of the final candidates for advanced encryption standard using field programmable gate arrays, CT-RSA 2001, LNCS 2020, 2001, pp. 84–99. [22] B. Schneier, Applied Cryptography, second ed., Wiley, New York, 1996. [23] Federal Information Processing Standards Publication 197 (FIPS 197), /http:// csrc.nist.gov/publications/fips/fips197/fips-197.pdfS, 2001. [24] J. Daemen, V. Rijmen, The Block Cipher Rijndael, Smart Card Research and Applications, LNCS 1820, Springer, Berlin, 2000, pp. 288–296. [25] Sun Microsystems, JBits User Guide, 2004. [26] Xilinx Constraints Guide 8.1i, 2005. [27] J.M. Granado, M.A. Vega-Rodrı´guez, J.M. Sa´nchez-Pe´rez, J.A. Go´mez-Pulido, A dynamically and partially reconfigurable implementation of the IDEA algorithm using FPGAs and Handel-C, Journal of Universal Computer Science (JUCS) 13 (3) (2007) 407–418. [28] Xilinx, Virtex-II Platform FPGAs: Complete Data Sheet, 2005. [29] Celoxica: /http://www.celoxica.comS, 2008. [30] Xilinx, Virtex-II Platform User Guide, 2005. [31] M. Alam, W. Badawy, G. Jullien, A novel pipelined threads architecture for AES encryption algorithm, in: IEEE International Conference on ApplicationSpecific Systems, Architectures and Processors, 2002, pp. 296–302. [32] M. McLoone, J.V. McCanny, Single-chip FPGA implementation of the advanced encryption standard algorithm, in: 11th Field Programmable Logic and Applications (FPL), 2001, pp. 152–161. [33] M. McLoone, J.V. McCanny, High performance single-chip FPGA Rijndael algorithm implementations, in: Third Cryptographic Hardware and Embedded Systems (CHES), 2001, pp. 65–76. [34] N. Sklavos, O. Koufopavlou, Architectures and VLSI implementations of the AES-proposal Rijndael, IEEE Transactions on Computers 51 (2002) 1454–1459. [35] I. Papaefstathiou, V. Papaefstathiou, C. Sotiriou, Design-space exploration of the most widely used cryptography algorithms, Microelectronics and Microsystems 28 (2004) 561–571. [36] A.J. Elbirt, W. Yip, B. Chetwynd, C. Paar, An FPGA implementation and performance evaluation of the AES block cipher candidate algorithm finalists, in: The Third Advanced Encryption Standard (AES3) Candidate Conference, 2000, pp. 13–27. [37] A. Labbe´, A. Pe´rez, AES implementations on FPGA: time-flexibility tradeoff, in: 12th Field Programmable Logic and Applications (FPL), 2002, pp. 836–844. [38] A.J. Elbirt, W. Yip, B. Chetwynd, C. Paar, An FPGA-based performance evaluation of the AES block cipher candidate algorithm finalists, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 9 (4) (2001) 545–557. [39] N. Weaber, J. Wawrzynek, Very High Performance, Compact AES Implementation in Xilinx FPGAs, /http://www.cs.berkeley/edu/nweaver/sfra/rijndael. pdfS, 2002. [40] S. McMillan, C. Patterson, JBitsTM implementation of the advanced encryption standard (Rijndael), in: 11th Field Programmable Logic and Applications (FPL), 2001, pp. 162–171 [41] A. Dandalis, V.K. Prasanna, J.D.P. Rolim, A comparative study of performance of AES candidates using FPGAs, in: The Third Advanced Encryption Standard (AES3) Candidate Conference, 2000, pp. 125–134.