On efficient implementation of FPGA-based hyperelliptic curve cryptosystems

On efficient implementation of FPGA-based hyperelliptic curve cryptosystems

Computers and Electrical Engineering 33 (2007) 349–366 www.elsevier.com/locate/compeleceng On efficient implementation of FPGA-based hyperelliptic curv...

424KB Sizes 84 Downloads 127 Views

Computers and Electrical Engineering 33 (2007) 349–366 www.elsevier.com/locate/compeleceng

On efficient implementation of FPGA-based hyperelliptic curve cryptosystems Grace Elias, Ali Miri *, Tet-Hin Yeap School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada Available online 27 July 2007

Abstract In this age, where new technological devices such as PDAs and mobile phones are becoming part of our daily lives, providing efficient implementations of suitable cryptographic algorithms for devices built on embedded systems is becoming increasingly important. This paper presents an efficient design of a high-performance hyperelliptic curve cryptosystem for a field programmable gate array which is well suited for embedded systems having limited resources such as memory, space and processing power. In this paper, we investigate two architectures, one using a projective coordinate representation for hyperelliptic systems and the second using a mixed coordinate representation that eliminates the need for field inversions in the point arithmetic, which has been proven to be expensive in both time and space. In addition, both architectures are based on an explicit formula which allows one to compute the point arithmetic directly in the finite field, thereby eliminating a level of arithmetic. The operation time for the HECC system is also improved by considering simplifications of the hyperelliptic curve which are accomplished through simple transformation of variables. As a result, these implementations offer significantly faster operation time and smaller area consumption then other HECC hardware implementations done to date.  2007 Published by Elsevier Ltd. Keywords: Reconfigurable hardware; Public key cryptosystems; Algorithms implemented in hardware

1. Introduction Security plays an important role in many facets of modern communications and computer networks by providing a means to protect information that would otherwise be vulnerable to tampering or eavesdropping during transmission. With the advent of mobile communication systems such as PDAs and mobile phones, security considerations in embedded systems are becoming increasingly important [33]. Within these constrained environments one must consider the limitations on memory, processing power and space, and as a result careful thought must be made into choosing a cryptographic system that would be the most efficient and effective in this sort of environment. *

Corresponding author. E-mail address: [email protected] (A. Miri).

0045-7906/$ - see front matter  2007 Published by Elsevier Ltd. doi:10.1016/j.compeleceng.2007.05.006

350

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

One such cryptographic system, introduced by Koblitz in 1988 [1], is the hyperelliptic curve cryptosystem (HECC). Unlike the well-known RSA cryptographic system whose security is built on the difficulty of factoring large composite numbers, HECC derives its security from the difficulties in solving the discrete logarithm problem and offers many advantages over RSA. One of the major advantages is that HECC can use much shorter key lengths than RSA for an equivalent level of security because a sub-exponential time attack against HECC does not yet exist. Hyperelliptic curve cryptosystems is a generalization of the popular elliptic curve cryptosystems (ECC) into which much research has already been done. However, HECC can offer the same level of security as ECC with again a smaller operand length. It is these shorter operand lengths that make HECC ideal for such constrained environments because it can result in faster implementations, less power consumption and less space. Over the last few years, there have been implementations of HECC in software [2–5] and in hardware devices such as FPGAs [6–9]. These implementations however were found to be significantly slower than ECC due to the extra level of polynomial arithmetic that must be performed when using Cantor’s algorithm [10]. In 1994, Spallek [11] attempted to speed up the HECC operations by defining the required coefficients from Cantor’s algorithm solely in the finite field, thereby removing an extra level of arithmetic. Since then other work on optimizing the explicit formula for even or odd characteristic and specific genus was done in [12–22]. Implementations of these various versions of the explicit formula were done in software [13–16] and were found to be much faster than those implementations based on Cantor’s algorithm. The first implementation of the explicit formula on an embedded microprocessor (ARM7) was done by Pelzl [23] and improvements were made in later work [24,25]. In 2002, Lange [17] investigated the use of projective coordinates in the explicit formula for hyperelliptic curves of genus 2 to eliminate the need for field inversions all together, which was later implemented by Nguyen [26] on a smartcard. In [34,35], Hodjat et al. use a hardware/software co-design approach on 8-bit microprocessors to achieve significant speed-up performance compare to that of software only solution. In this paper, we present an efficient and complete design of a high-performance, FPGA-based, HECC processor which is aimed at embedded systems with constrained environments. The processor itself performs a key operation required in both encryption and decryption within HECC systems namely the scalar divisor multiplication on the Jacobian of a hyperelliptic curve. The first implementation is based on an explicit formula for projective coordinates while the second implementation is based on an explicit formula for mixed coordinates, operating over a curve of genus 2 and a field of F 2113 . In this implementation the design was described using the Verilog hardware description language, simulations were performed using the Modelsim Simulator, and the Xilinx integrated software environment was used to synthesize and implement the design for a Xilinx Virtex II FPGA. This implementation is to our knowledge the fastest implementation of HECC in hardware. The remainder of this paper is as follows. Section 2 summarizes the previous implementations of HECC in hardware and on embedded microprocessors. Section 3 provides the most important theoretical background needed to understand HECC as well as introduces the algorithms implemented at each level. Section 4 describes the architecture of the HECC processor and its main components. Section 5 describes the implementation platform, conditions, and lists the final implementation results. Conclusions are given in Section 6. 2. Previous work As previously mentioned there have been a few documented software and hardware implementations of HECC. In this section we will focus solely on those implementations done in hardware and on embedded microprocessors [6–9,23–26]. In 2001, Wollinger [6] proposed the first hardware architectures for the implementation of hyperelliptic cryptosystems based on Cantor’s algorithm on an FPGA. Wollinger did not however, design for optimal area usage and did not place and route the entire design to determine the exact speed and number of logic units used on the FPGA. Further work was done in [7]. On the other hand, in Clancy’s paper [9] the first complete hardware implementation of a hyperelliptic curve processor including scalar multiplication (Binary method and Window NAF) with exact timing and area values was given. In this paper, various results were used to simplify some of the algorithms, for instance instead of using the extended Euclidean algorithm to perform polynomial GCD calculations, a simplified

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

351

GCD calculation block was developed that could perform much faster. This implementation which was based on the hyperelliptic curve C: v2 + uv = u5 + u2 + 1 of genus 2 over several base fields, was designed using Verilog HDL and the Xilinx Integrated Software Environment for synthesis and implementation. Previous work that was done [8] explained the point arithmetic architecture in further detail. In his thesis, Pelzl [23] presents a complete implementation of the group operations on the Jacobian of a hyperelliptic curve for an embedded processor such as the ARM7 and the PowerPC. The implementation was done over curves of genus 2 and genus 3, and performance comparisons were given between ECC and HECC. Pelzl generalized and optimized explicit formulas for genus 3 and made use of three basic algorithms for addition and doubling on the Jacobian, namely, Lange’s explicit formula for genus 2, his own explicit formula for genus 3, and Cantor’s algorithm. The algorithm was selected based on the following rules: • Use his own explicit formula for genus 3 and Lange’s explicit formula for genus 2 only for the frequent cases of doubling and adding. • Use Cantor’s algorithm (with slower polynomial arithmetic) for all other cases which will occur with low probability. Pelzl included Cantor’s algorithm for completeness, however he mentioned that due to the low probability of the non-frequent cases the use of Cantor’s algorithm could be avoided. Improvements were made in [24,25]. In 2002, Nguyen [26] implemented Lange’s explicit formula for mixed coordinates on a smartcard with a FameXE coprocessor. This is similar to the formula based on projective coordinates except it allows the addition algorithm to take one point in affine coordinates and the other in projective coordinates, thereby saving some multiplications. With the mixed coordinates version of the explicit formula he was able to eliminate any field inversions in the point arithmetic. The hyperelliptic processor proposed in this paper uses some of the features of the mentioned implementations (early results of this work were presented in [27]). The work done by a few authors [6–9] is based on Cantor’s algorithm which as explained earlier does not work well under constrained environments because of the extra polynomial arithmetic that is required. This will ultimately reduce the time performance and increase the area consumption on the FPGA. They do however provide efficient finite field algorithms that are suitable for hardware devices and also provide a good basis for performance comparisons. In addition, the above papers [6–9,23–25] implemented HECC using affine coordinates and thus must include field inversion for every point doubling and addition operation that is performed. This can significantly increase the time to perform a scalar multiplication. Although the work done by Nguyen [26] is based on projective coordinates on a smartcard, the FPGA implementation proposed in this paper outperforms this work as well the other previous hardware implementations. 3. Background The hyperelliptic curve version of the discrete logarithm problem translates into finding the value of scalar integer k when given Q = k Æ P, where P is a group element and k is an arbitrary integer in the range 1 6 k < ord(P). The computation of Q which involves the scalar multiplication of k with P is easy, however going back to find the value of k given the curve, the point P, and Q is very difficult. The strength of the hyperelliptic cryptosystem lies in the difficulty of this problem and the HECC processor in this paper is concerned with the implementation of the scalar multiplication of k with P such that Q = k Æ P. 3.1. General theory of hyperelliptic curve An elliptic curve is a curve of genus 1 while the hyperelliptic curve is a curve with genus greater than or equal to 2. Working with higher genus curves allows one to use smaller and smaller base fields for the same level of security, which should theoretically reduce the computational complexity. However, over the past few years the discrete logarithm problem has received much attention and numerous algorithms using various techniques have been developed to solve this problem. One of the well-known approaches is referred to as the Pollard q-method, which works over arbitrary groups. In 2000, Gaudry [28] showed that the discrete

352

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

log problem for curves with genus greater than 5 could be solved faster than the existing Pollard q-method and thus only curves with 2 6 g 6 4 should be used. In this paper we implement over a curve of genus 2. A hyperelliptic curve of genus 2 in the Weierstraß form is given by C : y 2 þ ðh2 x2 þ h1 x þ h0 Þy ¼ x5 þ f4 x4 þ f3 x3 þ f2 x2 þ f1 x þ f0 which may be written in short form as C: y2 + h(x)y = f(x). In general, for arbitrary genus g, f(x) and h(x) are elements of the ring F2n[x] where h(x) is of degree at most g and f(x) is a monic polynomial of degree 2g + 1. In a hyperelliptic curve cryptosystem, group operations are performed on an ideal class group that is isomorphic to the Jacobian of the hyperelliptic curve and where each element can be represented uniquely by a reduced divisor. According to Mumford’s Representation each divisor class may be represented by a pair of polynomials div[u, v] over the field F 2n , with the following properties: (1) u is monic, (2) deg v 6 deg u 6 g, (3) ujv2 + vh  f. Divisors of hyperelliptic P curves may also be defined as a finite formal sum of points on the hyperelliptic curve, written as D ¼ P 2C gP P , with gP 2 Z, however for our purposes we will use the first definition throughout this paper as it is more practical. 3.2. Hyperelliptic curve cryptosystem hierarchy As mentioned earlier, in a discrete log hyperelliptic curve cryptosystem the main operation that needs to be performed is scalar multiplication of a group element P, which is a reduced divisor div[u, v], by an integer k (level 1). This in turn requires point addition and point doubling on the Jacobian of hyperelliptic curve (level 2) which in turn depends on the performance of the polynomial arithmetic (level 3) and finally on the performance of the finite field arithmetic (level 4). The relationship between all four levels of the HECC is illustrated in Fig. 1.

Fig. 1. Hyperelliptic curve hierarchy.

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

353

The group operation of point addition and point doubling on the Jacobian of the hyperelliptic curve (level 2), as defined by Cantor, is given in Algorithm 1. Algorithm 1 Cantor’s algorithm INPUT: reduced D1 = div(u1, v1), reduced D2 = div(u2, u2), curve polynomials f, h OUTPUT: reduced div(u3, v3) = D1 + D2 1. Perform two extended GCD’s to compute: d = ged(u1, u2, u3 + u2 + h) = s1u1 + s2u2 + s3(v1 + v2 + h) 2. a3 u1u2/d2 b3 ðs1 u1 u2 þs2 u2 u2 þs3 ðu1 u2 þf ÞÞ 3. dðmod u3 Þ 4. while deg(u3) > g 5. u3 ðh  hv3  v23 Þ=a3 6. u3 h  v3(mod u3) 7. return div(u3, u3) Cantor’s algorithm consists of two steps, namely the composition step followed by the reduction step. The composition step adds two reduced input divisors D1 = div(u1, v1) and D2 = div(u2, v2) and outputs, a semireduced divisor, D1 + D2. The reduction step then finds the unique reduced divisor D = div(u3, v3) corresponding to the output of the composition step. The explicit formula is now obtained by defining Cantor’s algorithm solely in terms of the finite field arithmetic thereby eliminating the need for the extra level of polynomial arithmetic (level 3) entirely. The 4-level hierarchy in Fig. 1 can then be reduced to three levels via the explicit formula, namely scalar multiplication (level 1), point arithmetic (level 2), and finally the field arithmetic (level 4). These three levels which make up this HECC system will be discussed further in the following sections. 3.3. Level 1 – scalar multiplication There are many methods for multiplying a divisor P by a scalar integer k. In this paper we choose to look at the basic double-and-add method since in a hyperelliptic cryptosystem the calculation of k Æ P involves a fixed base P and a variable integer k, a situation where the double-and-add method is optimal. In general, the double-and-add method can scan the bits of the integer k from right to left (RL) or left to right (LR). The RL method has the advantage of being able to run the point doubling and addition operations in parallel while the LR method must run them serially. The first implementation uses the RL double-and-add method which relies on the binary expansion of an ‘-bit integer k. This method requires ‘ point doublings and W  1 point additions where W is the weight of the integer k and point additions with identity 0 are not counted. This method which we refer to as the right-to-left binary expansion method is given in Algorithm 2.

Algorithm 2 Right-to-left binary expansion method INPUT: divisor P = div(u, v) and an ‘-bit binary vector k ¼ integer. OUTPUT: R = k Æ P 1. B P; R 0 2. For i from 0 to ‘  1 do 3. if ki = 1 then R R+B 4. B 2ÆB 5. return R

P‘1

j¼0 k j 2

j

; k j 2 f0; 1g representing an

InPthe second implementation we consider a signed digit representation of the integer k written as ‘ k ¼ j¼0 sj 2j where sj 2 {1, 0, 1}. This representation is referred to as the non-adjacent form (NAF) because

354

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

it results in a representation that has no adjacent non-zero digits. The NAF method should ultimately reduce the overall processing time as the expected weight of the converted ‘-bit integer is ‘/3. For example, the integer number 7 which is 111 in binary, can be computed as 4 + 2 + 1 or as 8  1. In this case we look at the left-toright binary signed digit recoding algorithm given in the work done by Joye and Yen [29]. The algorithm for converting an integer k to NAF form is given as Algorithm 3. Algorithm 3 Left-to-right conversion to NAF P‘1 INPUT: An integer k P ¼ j¼0 k j 2j ; k j 2 f0; 1g ‘ OUTPUT: NAF k ¼ j¼0 sj 2j ; sj 2 f1; 0; 1g 1. cm 0; km 0; k1 0; k2 0 2. For i from ‘ to 0 do 3. ci  1 b(ci + ki1 + ki2)/2c 4. si = 2ci + ki + ci1 5. return ðs‘ s‘1...s0 Þ For the second implementation we use the NAF form of the integer k along with the LR double-and-add method. We refer to this as the left-to-right NAF method and is given in Algorithm 4. Algorithm 4 Left-to-right NAP method INPUT: divisor P and a t-bit NAP vector s ¼ OUTPUT: R = k Æ P 1. R 0 2. For i from t to 0 do 3. R 2ÆR 4. if si = 1 then R R+P 5. if si = 1 then P RP 6. return R

Pt1

j¼0 sj 2

j

; sj 2 f1; 0; 1g representing an integer.

As stated earlier, the LR method cannot run the doubling and addition operation in parallel however it does have advantages over the RL method when using the NAF form of the integer k for reasons we will try to now make clear. Recall that the NAF form of the integer has digits sj in the set {1, 0, 1}. The RL double-and-add method in Algorithm 2 is adapted to the NAF representation of the integer k by performing a subtraction in place of an addition whenever a negative digit sj is encountered. This subtraction translates into adding the inverse of a value. In the case of the RL method the inverse of an intermediate divisor B must be calculated whenever a negative digit is encountered because its value is changed on every loop of the algorithm (refer to Algorithm 2). This can be a computationally expensive operation as inversions are usually achieved via the extended Euclidean algorithm and as a result the advantages gained from the NAF form may be absorbed when using the RL method. On the other hand, in the LR method we will only require the inverse of the input point P as its structure is kept on each loop of the algorithm (refer to Algorithm 4). Since, the base P is fixed then the inverse of P will also be fixed and hence can be pre-computed beforehand, thereby eliminating the need for any divisor inversions in the algorithm. For this reason, we chose to implement the LR method when using the NAF representation of the integer k. In summary we use the right-to-left binary expansion method for scalar multiplication in the first implementation, while the left-to-right NAF method is used for the second implementation. Note that in the usual case, the HECC processor will take a divisor input, P, in affine coordinates and will output the result, k Æ P, in affine coordinates. Through conversions the internal arithmetic required by the scalar multiplication algorithms described above may be performed in another coordinate system to achieve better results. The projective coordinate representation used in the point arithmetic for these implementations is described in the following section.

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

355

3.4. Level 2 – point arithmetic As mentioned earlier, the point arithmetic required in the scalar multiplication algorithms of the previous section is defined by Cantor’s algorithm and from which the explicit formula is derived. The explicit formula defines the point arithmetic solely in terms of the finite field arithmetic thereby eliminating the need for the slower polynomial arithmetic level altogether. This will ultimately speed up operation time and make HECC more comparable to that of ECC. In addition we consider the use of projective coordinates for the point arithmetic which will eliminate the need for field inversions. This is advantageous as field inversion has already been proven to be quite expensive both in terms of speed and space. Recall, that each element of the Jacobian can be represented by the pair of polynomials [u, v] (i.e. by a unique reduced divisor div[u, v]) which we will refer to as affine coordinates. As in the case of ECC, we can add an additional coordinate z, where each element can now be represented as the triplet [U, V, Z], which we will refer to as projective coordinates. Conversion from affine coordinates into projective coordinates may be done as follows: affine½u; v ) projective½U ; V ; 1 By substituting this projective representation of the element into the explicit formula for affine coordinates we are able to eliminate the denominators in the result of the point addition or doubling such that the output can now be represented by the projective coordinates [U 0 , V 0 , Z 0 ]. With this technique all field inversions can be completely eliminated in the explicit formula for point addition and doubling at the cost of extra multiplications. In our first design we implement a projective coordinate’s version of the explicit formula where the point addition algorithm accepts two inputs in projective coordinates and outputs a result in projective coordinates [19]. In the second design we implement a mixed coordinate version of the explicit formula which essentially takes one input in projective and the other in affine coordinates and outputs the result in projective coordinates [17], saving about seven multiplication operations over the purely projective version. Recall that in the second implementation we use the left-to-right NAF method where the base point P, that is input in affine coordinates, is retained on each loop of the algorithm and therefore we need only add one point in projective and one in affine on each loop. Both versions use the same point doubling algorithm where the input is almost always in projective representation [19]. It is important to note that the most frequent case occurs when each element is of weight two, where the result from adding or doubling is another weight two divisor. In this case the pair of polynomials of each reduced divisor can be represented in affine coordinates as ½u; v ¼ ½x2 þ u1 x þ u0 ; v1 x þ v0  : ½u1 ; u0 ; v1 ; v0  The corresponding representation in projective coordinates is given as [U1, U0, V1, V0, Z] which translates into the affine representation [x2 + U1/Zx + U0/Z, V1/Zx + V0/Z]. If affine coordinates are required at the output of the HECC processor then only one field inversion and four field multiplications are required at the end of the scalar multiplication computation. Since this work considers the implementation over the field F 2n where n = 113 then the probability that the frequent case (each element of weight two) occurs is approximately 1  1/q = 1  1/2113. The probability that the non-frequent case will occur is therefore very low and thus will not be considered for implementation in this paper. This is reasonable for embedded systems with constrained environments because adding another algorithm for the sake of completeness may not be worth the extra space and processing power on the chip if indeed it occurs at such a low probability. If by chance this case occurs then perhaps a change in protocol might be more efficient, for example, to restart the protocol if the non-frequent case has occurred or to avoid values which lead to this case altogether. 3.5. Level 4 – field arithmetic over F 2n In this section we describe the binary arithmetic needed to perform operations over the finite field F 2n as well as list the algorithms used in these implementations. We require both finite field addition and

356

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

multiplication operations in the point arithmetic of the previous section. In addition we will require field inversion and field multiplication operations to convert the final result from projective coordinates to affine coordinates before being output from the HECC system. Finite fields of characteristic 2 are a common choice for implementers because of two reasons; the carry-free arithmetic and the different representations associated with it. The most common representations used in implementations are the polynomial bases and the normal bases representations. In both cases field elements are represented as binary vectors of dimension n with coefficients in F2. For this implementation we are using a polynomial basis which takes the form (1, a, a2, . . . , an1), where a is the root of the irreducible polynomial F(x) with degree n and coefficients in F2. Each element is then realized as F2[x]/F(x) where the arithmetic is that of polynomials of degree at most n  1, modulo F(x). In less complicated terms we are essentially working with polynomials of degree n  1 or less with coefficients from the set {0, 1}. So in the field F 2113 we are working with polynomials in the form: a0 þ a1 x þ a2 x2 þ    þ a111 x111 þ a112 x112 where the ai coefficients are from the set {0, 1} and arithmetic is performed modulo F(x). The interested reader should refer to [30] for more detailed information about the various representations that can be used for elements of the finite field. 3.5.1. Field addition The addition of two polynomials is simply the sum of the coefficients which can be written as n1 X i¼0

ai x i þ

n1 X i¼0

bj x j ¼

n1 X

ðak þ bk Þxk

k¼0

In a field of characteristic 2 all coefficients are reduced modulo 2 which translates into the bit-wise XOR of each coefficient. In the case of F 2113 this would require 113 XOR gates. 3.5.2. Field multiplication The multiplication of two field polynomials is achieved using the standard grade-school method that is modified to support the required modulo reduction. This is shown as Algorithm 5. Notice that the intermediate results are reduced in each loop of the algorithm so as to prevent the result from becoming too large. Algorithm 5 Bit-serial field multiplication INPUT: a; b 2 F 2n , and reduction polynomial F OUTPUT: c = a Æ b mod F(x) 1. c 0 2. for i from n  1 down to 1 3. if (bj = 1) then c (c + a)  1 else c c1 4. if (shift carry = 1) then c c+f 5. if (b0 = 1) then c (c + a) 6. return c

Another method which can dramatically speed up the operation is the use of digit-serial multiplier which can process the data in [n/D] clock cycles as oppose to the n clock cycles required in the purely bit-serial version of Algorithm 5. This method processes one digit at a time where D represents the digit size in bits. In this implementation we use the LSD multiplier introduced in [31] with a digit size D of 4-bits, where the field multiplication of A(x) and B(x) can be expressed as ! ! k D1 k D1 X X Di Di AðxÞBðxÞ  AðxÞ Bi ðxÞx mod F ðxÞ  Bi ðxÞðAðxÞx mod F ðxÞÞ mod F ðxÞ i¼0

i¼0

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

357

D1 In the above equation B(x) is expressed in kD digits (1 6 kD 6 jm/Dj) by the equation B ¼ Rki¼0 Bi xDi j where D1 j Bi ¼ Rj¼0 bDiþj x . This is given as Algorithm 6.

Algorithm 6 Digit-serial/parallel multiplier (LSD) k D1 D1 i Di j INPUT: A ¼ Rn1 i¼0 ai x ; B ¼ Ri¼0 Bi x , where Bi ; ¼ Rj¼0 bDiþj x , and reduction polynomial F OUTPUT: C = A Æ B mod F(x) 1. C 0, A A 2. For (1 6 i 6 kD) 3. C (A Æ Bi1) + C 4. A A Æ xD mod F(x) 5. C C mod F(x) 6. return C

3.5.3. Field inversion For field inversion we use a modified version of the extended Euclidean algorithm which uses bit shifting with XOR. This is given in Algorithm 7. Algorithm 7 Field inversion INPUT: a 2 F 2n , and reduction polynomial F OUTPUT: b = a1 mod F(x) 1. b 1, c 0, u a, v F 2. While deg(u) 5 0 3. j deg(u)  deg(v) 4. if (j < 0) then u M v, b M c, j j 5. u u + (v  j), b b + (c  j) 6. return b Please note that for all finite field arithmetic mentioned above we used the irreducible trinomial F(x) = x113 + x9 + 1 as our modular reduction factor. In fact, to speed up operations the digit-serial/parallel multiplier in Algorithm 6 was optimized for this specific irreducible polynomial. 4. Processor architecture In this section we introduce two architectures of a HECC processor corresponding to the two implementations discussed in earlier sections. To remind the reader the first implementation uses the right-to-left binary expansion method (unsigned expansion {0, 1}) for scalar multiplication together with the projective coordinate version of the explicit formula for the point arithmetic. In this implementation the reduced divisor P is converted from affine coordinates to projective coordinates, as all operations are performed internally in the latter representation. The second implementation use the left-to-right NAF method (signed expansion {1, 0, 1}) for scalar multiplication and the mixed coordinate version of the explicit formula for the point arithmetic. With regards to this implementation the base point P can be kept in affine coordinates because of two reasons; the first is the use of a mixed coordinate version of the explicit formula which essentially accepts one point in affine and one point in projective and outputs the result in projective, while the second is due to the serial nature of the point addition and point doubling operations as opposed to the parallel nature of the binary expansion method. Note that both implementations uses the same field arithmetic modules discussed in the previous section. 4.1. Binary expansion method Recall that the mathematical operation that we are ultimately trying to achieve is the scalar multiplication of a group element P, by an integer k. The HECC processor will compute k Æ P using three basic modules: the

358

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

Fig. 2. HECC processor architecture for the binary expansion method.

main controller (MC), the point arithmetic controller (PAC) and the field arithmetic unit (FAU). This is illustrated in Fig. 2. The MC is responsible for coordinating the scalar multiplication k Æ P according to the binary expansion method which was given as Algorithm 2. The PAC is responsible for coordinating the point arithmetic, namely point addition and point doubling, according to Lange’s explicit formula for projective coordinates [19]. Lastly the FAU is responsible for performing the finite field arithmetic including field inversion, multiplication and addition. A sequence of operations on a typical call to this processor is described in the following paragraph. The processor will compute k Æ P as follows. The host will send a command to the MC to indicate that data has been loaded on the input parallel lines and is ready for processing. This data will include the integer k and the reduced divisor P in affine coordinates. Upon receiving the data, the MC will convert the point P from affine coordinates [U1, U0, V1, V0] to projective coordinates [U1, U0, V1, V0, Z] by adding an extra coordinate Z. Following conversion the MC will coordinate the binary expansion algorithm and as required will pass point arithmetic commands and intermediate data to the PAC. When the PAC receives the command from the MC it will perform the necessary point arithmetic operations on the intermediate data and will send commands to the FAU to perform any needed field arithmetic, namely field multiplication or field addition. Whenever the FAU and PAC has completed their respective operations they will send a status signal to the calling function to indicate that the operation has been performed and that the result is ready to be accessed. After completing the scalar multiplication k Æ P, the MC will conclude by converting the final data from its projective coordinate representation [U1, U0, V1, V0, Z] back to affine coordinates [x2 + U1/Z + U0/Z, V1/Zx + V0/Z] by calling the FAU to perform one field inversion and four field multiplications. A status signal will then be sent from the MC to the host indicating successful completion of the scalar multiplication operation. As mentioned earlier the FAU is responsible for performing finite field arithmetic which forms the heart of the HECC processor. Since in this first implementation we use the right-to-left binary expansion method, the point adder and doubler will be working in parallel and hence separate field modules for multiplication and addition are required for each of them. As well, there is a separate field inversion module and four field multipliers to perform the final output conversion from projective coordinates to affine coordinates. Since the final conversion operation does not happen in parallel to the point arithmetic operations then we could potentially reuse the field modules from point the point arithmetic, however sharing modules could potentially complicate the routing and thus in this implementation we have dedicated field modules for conversion. The field modules are organized as shown in Fig. 3. 4.2. NAF method In the second architecture the MC coordinates the scalar multiplication k Æ P but this time according to the left-to-right NAF method which was given as Algorithm 4, the PAC coordinates the point arithmetic

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

359

Fig. 3. Organization of field modules in the FAU.

according to Lange’s explicit formula for mixed coordinates [17], and finally the FAU performs the finite field arithmetic. The sequence of operations on a typical call to the processor in this architecture is similar to that of the previous architecture and thus only the differences will be discussed in the next paragraph. Upon reception of the integer k and the group element P the MC will call another module to perform the scalar conversion of the integer k to a signed digit representation (NAF) according to Algorithm 3. Due to the fact that both the NAF conversion algorithm (Algorithm 3) and the NAF scalar multiplication algorithm (Algorithm 4) work from left to right, we can run the MC and the NAF conversion module in parallel, each processing one bit at a time. For example, the NAF conversion module will process the first bit of the integer k let us call ki and will produce the corresponding signed digit si which it will then pass to the MC. While the MC processes the current signed bit si the NAF conversion module will start processing the next bit kj producing the signed digit sj. This is repeated until all bits of k have been processed. The remaining operations are performed as in the first architecture for the binary expansion method except that the PAC uses the mixed coordinate version of the explicit formula thereby maintaining the affine coordinate representation of the point P. Please refer to Fig. 4. The organization of the field modules in the FAU are the same as that used in the first architecture, although a more efficient implementation may consider sharing some modules between the point additions

Fig. 4. HECC processor architecture for the NAF method.

360

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

and doubling operations since they now work in a serial fashion as opposed to the parallel nature of the first implementation. 5. Implementation The HECC processors were implemented on a FPGA. An FPGA is a device that is composed of a number of programmable logic blocks that are connected to programmable switching matrices. The device can be configured to perform a certain function by programming the switching matrices to route signals between the individual logic blocks. In the case of a Xilinx FPGA these logic blocks are referred to as CLB’s (configurable logic blocks) where each CLB contains four slices and two 3-state buffers. 5.1. Design methodology The HECC processor architectures was described using a popular hardware description language known as Verilog. The Xilinx ISE Environment 5.2i was then used to synthesize and implement the logic design for a Xilinx Virtex II FPGA (xc2v8000-5ff1152) with synthesis options set to HIGH optimization for area. In addition the Modelsim Simulator was used to ensure the correctness of the design. 5.2. Assumptions and considerations Recall the equation for the hyperelliptic curve which is given as C : y 2 þ hðxÞy ¼ f ðxÞ where for a genus 2 curve hðxÞ ¼ h2 x2 þ h1 x þ h0

and

f ðxÞ ¼ x5 þ f4 x4 þ f3 x3 þ f2 x2 þ f1 x þ f0

The point arithmetic is accomplished via the explicit formula which depends on the coefficients of the polynomials h(x) and f(x) of the curve. As such we consider simple transformations which can be done on the hyperelliptic curve so as to ease calculations and make the overall implementation more time efficient. Since we are dealing with the field F 2113 we only consider manipulations for even characteristic. The transformations pertaining to our implementations are as follows: • For cost estimation we assume f4 = 0 as this can be derived for p6 = 5 by the substitution x ! x + f4/5. Hence multiplication by the coefficient f4 is not included. • For cost estimation we assume h2 = {0, 1} as this can be achieved for non-zero h2 by substituting    leading to h2 = 1. y ! h52 yx ! h22 x and dividing the equation by h10 2 • For cost estimation we assume f3 = 0 as this can be derived by the transformation y ! y þ f3 x þ ðf2 þ h1 f3 þ f32 Þ when deg h = 2 or by the transformation y ! y + h1f3x2 when deg h = 1. Hence multiplication by the coefficient f3 is not included. Please note similar assumptions regarding costs estimates have been used in other works (see, for example, [23]). 5.3. Architectures for the field arithmetic There are basically two types of implementation structures to consider here; bit-parallel and bit-serial structures. A bit-parallel structure refers to a purely combinatorial implementation which processes all bits of the input at the same time thereby obtaining the result in ideally one clock cycle. A bit-serial structure, on the other hand, processes 1 bit of the input at each clock cycle thereby taking multiple clock cycles to obtain the result. These two structures may be compared in terms of three measuring parameters namely, speed, area and the number of clock cycles to complete the operation. Speed refers to the maximum circuit delay, while area refers to the number of slices. In terms of speed and clock cycles, the parallel structure is normally better

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

361

than the serial structure, however in terms of area the serial structure is normally more space efficient than the parallel structure. Obviously there are inherent speed–area–clock cycles tradeoffs between these two structures that must be considered. In light of these tradeoffs between these two basic structures we also consider a mix of the bit-parallel and bit-serial structure which we refer to as a digit-serial/parallel structure. This structure processes multiple bits of the input, which we refer to as a digit, in one clock cycle (bit-parallel) where each digit of the input is taken in serially. This structure takes advantage of the space efficiency of the bit-serial structure and the speed of the bit-parallel structure. For field addition we chose the bit-parallel structure because it was often used in the algorithm and because it only required 113 XOR gates running in parallel over F 2113 which does not require a tremendous amount of space. Since the bottleneck of the system is the field multiplication operation we chose to use the digit-serial/ parallel structure (Algorithm 6) for a more time-efficient implementation. The bit-serial structure (Algorithm 5), on the other hand, proved to be very inefficient because of the time it took to complete the operation while the bit-parallel structure would cause the system to explode in area. For this implementation we chose a digit size of 4 bits. The LSD multiplier architecture is shown in Fig. 5. In the LSD multiplier a reduction algorithm is used to compute a value in two places: (1) To compute the intermediate value Ai1xD mod F(x); (2) To compute the final output. Now since the reduction algorithm is used in different clock cycles for these two cases we could use one reduction function, however the results achieved when two are used were better overall. The reason for this is because when you use a shared circuit the routing done by the tool can become more complicated and hence may actually give you worse results. This was tested and as expected gave worse results. 5.4. Results In this section we list the performance results for the finite field arithmetic (level 4), the point arithmetic based on Lange’s explicit formula for projective coordinates and mixed coordinates (level 2), and finally for the scalar multiplication operation using both the binary expansion method and the NAF method (level 1).

Fig. 5. LSD architecture [32].

362

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

Table 1 Performance results for the finite field arithmetic over F 2113 (level 4) Field arithmetic

Clock cycles

Slices

Maximum frequency

Multiplier (D = 1) Multiplier (D = 4) Inversion Addition

114 31 395 (avg) 1

387 645 1822 113

100 MHz 73 MHz 100 MHz Max combinatorial delay: 8.445 ns

Table 2 Performance results for the point arithmetic (level 2) Point arithmetic

Clock cycles

Maximum frequency (MHz)

Time (ls)

Addition (D = 1) – projective coordinates Addition (D = 1) – mixed coordinates Doubling (D = 1)

1612 1496 1382

Slices 9514 8437 9058

48 48 50.2

35.8 33.2 30.7

Addition (D = 4) – projective coordinates Addition (D = 4) – mixed coordinates Doubling (D = 4)

450 417 386

10 988 9652 10 087

48 48 50.3

10.0 9. 27 8.58

Table 3 Performance results for the scalar multiplication (level 1) Arithmetic

Clock cycles

Slices

Maximum frequency (MHz)

Time (ms)

Bin (D = 1) NAF (D = 1) Bin (D = 4) NAF (D = 4)

339 057 332 913 95 286 91 606

22 183 21 550 25 911 25 271

45.6 45.6 46.7 45.3

7.53 7.39 2.12 2.03

In Table 1 the performance results for the finite field arithmetic over F 2113 is given. Note that both the bitserial (D = 1) and the LSD digit-serial/parallel multiplier (D = 4) were implemented. In Table 2 we list the point arithmetic results for two cases: one based on Lange’s explicit formula for projective coordinates [19] and the second based on Lange’s explicit formula for mixed coordinates [17]. For each case we give the results using both the bit-serial multiplier (D = 1) and the LSD multiplier (D = 4). Examining the achievable frequencies of each module we will choose one of 45 MHz to calculate the total operation time. In Table 3 we list the results for scalar multiplication using the binary expansion method and the NAF method. For each case we again implemented both the D = 1 and D = 4 field multipliers. On average, for an l-bit integer k there will be l point doublings and l/2 point additions in the binary expansion method. For comparison reasons we use the same value of k in the NAF method as we do in the binary expansion method to achieve average results. From Table 3 we can see that the time to perform the scalar multiplication operation, for both the binary expansion and NAF method, using the field multiplier with D = 4 is significantly faster than the implementation with D = 1, with the tradeoff being an increase in area. Since the increase in space requirements is reasonable for this implementation we will choose to use the field multiplier with D = 4 for our final implementation, providing us with a much faster HECC system. With regards to the operation time between the binary expansion and NAF method, it appears that on average the left-to-right NAF method performs only slightly faster than the right-to-left binary expansion. However this really depends on the value of the integer k and its corresponding value under NAF conversion. The average for the NAF method was found by calculating the operation time for various values of the integer k having 1 point doublings and 1/2 point additions in the original binary expansion of the 1-bit integer k.

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

363

Table 4 Performance results for HECC implementations over genus 2 Reference

SW/HW

Platform

Fields

Scalar multiplication time (ms)

[2]

SW

Pentium @100 MHz

F 264

520

[9]

HW

Xilius Virtex II FPGA

F 283 F 297 F 2113 F 2131 F 2140 F 2163

10 14 19 26 33 40

[13]

SW

PentiumIII @866 MHz

186-bit OEF

1.98

[15]

SW

PentiumIII @866 MHz

186-bit OEF

1.69

[16]

SW

Pentium-IV @1.5 GHz

F 280 F 295 Fp(log2p = 160) F p ðlog2p ¼ 180Þ

18.875 25.215 5.663 8.162

[26]

HW

SmartCard FameXE Coprocessor

F 290

30

[24]

HW

PowerPC @50 MHz embedded microprocessor

F 280 F 285 F 290 F 295

117 121.2 138.1 141.7

[25]

HW

ARM7 @ 80 MHz embedded microprocessor

F 263 F 281 F 283 F 288 F 291 F 295

48.35 69.06 71.56 77.19 81.11 85.74

HECC (D = 4)

HW

Xilinx Virtex II FPGA

F 2113

2.03

5.5. Performance comparisons In this section we shall summarize the timing results for the scalar multiplication operation in various HECC systems working over a curve of genus 2 and various fields of characteristic 2. Please note that the first two entries in Table 4 [2,9], implemented the point arithmetic using Cantor’s algorithm while the remaining entries used the explicit formula. Remarks: • Only the fastest implementation (PowerPC) for entry [24] was listed. • In [9] the author implements scalar multiplication using field multipliers with D = 1 and D = 4. Only the results for D = 1 is listed in this table as the results for D = 4, as stated by the author, resulted in unreasonable area requirements. For example, his implementation with the D = 4 field multiplier over F 2113 explodes to 81 000 slices. The largest Virtex II FPGA has 46 592 slices and thus this implementation would exceed the resources. Entry [9] is the first and only complete implementation of a HECC processor on an FPGA. It operates over the field F 2113 and it takes 19 (ms) to complete a scalar multiplication. Our implementation over the field F 2113 , that is based on the explicit formula using projective coordinates, takes 2.03 (ms) which is approximately nine times faster than the implementation done in this work. In addition, the author’s implementation over F 2113 consumes 29 000 slices where ours only takes 25 911 slices. From this analysis we can see that the results achievable via the explicit formula is far greater than that can be achieved via Cantor’s algorithm. As explained earlier this

364

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

is due to the fact that the 4-level hierarchy shown in Fig. 1 is reduced to 3 levels via the explicit formula. We may also note that Clancy’s implementation works only for deg H(u) = 1 but for any F(u). In [26], the author implemented the HECC processor on a smartcard over a field F 290 and is to our knowledge the only implementation of Lange’s explicit formula for mixed coordinates. Our implementation over the larger field F 2113 is 14 times faster than his results. 6. Conclusion In this work we implemented a high-performance HECC processor ideal for embedded systems with limited resources. This processor which is based on an explicit formula using projective coordinates has yielded very fast results compared with previous work and is to our knowledge the fastest implementation of HECC in hardware. The two main reasons that this implementation resulted in better overall performance is attributed to the fact that we used an explicit formula that allowed us to avoid one level of slower arithmetic and secondly that the use of a projective representation of points allowed us to avoid the need for any expensive field inversions in the point arithmetic. In addition, the use of a semi-customizable hardware chip as a platform allows one to run many operations in parallel without having heavy access delay times. This inherent speed in FPGAs provides a fast and efficient platform for performing operations needed in these types of systems. As mentioned earlier there were some assumptions made when calculating the computational costs of the implementation, namely f4 = f3 = 0 and h2 = {0, 1}. This does not limit the system however, as the curve can always be brought to this form by a simple transformation of variables and as already seen may be worth the reduction in operation time. In addition we also fixed the curve parameters in the design as implementations in constrained environments are usually based on one cryptographic algorithm with fixed parameters which can be essential for speed performance and power consumption. We are currently working on implementing the HECC processor in this paper over fields of various sizes for comparison, and investigating further avenues that could possibly speed up operations on the FPGA. Acknowledgements The authors specially thank Tanja Lange and Thomas Wollinger for proof reading and providing useful feeding on this work. This work was partially supported by the grants from Natural Sciences and Engineering Research Council of Canada (NSERC). References [1] Koblitz N. Algebraic aspects of crytography. 1st ed. Berlin, Germany: Springer-Verlag; 1998. [2] Krieger U. ‘‘signature.c.’’ Master’s thesis, Mathematik und Informatik, Universitat Essen, Fachbereich 6, Essen Germany; 1997. [3] Sakai Y, Sakurai K. Design of hyperelliptic cryptosystems in small characteristic and a software implementation over F 2n . In: Advances in cryptology (ASIACRYPT). Lecture notes in computer science, vol. 1514. Berlin, Germany: Springer-Verlag; 1998. p. 80–94. [4] Sakai Y, Sakurai K, Ishizuka H. Secure hyperelliptic cryptosystems and their performance. In: Public key cryptography. Lecture notes in computer science, vol. 1431. Berlin, Germany: Springer-Verlag; 1998. p. 164–81. [5] Sakai Y, Sakurai K. On the practical performance of hyperelliptic curve cryptosystems in software implementation. IEICE Trans Fundam Electron Commun Comput Sci 2000;E83-A(4):692–703. [6] Wollinger T. Computer architectures for cryptosystems based on hyperelliptic curves. Master’s thesis, Worcester, MA, USA: ECE Department, Worcester Polytechnic Institute; 2001. [7] Wollinger T, Paar C. Hardware architectures proposed for cryptosystems based on hyperelliptic curves. In: Proceedings of the ninth IEEE international conference on electronics circuits and systems (ICECS), vol. III. 2002. p. 1159–63. [8] Clancy T. Analysis of FPGA-based hyperelliptic curve cryptosystems. Master’s thesis, Urbana-Champaign, Illinois: University of Illinois; 2002. [9] FPGA-based hyperelliptic curve cryptosystems. Coordinated Science Laboratory, University of Illinois, Urbana-Champaign, Illinois, Technical Report; 2003. [10] Cantor D. Computing in the Jacobian of a hyperelliptic curve. Math Comput 1987;48(177):95–101.

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366

365

[11] Spallek AM. Kurven vom geschlecht 2 und ihre anwendung in public-key-kryptosystemen. Ph.D. dissertation, Universitat Gesamthochschule Essen; 1994. [12] Harley R. Fast arithmetic on genus two curves; 2000. and . [13] Matsuo K, Caho J, Tsujii S. Fast genus two hyperelliptic curve cryptosystems. IEICE Japan, Technical Report ISEC2001-31; July 2001. [14] Kuroki J, Gonda M, Matsuo K, Chao J, Tsujii S. Fast genus three hyperelliptic curve cryptosystems. In: Proceedings of the 2002 symposium on cryptography and information security (SCIS), January 2002. p. 501–7. [15] Miyamoto Y, Doi H, Matsuo K, Chao J, Tsuji S. A fast addition algorithm of genus two hyperelliptic curve. In: Proceedings of the 2002 symposium on cryptography and information security (SCIS), Japan, January 2002. p. 497–502 [in Japanese]. [16] Lange T. Efficient arithmetic on genus 2 hyperelliptic curves over finite fields via explicit formulae. Cryptology ePrint Archive, Technical Report 2002/121; 2002. . [17] Inversion-free arithmetic on genus 2 hyperelliptic curves. Cryptology ePrint ARchive, Technical Report 2002/147; 2002. . [18] Weighted coordinates on genus 2 hyperelliptic curves. Cryptology ePrint Archive, Technical Report 2002/153; 2002. . [19] Formula for arithmetic on genus 2 hyperelliptic curves; 2003. . [20] Pelzl J, Wollinger T, Paar C. High performance arithmetic for hyperelliptic curve cryptosystems of genus two. Cryptology ePrint Archiv, Technical Report 2003/212; 2003. . [21] Pelzl J, Wollinger T, Paar C. Low cost security: explicit formulae for genus-4 hyperelliptic curves. In: Matsui M, Zuccherato R, editors. Selected areas in cryptography (SAC). Lecture notes in computer science, vol. 3006. Springer-Verlag; 2003. p. 1–16. [22] Pelzl J, Wollinger T, Guajardo J, Paar C. Hyperelliptic curve cryptosystems: closing the performance gap to elliptic curves. In: Walter CD, Koc CK, Paar C, editors. Cryptographic hardware and embedded systems (CHES). Lecture notes in computer science, vol. 2779. Springer-Verlag; 2003. p. 349–65. [23] Pelzl J. Hyperelliptic cryptosystems on embedded microprocessor. Master’s thesis, Department of Electrical Engineering and Information Sciences, Ruhr-Universitaet Bochum, Bochum, Germany; 2002. [24] Wollinger T, Pelzl J, Wittelsberger V, Paar C, Saldahamli G, Koc CK. Elliptic & hyperelliptic curves on embedded lp. ACM Trans Embedded Comput Sys (TECS) 2004;3(3):509–33. [25] Pelzl J, Wollinger T, Paar C. High performance arithmetic for special hyperelliptic curve cryptosystems of genus two. In: Proceedings of the international conference on information technology: coding and computing (ITCC), April 2004. IEEE Computer Society; 2004. [26] Nguyen K. Curve based cryptography – the state of the art in smart card environments. 2002, Cryptology Competence Center, Business Unit Identification, Philips Semiconductors GmbH; 2002. . [27] Elias G, Miri A, Yeap T. High-performance, FPGA-based hyperelliptic curve cryptosystems. In: Proceedings of the 22nd biennial symposium on communications. Kingston, Ontario, Canada: Queens University; 2004. [28] Gaudry P. An algorithm for solving the discrete log problem on hyperelliptic curves. In: Advances in cryptology. Lecture notes in computer science, vol. 1807. Berlin, Germany: Springer-Verlag; 2000. p. 19–34. [29] Joye M, Yen S-M. Optimal left-to-right binary signed-digit recoding. IEEE Trans Comput 2000;49(7):740–8. [30] Blake I, Seroussi G, Smart N. Elliptic curves in cryptography. 1st ed. Cambridge University Press; 1999. [31] Song L, Parhi K. Low-energy digit-serial/parallel finite field multipliers. J VHDL Signal Process 1998;19:149–66. [32] Orlando G, Paar C. A high performance reconfigurable elliptic curve processor for GF(2 m). In: Cryptographic hardware and embedded systems (CHES). Lecture notes in computer science, vol. 1965. Springer-Verlag; 2000. p. 41–56. [33] Sklavos N, Koufopavlou O. Mobile communications world: security implementations aspects – a state of the art. CSJM J, Inst Math Comp Sci 2003;11(2(32)):168–87. [34] Hodjat A, Batina L, Hwang D, Verbauwhede I. HW/SW co-design of a hyperelliptic curve cryptosystem using a microcode instruction set coprocessor. Integration, The VLSI Journal 2007;40(1):45–51. [35] Batina L, Hwang D, Hodjat A, Preneel B, Batina L. Hardware/software co-design for hyperelliptic curve cryptography (HECC) on the 8051 microprocessor. In: Rao JR, Sunar B, editors. Cryptographic hardware and embedded systems – CHES 2005. Lecture notes in computer science, vol. 3659. Springer-Verlag; 2005. p. 106–18. Grace Elias received her B.A.Sc. and M.A.Sc. in electrical engineering from the University of Ottawa, Ottawa, Canada in 2001 and 2004, respectively. She is currently with Advance Micro Devices (AMD) technologies. Her research interest include efficient arithmetic and for cryptographic algorithms and their hardware implementations.

366

G. Elias et al. / Computers and Electrical Engineering 33 (2007) 349–366 Ali Miri is an Associate Professor at the School of Information Technology and Engineering, and the director of Computational Laboratory in Coding and Cryptography at the University of Ottawa, Ottawa, Canada. His research interests include applied cryptography, digital communication, distributed systems, and mobile computing.

Tet Hin-Yeap is an Associate Professor at the School of Information Technology and Engineering, and the director of the Bell Canada Advanced Research Laboratory at the University of Ottawa. His research interests include broadband access architecture, neural networks, multimedia, parallel architecture, and dynamics and control.