An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m)

An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m)

INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 1 2 3 4 5 6 7 8 9 10 11 12 13 Q2 14 15 Q1 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 ...

4MB Sizes 4 Downloads 34 Views

INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 Q2 14 15 Q1 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Contents lists available at ScienceDirect

INTEGRATION, the VLSI journal journal homepage: www.elsevier.com/locate/vlsi

An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m) Bahram Rashidi a,n, Sayed Masoud Sayedi a, Reza Rezaeian Farashahi b,c a

Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran Department of Mathematical Sciences, Isfahan University of Technology, Isfahan 84156-83111, Iran c School of Mathematics, Institute for Research in Fundamental Sciences (IPM), P.O. Box 19395-5746 Tehran, Iran b

art ic l e i nf o

a b s t r a c t

Article history: Received 27 October 2015 Received in revised form 1 May 2016 Accepted 19 May 2016

Finite field multiplication is one of the most important operations in the finite field arithmetic and the main and determining building block in terms of overall speed and area in public key cryptosystems. In this work, an efficient and high-speed VLSI implementation of the bit-serial, digit-serial and bit-parallel optimal normal basis multipliers with parallel-input serial-output (PISO) and parallel-input paralleloutput (PIPO) structures are presented. Two general multipliers, namely, Massey–Omura (MO) and Reyhani Masoleh–Hassan (RMH) are considered as case study for implementation. These multipliers are constructed by using AND, XOR–AND and XOR tree components. In the MO multiplier, to have strong input signals and have a better implementation, the row of AND gates are implemented by using inverter and NOR components. Also the XOR–AND component in the RMH structure is implemented using a new low-cost structure. The XOR tree in both multipliers consists of a high number of logic stages and many inputs; therefore, to optimally decrease the delay and increase the drive ability of the circuit for different loads, the logical effort method is employed as an efficient method for sizing the transistors. The multipliers are first designed for different load capacitances using different structures and different number of stages. Then using the logical effort method and a new proposed 4-input XOR gate structure, the circuits are modified for acquiring minimum delay. Using 0.18 μm CMOS technology, the bit-serial, digitserial and bit-parallel structures with type-1 and type-2 optimal normal basis are implemented over the finite fields GF(2226) and GF(2233) respectively. The results show that the proposed structures have better delay and area characteristics compared to previous designs. & 2016 Elsevier B.V. All rights reserved.

Keywords: Cryptography Logical effort Finite fields Optimal normal basis multiplication VLSI design

1. Introduction Finite field operations such as field multiplication, field squaring and field inversion are the main and important operations in public key cryptosystems, such as Elliptic Curve Cryptography (ECC). In cryptographic applications, many large and complex finite field circuits are involved; therefore, efficient implementation of these circuit blocks is a key factor to implement high performance and high-speed cryptosystems. The numbers of bits are high in cryptosystems; hence in related finite field arithmetic implementations the critical path delays of the circuits are also high. Different hardware design approaches to optimize the circuits for less delay are presented in many works. Several imp lementations on the normal basis multipliers are presented in [1–14]. In [2] a novel sequential Type-1 optimal normal basis n

Corresponding author. E-mail addresses: [email protected] (B. Rashidi), [email protected] (S.M. Sayedi), [email protected] (R.R. Farashahi).

multiplier in the binary finite field GF(2m) with a highly regular, modular and expandable structure is presented. In [3] the Massey–Omura multiplier in the finite field GF(2m) is realized by a pipeline structure. In [4] an improved and fast architecture of the Massey–Omura multiplier is presented. The area and power consumption of the circuit are reduced at circuit level, and it is also more regular at architecture level. In [5,,6] two structures of multipliers over GF(2m) which are called “AND efficient sequential multipliers with parallel output” (AESMPO) and “XOR efficient sequential multipliers with parallel output” (XESMPO) are proposed. In these two multipliers that work in sequential mode, the final result is obtained after m clock cycles. Reyhani-Masoleh and Hasan [7] propose a new architecture for the normal basis parallel multiplier, which is applicable to any arbitrary size finite field. Compare to original Massey–Omura parallel multiplier it has lower circuit complexity. A multiplexer-based algorithm for normal basis multiplication is presented in [8]. Here, instead of AND and XOR gates, Multiplexer and XOR gates are employed. In [9] a new modified Booth’s algorithm for normal basis multiplier is presented. The proposed architecture is simple

http://dx.doi.org/10.1016/j.vlsi.2016.05.006 0167-9260/& 2016 Elsevier B.V. All rights reserved.

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

and highly regular. In [10] a serial versatile multiplier in normal basis over GF(2m) with low complexity is presented. The circuit requires some simple control signals and has regular local interconnections. In [11] two special cases of optimal normal bases for two digit-serial architectures for normal basis multipliers are presented. The gate counts and gate delays are the same in two structures. In [14] some linear and nonlinear techniques for design exploration of Sunar-Koç optimal normal basis type 2 multiplication algorithm are presented. Here the concepts in computational geometry to convert the algorithm into a convex hull in a multidimensional space determined by the algorithm indices are used. The paper explores several systolic arrays of the Sunar-Koç multiplication algorithm using a systematic technique that combines affine and nonlinear processing element (PE) scheduling and assignment of tasks to processors. Six systolic arrays are obtained. The nonlinear techniques allow control of processor workload and control of communication between processors. General formulas are provided for each design so that the operation of the system can be determined for a given GF(2m). Some architectures of the Gaussian normal basis (GNB) multiplication are presented in [15–21]. For example, in [16] a modified digit-level GNB multiplier over GF(2m) is proposed. Also, a complexity reduction algorithm for GNB of types greater than 2, to reduce the number of XOR gates without increasing the gate delay of the digit-level multiplier is proposed. In [17] three structures for GNB multiplier are presented. The first one is a low-complexity digit-level serial input parallel output (SIPO) GNB multiplier. The second one is an improved digit-level parallel input serial output (PISO) multiplier, and the third structure is a hybrid architecture by connecting the output of the digit-level PISO multiplier to the input of the digit-level SIPO multiplier. Multiplier structure in [19] is constructed based on some regular modules for computation of exponentiation by powers of 2 and also some low-cost blocks for multiplication by normal elements of the binary field. In [21] a low-complexity super serial architecture for the dual basis (DB) multiplication over GF(2m), which is suitable for lightweight cryptographic algorithms is presented. The field multiplication is the most important operation in the implementation of ECC systems. In the binary field GF(2m), addition and squaring operations are low complexity operations that are negligible in hardware implementation. Also the high complexity field inversion operation can be performed by several field multiplication operations. Therefore, having an efficient low-cost and high-speed implementation of field multiplication can significantly improve the overall cost and performance of the binary elliptic curve cryptography systems. It is an important issue in many applications such as smart cards. The design approach used in current work is applicable for all multipliers in which an XOR tree is used to sum the values of partial products. Examples are bit-parallel, digit-serial polynomial basis multipliers, normal basis with parallel-input parallel-output (PIPO) structure and bit-serial normal basis multipliers with parallel-input serial-output (PISO) structure in the binary finite fields. In this work to present the method we select Massey– Omura (MO) and Reyhani Masoleh–Hassan (RMH) optimal normal basis multipliers over GF(2m) as the case study. Here, an efficient and general hardware implementation of bit-serial, digit-serial and bit-parallel MO and RMH multipliers are designed based on logical effort technique. In both multipliers an XOR tree is used for summation of partial products. A structural VLSI implementation of XOR tree which is designed based on logical effort technique is presented. Also, an optimized 4-input XOR logic gate to be used in XOR trees of two multipliers, and an XOR–AND circuit used in the bit-serial RMH normal basis multiplier are presented. The proposed 4-input XOR gate and XOR–AND circuit are more compact than a cascade structure of 2-input gates.

The paper is organized as follows. In Section 2 a brief introduction on normal basis representation and bit-serial, digit-serial and bit-parallel multipliers are provided. In Section 3 a brief discussion on logical effort are presented. Section 4 describes the proposed implementation of the optimal normal basis MO and RMH multipliers over GF(2m). Section 5 provides a comparison between this work and other previously related works. The paper is concluded in Section 6.

2. Normal basis representation and multiplication A binary finite field of order 2m denoted by GF(2m) is a vector space of dimension m over GF(2). The elements of GF(2m) can be represented by a basis. Two important types of this representation in the finite field arithmetic are Polynomial Basis (PB) and Normal Basis (NB). For an efficient hardware implementation, the normal basis is a suitable choice. The element β in GF(2m) is called a 2m  1

2m  2

22

21

20

;β ; …; β ; β ; β is a basis normal element if the set B ¼ β for GF(2m) over GF(2): For every binary finite field such a normal element exists and the corresponding set B is called a normal basis. Using B, every element A of GF(2m) can be represented by A¼

m 1 X

  2i 20 21 22 2m  2 2m  1 ai β ¼ a0 β þ a1 β þ a2 β þ …þ am  2 β þ am  1 β

i¼0

where ai A GF(2). For simplicity, the element A is represented by the m-bit number ½am  1 ; am  2 ; …; a2 ; a1 ; a0 . The addition of the elements of GF(2m) can be done by using bit-wise XOR logic gates, and B¼ that means if A ¼ ½am  1 ; am  2 ; …; a2 ; a1 ; a0   bm  1 ; bm  2 ; …; b2 ; b1 ; b0 then we have   C ¼ A þ B ¼ am  1  bm  1 ; am  2  bm  2 ; …; a2  b2 ; a1  b1 ; a0  b0 One important property of using normal basis representation is that the performing of squaring operation can be done very efficiently. This operation is done by a simple one-bit rotation to the left, i.e., squaring of element A is given by A2 ¼

m 1 X

!2 ai β

2i

¼

i¼0

m 1 X

a2i



β2

i

2

¼

m 1 X

i¼0

ai β

2i þ 1

¼ am  1 β þ

i¼0

m 1 X

ai  1 β

2i

i¼1

which is represented by ½am  2 ; am  3 ; …; a2 ; a1 ; a0 ; am  1 . The multiplication operation in GF(2m) with the normal basis representation can be expressed as C ¼ AB ¼

m 1 X i¼0

ai β

2i

m 1 X i¼0

2i

bi β ¼

m 1 X

ck β

2k

k¼0

where ck , for k ¼ 0; 1; …; m 1, is given by ck ¼ ðkÞ

mP  1 mP 1 i¼0 j¼0

λðkÞ ij ai bj in ðkÞ

which λ are some m dimensional matrices with entries λij in GF (2). The complexity of the hardware implementation of a normal basis multiplication is related to the number of nonzero entries of ðkÞ the matrices λ . For Optimal Normal Basis (ONB) this value is minimum and is 2m–1. Two types of ONB, type-1 and type-2 are defined [22]. In the binary finite field GF(2m), ONB of type-1 exists if m þ 1 is a prime number and ‘2’ is a primitive element of the prime field GFðm þ 1Þ. Also, in GF(2m) ONB of type-2 exists if m þ 1 is a prime number and either ‘2’ is a primitive element in GFð2m þ 1Þ or 2m þ 1  3ðmod4Þ and the order of ‘2’ in GFðm þ 1Þ is m. As previously mentioned, the proposed method is applicable for bit-serial multipliers with parallel-input serial-output (PISO) structure and digit-serial and bit-parallel multipliers with parallelinput parallel-output (PIPO) structure. So, in this paper, type-1 and type-2 ONB multipliers with PISO and PIPO structures are considered for implementation. To that end first three structures of bit-serial, digit-serial and bit-parallel are presented. For example

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

ð0Þ

computation of λ matrix for type-1 ONB is as follows [23]: 8   < 1 if 2i þ 2j  0 or1ðmodðm þ1ÞÞ ð0Þ λij ¼ :0 otherwise: ðkÞ

a2 b7  a4 b7  a3 b8  a4 b9  a9 b9 a6 b8  a8 b6 :

ð0Þ

The λ matrices can be computed by a k-cyclic shift to λ matrix. The Massey–Omura (MO) multiplier is a standard multiplier for normal basis multiplication. It is implemented by a F function constructed by a row of AND gates and an array of XOR gates. The AND gates generate the partial products based on entries ‘1’ of the matrix λð0Þ . The XOR array is used for summation of partial products. In this structure to compute ck , for k ¼ 0; 1; …; m  1, the structure of the F function is fixed and only the inputs are cyclically shifted one bit. For example, if c0 ¼ F ða0 ; a1 ; a2 ; …; am  1 ; b0 ; b1 ; b2 ; …; bm  1 Þ, we have c1 ¼F ða1 ; a2 ; …; am  1 ; a0 ;b1 ; b2 ; …; bm  1 ; b0 Þ: Also, for k ¼ 2; …; m 1, we have  ck ¼ F ak ; ak þ 1 ; …; am  1 ; a0 ; …; ak  2 ; ak  1 ;  bk ; bk þ 1 ; …; bm  1 ; b0 ; …; bk  2 ; bk  1 As an example consider the binary finite field GF(210). Let A and ð0Þ B be elements of GF(210) and C¼ A  B. The matrix λ is calculated as 2 3 0 0 0 0 0 1 0 0 0 0 60 0 0 0 0 1 1 0 0 07 6 7 6 7 60 0 0 1 0 0 0 1 0 07 6 7 60 0 1 0 0 0 0 0 1 07 6 7 6 7 60 0 0 0 0 0 0 1 0 17 ð0Þ 7 λ ¼6 61 1 0 0 0 0 0 0 0 07 6 7 6 7 60 1 0 0 0 0 0 0 1 07 6 7 60 0 1 0 1 0 0 0 0 07 6 7 6 7 40 0 0 1 0 0 1 0 0 05 0

0

0

0

1

0

0

0

0

1

The bit c0 is computed by the summation of partial products ð0Þ based on entries ‘1’ of the matrix λ as follows: c0 ¼ Fða0 ; a1 ; a2 ; …; am  1 ; b0 ; b1 ; b2 ; …; bm  1 Þ ¼ a5 b0  a5 b1  a6 b1  a3 b2  a7 b2  a2 b3  a8 b3  a7 b4  a9 b4 a0 b5  a1 b5  a1 b6 

3

Other ck for k ¼ 1; …; 9, are constructed similarly using cyclically shift of inputs A and B. Reyhani-Masoleh and Hassan [11] proposed a bit-serial normal basis multiplier based on a different formulation by rearranging the XOR and AND operations. The output terms are given by X ðar þ as Þðbr þbs Þ; 0 r k r m  1: ck ¼ ak bk þ ðr;sÞϵφk

where φk is the set of the coordinates of entries ‘1’ in the upper ðkÞ part of the matrix λ . In above example, φ0 is computed as

φ0 ¼ ð0; 5Þ; ð1; 5Þ; ð1; 6Þ; ð2; 3Þ; ð2; 7Þ; ð3; 8Þ; ð4; 7Þ; ð4; 9Þ; ð6; 8Þ Thus, the product c0 can be calculated as c0 ¼ a0 b0 þ ða0 þa5 Þðb0 þ b5 Þ þ ða1 þ a5 Þðb1 þ b5 Þ þ ða1 þ a6 Þðb1 þb6 Þ þ ða2 þ a3 Þðb2 þ b3 Þ þ ða2 þ a7 Þðb2 þb7 Þ þ ða3 þ a8 Þðb3 þ b8 Þ þ ða4 þ a7 Þðb4 þb7 Þ þ ða4 þ a9 Þðb4 þ b9 Þ þ ða6 þ a8 Þðb6 þb8 Þ: Fig. 1(a) and (b) shows the bit-serial type-1 PISO optimal normal basis MO and RMH over GF(210), respectively. To increase the speed of the multiplier, m copies of the F function block of the bit-serial multiplier is used to realize bit-parallel structure [11]. The input signal of the ith F function is one-bit left cyclic shift of the inputs of ði 1Þth F function. The bit-parallel structure is fast but needs more logic gates. Another structure for the multiplier is digitserial, which is constructed by using n copies of the F function block, where 1rnrm. Output signals of digit-serial multiplier are ci, 0rirn–1 . Figs. 2 and 3 show the bit-parallel and digit-serial parallelinput and parallel-output structures of the MO multiplier respectively. In these figures, the F function is related to bit-serial MO multiplier. For construction of the bit-parallel and digit-serial RMH multiplier only the F function is changed to the F function of the bit-serial RMH multiplier. The circuits in Figs. 2 and 3, are PIPO structures, because the input signals are applied in parallel to the multiplier and output signals are produced in parallel form. In the bit-parallel and digit-serial multipliers the output bits are computed in one and d clock cycles respectively, where d ¼ ⌈m n ⌉.

Fig. 1. Structure of the bit-serial type-1 PISO optimal normal basis MO multiplier (a) and RMH multiplier (b) over GF(210).

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 2. Structure of the bit-parallel PIPO MO multiplier.

Fig. 3. Structure of the digit-serial PIPO MO multiplier.

As seen in the bit-parallel and digit-serial structures the input registers must drive many F functions. For example, in the MO multiplier, if we denote input capacitance of each AND gate with Cin-AND, then the capacitance load for each output bit of the input registers is equal to 2m  Cin-AND. For the applicable finite fields range of m is very big, so this capacitance is a big load that affects the delay of the circuit. To improve the speed, in the implementation, the output bits of the input registers are buffered. The structure of buffer is based on a chain of inverters [24]. Fig. 4 shows the digit-serial structure of the RMH over GF(210) (d¼ 2, n ¼5), with buffered input registers. 3. A brief discussion on logical effort technique The logical effort technique is a procedure for achieving the least delay for a given load in a logic circuit [25]. In this technique,

the normalized delay D is defined as D ¼ f þ P, where P is the parasitic delay and f is the effort delay or stage effort of the circuit. The delay is expressed in unit τ which is technology dependent and is the delay of the smallest CMOS inverter in the technology when it is connected to a similar inverter with assumption that the drain capacitances are negligible. The effort delay f is defined by logical effort g and fanout h of the logic gate as f ¼ gh. Parasitic delay of the logic gate is computed based on diffusion capacitance of the output node. The logical effort g represents the complexity of the logic gate. It is expressed as the ratio of the gate input capacitance to the input capacitance of an inverter that can deliver same output current. Fanout or electrical effort h is computed as h¼

C out C in

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

5

Fig. 4. Digit-serial structure of the RMH over GF(210) with buffered input registers.

where C in is the input capacitance of the gate and C out is the external load capacitance. In multi-stage logic circuits, path logical effort which is denoted by G is computed by multiplying logical efforts of all the stages along the path G ¼ ∏g i The path electrical effort H is defined as the ratio of output capacitance of the path to the input capacitance of the path H¼

C out ðpathÞ C in ðpathÞ

The product of the stage efforts along the path is defined as path effort F

ges the identical logical effort of the stages is calculated as 1 f^ ¼ g i bi ¼ F N . Thus minimum delay of the path is given by ^ ¼ NFN1 þ P D P where P is path parasitic delay defined as P ¼ pi . Input capacitance of each stage of the path, C in  i , which depends on output load capacitance of stage, C out  i , and is calculated by C in  i ¼ C out ^ i gi f

(1)

The values of C in  i , are used to determine the sizes of transistors.

F ¼ ∏f i ¼ ∏g i hi In a path with a branch, the parameter of branching effort is defined to account the effect of branch load. This parameter is defined as the ratio of all load capacitance of the branch node to the on-path load capacitance of the node b¼

C on  path þC off  path C on  path

Path branching effort B is defined as the product of the branching efforts B ¼ ∏bi Path effort is calculated by multiplying path logical effort G, path electrical effort H and path branching effort B F ¼ GBH It can be shown that the delay of a path is minimized when all stages in the path have same logical effort. In a path with N sta-

4. Proposed implementation of optimal normal basis MO and RMH multipliers over GF(2m) The structures of two optimal normal basis multipliers are constructed by XOR arrays and AND gates as seen in Figs. 1–4. The hardware utilization for two bit-serial MO and RMH multipliers are (2m–1 AND gates and 2m–2 XOR gates) and (m AND gates and 3m–3 XOR gates) respectively. For bit-parallel and digit-serial the hardware resources are multiplied by m and n respectively. The main element of all these circuits is XOR gate. There are different low-cost full swing circuits of the XOR gates presented in literatures [26–34]. Here, six different XOR circuits are shown in Fig. 5 (a–f). In Fig. 5(a) [26] XOR gate is constructed by eight transistors. There is an inverter at the output node to have a full swing. An XOR–XNOR cross-coupled circuit based on pass transistor logic, which is shown in Fig. 5(b) is presented in [30].

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Fig. 5. Different circuit of XOR gates in related literatures [26–34].

A

Table 1 Schematic simulation results of delay and area in circuits of Figs. 5 and 6.

B

X

_

Circuits

Delay, CL ¼ 10 fF

Delay, CL ¼25 fF

Delay, CL ¼50 fF

Delay, CL ¼100 fF

Area (μm2)

Fig. Fig. Fig. Fig. Fig. Fig. Fig.

219 ps 205 ps 157 ps 145 ps 143 ps 161 ps 137 ps

340 ps 395 ps 258 ps 229 ps 284 ps 322 ps 232 ps

517 ps 709 ps 427 ps 396 ps 524 ps 594 ps 392 ps

854 ps 1.337 ns 765 ps 728 ps 1.011 ns 1.138 ns 714 ps

43.8 42.2 54.6 49.6 44.6 32.4 44.4

5(a) 5(b) 5(c) 5(d) 5(e) 5(f) 6

X

Fig. 6. Proposed high-speed modified structure of Fig. 5(f).

A Double Pass Transistor Logic (DPL) XOR gate is shown in Fig. 5 (c) [31]. This structure provides a full voltage swing. Limitation of the circuit is its large area. The circuit of XOR gate in [32] is shown in Fig. 5(d). This circuit is constructed by transmission gate, and the voltage level is restored by an inverter at the output node. In Fig. 5(e) circuit is composed of 2 transmission gates and 2 inverters [33]. High power consumption and large area are drawback of the circuit. The XOR circuit of Fig. 5(f), which consists of 6 transistors [34], the output signal and its complement are generated simultaneously. In this circuit for two states of A¼’1,’ B ¼’0’ and A ¼’0,’ B ¼’1’ the level of output voltage depends on voltage level of input signals. In this work a modified version of Fig. 5(f) in which this limitation is eliminated is used. Fig. 6 shows the proposed modified structure. In this figure to increase speed and improve output voltage level, two minimum size pull-up PMOS transistors are added to the circuit. For an identical delay, the sizes of input transistors are lower in the circuit with pull-up pMOS transistors. The delays and areas of the circuits in Fig. 5 and also Fig. 6 for different load capacitances are shown in Table 1. The circuits are implemented in 0.18 mm CMOS technology, with VDD ¼ 1.8 V. All transistors and inverters in the circuits are minimum size. The

areas are estimated by summation of transistors areas without considering the routing. As shown in the table, the modified XOR gate in Fig. 6 is the fastest among other XOR gates. XOR gates with 3 or 4 inputs can be more compact than a cascade of 2-input XOR gates [35]. Fig. 7 shows three 4-input XOR circuits presented in [35]. 4-input static CMOS XOR and Complementary Pass Transistor Logic (CPL) XOR/XNOR gates are shown in Fig. 7(a) and (b) respectively. In the figures the true and complementary trees share most of the transistors. Fig. 7(c) shows a 4-input Cascade Voltage Switch Logic (CVSL) XOR gate, in which both true and complementary inputs are applied to the circuit and a pair of nMOS pull-down transistors produce both true and complementary outputs. In pseudo-nMOS structure, the size of pMOS transistor is important. A small pMOS transistor is slow at pulling up the complementary output. In addition, the CVSL gate requires both the low- and high-going transitions that add more delay. All these 4-input gates have large area. In this work a new 4-input XOR gate constructed by two highspeed modified 2-input XOR gates, 4 pass transistors and one inverter at the output node is used. The number of transistors in the proposed structure is lower than those shown in Fig. 7. Voltage level and driving capability is restored by using an inverter at the output. Fig. 8(a) shows the proposed circuit. In the bit-serial RMH multiplier the structure is divided into two main sections of XOR–AND and XOR tree parts. In this

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

7

Fig. 7. Three traditional 4-input XOR gates presented in [35].

A A

B B

X

A B

XA

XA

C D

C

C

D

D

Fig. 8. The proposed structure of the 4-input XOR (a) and structure of the XOR–AND part in the RMH multiplier (b).

Fig. 9. Monte Carlo result for the delay of the proposed 4-input XOR.

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Table 2 Schematic simulation results of delay and area of the proposed 4-input XOR gates and 4-input XOR circuits constructed by cascade of 2-input XOR circuit in Figs. 2 and 3. Circuits

Delay, CL ¼10 fF

Delay, CL ¼25 fF

Delay, CL ¼ 50 fF

Delay, CL ¼ 100 fF

Area (μm2)

4-input XOR by Fig. 2(a) 4-input XOR by Fig. 2(b) 4-input XOR by Fig. 2(c) 4-input XOR by Fig. 2(d) 4-input XOR by Fig. 2(e) 4-input XOR by Fig. 2(f) 4-input XOR by Fig. 3 Proposed 4-input XOR

370 ps 353 ps 236 ps 336 ps 318 ps 382 ps 341 ps 203 ps

450 ps 589 ps 337 ps 421 ps 357 ps 590 ps 530 ps 305 ps

593 ps 977 ps 505 ps 552 ps 596 ps 860 ps 802 ps 475 ps

892 ps 1.754 ns 838 ps 834 ps 1.074 ns 1.315 ns 1.28 ns 814 ps

131.4 126.6 163.8 148.8 133.8 97.2 133.2 121.4

Fig. 10. Monte Carlo result for the delay of the proposed XOR–AND circuit.

structure we have m–1 XOR–AND block over GF(2m). Fig. 8 (b) shows the proposed low-cost XOR–AND implementation. In this implementation by using both XNOR and XOR outputs in the modified XOR structure of Fig. 6, the proposed XOR–AND structure is implemented by 19 transistors. The effect of process variations and mismatch on the delay was evaluated through the Monte Carlo analysis for the proposed 4input XOR circuit. Fig. 9 shows the result for 500 iterations and for load capacitance of 10 fF. As the figure shows the mean value of the delay is 203.661 ps. Table 2 shows the delay and area of the proposed 4-input XOR gate and 4-input XOR circuits constructed by cascade of 2-input XOR circuits in Figs. 5 and 6. The results, which are obtained for minimum size transistors and inverters, show better delay and area characteristics for the proposed structure, compared to other structures. Result of the Monte Carlo analysis was also obtained for the proposed XOR–AND circuit. The result for N ¼500 iterations and for load capacitance 10fF is shown in Fig. 10. As the figure shows the mean value of delay is 213.939 ps. In the following, based on logical effort technique implementations of type-1 MO and RMH optimal normal basis multipliers over GF(2226) also type-2 MO and RMH optimal normal basis multipliers over GF(2233), which is recommended by NIST for elliptic curve cryptography [36], are presented. First implementation of the MO multiplier is described. In the proposed structure the row of AND gates are implemented by minimum size CMOS inverters and NOR gates. By using the inverters, sensitivity to the voltage level of previous blocks is reduced. Fig. 11 shows the proposed change which is applied on Fig. 1(a), the structure of bitserial MO multiplier over GF(210). For implementation of the

Fig. 11. Implementation of input row of the AND gates based on NOR gates and inverters in the MO multiplier over GF(210).

multiplier over GF(2226) the number of input signals in the XOR tree are 2m–1¼ 2(226)–1 ¼ 451. In the general case as shown in Fig. 12(a) and (b) the XOR tree is implemented in nine stages by using 2-input XOR gates for GF (2226) and GF(2233).

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

112

56

28

. . .

224

14

. . .

. . .

I[450:1]

6

. . .

. . .

. . .

3-XOR 1

1

225-XOR

56-XOR

112-XOR

1

232

116

58

28

. . .

. . .

I[465:1]

Out

7-XOR

14-XOR

28-XOR

1

14

. . .

I[0]

3

6

. . .

. . .

. . .

Out 3-XOR 1

1 I[0]

58-XOR

116-XOR

232-XOR

29-XOR

7-XOR

14-XOR

1

2-XOR

1

1

Fig. 12. General 9-stage structure of the XOR tree in the MO multiplier type-1 over GF(2226) (a), and type-2 over GF(2233) (b), based on 2-input XOR gates.

I[450:3]

112

. . .

. . .

I[2:0] 3

28

4

. . .

1

Out

1

3 7-XOR

28-XOR

112-XOR

1-XOR

1-XOR

1-XOR

Fig. 13. Structure of the XOR tree in 12-stage over GF(2226) based on 4-input XOR gate (a), and a typical path in this case (b).

. . .

I[2:0] 3

112 . . .

I[450:3]

28

7

. . .

Out

2

1 7-XOR

28 -XOR

112-XOR

1-XOR

2-XOR 2

. . .

I[0] 1

116 . . .

I[464:0] 464

29 28 1

. . .

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

9

7

Out

2

‘0’ 116-XOR

29 -XOR

7-XOR

2-XOR

1-XOR

Fig. 14. Structure of the XOR tree in 10-stage over GF(2226) (a), and GF(2233) (b), based on 4-input XOR gates.

For implementation of the XOR tree, the proposed low-cost 4-input XOR gate is used in the MO and RMH multipliers. The bitserial type-1 MO multiplier over GF(2226) is implemented by two structures, one by 12 stages and the other by 10 stages. For type-2 MO multiplier over GF(2233) the implementation is based

on 10-stage. The sizes of transistors are computed for different electrical effort by using logical effort technique. The process for the 12-stage structure of the bit-serial MO multiplier over GF(2226) and for H ¼10 is described in detail in the following. For the 12-stage structure, we use 6-stage of the 4-input

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

10

a0 b0

112

. . .

. . .

. . .

1

56

28

. . .

225 224

450

14

. . .

I[449:0]

Out

4

8

. . .

2-XOR

4-XOR 112-XOR

56-XOR

8-XOR

14-XOR

28-XOR

225-XOR-AND

a0 b0

116

58

. . .

. . .

. . .

116-XOR

58-XOR

29

. . .

232

464

15 14

. . .

I[463:0]

1

2-XOR

1

4-XOR 7-XOR

15-XOR

29-XOR

Out

4

7

. . .

232-XOR-AND

Fig. 15. General 10-stage structure of the RMH multiplier over GF(2226) (a), and over GF(2233) (b), based on 2-input XOR gates.

a0 b0

I[449:0]

225 224

450

14

56

. . .

. . .

56-XOR

Out

4

. . .

14-XOR

1-XOR

4-XOR

225-XOR-AND

a0 b0

I[463:0]

232 . . .

464

58

15

Out

4

. . .

. . .

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

“00”

58-XOR

15-XOR

4-XOR

1-XOR

232-XOR-AND

Fig. 16. Proposed low-cost 9-stage structures of the bit-serial RMH multiplier over GF(2226) (a), GF(2233) (b).

XOR gate, because each 4-input XOR gate constructed by two logic stage including an inverter and a 2-input XOR gate and pass transistors. The path logical effort is the product of logical efforts of six inverters and six 2-input XOR gates calculated as  G ¼ g 1 g2 …g 12 ¼

1:4252 1:22

6  ð1Þ6 ¼ ð1:1679Þ6 ¼ 2:538:

The branching effort is B ¼ 1, and the path effort is F¼GBH ¼ 25.38. Minimum delay can be realized if the transistor sizes in each stage are chosen properly. To ffi that end first the stage pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi effort is computed as f^ ¼ 101:16796 ¼ 1:31, and since H is equal to 10, CL ¼10Cin ¼ 10Cx ¼14.252 fF where Cx is the input capacitance of the proposed 4-input XOR gate. The input

capacitance of each gate is computed by using Eq. (1). It can be started with the load capacitance at the output node of the path. The method is as follows:  1  Cin-12 ¼ Cout-12g^12 ¼14.252 fF  1:31 ¼10.88 fF ) Wn-12 ¼ 1.96 μm, f

Wp-12 ¼ 3.91 μm, 1:1679 Cin-11 ¼Cin-12g11 ^ ¼ 10.88 fF  1:31 ¼9.7 fF ) f

Wn1–11 ¼Wn2 11 ¼ –Wp1 11 ¼Wp2–11 ¼ 1.5 μm 1 (1:31 ) ¼7.4 fF Cin-10 ¼ Cin-11g10 ^ ¼9.7 fF  f

)

Wn-10 ¼ 1.33 μm,

Wp-102.66 μm,   Cin-9 ¼ Cin10g^9 7.4 fF  1:1679 1:31 ¼6.6 fF ) Wn1-9 ¼Wn2-9 ¼ f

Wp1-9 ¼Wp2-9 ¼1.014 μm   1 ¼5.04 fF ) Wn-8 ¼0.9 μm, Cin-8 ¼ Cin-9g^8 ¼6.6 fF  1:31

Wp-8 ¼1.8 μm,

f

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

11

Fig. 17. Layout of the 4-input XOR gate used in first stage of the XOR tree in the 9-stage structure of the RMH multiplier (a), the layout of one XOR–AND block (b).

  Cin-7 ¼Cin-8g^7 ¼ 5.04 fF 1:1679 1:31 ¼4.5 fF ) Wn1-7 ¼ Wn2-7 ¼ f

Wp1-7 ¼Wp2-7 ¼0.69 μm,  1  ¼3.435 fF Cin-6 ¼Cin-7g^6 ¼ 4.5 fF  1:31 Wp-6 ¼1.2 μm,

f

)

Wn-6 ¼0.6 μm,

  Cin-5 ¼ Cin-6g^5 ¼3.435 fF 1:1679 1:31 ¼3.06 fF ) Wn1-5 ¼ f

Wn2-5 ¼ Wp1-5 ¼ Wp2-5 ¼ 0.47 μm,  1 ¼ 2.336 fF Cin-4 ¼ Cin-5g^4 ¼3.06 fF 1:31 Wp-4 ¼0.8 μm,

f

)

Wn-4 ¼0.4 μm,

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

12

XOR-AND Part & first stage of 4-input XOR gates

Third & fourth stages of 4-input XOR gates

Second stage of 4-input XOR gates

First stage of 4-input XOR Part 4

XOR-AND part 4

First stage of 4-input XOR Part 3

XOR-AND part 3

First stage of 4-input XOR Part 2

XOR-AND part 2

First stage of 4-input XOR Part 1

XOR-AND part 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Fig. 18. Layout of the proposed 9-stage structure of the bit-serial RMH multiplier over GF(2226) for the case of H¼10, CL ¼14.252fF.

Fig. 19. Result of Monte Carlo analysis for the delay of the proposed 9-stage structure of the bit-serial RMH multiplier over GF(2226) for the case of H¼ 10, CL ¼ 14.252fF and for 500 iterations.

Cin-3 ¼Cin-4g^3 ¼2.336 fF f

Wn2-3 ¼Wp1-3 ¼ Wp2-3 ¼0.33 μm, Cin-2 ¼Cin-3g^2 ¼2.083 fF f

1:1679 1:31



¼2.083 fF



1 1:31

)

¼1.59 fF ) Wn-2 ¼0.28 μm,

Wp-2 ¼0.56 μm,   Cin-1 ¼Cin-2g^1 ¼1.59 fF 1:1679 1:31 ¼1.42 fF ¼Cin ) Wn1-1 ¼ f

Wn1-3 ¼

Wn2-1 ¼Wp1-1 ¼ Wp2-1 ¼0.22 μm. As it was expected, the size of first stage is equal to the input capacitance of the 4-input XOR gate. Wn1–i, Wn2–i, Wp1–i and Wp1–i are the sizes of input nMOS and pMOS transistors in the proposed 4-input XOR circuit. Also Wn-i and Wp-i are the sizes of nMOS and pMOS transistors of the inverter in the proposed circuit. Based on above calculations transistors in the output stages are wider, which enable them to drive current into large output loads. The proposed 12-stage structure of the XOR tree in the bitserial MO multiplier over GF(2226) is shown in Fig. 13(a). As previously mentioned, in this structure each 4-input XOR gate consists of two logic stages, one is an inverter and the other one is

the circuit shown in Fig. 8(a). A typical path in the circuit is shown in Fig. 13(b). Fig. 14(a) and (b) show the proposed 10-stage structure of the XOR tree in the bit-serial MO multiplier over GF(2226) and GF (2233), respectively. The main element in the bit-parallel and digit-serial structures of the MO multiplier is the F function. The implementation of this function for the bit-parallel and digit-serial structures is same as for bit-serial structure. In the bit-serial RMH multiplier over fields GF(2226) and GF 233 (2 ) numbers of inputs in the XOR–AND components and XOR trees are (2m–2 ¼2(226)–2 ¼ 450 and m ¼226) and (2m–2 ¼2 (233)–2 ¼ 464 and m ¼ 233), respectively. The XOR tree is implemented by eight stages of 2-input XOR gates for two fields. Also the XOR–AND is implemented in 2 stages. So the multiplier is implemented in 10 stages. The general structures of the RMH multiplier over GF(2226) and GF(2233) based on 10-stage structure

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

13

Table 3 Comparison of the proposed structures and other related works. Works/structures

# 2-input XOR

# 2-input AND

# FF

# 2-to-1 Mux

[1] Bit-serial ONB

2m–1

m

2m



[2] Bit-serial ONB [6] Digit-serial ONB SMPOI [6] Digit-serial ONB SMPOII [6] Digit-serial ONB SMPOI [6] Digit-serial ONB SMPOII [8] Bit-Parallel ONB

w-

2mþ 1 w(2m–1)

mþ 1 w(m/2þ 1) þm

4mþ3 3m

– –

w-

w(mþ m/2)

wm þ m

3m



h i ðw  1Þ r 2TA þ 3þ ⌈log 2 ⌉ TX

⌈m ⌉þ1 d

w-

3wm/2þ mþwþ 1

wm/2þ mþ wþ 1

3m



h i ðw  1Þ r 2TA þ 3þ ⌈log 2 ⌉ TX

⌈m ⌉þ1 d

w-

wm þwþ mþ 1

wm þ wþ mþ1

3m



h i ðw  1Þ r 2TA þ 3þ ⌈log 2 ⌉ TX

⌈m ⌉þ1 d

(3m2–mþ6)/4 2





m(m-1)/2

[9] Bit-Parallel ONB

m þ 2m–1

m



m(m-1)/2

[11] Digit-serial ONB AEDS [11] Digit-serial ONB XEDS [12] Bit-Parallel ONB

w(3m–w–2)

w(m–0.5w þ0.5)





w(2m–0.5w–1.5)

w(2m–w)





3/2m(m–1)

m2





[13] Digit-serial ONB

d(2m–1)

dm

3m



3m m 5mþ m(log 2 ) þ 8

9mþ1 – m 7mþm(log 2 ) þ 3 – m 7mþm(log 2 ) þ 3 –

[14] Bit-serial ONB #1_11 5m–3 [14] Bit-serial ONB #1_12 5m–4 [14] Bit-serial ONB #1_13 5m–4

m

5mþ m(log 2 ) þ 8 3m m 5mþ m(log 2 ) þ 8

[14] Bit-serial ONB #1_21 5m–3 [14] Bit-serial ONB #1_22 5m–4 [14] Bit-serial ONB #1_23 5m–4

m

[15] Digit-serial GNB DLGMs [15] Digit-serial GNB DLGMp [16] Digit-serial GNB

r d((m–

[17] Bit-Parallel GNB [17] Digit-serial GNB

ðd þ 1Þ ðd þ 1Þ 2 )T þ 2 )

2m



Latency (cycle)

 m  TA þ 1 þ ⌈log 2 ⌉ TX TA þ TX þTL h i ðw  1Þ r 2TA þ 3þ ⌈log 2 ⌉ TX

h

ðm  1Þ

i

h

ðmÞ

⌈m ⌉þ1 d

i

⌈m ⌉þ1 d

TA þ 1þ ⌈log 2 ⌉ TX h i ðmÞ TA þ 1þ ⌈log 2 ⌉ TX h i ðd þ 1Þ TA þ 1þ ⌈log 2 ⌉ TX

– ⌈m ⌉þ1 d

TA þ TX TA þ TX

mþ 2 mþ 2

TA þ TX

mþ 2

TA þ TX TA þ TX

mþ 2 mþ 2

TA þ TX h i T ðmÞ TA þ ⌈log 2 ⌉ þ ⌈log 2 ⌉ TX

mþ 2 ⌈m ⌉þ1 d

h i T ðd þ 1Þ ⌉ TX TA þ ⌈log 2 ⌉ þ ⌈log 2

3m



r dðm2 1Þ(T–1) þ dm

dm

3m



r T þ4 4(m2–m)

m2

2m

2m

2m

2m

pffiffiffi (1 þ3⌈ m ⌉Þm d



3m

2m





h i T ðd þ 1Þ TA þ ⌈log 2 ⌉ þ ⌈log 2 ⌉ TX h i T ðmÞ TA þ ⌈log 2 ⌉ þ ⌈log 2 ⌉ TX h i T ðd þ 1Þ TA þ ⌈log 2 ⌉ þ ⌈log 2 ⌉ TX h i T ðd þ 1Þ TA þ ⌈log 2 ⌉ þ ⌈log 2 ⌉ TX h i T ðd þ 1Þ TA þ ⌈log 2 ⌉ þ ⌈log 2 ⌉ TX  m  TA þ 6 þ 5⌈log 2 ⌉ TX

r

dm pffiffiffi m   ffiffiffi p ⌉d ð m  1 Þ d ðT  1Þ þ 1þ ⌈ m ⌉d m ⌈ d ⌉dm 2 d

pffiffimffi

r

[19] Digit-serial GNB

r (m–1)(T–1)d þ dm



36m

3 log 2

 28mþ2

dm 3m

3 log 2

40.88mlog 3  31mþ2.8

3.1mlog 3





 m  TA þ 6 þ 8⌈log 3 ⌉ TX

⌈2m3 5⌉ (4-input XOR)

2m–1(NOR), 2m (inverter) m

2m



TInv þ TN þ ⌈log 4

2m



TA þ TX þ ⌈log 4 ⌉TX4

n(2m–1)(NOR), 2nm (inverter) nm

2m

TInv þ TN þ TB þ ⌈log 4

(2m2–m)(NOR), 2m2 (inverter) m2

2m

2m, n (buffer) 2m, n (buffer) m (buffer)

2m

m (buffer)

TA þ TX þ TB þ ⌈log 4 ⌉TX4

6

2m–2, ⌈m 3 4⌉ (4-input XOR) n⌈2m3 5⌉ (4-input XOR) 2n(m–1), n⌈m 3 4⌉ (4-input XOR) 2

⌈2m

 5m ⌉ 3

(4-input XOR)

2(m2–m), ⌈

m2

 4m ⌉ 3

(4-input XOR)

6

2m

m(m þ1) ⌈m ⌉þ1 d



dm

(d(m–1)–dðd2 1Þ)(T–1) þdm

m

1

TM þ 2 þ log 2 TX h i ðm  1Þ TM þ 2 þ log 2 TX h i ðmÞ TA þ 1þ ⌈log 2 ⌉ TX

r 2d(Tm–T þ mþ1)

[18] Digit-serial GNB

[20] Bit-Parallel GNB, b ¼2 T ¼ 4 [20] Bit-Parallel GNB, b ¼3 T ¼ 4 Proposed structure (MO) Bit-serial Proposed structure (RMH) Bit-serial Proposed structure (MO) Digit-serial Proposed structure (RMH) Digit-serial Proposed structure (MO) Bit-Parallel Proposed structure (RMH) Bit-Parallel

5mþ m(log 2 ) þ 8 dm

9mþ1 – m 7mþm(log 2 ) þ 3 – m 7mþm(log 2 ) þ 3 –

Critical path delay

ð2m  1Þ

⌈m ⌉þ1 d 1 ⌈m ⌉þ1 d pffiffiffi r 2⌈ m ⌉ d ⌈m ⌉þ1 d – – m

⌉TX4

m

ðmÞ

ð2m  1Þ

⌉TX4

ð2m  1Þ

TInv þ TN þ TB þ ⌈log 4

⌈m n⌉ ⌈m n⌉

ðmÞ

TA þ TX þ TB þ ⌈log 4 ⌉TX4

ðmÞ

⌈m ⌉þ1 d

⌉TX4

1 1

Note: T is type of the GNB; n: number of the F function; d: digit size; w¼ ⌈m ⌉: number of words; TA , TX ,TN , TL , TM ,TB , TInv and T4X denote time delay of a 2-input AND gate, d 2-input XOR gate, 2-input NOR gate, one bit Latch, and 2 to 1 multiplexer, buffer, inverter gate and 4-input XOR gate respectively; TN þ TInv  TA .

and by using 2-input XOR gates are shown in Fig. 15(a) and (b), respectively. The proposed low-cost 9-stage structures of the bitserial RMH multiplier over GF(2226) and GF(2233) based on the proposed 4-input XOR gate and the proposed XOR–AND structure are shown in Fig. 16(a) and (b). In the proposed 9-stage structure the path logical effort is 5 4 5 G ¼g1g2…g9 ¼(1:4252 1:22 )  (1) ¼(1.1679) ¼2.173. For H¼ 10 the path effort is calculated as F¼ GBH¼21.73. To achieve minimum delay, pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi the stage effort is calculated as f^ ¼ 101:16795 ¼ 1:408.

Starting with the output load 10Cx ¼ 14.252 fF, the input capacitances are computed as follows: Cin-9 ¼Cout-9g^8 ¼ 14.252 fF  f



1 1:408



¼10.12 fF ) Wn-9 ¼1.823 μm,

Wp-9 ¼3.64 μm,   Cin-8 ¼Cin-9g^7 ¼ 10.12 fF  1:1679 1:408 ¼8.394 fF ) Wn1–8 ¼Wn2– f

μm,  1  Cin-7 ¼Cin-8g^6 ¼ 8.394 fF  1:408 ¼5.96 fF ) Wn-7 ¼ 1.075 μm, f Wp-7 ¼2.15 μm, 8 ¼Wp1–8 ¼ Wp2–8 ¼0.88

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Table 4 Critical path delay, time and area results for the proposed implementation of the bit-serial and bit-parallel Massey–Omura multiplier over GF(2226) and GF(2233). Methods

Field

CL/H

Critical path Delay (ns) (BS, BP)

Time (ns) (BS, BP)

Area (μm2) (BS, BP)

12-stage(without 12-stage(without 12-stage(without 12-stage(without 10-stage(without 10-stage(without 10-stage(without 10-stage(without

LE) LE) LE) LE) LE) LE) LE) LE)

GF(2226)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500 14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(3.340, 3.627) (3.593, 3.880) (4.719, 5.020) (6.101, 6.388) (2.072, 2.359) (2.902, 3.189) (4.024, 4.312) (5.492, 5.779)

(754.84, 3.627) (812.018, 3.880) (1066.494, 5.020) (1378.826, 6.388) (468.272, 2.359) (655.852, 3.189) (909.424, 4.312) (1241.192, 5.779)

(24206, (24206, (24206, (24206, (24206, (24206, (24206, (24206,

5555035) 5555035) 5555035) 5555035) 5555035) 5555035) 5555035) 5555035)

10-stage(without 10-stage(without 10-stage(without 10-stage(without

LE) LE) LE) LE)

GF(2233)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(2.072, (2.902, (4.024, (5.492,

(482.776, 2.360) (676.166, 3.190) (937.592, 4.312) (1279.636, 5.780)

(24811, (24811, (24811, (24811,

5869736) 5869736) 5869736) 5869736)

2.360) 3.190) 4.312) 5.780)

12-stage(with 12-stage(with 12-stage(with 12-stage(with 10-stage(with 10-stage(with 10-stage(with 10-stage(with

LE) LE) LE) LE) LE) LE) LE) LE)

GF(2226)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500 14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(1.812, 2.099) (1.907, 2.194) (2.806, 3.093) (3.662, 3.948) (1.414, 1.701) (1.512, 1.799) (2.187, 2.474) (2.753, 3.040)

(409.512, 2.099) (430.982, 2.194) (634.156, 3.093) (827.612, 3.948) (319.564, 1.701) (341.712, 1.799) (494.262, 2.474) (622.178, 3.040)

(24510, 5623739) (25664, 5884543) (27393, 6275297) (30122, 6892051) (24613, 5647017) (24865, 5703969) (27714, 6347843) (30563, 6991717)

10-stage(with 10-stage(with 10-stage(with 10-stage(with

LE) LE) LE) LE)

GF(2233)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(1.414, 1.702) (1.512, 1.780) (2.187, 2.475) (2.753, 3.041)

(329.462, 1.702) (352.296, 1.780) (509.571, 2.475) (641.449, 3.041)

(25237, 5968994) (25499, 6030040) (28358, 6696187) (31217, 7362334)

Note: BS: bit-serial, BP: bit-parallel, LE: logical effort.

Table 5 Critical path delay, time and area results for the proposed implementation of the digit-serial Massey–Omura multiplier for digit numbers (d1 ¼ 4, d2 ¼ 15, d3 ¼ 57) and (d1 ¼4, d2 ¼15, d3 ¼59) over GF(2226) and GF(2233) respectively. Methods

Field

CL/H

Critical path Delay (ns) (d1 , d2 , d3 )

Time (ns) (d1 , d2 , d3 )

Area (μm2) (d1 , d2 , d3 )

12-stage(without 12-stage(without 12-stage(without 12-stage(without 10-stage(without 10-stage(without 10-stage(without 10-stage(without

LE) LE) LE) LE) LE) LE) LE) LE)

GF(2226)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500 14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(3.550, 3.508, 3.426) (3.804, 3.761, 3.6787) (4.929, 4.887, 4.805) (6.312, 6.269, 6.187) (2.283, 2.240 , 2.158) (3.113, 3.070, 2.988) (4.235, 4.192, 4.110) (5.703, 5.660, 5.578)

(14.2, 56.128, 195.282) (15.216, 60.176, 209.686) (19.716, 78.192, 273.885) (25.248, 37.614, 352.659) (9.132, 35.84, 123.01) (12.452, 49.12, 140.436) (16.94, 67.072, 234.27) (22.812, 90.56, 317.946)

(1399116, (1399116, (1399116, (1399116, (1399116, (1399116, (1399116, (1399116,

401944, 110484) 401944, 110484) 401944, 110484) 401944, 110484) 401944, 110484) 401944, 110484) 401944, 110484) 401944, 110484)

10-stage(without 10-stage(without 10-stage(without 10-stage(without

LE) LE) LE) LE)

GF(2233)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(2.290, 2.244, 2.163) (3.120, 3.074, 2.993) (4.242, 4.196, 4.115) (5.712, 5.664, 5.583)

(9.16, 35.904, 127.617) (12.48, 49.184, 176.587) (16.968, 67.136, 242.785) (22.848, 90.624, 329.397)

(1484319, (1484319, (1484319, (1484319,

412092, 113324) 412092, 113324) 412092, 113324) 412092, 113324)

12-stage(with 12-stage(with 12-stage(with 12-stage(with 10-stage(with 10-stage(with 10-stage(with 10-stage(with

LE) LE) LE) LE) LE) LE) LE) LE)

GF(2226)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500 14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(2.022, 1.980, 1.898) (2.118, 2.075, 1.993) (3.016, 2.974, 2.892) (3.872, 3.830, 3.748) (1.624, 1.582, 1.499) (1.723, 1.680, 1.570) (2.398, 2.355, 2.273) (2.964, 2.921, 2.839)

(8.088, 31.68, 108.186) (8.472, 33.2, 113.601) (12.064, 47.584, 164.844) (15.488, 61.28, 213.636) (6.496, 25.312, 85.443) (6.892, 26.88, 89.49) (9.592, 37.68, 129.561) (11.856, 46.736, 161.823)

(1416444, 406808, 111700) (1482222, 425272, 116316) (1580775, 452936, 123232) (1736328, 496600, 134148) (1422315, 408456, 112112) (1436679, 412488, 113120) (1599072, 458072, 124516) (1761465, 503656, 135912)

10-stage(with 10-stage(with 10-stage(with 10-stage(with

LE) LE) LE) LE)

GF(2233)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(1.632, 1.586, 1.505) (1.730, 1.684, 1.603) (2.405, 2.359, 2.278) (2.971, 2.925, 2.844)

(6.528, 25.376, 88.795) (6.92, 26.944, 94.577) (9.62, 37.744, 134.402) (11.884, 46.8, 167.796)

(1509453, 418908, 115028) (1524911, 423100, 116076) (1693592, 468844, 127512) (1862273, 514588, 138948)

Note: BS: bit-serial, BP: bit-parallel, LE: logical effort.

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

15

Table 6 Critical path delay, time and area results for the proposed implementation of the bit-serial and bit-parallel RMH multiplier over GF(2226) and GF(2233). Methods

Field

CL/H

Critical path delay (ns) (BS, BP)

Time (ns) (BS, BP)

Area (μm2) (BS, BP)

9-stage(without 9-stage(without 9-stage(without 9-stage(without

LE) LE) LE) LE)

GF(2226)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(2.476, 2.768) (2.725, 3.017) (3.991, 4.283) (5.554, 5.846)

(559.576, 2.768) (615.85, 3.017) (901.966, 4.283) (1255.204, 5.846)

(20172, (20172, (20172, (20172,

4793188) 4793188) 4793188) 4793188)

9-stage(without 9-stage(without 9-stage(without 9-stage(without

LE) LE) LE) LE)

GF(2233)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(2.476, 2.769) (2.725, 3.018) (3.991, 3.312) (5.554, 5.847)

(576.908, 2.769) (634.925, 3.018) (929.903, 3.312) (1294.082, 5.847)

(21340, (21340, (21340, (21340,

5070313) 5070313) 5070313) 5070313)

9-stage(with 9-stage(with 9-stage(with 9-stage(with

LE) LE) LE) LE)

GF(2226)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(1.435, 1.7272) (1.613, 1.9052) (2.427, 2.7192) (3.161, 3.4532)

(324.31, 1.7272) (364.538, 1.9052) (548.502, 2.7192) (714.386, 3.4532)

(20320, 4685432) (21453, 4941490) (22545, 5188282) (22962, 5282524)

9-stage(with 9-stage(with 9-stage(with 9-stage(with 9-stage(with

LE) LE) LE) LE) LE) Fig. 16 (a)a

GF(2233)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500 14.252 fF/10

(1.435, 1.7272) (1.613, 1.9063) (2.427, 2.7203) (3.161, 3.4543) 1.525

(334.355, 1.7272) (375.829, 1.9063) (565.491, 2.7203) (736.513, 3.4543) 344.65

(21565, 5122738) (22728, 5393717) (23850, 5655143) (24297, 5759294) 48076

GF(2226)

Note: BS: bit-serial, BP: bit-parallel, LE: logical effort a

For this case results are form post-layout simulation.

Table 7 Critical path delay, time and area results for the proposed implementation of the digit-serial RMH multiplier for digit numbers (d1 ¼ 4, d2 ¼ 15, d3 ¼57) and (d1 ¼ 4, d2 ¼ 15, d3 ¼59) over GF(2226) and GF(2233), respectively. Methods

Field

CL/H

Critical path Delay (ns) (d1 , d2 , d3 )

Time (ns) (d1 , d2 , d3 )

Area (μm2) (d1 , d2 , d3 )

9-stage(without 9-stage(without 9-stage(without 9-stage(without

LE) LE) LE) LE)

GF(2226)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(2.687, 2.644, 2.562) (2.935, 2.893, 2.812) (4.202, 4.159, 4.077) (5.765, 5.722, 5.640)

(10.748, 42.304, 146.034) (11.74, 46.288, 160.284) (16.808, 66.544, 232.389) (23.06, 91.552, 321.48)

(1169178, (1169178, (1169178, (1169178,

337400, 337400, 337400, 337400,

94348) 94348) 94348) 94348)

9-stage(without 9-stage(without 9-stage(without 9-stage(without

LE) LE) LE) LE)

GF(2233)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(2.695, (2.943, (4.209, (5.772,

2.648, 2.567) 2.897, 2.816) 4.163, 4.082) 5.726, 5.645)

(10.78, 42.368, 151.453) (11.772, 46.352, 166.144) (16.836, 66.608, 240.838) (23.088, 91.616, 333.055)

(1236174, (1236174, (1236174, (1236174,

356508, 356508, 356508, 356508,

99440) 99440) 99440) 99440)

9-stage(with 9-stage(with 9-stage(with 9-stage(with

LE) LE) LE) LE)

GF(2226)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(1.646, 1.603, 1.521) (1.831, 1.781, 1.699) (2.638, 2.595, 2.513) (3.372, 3.329, 3.247)

(6.584, 25.648, 86.697) (7.324, 28.496, 96.843) (10.552, 41.52, 143.241) (13.488, 53.264, 185.079)

(1177614, 339768, 94940) (1242195, 357896, 99472) (1304439, 375368, 103840) (1328208, 382040, 105508)

9-stage(with 9-stage(with 9-stage(with 9-stage(with

LE) LE) LE) LE)

GF(2233)

14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500

(1.653, 1.607, 1.533) (1.831, 1.785, 1.704) (2.645, 2.599, 2.518) (3.379, 3.334, 3.252)

(6.612, 25.712, 90.447) (7.324, 28.56, 100.536) (10.58, 41.584, 148.562) (13.516, 53.344, 191.868)

(1292805, 360156, 100340) (1361422, 378764, 104992) (1427620, 396716, 109480) (1453993, 403868, 111268)

Cin-6 ¼Cin-7g^5 ¼5.96fF  f

1:1679 1:408

¼ 4.947 fF ) Wn1–6 ¼Wn2–6

¼Wp1–6 ¼Wp2–6 ¼ 0.765 μm,  1  Cin-5 ¼Cin-6g^4 ¼4.947 fF  1:408 ¼ 3.512 fF ) Wn-5 ¼0.63 μm, f Wp-5 ¼ 1.26 μm,   Cin-4 ¼Cin-5g^3 ¼3.512 fF  1:1679 1:408 ¼2.915 fF ) Wn1-4 ¼ f

μm, Wn2-4 ¼Wp1-4 ¼Wp2-4 ¼0.45  Cin-3Cin-4g^2 ¼ 2.915 fF  1:1679 1:408 ¼2.07 fF ) Wn-3 ¼0.375 μm, f

Wp-3 ¼ 0.745 μm,   Cin-2 ¼Cin-3g^1 ¼2.07 fF  1:1679 ¼ Wn2-2 1:408 ¼1.71 fF Wn1-2 f Wp1-2 ¼ Wp2-2 ¼0.27 μm,   Cin-1 ¼Cin-2g^9 ¼1.71 fF  1:1679 Wn1-1 1:408 ¼1.425 fF¼Cin, ) f

Wn2-1 ¼Wp1-1 ¼ Wp2-1 ¼0.22 μm.

¼ ¼

To evaluate the performance of the circuit, the layout of the 9-stage structure of the bit-serial RMH multiplier over GF(2226) was implemented and post-layout simulation applied. Fig. 17(a) shows the layout of the 4-input XOR gate used in first stage of the XOR tree in the 9-stage structure of the RMH multiplier and Fig. 17(b) shows the layout of one XOR–AND block. The area layout of Fig. 17(a) and (b) is 15 mm  10 mm ¼ 150 mm2 and 14.5 mm  9 mm ¼130.5 mm2, respectively. The layout of the 9-stage structure of the bit-serial RMH multiplier over GF(2226) for the case of H ¼10, CL ¼ 14.252fF is shown in Fig. 18. The area of the layout is 238 mm  202 mm. Result of Monte Carlo analysis for the delay of the proposed 9-stage structure of

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Table 8 Comparison of computation time and number of transistors for bit-serial multipliers. Methods

CL/H

Time (ns)

# Transistor

Field Size m

[2] [5] [6] (b-SMPO I ) [38] Proposed 10-stage structure (MO) Proposed 10-stage structure (MO) Proposed 10-stage structure (MO) Proposed 10-stage structure (MO) Proposed 9-stage structure (RMH) Proposed 9-stage structure (RMH) Proposed 9-stage structure (RMH) Proposed 9-stage structure (RMH) Proposed 10-stage structure (MO) Proposed 10-stage structure (MO) Proposed 10-stage structure (MO) Proposed 10-stage structure (MO) Proposed 9-stage structure (RMH) Proposed 9-stage structure (RMH) Proposed 9-stage structure (RMH) Proposed 9-stage structure (RMH) Proposed 9-stage structure (RMH) Fig. 16(a)a

– – – – 14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500 14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500 14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500 14.252 fF/10 71.26 fF/50 356.3 fF/250 712.6 fF/500 14.252 fF/10

4448 6072 5934 4278 319.564 341.712 494.262 622.178 324.31 364.538 548.502 714.386 319.564 341.712 494.262 622.178 324.31 364.538 548.502 714.386 344.65

5832 5382 5796 5790 6916 6916 6916 6916 5933 5933 5933 5933 7026 7026 7026 7026 6132 6132 6132 6132 5933

138 138 138 138 226 226 226 226 226 226 226 226 233 233 233 233 233 233 233 233 226

a

For this case results are form post-layout simulation.

the bit-serial RMH multiplier over GF(2226) is shown in Fig. 19. In this case H¼ 10, CL ¼ 14.252fF and also the number of iterations is N ¼500. As the figure shows the mean value of delay is 1.52457 ns.

5. Results and comparison The proposed ONB multiplier structures were successfully implemented in 0.18 μm CMOS technology. In this section a comparison between these structures with applying logical effort technique and without applying this technique, and also other implementations of the ONB multipliers is presented. In Table 3, the values of hardware utilization, critical path delay and latency are provided for the proposed structures and several related works. As seen in this table the values of the hardware resources and timing parameters in the proposed implementations are reasonable compared to the other bit-serial, digit-serial and bitparallel designs. The comparison as shown in Tables 4–7 is based on parameters of critical path delay, time and area. The reported simulation results are based on schematic structure. The areas are estimated by summation of transistors area without considering the routing. Only for the case of 9-stage structure of the bit-serial RMH multiplier over GF(2226) for H ¼10 and CL ¼14.252 fF the results are obtained from post layout simulation. The implementations are presented for different electrical efforts, namely, H¼ 10, 50, 250 and 500. Table 4 shows the simulation results for the bit-serial and bit-parallel Massey–Omura multiplier over GF(2226) and GF(2233) with both applying the logical effort technique and without applying it. Also results of the digit-serial structure for digit numbers (d1 ¼4, d2 ¼15, d3 ¼57) and (d1 ¼4, d2 ¼15, d3 ¼59) over GF(2226) and GF(2233) are reported in Table 4, respectively. Here, we consider three different digit sizes, small, intermediate and large sizes for better comparison. As Tables 4 and 5 show the proposed 12-stage and 10-stage structures when applying the logical effort technique have better results compared to those of similar structures without applying the technique. Also the results of 10-stage structure are better than those of 12-stage structure over GF(2226). Table 6 shows the timing and hardware results for the proposed 9-stage implementation of bit-serial and bit-parallel RMH multiplier over GF(2226) and GF

(2233). Results of the proposed digit-serial structure for RMH multiplier over. GF(2226) and GF(2233) with digit numbers, respectively, (d1 ¼ 4, d2 ¼15, d3 ¼ 57) and (d1 ¼ 4, d2 ¼15, d3 ¼59) are reported in Table 7. Table 8 compares computation time and number of transistors of the proposed structures over GF(2226), GF(2233) and previously reported bit-serial structures which are designed over a smaller field size GF(2138). However, the computation time of the proposed structures is better than those works.

6. Conclusions Efficient and general VLSI implementations of the bit-serial (with parallel-input serial-output structure), digit-serial and bitparallel (with parallel-input parallel-output structure) of the optimal normal basis Massey–Omura and Reyhani Masoleh–Hassan multipliers were presented. In the proposed structures by using a proposed 4-input XOR gate, a proposed structure of the XOR–AND circuit, and using logical effort technique, the speed and area of the two multipliers for bit-serial, digit-serial and bitparallel structures over GF(2226) and GF(2233), when compared to structures without applying logical effort technique and also the previously reported designs have been improved. The proposed structures and methods are general and are suitable for highspeed hardware implementation of the normal basis and polynomial basis binary finite field arithmetic operations.

Uncited reference [37].

References [1] T. Beth, D. Gollman, Algorithm engineering for public key algorithms, IEEE J. Sel. Areas Comm. 7 (4) (1989) 458–465. [2] C.W. Chiou, C.-Y. Lee, Y.-C. Yeh, Sequential type-1 optimal normal basis multiplier and multiplicative inverse in GF(2m), Tamkang J. Sci. Eng. 13 (4) (2010) 423–432.

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 Q4 122 123 124 125 126 127 128 129 130 131 132

B. Rashidi et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Q5 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

[3] C.C. Wang, T.K. Truong, H.M. Shao, L.J. Deutsch, J.K. Omura, Irving, S. Reed, VLSI architectures for computing multiplications and inverses in GF(2m), IEEE Trans. Comput. 34 (8) (1985) 709–717. [4] L. Gao, G.E. Sobelman, Improved VLSI designs for multiplication and inversion in GF(2m) over normal bases, in: Proceedings of 13th Annual IEEE International ASIC/SOC Conference, 2000, pp. 97–101. [5] A. Reyhani-Masoleh, M.A. Hasan, Low complexity sequential normal basis multipliers over GF(2m), in: Proceedings of the 16th IEEE Symposium on Computer Arithmetic, vol. 16, 2003, pp. 188–195. [6] A. Reyhani-Masoleh, M.A. Hasan, Low complexity word-level sequential normal basis multipliers, IEEE Trans. Comput. 54 (2005) 98–110. [7] A. Reyhani-Masoleh, M.A. Hasan, A new construction of Massey–Omura parallel multiplier over GF(2m), IEEE Trans. Comput. 51 (2002) 511–520. [8] J.-S. Horng, I.-C. Jou, C.-Y. Lee, Low-complexity multiplexer-based normal basis multiplier over GF(2m), J. Zhejiang Univ. Sci. 10 (6) (2009) 834–842. [9] J.-S. Horng, I.-C. Jou, C.-Y. Lee, On complexity of normal basis multiplier using modified Booth’s algorithm, in: Proceedings of the 7th WSEAS International Conference on Applied Informatics and Communications, Athens, Greece, 24– 26 August, 2007, pp. 12–17. [10] H. Li, C.N. Zhang, Low-complexity versatile finite field multiplier in normal basis, EURASIP J. Appl. Signal Process. 9 (2002) 954–960. [11] A. Reyhani-Masoleh, M.A. Hasan, Efficient digit-serial normal basis multipliers over GF(2m), ACM Trans. Embed. Comput. Syst., Spec. Issue Embed. Syst. Secur. 52 (4) (2003) 428–439. [12] C.K. Koc, B. Sunar, An efficient optimal normal basis type II multiplier over GF (2m), IEEE Trans. Comput. 50 (1) (2001) 83–87. [13] Y. Sukcho, J. Yeon Choi, A new word-parallel bit-serial Normal basis multiplier over GF(2m), Int. J. Control Autom. 6 (3) (2013) 209–216. [14] Atef Ibrahim, Fayez Gebali, Turki F. Al-Somani, Systolic array architectures for Sunar–Koç optimal normal basis Type II multiplier, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 23 (10) (2015) 2090–2102. [15] A. Reyhani-Masoleh, Efficient algorithms and architectures for field multiplication using Gaussian normal bases, IEEE Trans. Comput. 55 (1) (2006) 34–47. [16] R. Azarderakhsh A. Reyhani-Masoleh. A modified low complexity digit-level Gaussian normal basis multiplier, in: Proceedings of the Third International Workshop Arith. Finite Fields (WAIFI), 2010 pp. 25–40. [17] R. Azarderakhsh, A. Reyhani-Masoleh, Low-complexity multiplier architectures for single and hybrid-double multiplications in Gaussian normal bases, IEEE Trans. Comput. 62 (4) (2013) 744–757. [18] R. Azarderakhsh, M. Mozaffari Kermani, S. Bayat-Sarmadi, C.Y. Lee, Systolic Gaussian normal basis multiplier architectures suitable for high-performance applications, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 23 (9) (2014) 1969–1972. [19] B. Rashidi, S.M. Sayedi, R.R. Farashahi, Efficient and low-complexity hardware architecture of Gaussian normal basis multiplication over GF(2m) for elliptic curve cryptosystems, IET Circuits Devices Syst. 10 (2016) 1–10. [20] J.S. Pan, C.Y. Lee, Y. Li, Subquadratic space complexity Gaussian normal basis multipliers over GF(2m) based on Dickson-Karatsuba decomposition, IET Circuits Devices Syst. 9 (5) (2015) 336–342.

17

[21] S. Bayat-Sarmadi, M.M. Kermani, R. Azarderakhsh, C.-Y. Lee, Dual-basis superserial multipliers for secure applications and lightweight cryptographic architectures, IEEE Trans. Circuits Syst.-II: Express Briefs 61 (2) (2014) 125–129. [22] R.C. Mullin, I.M. Onyszchuk, S.A. Vanstone, R.M. Wilson, Optimal normal bases in GF(pm), Discret. Appl. Math. 22 (2) (1989) 149–161. [23] H. Cohen, G. Frey, R. Avanzi, C. Doche, T Lange, K. Nguyen, F. Vercauteren, Handbook of Elliptic and Hyperelliptic Curve Cryptography, first ed., CRC Press, Boca Raton, 2006. [24] R. Jacob Baker, CMOS Circuit Design, Layout, and Simulation, IEEE Press Series on Microelectronic Systems, third ed., John Wiley & Sons, Inc, Hoboken, New Jersey, 2010. [25] Ivan Sutherland, R.F. Sproull, Logical Effort: Designing for Speed on the Back of an Envelope, IEEE Advanced Research in VLSI, MIT Press, 1991. [26] Shiv Shankar Mishra, Adarsh Kumar Agrawal, R.K. Nagaria, A comparative performance analysis of various CMOS design techniques for XOR and XNOR circuits, Int. J. Emerg. Technol. 1 (1) (2010) 1–10. [27] S. Goel, M.A. Elgamel, Magdy A. Bayoumi, Yasser Hanafy, Design methodologies for high-performance noise-tolerant XOR-XNOR circuits, IEEE Trans. Circuits Syst.-I: Regul. Pap. 53 (4) (2006) 867–878. [28] Mohamed Elgamel, Sumeer Goel, Magdy Bayoumi, Noise tolerant low voltage XOR-XNOR for fast arithmetic, in: Proceedings of the Great Lake Symposium VLSI, Washington DC, 28–29 April, 2003, pp. 285–288. [29] Sumeer Goel, Mohammed A Elgamel and MA Bayoumi, Novel design methodology for high-performance XOR-XNOR circuit design, in: Proceedings of the 16th Symposium Integration Circuits System Design, Brazil, September 8–11, 2003, pp. 71–76. [30] S. Goel, S. Gollamudi, A. Kumar, M. Bayoumi, On the design of low-energy hybrid CMOS 1-bit full-adder cells, Midwest Symp. Circuits Syst. (2004) 209–212. [31] H. Lee, G. E. Sobelman, New low-voltage circuits for XOR and XNOR, in: Proceedigns of the IEEE Southeastcon, April 12–14, 1997, pp. 225–229. [32] H. Lee, G.E. Sobelman, New XOR/XNOR and full adder circuits for low voltage, low power application, Microelectron. J. 29 (1998) 509–517. [33] A.M. Shams, T.K. Darwish, M.A. Bayoumi, Performance analysis of low-power 1-bit CMOS full adder cells, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 10 (1) (2002) 20–29. [34] D. Radhakrishanan, Low-voltage low-power CMOS full adder, Proc. IEE Circuits Devices Syst. 148 (2001) 19–24. [35] Neil Weste, David Harris, CMOS VLSI Design: A Circuits and Systems Perspective, fourth ed., Addison-Wesley, 2010. [36] Federal Information Processing Standards Publications (FIPS) 186-2, U.S. Department of Commerce/NIST: Digital Signature Standard (DSS), 2000. [37] R. Lidl, H. Niederreiter, Introduction to Finite Fields and Their Applications, second ed., Cambridge University Press, 1997. [38] G.B. Agnew, R.C. Mullin, I.M. Onyszchuk, S.A. Vanstone, An implementation for a fast public-key cryptosystem, J. Cryptol. 3 (1991) 63–79.

Please cite this article as: B. Rashidi, et al., An efficient and high-speed VLSI implementation of optimal normal basis multiplication over GF(2m), INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.05.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132