Two Normal Basis Multiplication Algorithms for GF(2n)

Two Normal Basis Multiplication Algorithms for GF(2n)

TSINGHUA SCIENCE AND TECHNOLOGY I S S N 1 0 0 7 - 0 2 1 4 0 2 / 1 6 p p 2 6 4 -270 Volume 11, Number 3, June 2006 Two Normal Basis Multiplication Alg...

181KB Sizes 0 Downloads 6 Views

TSINGHUA SCIENCE AND TECHNOLOGY I S S N 1 0 0 7 - 0 2 1 4 0 2 / 1 6 p p 2 6 4 -270 Volume 11, Number 3, June 2006

Two Normal Basis Multiplication Algorithms for GF(2n)* FAN Haining (樊海宁), LIU Duo (刘 铎), DAI Yiqi (戴一奇)** Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China Abstract: For software implementations, word-level normal basis multiplication algorithms utilize the full data-path of the processor, and hence are more efficient than the bit-level multiplication algorithm presented in the IEEE standard P1363-2000. In this paper, two word-level normal basis multiplication algorithms are n proposed for GF(2 ). The first algorithm is suitable for high complexity normal bases, while the second algo-

rithm is fast for type-I optimal normal bases and low complexity normal bases. Theoretical analyses and ex233

perimental results both indicate that the presented algorithms are efficient in GF(2

), GF(2283), GF(2409),

571

and GF(2

), which are four of the five binary fields recommended by the National Institute of Standards

and Technology (NIST) for the elliptic curve digital signature algorithm (ECDSA) applications. Key words: finite field; normal basis; multiplication algorithm

Introduction Arithmetic operations in GF(2n) play an important role in coding theory, computer algebra, and cryptosystems. Among the different types of field representations, the normal basis (NB) has received considerable attention on account of its efficient implementation. For portability, as well as for price reasons, it is often advantageous to realize cryptographic algorithms in software. While many constructions of very large scale integration (VLSI) NB multipliers have been proposed recently, few software-efficient NB algorithms can be found in the open literature. Fan[1] presented a software algorithm for type-I optimal normal bases (ONB) defined in Mullen et al.[2] The algorithm can be further improved if symmetry is employed[3,4]. The Reyhani and Hasan (RH) algorithm is designed to work for all normal bases. More recently, these Received: 2004-09-16; revised: 2005-03-31

﹡ Supported by the National Natural Science Foundation of China (No. 90304014) and the National High-Tech Research and Development (863) Program of China (No. 2005AA114160)

﹡﹡ To whom correspondence should be addressed. E-mail: [email protected]; Tel: 86-10-62789754

authors have presented a new NB multiplication algorithm for GF(2n)[5]. This new algorithm is efficient for composite finite fields. Since XOR and AND instructions take the same number of clock cycles on most modern CPUs, the time complexity of the new algorithm is the same as the original RH algorithm in GF(2n), where n is prime. All these previously mentioned algorithms are word-level and are more efficient than the bit-level NB multiplication algorithm presented in the IEEE standard P1363-2000[6]. Ning and Yin presented a generalized version (referred hereafter as the NY algorithm)[7]. Although the NY algorithm is fast for ONBs, it is slow for nonoptimal normal bases. In this paper, two NB multiplication algorithms for software implementation are proposed. The first algorithm (Algorithm 1) is an improvement on the RH algorithm. A theoretical analysis shows that it should be faster than the RH algorithm. Experimental results, however, show that this is not true for a few GF(2n), e.g., GF(2359) and GF(2491). The reason is that Algorithm 1 uses larger lookup tables than those of the RH algorithm. Algorithm 2 is designed to keep the total number of cyclic shift operations as a constant. It is faster than the NY algorithm for most NBs, with the

FAN Haining (樊海宁) et al:Two Normal Basis Multiplication Algorithms for GF(2n)

particular exception of type-II ONBs. Compared to Algorithm 2, Algorithm 1 is suitable for high complexity NBs. For example, our experimental results show that it is faster than Algorithm 2 in GF(2283) (Type 6 Gaussian NB (GNB)) and GF(2571) (Type 10 GNB). We also compare our new NB algorithms to the polynomial basis multiplication algorithm presented in Koc and Acer[8], i.e., the finite field analogue of the Montgomery multiplication for integers. Experimental results show that in some GF(2n), where ONBs or low complexity normal bases exist, Algorithm 2 is faster than the Montgomery algorithm. Examples are type 4 GNBs in GF(2577), GF(2673), and GF(2739).

1

Preliminaries

n −1

∑a b β i=0

n −1

∑a b β i=0

i i

i =1 j = 0

v hi ⎡ n −1 + ∑∑ ⎢ ∑ a i + j b j + b i + j a j β i = 1 k =1 ⎣ j = 0

(

)

n −1

and treat it as the field element, then

∑a j =0

⎤ ⎥ ⎦

j + wi ,k

(2)

i+ j

bj β

j + wi ,k

=

( B & An −i ) wi ,k . So, Eq. (2) may be rewritten as v

hi

D = ( B & A)1 + ∑∑ ⎡( B & An −i )w + ( A & Bn −i ) w ⎤ = i ,k i ,k ⎦ ⎣ i =1 k =1 v

β 2 ,..., β n −1} of GF(2n) is linearly independent over GF(2), then N is called a normal basis of GF(2n) over GF(2). A field element A can be represented by a binary vector (a0, a1,...,an−1) with respect to this basis as n −1

A = ∑ ai β i , where ai ∈ GF(2) and i = 0,1,..., n − 1 . i=0

For 1 ≤ i ≤ n − 1, let the coordinate representation

β j be the expansion of β 0 βi with respect to N,

where φi , j ∈ GF(2). Let Si = { j | φi , j = 1} and hi=|Si|. We may rewrite Si as Si = {wi ,1 , wi ,2 ,..., wi , hi }, where 0 ≤ wi ,1 < wi ,2 <

< wi , hi ≤ n − 1

(1)

(3)

i = 1 k ∈Si

Furthermore, let Ri=(B & An−i)+(A & Bn−i), we have v

D = ( B & A)1 + ∑ ∑ ( Ri ) k

simplicity, denote β 2 by βi . If a subset N= {β0 , β1 ,

i, j

2j

i

i

j =0

+ ∑∑ (a i + j b j ) ( β i β 0 ) =

( B & A)1 + ∑ ∑ ⎡⎣( B & An −i ) + ( A & Bn −i )⎤⎦ k

The field GF(2 ) is the n-dimensional extension of the field GF(2) and is often viewed as a vector space defined over GF(2). Let β be an element of GF(2n). For

∑φ

n −1 n −1

i +1

For notational simplicity, denote A 2 with Ai. If we define B&An−i as B&An−i= (ai b0 , a i +1 b1 ,..., a i + n −1 bn −1 )

n

n −1

i +1

i i

265

(4)

i =1 k ∈Si

Similarly, for even n, set v= n/2 and have v

D = ( B & A)1 + ∑ ∑ ( Ri ) k , i = 1 k ∈S i

where Ri is defined as ⎧ Ri = ( B & An −i ) + ( A & Bn −i ), ⎨ ⎩ Rv = A & Bv

1 ≤ i ≤ v − 1; (5)

Based on Eq. (4), the following multiplication algorithm was first presented[4]. RH multiplication algorithm for odd n INPUT: A, B, Si, where 1≤ i ≤ v. OUTPUT: D=AB. S1: D := (A & B) >> 1; S2: UA := A; UB := B;

hi

Clearly, β 0 β i = ∑ β wi ,k .

S3: for i = 1 to v do {

k =1

Note that for a particular normal basis N, the representation of β 0 βi , is fixed, and so is wi,k . Reyhani-Masoleh and Hasan[4] presented a wordlevel normal basis multiplication algorithm over GF(2n). The algorithm includes two similar versions: one is for odd n and the other is for even n. In this paper, we assume that n is odd, unless otherwise stated. Let v=( n − 1 )/2 and denote the non-negative residue of x mod n. D=AB can be computed as[4] n −1 n −1

D = AB = ∑∑ ai b j βi β j = i =0 j=0

S4: S5: S6:

UA := UA << 1 ; UB := UB << 1; R := (A & UB) ⊕ (B & UA); for each k ∈ Si do D := D ⊕ (R >> k); }

S7: Output D. Notes: (1) A & B= (a0b0 , a1b1 ,..., an −1bn −1 ) . (2) ⊕ denotes the addition in GF(2n). (3) A<>i) denotes i-fold left (resp. right) cyclic shift operation of the coordinates of A.

Obviously, the number of cyclic shift operations in our method in S4 is n−1. Before introducing the first new algorithm, we show that this number can be further reduced for large values of n.

Tsinghua Science and Technology, June 2006, 11(3): 264-270

266

Let z be the full width of the data-path of a generalpurpose processor, e.g., z=32 for a Pentium CPU. We assume z
∑ ∑ (R )

i = i1 ,...,ie k ∈Si

operations. The number of n-bit cyclic shift operations of the original RH algorithm is equal to (CN +2n−1)/ 2. While the original and improved RH algorithms require the same number of XOR and AND operations, the improved algorithm requires only (CN +1)/2 + 3z

(6)

v

D = ( B & A)1 + ∑ ∑ ( Ri )k = i =1 k∈Si

⎞ Ri ⎟⎟ k = 0 ⎝ i: such that 1≤i≤v and k ∈Si ⎠k n −1



( B & A )1 + ∑ ⎜⎜

cessive computer words starting from DA[s][t] and ending at DA[s][t+ ⎢⎡ n / z ⎥⎤ − 1] , where t= ⎢⎣i / z ⎥⎦ , s=i &&

tions, and 3zn bits are needed to store these arrays. We call the modified RH algorithm using precomputation tables DA and DB the improved RH algorithm. The time complexity of the original RH algorithm was given by Reyhani-Masoleh A et al.[4] The algorithm requires n n-bit AND operations and ( n − 1) / 2 + (C N − 1) / 2 = (C N + n − 2) / 2 n-bit XOR

⎛ ⎞ = ⎜⎜ ∑ Ri ⎟⎟ ⎝ i =i1 ,...,ie ⎠ k

The saving of the k-fold cyclic shift operation is obvious. The left side of Eq. (6) needs e such operations, while the right side needs only 1. The correctness of this method is based on the fact that we can interchange the order of summation in the identity Eq. (4). Since 0 ≤ k ≤ n − 1 , we have

DAj is stored in ⎡⎢(n + v) / z ⎤⎥ successive computer words. So for 1 ≤ i ≤ v , An−i is stored in ⎡⎢ n / z ⎤⎥ suc-

(z−1) and && denotes integer bit-wise AND. That is to say, the two indices of An−i can be computed at the cost of one binary shift (t = [i/z]) and one bit-wise AND. Moreover, the starting address of An−i may be calculated in the precomputation procedure. The arrays for the DBj are defined similarly. Clearly, the time complexity to compute DAj and DBj ( 0 ≤ j < z ) is about 3z n-bit cyclic shift opera-

i k



(7)

Similarly, for even n, we set v= n/2 and have v

D = ( B & A)1 + ∑ ∑ ( Ri )k = i =1 k∈Si

⎞ Ri ⎟⎟ , k = 0 ⎝ i: such that 1≤i≤v and k ∈Si ⎠k n −1



( B & A)1 + ∑ ⎜⎜



where Ri is defined in Eq. (5). Based on Eq. (7) and the method to compute DAi and DBi for 0
m[k][ek] := i ;

S5:

ek := ek+1; } }

cyclic shift operations. Thus, it is faster than the original RH algorithm when n>3z+1.

This procedure outputs ek and m[k][j], where 0 ≤ k ≤ n − 1 and 0 ≤ j ≤ek −1. ek is the total number of i such that 1 ≤ i ≤ v and k ∈ Si ; and m[k][0]

2

to m[k][ek−1] store these i's, i.e., k ∈ Sm[ k ][ j ] for 0 ≤

Algorithm 1

j ≤ ek − 1.

The first algorithm is an improvement on the original RH algorithm. The idea is based on the following observation on Eq. (4). For some 1 ≤ i1 < < ie ≤ v , the

Multiplication algorithm 1 for odd n

set Si1 ∩

∩ Sie may not be empty. Thus for each

OUTPUT: D=AB. S1: Compute DAi and DBi for 0 ≤ i < z ;

k ∈ Si1 ∩

∩ Sie ,

∑ ∑ (R )

i = i1 ,..., ie k ∈Si

i k

can be computed by

INPUT: A, B, ek, and m[k][j], 0 ≤ k ≤ n − 1 , 0 ≤ j ≤ ek − 1.

S2: D := A1 & B1; S3: for i = 1 to v do R[i] := (B & An−i) ⊕ (A & Bn−i); S4: for k = 0 to n−1 do

FAN Haining (樊海宁) et al:Two Normal Basis Multiplication Algorithms for GF(2n) S5:

Thus, Eq. (3) can be rewritten as D = ( B1 & A1 ) +

if ek > 0 then{

S6: S7:

C := R[m[k][0]]; for j = 1 to ek−1 do C :=C ⊕ R[m[k][j]];

S8:

D := D ⊕ (C >> k); }

∑∑ ⎡⎣⎢( B v

hi

wi ,k

i =1 k =1

S9: Output D.

v

Since

Algorithm 1 may be implemented without computing arrays DA and DB. In this implementation, each R[i] is computed using statements similar to S4 and S5 of the RH algorithm[7], and the total number of cyclic shift operations is at most 2n.

3



⎛⎛ ⎛ ⎜⎜ B w−i ( B1 & A1 ) + ∑ ⎜ ⎜ Aw & ⎜⎜ ∑ w= 0 ⎜ 1≤i≤v and ⎜ w = wi: suchforthat ⎜ ⎝ i ,k some k , where 1≤k ≤hi ⎝⎝

) (

(

)+(A

wi ,k

&Bw

i ,k

−i

)⎤⎦⎥

(8)

= (C N − 1) / 2 , D can be computed by Eq.

(

) (

)

⎡ B &A + Aw & B w−i ⎤ = w −i ⎣ w ⎦

⎞⎞ ⎛ ⎛ ⎟⎟ + ⎜ B & ⎜ A w−i ⎟ ⎟ ⎜ w ⎜ i: such that 1∑ v and ⎟⎟ ⎜ ⎜ w= w for some≤ki,≤where 1≤k ≤hi ⎠⎠ ⎝ ⎝ i ,k

Obviously, the number of & operations in Eq. (9) is 2n+1. Thus, Eq. (9) is faster than Eq. (8) for v −1 hi

−i

1 ≤ s ≠ t ≤ hi . Thus, Eq. (8) can be rewritten as

i: such that 1≤i≤v and w = wi ,k for some k , where 1≤k ≤hi

n −1

i ,k

wi , s ≠ wi ,t for a given i, where 1 ≤ i ≤ v and

a0 b0 ,..., a n −i −1 b n −i −1 ) = ( Ai & Bi ).

w=0

i

& Aw

(8) at the cost of approximately CN AND (&) and CN XOR ( ⊕ ) operations if the Ai and Bi are available, where 0
From the definition of A&B, we know that i ( A & B )i = ( A & B)2 = (an −i bn −i , a n −i +1 b n −i +1 ,..., an −1bn −1 , n −1

∑h i =1

Algorithm 2

D = ( B1 & A1 ) + ∑

267

⎞⎞⎞ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟ ⎠⎠⎠

(9)

nonoptimal normal bases. Similarly, for even n, we set v= n/2 and have

)

hv

(

)

D = ( B1 & A1 ) + ∑∑ ⎡ Awi ,k & B w −i + Bwi ,k & A w −i ⎤ + ∑ Awv ,k & B w − v = i ,k i ,k v ,k ⎢ ⎥⎦ k =1 i =1 k =1 ⎣ ⎛⎛ ⎛ ⎜⎜ B & A A & B w−i + ( 1 1 ) ∑ ⎜ ⎜ w ⎜⎜ ∑ w= 0 ⎜ i: such that 1≤i≤v and ⎜ ⎜ ⎝ w = wi ,k for some k , where 1≤k ≤hi ⎝⎝ n −1

Especially, for type-I ONBs, it is well known that β0 βv = 1∈ GF(2) and β 0 β i = β j for some 0 ≤ j ≤ n − 1 , where 1 ≤ i ≤ v − 1 . Thus, Eq. (10) can be simplified if the Hamming weight method[1] is used, v −1

((

D = ( B1 & A1 ) + ∑ Bwi ,1 & A w i =1

i ,1 − i

)+(A

Hamming-weight ( B & Av )

wi ,1

&Bw

i ,1 − i

)) +

(11)

Aw=An−(n−w) can be computed by the method introduced in Section 1. Here, array DA is defined as DA0 = (a0,a1,...,an−1,a0,a1,...,an−1) and DAj=(aj,...,an−1,a0,a1,...,an−1,a0,...,aj−1), 1 ≤ j < z , i.e., they are 2n-bit vectors. The array DB is defined in a similar way. The time complexity to compute DAj and DBj is

⎞⎞ ⎛ ⎛ ⎟⎟ ⊕ ⎜ B & ⎜ A w−i ⎟ ⎟ ⎜ w ⎜ i: such that 1∑ v −1 and ⎟⎟ ⎜ ⎜ w= w for some≤ki,≤where 1≤k ≤hi ⎠⎠ ⎝ ⎝ i ,k

⎞⎞⎞ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟ ⎠⎠⎠

(10)

therefore approximately 4z n-bit cyclic shift operations, and 4zn bits are needed to store these arrays, where 0 ≤ j < z . Based on Eq. (9), we now present Algorithm 2 for odd values of n. For each 0 ≤ w ≤ n − 1, the following precomputation procedure is implemented to find all i such that 1 ≤ i ≤ v and w=wi,k for some k, where 1 ≤ k ≤ hi . Precomputation INPUT: n, Si, where 1 ≤ i ≤ v . OUTPUT: ew and m[w][j], where 0 ≤ j ≤ ew − 1 . S1: for w = 0 to n−1 do ew :=0;

0 ≤ w ≤ n − 1 and

Tsinghua Science and Technology, June 2006, 11(3): 264-270

268 S2: for i = 1 to v do { S3:

for each w ∈ Si do {

S4:

m[w][ew] := i ;

S5:

ew := ew+1; } }

This procedure outputs ew and m[w][j], where

0 ≤ w ≤ n − 1 and 0 ≤ j ≤ ew − 1 . ew is the total

number of i such that 1≤ i ≤ v and w=wi,k, and m[w][0] to m[w][ew−1] store these i's, i.e., w ∈ S m [ w ][ j ] for 0 ≤ j ≤ ew − 1 . Multiplication algorithm 2 for odd n INPUT: A, B, ew, and m[w][j], where 0 ≤ w ≤ n − 1 and 0 ≤ j ≤ ew − 1 . OUTPUT: D=AB. S1: Compute DAi and DBi for 0 ≤ i < z ; S2: D := A1 & B1; S3: for w = 0 to n−1 do S4: S5: S6:

if ew > 0 then{ UA := A; UB := B; for j = 1 to ew−1 do { UA := UA ⊕ A; UB := UB ⊕ B; }

S7:

D := D ⊕ (Bw & UA) ⊕ (Aw & UB); }

S8: Output D .

For each 0 ≤ w ≤ n − 1 , Aw=An−(n−w) is stored in ⎢⎡ n / z ⎥⎤ successive computer words starting from DA[s][t] and ending at DA[s][t+ ⎡⎢ n / z ⎤⎥ − 1] , where t= ⎣⎢(n − w) / z ⎦⎥ , s = (n−w) && (z−1) and && denotes integer bit-wise AND. In our implementation, these address computations are performed in the precomputation procedure, and the starting addresses of Aw, A, Bw, and B are stored sequentially in a one-dimensional array for 0 ≤ w ≤ n − 1 .

4

Analysis and Comparison

We implement these algorithms in ANSI C using Microsoft Visual C++ 6.0 compiler and test them on two computers: 1) An IBM ThinkPad 770X notebook with a 300MHz Pentium II CPU running Windows NT 4.0. 2) A PC compatible computer with a 450-MHz Pentium III CPU running Windows 2000. Both experimental results serve to validate the previous conclusions, which were based primarily on theoretical considerations. Timings listed in this paper are obtained on the first computer.

We first determine the time complexity of Algorithm 1. Since

n−1

v

k =0

i =1

∑ek = ∑hi = (CN −1)/ 2 , the total

number of XOR operations in lines S7 and S8 is (C N − 1) / 2 . Thus, the total number of XOR operations in Algorithm 1 is (n − 1) / 2 + (CN − 1) / 2 = (CN + n − 2) / 2 . Obviously, Algorithm 1 requires n AND operations. Thus, the two RH algorithms and Algorithm 1 require the same number of XOR and AND operations. It is well known that CN ≥ 2n − 1 , thus the total number of cyclic shift operations in the improved RH algorithm is at least n+3z. Since ek may be zero for some k, one can see that the total number of cyclic shift operations in Algorithm 1 is at most n+3z. Obviously, Algorithm 1 is faster than the improved RH algorithm for nonoptimal normal bases. n −1

For 0 ≤ i ≤ n − 1 , let β 0 βi = ∑ φi , j β j be the j =0

expansion of β 0 βi with respect to the normal basis generated by β , where φi , j ∈ GF(2) . The following matrix was defined by Mullen et al.[2] T0 = (φi , j )

0≤i≤n −1, 0≤ j≤n −1

(12)

For a type-II ONB, the matrix T0 defined for the type-II optimal normal basis is symmetric. Consequently, the probability of ek=0 is 0.25 for type-II ONBs in Algorithm 1. Our experiments show that for 100
FAN Haining (樊海宁) et al:Two Normal Basis Multiplication Algorithms for GF(2n)

Since computation of starting addresses of Aw, A, Bw, and B in Algorithm 2 may be performed in the precomputation procedure, it is easy to determine the time complexity of Algorithm 2, namely 4z cyclic shift operations, 2n AND, and CN XOR operations. Table 1 compares the time complexity of the NB algorithms described in this paper for non-optimal normal bases in GF(2n) where n is odd. Table 1

Comparison of NB multiplication algorithms for nonoptimal normal bases

RH algorithm

Improved RH algorithm

XOR

AND

<< or >>

(CN + n − 2) / 2

n

(C N + 2n − 1) / 2

(C N + n − 2) / 2

n

(C N + 1) / 2 + 3 z

Algorithm 1

(C N + n − 2) / 2

n

< n+3z

Algorithm 2

CN

2n

4z

We assume that the general-purpose processor can perform 1 n-bit XOR or AND using 1 n-bit operation. We also assume that 1 cyclic shift operation needs ρ n-bit operations[4]. Our experiments and ReyhaniMasoleh and Hasan[4] show that the value of ρ is

typically 4 for the C programming language if only simple logical instructions, such as AND, SHIFT, and OR are used to emulate a k-fold cyclic shift. When ρ =4 and z=32, we may deduce that Algorithm 1 is faster than Algorithm 2 if CN>7n−256. Thus for high complexity NBs, Algorithm 1 is theoretically the fastest of these NB algorithms. The experimental results listed in Table 4 confirm this conclusion. For type-I ONB, Eq. (11) requires about n XOR, n AND, 4z cyclic shift operations, and 1 calculation of the Hamming weight. The Hamming weight of A can be computed by a lookup table. As an example, for a table with 28 entries on a 32-bit computer, our experimental results show that the cost to compute the Hamming weight of A is no more than 4 times that of a field addition operation for n=162, 418, and 562. The difference between Algorithm 2 and the NY algorithm is that a different multiplication matrix is used, i.e., Algorithm 2 uses the matrix T0 defined in Eq. (12) whereas the NY algorithm uses the matrix M defined in Annex 6.3[6]. Since no description of the precomputation procedure was presented[7] (part of the NY algorithm was described in a patent application), we assume that the method introduced in Section 1 is used to perform this precomputation procedure. For the NY

269

algorithm, DAj and DBj are defined as z ⎡⎢ n / z ⎤⎥ - bit vector, and thus the total number of cyclic shift operations is about 2z. Based on this assumption, it is easy to see that the fastest NY algorithm, Algorithm 4 of Ning and Yin[7], requires about 2n n-bit XOR, n n-bit AND, and 2z n-bit cyclic shift operations. Therefore, the theoretical analysis shows that Eq. (11) is faster than the NY algorithm for n>260, assuming that ρ=4 and z=32. Our experiments confirm this conclusion. Table 2 compares the time complexity of Eq. (11) and the fastest NY algorithm for type-I ONBs. Timings of some type-I ONBs are listed in Table 3. Table 2 Comparison of Eq. (11) and the fastest NY algorithm for type-I ONBs XOR

AND

<< or >>

Eq. (11)

n+4

n

4z

NY algorithm

2n

n

2z

Table 3 Timing for some type-I ONBs

(µs)

GF(2162) GF(2226) GF(2292) GF(2418) GF(2562) Eq. (11)

31

49

67

122

195

NY algorithm 4

31

49

69

136

216

We now compare these NB algorithms to the polynomial basis multiplication algorithm[8], i.e., the finite field analogue of the Montgomery multiplication for integers. Since this method is faster than the standard polynomial basis multiplication algorithm of Lopez and Dahab[9], we only consider the Montgomery multiplication algorithm. For simplicity, the multiplication algorithm is implemented in GF(2k) instead of GF(2n), where k = w ⎢⎡ n / w⎥⎤ . Koc and Acer[8] show that the case w=8 results in the fastest implementation on modern 32-bit computers. Consequently, we also select w=8, and employ the table lookup approach, which has been shown to be the best choice for performing word-level multiplications[8]. The experimental results are listed in Table 4. The five binary fields recommended by National Institute of Standards and Technology (NIST) for elliptic curve digital signature algorithm (ECDSA) applications are GF(2163), GF(2233), GF(2283), GF(2409), and GF(2571)[10]. The experimental results indicate that arrays DA and DB, which are defined at the end of Section 1, speed up Algorithm 1 by no more than 10% for the GF(2n) listed in Table 4. For

Tsinghua Science and Technology, June 2006, 11(3): 264-270

270

example, Algorithm 1 without computing arrays DA and DB performs one multiplication operation in 1566 µs over GF(2571). This result is somewhat better than that of Algorithm 2. Table 4 shows that for some GF(2n) where type 4 Gaussian NBs exist, Algorithm 2 is faster than the

Montgomery algorithm. Additionally for GF(2409), Algorithm 2 is slightly slower than the Montgomery algorithm. Consequently for applications where many squaring operations are needed, e.g., exponentiation, Algorithm 2 is a better choice.

Table 4 Timing for some GF(2n) n

Type

Original RH algorithm

Improved RH algorithm

Algorithm 1

131 233 359 491 163 277 409 577 673 739 283 503 751 599 571 563

2 2 2 2 4 4 4 4 4 4 6 6 6 8 10 14

70 180 350 590 164 373 671 1231 1593 1872 516 1326 2726 2241 2454 3308

64 153 282 472 153 333 609 1070 1382 1600 479 1212 2483 2097 2317 3185

57 140 283 493 123 260 506 993 1320 1578 318 914 1989 1444 1481 1782 [4]

5

Conclusions

This paper presents two normal basis multiplication algorithms in GF(2n). Algorithm 1 is suitable for high complexity normal bases whereas Algorithm 2 is fast in GF(2n) where type-I ONBs or low complexity normal bases exist. Theoretical analyses and experimental results both indicate that the presented algorithms are efficient in GF(2233), GF(2283), GF(2409), and GF(2571), which are four of the five binary fields recommended by NIST for ECDSA applications. References [1] [2]

[3]

Fan H N. Simple multiplication algorithm for a class of GF(2n). IEE Electronics Letters, 1996, 32(7): 636-637. Mullin R C, Onyszchuk I M, Vanstone S A, Wilson R M. Optimal normal bases in GF(pn). Discrete Applied Mathematics, 1988/89, 22: 149-161. Lu C C. A search of minimal key functions for normal basis multipliers. IEEE Trans. Compu., 1997, 46(5): 588-592.

(µs) Algorithm 2 39 98 199 321 112 236 441 825 1071 1277 339 865 1841 1524 1684 2198

NY algorithm 29 56 118 226 2500 11 970 95 840 278 900 438 900 576 600 12 490 177 200 616 100 300 300 258 600 251 300

Montgomery algorithm 38 125 309 593 57 180 421 844 1167 1390 190 622 1476 896 817 801

Reyhani-Masoleh A, Hasan M A. Fast normal basis multiplication using general purpose processors. IEEE Trans. Computers, 2003, 52(11): 1379-1390. [5] Reyhani-Masoleh A, Hasan M A. Efficient multiplication beyond optimal normal bases. IEEE Trans. Computers, 2003, 52(4): 428-439. [6] IEEE P1363-2000. Standard Specifications for Public Key Cryptography. August 2000. [7] Ning P, Yin Y L. Efficient software implementation for finite field multiplication in normal basis. In: Proceedings of 3rd International Conference on Information and Communications Security (ICICS). Springer Verlag, LNCS 2229, 2001: 177-188. [8] Koc C K, Acer T. Montgomery multiplication in GF(2k). Design, Codes and Cryptography, 1998, 14(1): 57-69. [9] Lopez J, Dahab R. High-speed software multiplication in F(2m). Technical report, IC-00-09, May 2000. Available at http://www.dcc.unicamp.br/ic-main/publications-e.html. [10] National Institute of Standards and Technology (NIST). Digital signature standard (DSS), Feb. 2000. Available at http://csrc.nist.gov/cryptval/dss/fr000215.html.