ARTICLE IN PRESS
INTEGRATION, the VLSI journal 41 (2008) 371–384 www.elsevier.com/locate/vlsi
Versatile multiplier architectures in GFð2k Þ fields using the Montgomery multiplication algorithm Apostolos P. Fournaris, O. Koufopavlou Electrical and Computer Engineering Department, University of Patras, Patras, Greece Received 11 August 2006; received in revised form 19 June 2007; accepted 17 July 2007
Abstract Many sequential multipliers for polynomial basis GFð2k Þ fields have been proposed using the LSbit and MSbit multiplication algorithm. However, all those designs are defined over fixed size GFð2k Þ fields and sometimes over fixed special form irreducible polynomials (AOL, trinomials, pentanomials). When such architectures are redesigned for arbitrary GFð2k Þ fields and generic irreducible polynomials, therefore made versatile, they result in high space complexity (gate–latch number), low frequency (high critical path) and high latency designs. In this paper a Montgomery multiplication element (MME) architecture specially designed for arbitrary GFð2k Þ fields defined over general irreducible polynomials, is proposed, based on an optimized version of the Montgomery multiplication (MM) algorithm for GFð2k Þ fields. To evaluate the proposed MME and prove the efficiency of the MM algorithm in versatile designing, three distinct versatile Montgomery multiplier architectures are presented using this proposed MME. They achieve small gate–latch number and high clock frequency compared to other sequential versatile designs. r 2007 Elsevier B.V. All rights reserved. Keywords: Computations in finite fields; Computer arithmetic; Montgomery multiplication; Pipeline; Versatile design; VLSI
1. Introduction Finite field arithmetic is becoming rapidly, a very useful tool for many applications in error coding theory, computer algebra and cryptography of elliptic curves [1,2]. Its main advantage lies in the simplicity of the finite field arithmetic operations since with no loss of accuracy they can give us results, quickly and with relatively little processing cost [3]. Finite fields are usually divided into prime fields or GF(p) fields and binary extension fields or GFð2k Þ fields. GFð2k Þ fields can be very efficiently implemented in hardware due to their ‘‘carry free’’ logic [4]. Multiplication in GFð2k Þ fields is thoroughly analyzed and researched in recent years using many different field basis representation like polynomial (standard) basis, normal basis or dual basis [5]. GFð2k Þ multipliers in polynomial basis representation can be grouped in many ways but generally fit into two major categories [5], the Corresponding author. Tel.: +30 2610997323; fax: +30 2610994798.
E-mail address:
[email protected] (A.P. Fournaris). 0167-9260/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.vlsi.2007.07.004
sequential and parallel multipliers. In sequential multipliers many clock cycles are needed, k or more, to come up with the multiplication product since retrospective algorithms are used. In parallel multipliers all calculations are concluded in one clock cycle. Parallel multipliers have increased space complexity (gate–latch number) compared to the sequential multipliers but smaller multiplication time delay (time delay to come up with the multiplication product). To achieve this, most of them use special irreducible polynomials, like AOL, trinomials or pentanomials [5–7]. Well-known sequential multipliers are the MSbit-first (MSB) and LSBit-first (LSB) multipliers [8] that have been proposed by many researchers in bit-serial [9], digit-serial [10] or systolic [11] and semisystolic [12] designs. Any one of those multipliers, however, is designed for calculations in a specific, fixed GFð2k Þ field and cannot work in any other such field with different k value. Also, multipliers of GFð2k Þ fields defined over special irreducible polynomials are restricted in calculations between GFð2k Þ numbers defined only on those type of polynomials. The above constrains, although not in general problematic,
ARTICLE IN PRESS 372
A.P. Fournaris, O. Koufopavlou / INTEGRATION, the VLSI journal 41 (2008) 371–384
make the resulting GFð2k Þ multipliers very impractical, due to their reduced flexibility, for some GFð2k Þ field applications. Such applications, like cryptography, that involve computations in different GFð2k Þ fields defined over general irreducible polynomials cannot take advantage of the above multipliers. To solve this problem, specially designed multipliers that can support arbitrary GFð2k Þ fields defined over general irreducible polynomials can be introduced. We can define versatile GFð2k Þ multipliers as follows. Suppose that a GFð2k Þ multiplier operates in a specific GFð2k Þ field defined over a general irreducible polynomial, then such multiplier is considered versatile if it can also perform multiplication in all underlining GFð2m Þ fields defined over any other irreducible polynomial, where 1pmpk. Some researchers have suggested modifications of the MSB and LSB multiplication algorithms in order to propose versatile multipliers [12–14] but the resulting designs have increased space complexity (gate, flip flop number) and multiplication time delay. In the work of [15], presenting the most promising results of the above designs, a modification of the LSB algorithm is proposed for the construction of versatile, scalable, digit-serial multipliers. However, this is achieved by introducing an additional reduction calculation at the end of the computations and by posing a constrain in the structure of the irreducible polynomial defining the GFð2k Þ field. The Montgomery multiplication (MM) algorithm [16] is very popular in standard arithmetic [17] because it can perform modular multiplication without trial division and it is ideal for scalable, reconfigurable designs. It is proved in [18], that this sequential algorithm is also functional in GFð2k Þ field arithmetic. Few works have been proposed concerning the MM algorithm for GFð2k Þ fields and those are software-oriented [18], or use special irreducible polynomials like trinomials in parallel [19] or systolic designs [20]. Although software implementations of MM algorithm for GFð2k Þ fields give very promising results in terms of AND and XOR operation number (the software equivalent of gate number) and multiplication time delay, there are many open possibilities in designing the MM algorithm for GFð2k Þ in hardware, especially when versatile architectures is our goal. Versatile designing of the MM algorithm has not been thoroughly analyzed yet. There exist the works [21,22] where versatile MM multipliers are proposed but in those architectures versatile designing is defined differently (as unified multipliers that can operate in fixed GFðpÞ, GFð3k Þ and GFð2k Þ fields). In this paper, the MM algorithm for GFð2k Þ fields is examined for its hardware applicability in designing efficient versatile multipliers in terms of gate–latch number and multiplication time delay. The algorithm is analyzed in bit level and an optimized version of the MM algorithm (mbMM algorithm) is proposed. The potentials of this proposed algorithm for the design of versatile multipliers are discussed and a relevant methodology using this algorithm is devised. As a result of this study a GFð2k Þ
field Montgomery multiplication element (MME) based on the optimized version of the MM algorithm for GFð2k Þ fields is proposed that can be used for the construction of versatile sequential Montgomery multipliers. The efficiency of the proposed MME is evaluated with criteria the time (latency, critical path) and space (gate–latch–MUX number) complexity. In order to prove the efficiency of the mbMM algorithm and the proposed MM element, three different versatile multiplier architectures that use this element, are also proposed, the bit-serial Montgomery multiplier, the pipelined-semisystolic Montgomery multiplier and the partially pipelined Montgomery multiplier. The paper is organized as follows. A brief mathematical analysis of GFð2k Þ field arithmetic is given in Section 2. In Section 3, the MM algorithm for GFð2k Þ fields is analyzed. The proposed optimized algorithm and the MME architecture are described in detail in Section 4. The resulting multiplier architectures are proposed in Section 5. In Section 6 measurements, results and comparisons with other known architectures are presented and Section 7 concludes the paper. 2. Mathematical background A finite field is a field that has finite set of elements. We also call such a field Galois field, GF(q), in honor of the mathematician who first introduced them. We define the order of a finite field, OrderðGFðqÞÞ, as the number of elements of a finite field. Finite fields only exist for q ¼ pk , where p is a prime number and k is a positive integer. The number of elements of a finite field, the order, is q. When choosing p ¼ 2, finite fields are called binary extension fields or GFð2k Þ fields. 2.1. Polynomial basis representation in GFð2k Þ fields GFð2k Þ fields, as stated in [3,23,24], are very attractive to implementations due to their ‘‘carry free’’ arithmetic. Also, due to the availability of different equivalent GFð2k Þ field element representations, the field arithmetic can be adapted and optimized accordingly for the computational environment at hand. GFð2k Þ field elements are represented as binary vectors of dimension k over GF(2) relative to a given basis ðak1 ; ak2 ; . . . ; a1 ; a0 Þ. The GFð2k Þ field is isomorphic to GFð2Þ½a=ðF ðaÞÞ, where F ðaÞ is a monic irreducible polynomial of degree k with coefficients f i 2 f0; 1g or equivalently f i 2 GFð2Þ. We define F ðaÞ as F ðaÞ ¼ ak þ
k1 X
f i ai .
i¼0
According to the polynomial basis representation, an element S of a GFð2k Þ field is a polynomial of degree less than or equal to k 1 defined over a basis ðak1 ; . . . ; a2 ; a1 ; 1Þ with coefficients si 2 f0; 1g, where a is a root of the
ARTICLE IN PRESS A.P. Fournaris, O. Koufopavlou / INTEGRATION, the VLSI journal 41 (2008) 371–384
irreducible polynomial F ðaÞ. This can be written as SðaÞ ¼
k1 X
si ai ¼ s0 þ s1 a þ þ sk1 ak1 .
i¼0
The above equation states that every element S of the GFð2k Þ field in polynomial basis is represented as a polynomial with coefficients s0 to sk1 or in vector format ðsk1 ; sk2 ; . . . ; s1 ; s0 Þ. The addition operation in GFð2k Þ is identical to subtraction and is ‘‘carry-free’’ thus it can simply, be defined as a XOR bit-wise operation. 2.2. Multiplication in GFð2k Þ fields Assuming AðaÞ; BðaÞ 2 GFð2k Þ the product of multiplying AðaÞ with BðaÞ would be ! ! k1 k1 X X AðaÞ BðaÞ ¼ ai ai bi ai mod F ðaÞ, i¼0
AðaÞ BðaÞ ¼
k1 X k1 X
i¼0
! ai b j a
iþj
mod F ðaÞ.
i¼0 j¼0
Using the fact that a is a root of the irreducible polynomial ðF ðaÞ ¼ 0Þ [8] two sequential multiplication algorithms have been developed. Those are the least significant bit first multiplication algorithm (LSB) where the least significant bit of the second operant is processed first and the most significant bit first multiplication algorithm (MSB) where the most significant bit of the second operant is processed first [10,12]. Assume from this point of the paper that A; B; C; F are the polynomials AðaÞ; BðaÞ; CðaÞ; F ðaÞ, respectively, in vector format, A ¼ ðak1 ; ak2 ; . . . ; a1 ; a0 Þ; B ¼ ðbk1 ; bk2 ; . . . ; b1 ; b0 Þ, C ¼ ðck1 ; ck2 ; . . . ; c1 ; c0 Þ, F ¼ ðf k ; f k1 ; f k2 ; . . . ; f 1 ; f 0 Þ and that all the superscripts indicate the algorithm’s round number. 2.3. Design aspects of versatile GFð2k Þ multipliers The LSB and MSB algorithms lead to architectures that can function for specific GFð2k Þ fields defined over general irreducible polynomials F ðaÞ. In other words, the resulting LSB and MSB multipliers are operational for any irreducible polynomial as long as the GFð2k Þ field has fixed order. If the order of the GFð2k Þ field varies (the field is called arbitrary in that case) then LSB and MSB algorithms cannot be used. The basic problem lies in the fact that the m-dimensional vectors of the GFð2m Þ field elements, where 1pmpk, cannot be extended to kdimensional GFð2k Þ field element vectors by padding the higher order bits with zeros [12] if they are to be inserted in LSB or MSB algorithms. The use of fixed order bits in LSB and MSB algorithms that depend on the GFð2k Þ field is the major reason for this problem. In order to design a versatile multiplier based on the LSB or MSB algorithm two solutions have been offered. One solution is the
373
introduction of extra circuitry for detecting the fixed order bits as proposed in [13,12]. Another solution is the redesign of the algorithms in order to eliminate this bit dependency [15]. Either one of the above solutions leads to multiplier architectures that suffer from extensive hardware costs in gate–latch number and multiplication time delay. 3. MM for GFð2k Þ fields The MM algorithm [16], taken from standard arithmetic, performs modular multiplication without trial division and leads to scalable, reconfigurable hardware designs. This well-known algorithm is also defined [18] in GFð2k Þ field arithmetic. According to MM algorithm’s logic, a special fixed number RðaÞ 2 GFð2k Þ is introduced and instead of computing AðaÞ BðaÞ, it is proposed to compute AðaÞ BðaÞ R1 ðaÞ. MM algorithm posses, however, the restriction that gcdðRðaÞ; F ðaÞÞ ¼ 1 which in GFð2k Þ is always true because the polynomial F ðaÞ is irreducible over GF(2) by its definition. Since RðaÞ and F ðaÞ are relatively prime, there exist two polynomials R1 ðaÞ and F ðaÞ with the property RðaÞR1 ðaÞ þ F ðaÞF ðaÞ ¼ 1, where R1 ðaÞ is the inverse of RðaÞ. The above equation can be calculated using the extended Euclidean algorithm and thus R1 ðaÞ and F ðaÞ polynomials can be found. The outcome of MM is CðaÞ ¼ AðaÞ BðaÞ R1 ðaÞ ) CðaÞ ¼ AðaÞBðaÞR1 ðaÞ mod F ðaÞ and can be calculated using the following algorithm. Montgomery Multiplication Algorithm (MM). Input AðaÞ; BðaÞ; RðaÞ; F ðaÞ; R1 ðaÞ; F ðaÞ Output CðaÞ ¼ AðaÞBðaÞR1 ðaÞ mod F ðaÞ (1) TðaÞ ¼ AðaÞ BðaÞ (2) UðaÞ ¼ TðaÞ F ðaÞ mod RðaÞ ðaÞ (3) CðaÞ ¼ TðaÞþUðaÞF RðaÞ The calculation of F ðaÞ and R1 ðaÞ can be a time consuming operation in hardware applications. If we use RðaÞ ¼ ak , then remainder operation using modulus RðaÞ can be performed by simply ignoring the terms that have powers greater or equal to k, and the division operation by RðaÞ is just a shift of k bits to the right. It also turns out that the computation of F ðaÞ can be completely avoided if the coefficients of AðaÞ are scanned one bit at a time [18]. From the above remarks it is proven in [18] that the MM algorithm in bit-level will be as follows. Bit-Level Montgomery Multiplication Algorithm (bMM). Input A; B; F Output CðaÞ ¼ AðaÞBðaÞak mod F ðaÞ (1) C ð0Þ ðaÞ ¼ 0 (2) For n ¼ 0 to k 1 do begin
ARTICLE IN PRESS A.P. Fournaris, O. Koufopavlou / INTEGRATION, the VLSI journal 41 (2008) 371–384
374
(3) DðnÞ ðaÞ ¼ C ðnÞ ðaÞ þ an BðaÞ ðnÞ DðnÞ ðaÞþd F ðaÞ 0 a
(4) C ðnþ1Þ ðaÞ ¼ end (5) Return C ðkÞ ðaÞ
The bMM algorithm needs k identical rounds to come up with the correct result in its output. Each round consists of two steps (steps 3 and 4). The polynomial CðaÞ of the previous algorithmic round, in step 3 is added to the polynomial BðaÞ if the appropriate coefficient an of the polynomial AðaÞ is 1 and the output polynomial DðaÞ is inserted into step 4. In step 4, if the coefficient d 0 of DðaÞ is 1 then the irreducible polynomial F ðaÞ is added to DðaÞ. The outcome, the new polynomial CðaÞ, is divided by a. The overall operation count in each algorithmic round is two additions, two multiplications and one division by a in the worst case scenario where both an and d 0 are 1. 3.1. MM algorithm for versatile GFð2k Þ multipliers Some characteristics of the MM algorithm for GFð2k Þ fields (MM and bMM) can be very useful for the design of versatile GFð2k Þ multipliers. The only fixed order bit used in the bMM algorithm is the least significant bit (LSbit) of DðaÞ. When versatile GFð2m Þ field multipliers are to be designed, where 1pmpk, the zero padding of the k m higher order bits of the input values, presents no problem because no high order bit is fixed in the algorithm. The least significant bit of DðaÞ required in the calculations will always have a valid value. If we assume that the RðaÞ polynomial follows the variation in the arbitrary GFð2m Þ field, thus becoming RðaÞ ¼ am , then the main loop of the bMM algorithm will have to be repeated m times. On the other hand, if the RðaÞ polynomial is considered fixed ðRðaÞ ¼ ak Þ, the loop of bMM will be repeated k times, regardless of the field order, giving correct results. The above remarks are proved and analyzed at length in the rest of this subsection. P i Assume that two elements AðaÞ ¼ m1 i¼0 ai a Pm1we have m i and BðaÞ ¼ i¼0 bi a 2 GFð2 P Þ defined over the irreducii ble polynomial F ðaÞ ¼ am þ m1 i¼0 f i a , where 1pmpk and ai ; bi ; f i 2 GFð2Þ. The m-dimensional vectors of the above field elements can be extended to k-dimensional vectors by adding zeros accordingly. In that case, the field elements and the irreducible polynomial will become AðaÞ ¼
k1 X
aj aj þ
m1 X
j¼m
i¼0
k1 X
m1 X
The bMM algorithm can be broken into two parts. Part 1 is responsible for the calculations until the mth algorithmic round while part 2 is responsible for the calculations beyond the mth algorithmic round (m to k 1 round). It must be noted that ai aj ¼ 0 for mpipk 1 and the multiplication ai BðaÞ ¼ 0. That simplifies part 2 of the bMM algorithm as shown below (vbMM algorithm): Bit-Level Montgomery Multiplication Algorithm for Versatile Multipliers (vbMM). Input A; B; F Output CðaÞ ¼ AðaÞBðaÞak mod F ðaÞ Part 1 CðaÞ ¼ 0 (1) For n ¼ 0 to m 1 do begin (2) DðnÞ ðaÞ ¼ C ðnÞ ðaÞ þ an BðaÞ DðnÞ ðaÞþd
ðnÞ
F ðaÞ
0 (3) C ðnþ1Þ ðaÞ ¼ a end Part 2 (1) For n ¼ m to k 1 do begin
(2)C ðnþ1Þ ðaÞ ¼ end Return C ðkÞ ðaÞ
C ðnÞ ðaÞþc
ðnÞ F ðaÞ 0
a
Replacing the polynomials of vbMM part 1, step 3 and of vbMM part 2 step 2 with their analytic form as P j presented in (1)–(3) and assuming that CðaÞ ¼ k1 j¼m cj a þ Pk1 P Pm1 i m1 j i i¼0 ci a and DðaÞ ¼ j¼m d j a þ i¼0 d i a , these steps become for 0pnpm 1 (vbMM part 1, step 2). P Pm1 ðnÞ i ðnÞ j ðnÞ Pk1 C ðnþ1Þ ðaÞ ¼ ð k1 j¼m d j a þ i¼0 d i a þ d 0 ð j¼mþ1 f j P i aj þ am þ m1 i¼0 f i a ÞÞ=a and for mpnpk 1 (vbMM part 2, step 2). P Pm1 ðnÞ i ðnÞ j ðnÞ Pk1 C ðnþ1Þ ðaÞ ¼ ð k1 j¼m cj a þ i¼0 ci a þ c0 ð j¼mþ1 f j Pm1 i j m a þ a þ i¼0 f i a ÞÞ=a. If we analyze the shifting operation further, the above equations will become for 0pnpm 1: C ðnþ1Þ ðaÞ ¼
k2 X
j d ðnÞ jþ1 a þ
j¼m1
þ
d ðnÞ 0
m2 X
i d ðnÞ iþ1 a
i¼0 k2 X
j
f jþ1 a þ a
m1
j¼m
ai ai ,
(1)
¼
þ
m2 X
! f iþ1 a
i¼0
k2 X ðnÞ ðnÞ m1 ðnÞ j ðd ðnÞ jþ1 þ d 0 f jþ1 Þa þ ðd m þ d 0 Þa j¼m
BðaÞ ¼
bj aj þ
j¼m
F ðaÞ ¼
k1 X
bi ai ,
(2)
i¼0
j
f ja þ a þ
j¼mþ1
where aj ; bj ; f j ¼ 0.
þ
m2 X
ðnÞ i ðd ðnÞ iþ1 þ d 0 f iþ1 Þa
i¼0
m
m1 X i¼0
and for mpnpk 1 i
f ia ,
i
(3) C ðnþ1Þ ðaÞ ¼
k2 X j¼m1
j cðnÞ jþ1 a þ
m 2 X i¼0
i cðnÞ iþ1 a
ARTICLE IN PRESS A.P. Fournaris, O. Koufopavlou / INTEGRATION, the VLSI journal 41 (2008) 371–384
þ
cðnÞ 0
k2 X
j
f jþ1 a þ a
m1
þ
j¼m
¼
k2 X
m2 X
! f iþ1 a
4. Optimized bit-level MM algorithm and proposed MME architecture
i
i¼0
ðnÞ ðnÞ m1 j ðnÞ ðcðnÞ jþ1 þ c0 f jþ1 Þa þ ðcm þ c0 Þa
j¼m
þ
m2 X
ðnÞ i ðcðnÞ iþ1 þ c0 f iþ1 Þa .
i¼0
Considering the fact that f j ¼ 0, the above equations become C ðnþ1Þ ðaÞ ¼
k2 X
ðnÞ m1 ðnÞ j d ðnÞ jþ1 a þ ðd m þ d 0 Þa
j¼m
þ
m2 X
ðnÞ i ðd ðnÞ iþ1 þ d 0 f iþ1 Þa ,
ð4Þ
i¼0
C ðnþ1Þ ðaÞ ¼
k2 X
ðnÞ m1 j ðnÞ cðnÞ jþ1 a þ ðcm þ c0 Þa
j¼m
þ
m2 X
375
ðnÞ i ðcðnÞ iþ1 þ c0 f iþ1 Þa
ð5Þ
Using the remarks of the previous section we can propose a MME that can perform one round of the MM algorithm. For this reason, we propose an optimized version of the bMM algorithm that can be easily implemented in hardware design. The proposed MME based on this algorithm is designed to fit, without any modification, in versatile Montgomery multiplier architectures that employ hardware techniques like pipelining, folding, parallelism, unfolding, etc. In the bMM algorithm, one round of the algorithm can be binary translated in two XOR operations (addition equivalent), two AND operations (multiplication equivalent) and a 1-bit right shift (division by a equivalent). However, a close inspection of the bMM algorithm reveals that one bMM round consists of two very similar steps. P P i i and Y ðaÞ ¼ k1 are Assuming TðaÞ ¼ k1 i¼0 ti a i¼0 yi a k elements of a GFð2 Þ field and that gn 2 GFð2Þ, both bMM steps 3 and 4, if the shifting operation is excluded, follow the equation presented below:
i¼0
for 0pnpm 1 (4) and mpnpk 1 (5) correspondingly. Similar analysis can be done for vbMM part 1, step 2 using the fact that aj ; bj ¼ 0. The resulting polynomial DðaÞ of this step would be DðnÞ ðaÞ ¼
k1 X j¼m
j cðnÞ j a þ
m1 X
HðaÞ ¼ TðaÞ þ gn Y ðaÞ ¼
k1 X
ti ai þ gn
k1 X
i¼0
¼
k1 X
ðti þ gn yi Þai .
yi a i
i¼0
ð7Þ
i¼0 i ðcðnÞ i þ an bi Þa ,
(6)
i¼0
where 0pnpm 1 according to the round number. It can be remarked from (4)–(6) that the coefficients of the polynomial CðaÞ and DðaÞ with degree m to k 1 are not altered in any part of the vbMM algorithm nor affect the results of vbMM algorithm in any way by reacting with any other polynomial. The above coefficients derive from our initial assumption of ‘‘extending’’ the GFð2m Þ field elements to GFð2k Þ field elements by padding with zero (this is not stated in a literal sense since the irreducible polynomial is still of degree m and not k). Since they do not interfere in the computational flow of the algorithm in any way, it can also be concluded that the initial zero padding has no effect in the results of vbMM algorithm. The same conclusion can be reached when the RðaÞ polynomial is not fixed ðRðaÞ ¼ am Þ. In that case, vbMM algorithm is identical with bMM since only part 1 of vbMM is used. Eqs. (4)–(6) are valid and lead us to the same conclusion. It must be noted that all the output values of the bMM algorithm are in Montgomery format (they are multiplied by RðaÞ1 mod F ðaÞ term) and need to be transformed to GFð2k Þ field ordinary format. This can be done at the end of all calculations by running a variation of the bMM algorithm. This variation is a simplification of the original bMM algorithm taking into account that the value to be transformed is multiplied with a fixed precomputed polynomial (R2 ðaÞ mod F ðaÞÞ [18].
The only change occurring in steps 3 and 4 of bMM is the values of TðaÞ, gn , Y ðaÞ since the calculation type remains the same. Therefore, we can investigate the possibility of calculating both step output values from the same hardware module of (7) by inserting the correct values of TðaÞ, gn ; Y ðaÞ in it. Modifications in the bMM algorithm must be made to match the above constrain. All the polynomials in such a module can be kdimensional vector numbers. The irreducible polynomial P i F ðaÞ ¼ ak þ k1 i¼0 f i a can also be considered k dimensional if the shifting operation is included in the calculation of bMM step 4. Note also that the most significant bit (MSbit) of F ðaÞ in its vector representation is always 1. More precisely, blending the calculations of bMM step 4 with the shifting operation in this step we have CðaÞ ¼ d 0 ak1 þ
k2 X
ðd iþ1 þ d 0 f iþ1 Þai .
(8)
i¼0
The polynomial CðaÞ, as shown in (8), has degree k 1 and the coefficient of the k 1 term (ak1 Þ is d 0 . Note also that the coefficient f 0 of F ðaÞ is ignored. Therefore, the polynomial CðaÞ can be represented as a k-dimensional vector defined over GF(2) and it can be stored or manipulated in hardware as a k bit number. The fact that the most significant bit of this vector is always equal to the least significant bit of the previous step’s output (step 3) can help us restrict the bit number of the output to k bits
ARTICLE IN PRESS A.P. Fournaris, O. Koufopavlou / INTEGRATION, the VLSI journal 41 (2008) 371–384
376
and thus strengthen our initial observation of the computational similarity between bMM steps 3 and 4. Both algorithmic steps (bMM steps 3 and 4) express the same equation (7) with different inputs in odd or even clock cycles. Summarizing the above remarks we can propose a modified version of bMM algorithm that uses only one equation in each multiplication round. Assume that P Pk2 ðnÞ i ðnÞ i ðnÞ ~ H ðnÞ ðaÞ ¼ k1 i¼0 hi a ,H ðaÞ ¼ i¼0 hiþ1 a and that F ðaÞ ¼ P k2 ak1 þ i¼0 f iþ1 ai . Our proposed design methodology for a versatile MME is based on the following algorithm. Modified Bit-Level Montgomery Multiplication Algorithm (mbMM). Input A; B; F Output CðaÞ ¼ AðaÞBðaÞak mod F ðaÞ (1) H ð0Þ ðaÞ ¼ 0 (2) For n ¼ 0 to 2k 2 increase n by 2 do begin (a) For w ¼ 0 to 1 do begin (i) If w ¼ 0 then (A) T ðnÞ ðaÞ ¼ H ðnÞ ðaÞ (B) gn ¼ av where v ¼ n2 (C) Y ðnÞ ðaÞ ¼ BðaÞ end if (ii) If w ¼ 1 then P ðnÞ i (A) T ðnÞ ðaÞ ¼ H ðnÞ ðaÞ ¼ k2 i¼0 hiþ1 a (B) gn ¼ hðnÞ 0 P ~ ¼ ak1 þ k2 f iþ1 ai (C) Y ðnÞ ðaÞ ¼ F i¼0 end if P ðnÞ ðnÞ i (iii) H ðnþ1Þ ðaÞ ¼ k1 i¼0 ðti þ gn yi Þa end end
bk-1
bk-2
…..
b1
x avbk-1
avbk-2
…..
Return H ð2k1Þ ðaÞ
(3)
One round of the mbMM algorithm consist of two repetitions of the same step (step 2(a)(iii)) with different inputs. Assuming that each such repetition corresponds to one clock cycle of a resulting architecture we can separate those repetitions in those occurring in even (w ¼ 0) and odd (w ¼ 1) clock cycles. Moreover, it can be noted that one round of the bMM and mbMM algorithm corresponds to two clock cycles described in the above fashion. This correspondence is evident in Fig. 1 where the computational flow for one algorithmic round of bMM and mbMM is presented. All the vectors are normalized to have the same bit length k. In even clock cycles calculations of bMM step 3 are performed. In odd clock cycles calculations of bMM step 4 are performed. It must be noted that the shifting operation of bMM step 4 is embedded in the calculations of mbMM. This can be achieved by using the HðaÞ and F~ðaÞ polynomials as inputs of step 2(a)(iii) in the odd clock cycle of mbMM. In that way, the MSbit of step 2(a)(iii) output for w ¼ 1 is always equal to the h0 of the even clock cycle’s step 2(a)(iii) output and the f 0 value is ignored. The mbMM algorithm is fully functional for arbitrary GFð2m Þ fields where 1pmpk. In such fields, as described in Section 3.1, the m-dimensional vectors of m 1 degree polynomials can be extended to k-dimensional vectors by adding zeros for the not needed higher order bits. In that case, the MSbit of the mbMM odd clock cycle’s step 2(a)(iii) output, acquires a zero value and not the h0 value of the previous clock cycle. In the mbMM algorithm that case is covered by the employed
fk
b0
fk-1
avb0
h0fk
h0fk-1
+ hink-1
hk-1
…..
h0fk-2
f1
f0
This bit is ignored
h0
x
av avb1
fk-2
…..
h0f1
…...
h1
+
hink-2
…..
hin1
hin0
hk-2
…..
h1
h0
0
h0fk
hk-1
hk-2
h0fk-1+hk-1 h0fk-2+hk-2
Even Clock Cycle
…..
Odd Clock Cycle
One bMM - mbMM Round Fig. 1. A single Montgomery multiplication round.
h0f1+h1
Output
ARTICLE IN PRESS A.P. Fournaris, O. Koufopavlou / INTEGRATION, the VLSI journal 41 (2008) 371–384
input structure of the odd clock cycles. The shifted F ðaÞ when mok in GFð2m Þ fields derives from (3) and becomes P P ~ðaÞ ¼ k2 f jþ1 aj þ am1 þ m2 f iþ1 ai . Since the coeffiF j¼m i¼0 cient of the k 1 term in the above polynomial and HðaÞ is zero, the resulting coefficient of the k 1 degree polynomial term of the mbMM step 2(a)(iii) output for odd clock cycles will also be zero. All the remaining not needed k m 1 bits of one round’s output are zero. Therefore, a resulting proposed MME can lead to versatile multiplier architectures with no redesign or extra hardware circuitry. Using mbMM algorithm, the design of two hardware modules, one for each bMM round steps (bMM steps 3 and 4), is not needed anymore. Both those steps can be calculated from the same hardware module of (7) or equivalently mbMM step 2(a)(iii) by inserting the correct values of TðaÞ; gn ; Y ðaÞ as inputs to that module. The data graph [25] of such a hardware module for odd and even clock cycles is presented in Fig. 2. In the above figure, the value l is considered the clock count beginning from zero. There are four switches that accept two values, according to even ð2l þ 0Þ or odd ð2l þ 1Þ clock cycles and a register. On even clock cycles the BðaÞ; av and H ð2lþ1Þ ðaÞ values are inserted in the module, where H ð2lþ1Þ ðaÞ is the H vector coming from the previous ~ðaÞ; hð2lþ0Þ and clock cycle-round. On odd clock cycles the F 0 H ð2lþ0Þ ðaÞ values are inserted in the module, where H ð2lþ0Þ ðaÞ is the shifted output value of the previous even clock cycle and hð2lþ0Þ is the LSBit of the this clock cycle output value. 0 Using the previous remarks, mbMM algorithm and Fig. 2, an architecture of an MME can be proposed. As input, the bit vector H in is used. The value is taken from a previous MME’s Intermediate Output H, where Intermediate Output H is the output value of the odd clock cycle in step 2(a)(iii) of mbMM algorithm. Other inputs are ~ of the irreducible polynomial F ðaÞ shifted the bit vector F one bit to the right, the multiplicand B and the bit av of the multiplier A. Two-to-one multiplexers are used for the choice between the even clock cycle’s data set or the odd clock cycle’s data set. A k-bit register is also used for temporal storage of the intermediate value H. In the even clock cycles, the multiplexers enable the k-bit signals B, H in and the bit signal av . The signals B, av are inserted into a series of AND gates and the output is pushed into a series of XOR gates that uses the signal H in
377
to reach a result. That result is stored in the C register. In the odd clock cycles, the multiplexers enable the k-bit ~, the register’s output H transformed accordingly signal F to match HðaÞ and the register’s least significant bit h0 . The ~, h0 are inserted into the same series of AND gates signals F and the output is pushed into the same series of XOR gates that uses the signal H to reach a result. That result is the Intermediate Output H. The data set for each clock cycle is summarized in Table 1 and the proposed MME architecture is shown analytically in Fig. 3 along with a 12-bit MME example. The basic advantage of the proposed MME using the mbMM algorithm is the fact that it can perform one algorithmic round of two steps (bMM steps 3 and 4) using the hardware for one step. This design approach was chosen over a direct realization of bMM algorithm because in that way the critical path delay is reduced and 2k gates are exchanged with 2k þ 1 two-to-one multiplexers (MUX). An MUX can be implemented in several standard cell libraries, both in FPGA and ASIC devices, covering less chip covered area and with smaller time delay than a gate (through LUTs, pass transistors or transmission gate) as analyzed in [26]. However, the time delay needed for one algorithmic round is two clock cycles since the right result is taken only at the end of every odd clock cycle. However, that does not make the proposed MME slow, because the critical path delay is very small resulting in high clock frequency. A k-bit data set of the proposed MME uses k AND gates, k XOR gates, k latches and 2k þ 1 MUXs. The critical path delay is the delay of one XOR gate ðT X Þ and one AND gate ðT A Þ while the overall multiplication time delay for one round is 2ðT X þ T A Þ. The above remarks can be verified by comparing an MME implementation with a one bMM round implementation. In a Virtex 2 XC2V8000 FPGA technology, the MME implementation employs 326 LUTs and has a maximum frequency of 377 MHz while the bMM round implementation employs 506 LUTs and has maximum frequency of 303 MHz. It must be noted that there is no H in polynomial vector for the first MME (FMME). It can be considered zero. Using the above remark, special care can be taken for that element and an optimized MME architecture for the FMME can be proposed. Since the value H in is considered zero (no previous stage exists), the even P clock cycle’s i output would be H in ðaÞ þ a0 BðaÞ ¼ a0 B ¼ k1 i¼0 a0 bi a and thereforeP the odd clock cycle’s output would be i ~ HðaÞ ¼ k2 i¼0 a0 biþ1 a þ a0 b0 F ðaÞ.
Table 1 The data set for even and odd clock cycles
Fig. 2. The data graph of one mbMM multiplication round and the choices of TðaÞ; gn and Y ðaÞ.
Architecture elements
Even clock cycles
Odd clock cycles
Multiplexer 1 Multiplexer 2 Multiplexer 3 Register
B H in av Indifferent
F~ H register h0 H
ARTICLE IN PRESS 378
A.P. Fournaris, O. Koufopavlou / INTEGRATION, the VLSI journal 41 (2008) 371–384
Fig. 3. The proposed Montgomery multiplication element.
Therefore, the first round of the mbMMPalgorithm can i be expressed by a single equation HðaÞ ¼ k2 i¼0 a0 biþ1 a þ ~ðaÞ. Instead of using the AND gate series for the bit a0 b0 F to vector multiplication, the multiplexers can be appropriately utilized to accept as input the B or zero value and the F~ or zero value with control signals, the a0 and a0 b0 bit values, respectively. Therefore, the proposed FMME has the architecture shown in Fig. 4 along with a 12-bit FMME example. Instead of k AND gates and 2k þ 1 MUXs now only one AND gate and 2k MUXs are needed. The gate number of the proposed FMME is k 1 XOR gate, 1 AND gate and 2k 2 MUXs while the critical path delay is T X . A presentation of the two proposed MMEs is shown in Table 2. There exist no other similar works concerning the design of multiplier elements for GFð2k Þ fields as construc-
tion elements of different multiplier architectures. Research has only been done for whole multipliers. So in order to prove the efficiency of the proposed MME for the design of an up-to-date Montgomery multiplier specially constructed for hardware, we propose three versatile versions of an MM architecture for GFð2k Þ fields using different hardware techniques. The MM architectures using the MME can be designed following the principles of folding and pipelining or a combination of the two. Therefore, three indicant versatile designs are proposed, the bit-serial Montgomery multiplier, the pipelined-semisystolic Montgomery multiplier and the partially pipelined Montgomery multiplier. Each architecture serves a different application need following criteria of space complexity, critical path delay-clock frequency and latency.
ARTICLE IN PRESS A.P. Fournaris, O. Koufopavlou / INTEGRATION, the VLSI journal 41 (2008) 371–384
379
Fig. 4. The proposed first Montgomery multiplication element.
Table 2 Gate–latch–MUX number and critical path delay for the two proposed MME (MME and FMME) Prop. archit. AND gates XOR gates Latches MUXs Crit. path delay MME FMME
k 1
k k1
k None
2k þ 1 T X þ T A 2k 2 T X
5. Proposed Montgomery multiplier architectures The proposed MME and FMME are used for the design of three Montgomery multiplier architectures. The proposed multiplier structure is indicant of how the proposed MMEs can be applied in the design of advantageous multiplier architectures that fit in applications with various construction constrains like low gate number or high throughput. The multipliers are designed to be versatile. We have already described in Section 3.1 how the MM algorithm should work for arbitrary GFð2m Þ fields. More specifically, it has been remarked that we can use either an arbitrary RðaÞ ¼ am polynomial or a fixed RðaÞ ¼ ak polynomial. In the first case, the multiplier will be able to come up with a correct result after m rounds while in the second case the correct result will reach the multiplier output always after k
round regardless of the field order. From design point of view an arbitrary number of rounds means that the multiplier system should externally be able to know the required number of rounds with respect to the arbitrary GFð2m Þ field. On the other hand, the use of a fixed number of rounds does not pose the above restriction (one multiplication is concluded always after k rounds regardless of the arbitrary field order). The above two cases are studied for each proposed multiplier that uses the MME and FMME. In general, the structure of the multipliers remains the same for both cases, assuming that the system is aware of the field order and especially the m value. The proposed bit-serial Montgomery multiplier, shown in Fig. 5(a), has an input and output register to store the Intermediate Output H. Every two clock cycles the data are inserted in the MME through the Input Register. It must be noted that each algorithmic round equals to two clock cycles. The main advantage of this multiplier is that it uses only one MME. Therefore, the proposed bit-serial multiplier utilizes only k AND gates, k XOR gates, 3k latches and 2k þ 1 MUXs. The critical path delay is ðT A þ T X Þ but the throughput remains low because 2k repetitions of the proposed employed MME, are needed. After 2k repetitions the multiplication result is loaded in the Output Register. When the field is arbitrary (GFð2m Þ field where
ARTICLE IN PRESS A.P. Fournaris, O. Koufopavlou / INTEGRATION, the VLSI journal 41 (2008) 371–384
380
FMME
FMME
MME
MME
MME MME MME
Fig. 5. The proposed versions of the Montgomery multiplier architecture. (a) Bit-serial versatile Montgomery multiplier. (b) Pipelined-semisystolic versatile Montgomery multiplier. (c) Partially pipelined versatile Montgomery multiplier.
1pmpkÞ, the Output Register is activated only after 2m clock cycles for arbitrary RðaÞ or after 2k clock cycles for fixed RðaÞ. In the proposed pipelined-semisystolic Montgomery multiplier, shown in Fig. 5(b), pipelining is used in order to increase the throughput of the Montgomery multiplier. In this architecture, k pipeline stages are employed, increasing the throughput approximately k-times. However, the gate–flip flop number is increased due to the use of k multiplication elements. External storage memory is also ~ and B values in each needed in order to feed the av , F pipeline stage. Apart from the proposed MME, the proposed FMME is also used in this multiplier design in an effort to optimize the architecture. The proposed pipelined-semisystolic Montgomery multiplier utilizes k2 k þ 1 AND gates, k2 1 XOR gates, 2k2 k latches and 2k2 þ k 3 MUXs. It comes up with a multiplication result every one algorithmic round, every 2ðT A þ T X Þ, thus achieving very high throughput. In case of arbitrary GFð2m Þ fields, where 1pmpk, the outcome can be taken at the m-th pipeline stage of the proposed pipelined-
semisytolic Montgomery multiplier architecture when arbitrary RðaÞ is used or at the k-th pipeline stage when RðaÞ is fixed. The proposed partially pipelined Montgomery multiplier, shown in Fig. 5(c), employs both pipelining and folding in order to overcome some of the disadvantages in the previous two architectures. Therefore, it utilizes p pipeline stages, where 1pppk 1 and processes the multiplication product by using p-bit digits of the A input in serial fashion. Those stages are reused through a process similar to that of the proposed bit-serial Montgomery multiplier. Assume that the bit vector A of the AðaÞ polynomial is broken into d ¼ dk=pe digit vectors of p dimension. The of the AðaÞ P partitioning Pp1polynomial j i would be AðaÞ ¼ d1 A ðaÞa where A ¼ j j j¼0 i¼0 adjþi a . In vector format A would consist of d concatenated Aj vectors, A ¼ ðAd1 ; Ad2 ; . . . ; A1 ; A0 Þ. The partially pipelined Montgomery multiplier consists of p pipelined stages and therefore it can process one Aj digit before reinserting the p-th pipelined stage’s output to the input, as is indicated in Fig. 5(c). Each MME processes one bit of the Aj digit,
ARTICLE IN PRESS A.P. Fournaris, O. Koufopavlou / INTEGRATION, the VLSI journal 41 (2008) 371–384
therefore, d repetitions of the partially pipelined architecture’s main loop are needed to come up with a correct output result. The proposed partially pipelined Montgomery multiplier, as shown in Fig. 5(c) is, in fact, a generalization of the two previous proposed architectures since the number of pipeline stages is customizable according to the application. For p ¼ k 1 this architecture is identical to the pipelined-semisystolic architecture while for p ¼ 1, the architecture is very similar to the bit-serial design. Note that the structure and logic of the partially pipelined Montgomery multiplier is similar to that of digit-serial multipliers. However, this multiplier is not digit serial since the bits are still processed in bit-serial fashion. Digit-serial multipliers process digits in bit parallel fashion and series of digits in bit-serial fashion [10,15]. The above description of the partial pipelined Montgomery multiplier fits well with arbitrary fields where RðaÞ is fixed. In that case, as already stated, the multiplier comes up with a result after d ¼ dk=pe digits are processed. When RðaÞ is arbitrary, using external control, the time for a correct result can be dictated to that multiplier. The result is outputted after d ¼ dm=pe digits are processed. The partially pipelined Montgomery multiplier uses both the proposed MME and the FMME architectures. Its gate number and throughout can be given parametrically according to the number p. Thus, the proposed partially pipelined Montgomery multiplier has kp þ 1 AND gates, kp þ k 1 XOR gates, 2kp þ k latches and 2kp þ 2k þ p 2 MUXs. 6. Proposed architecture’s analysis and performance The purpose of proposing the three versatile Montgomery multipliers in the previous section is to use them as a
381
tool for evaluating the performance of the proposed MME and for proving the efficiency of MM algorithm for the design of versatile multipliers. So, comparisons of the proposed Montgomery multipliers architecture with other well-known sequential multiplier architectures in terms of gate number, 1-bit latch number, two-to-one MUX number, critical path delay and number of needed clock cycles to come up with a result (latency) are presented. Additionally, FPGA hardware implementation results are given for the three versatile proposed architectures along with other versatile multiplier architectures, with measurements of chip covered area (gate number equivalent), clock frequency and throughput. The FPGA comparisons are used for proving the validity of the theoretical analysis concerning the efficiency of the proposed architectures. Summarizing the remarks for the three proposed versatile Montgomery multipliers proposed earlier and adding the results from other sequential multipliers, Table 3 can be constructed. Table 3 is divided into three sections according to the type of compared sequential multipliers. The number of AND–XOR gates and MUXs is measured and the sum of those hardware components (HW comp.) is calculated. The number of 1-bit latches is measured separately along with the critical path delay and latency of each multiplier design. The latency is the number of clock cycles needed for the complete MM of two numbers. The superscript a in Table 3 indicates whether or not the design is versatile. Any additional gates that are used in some designs are added in the hardware component sum and appropriate indication is given. It must also be noted that, in order to have fair comparisons, designs using three or four input gates have been normalized to two input gates using the formula: Gate3 2Gate2 , Gate4 3Gate2 .
Table 3 Gate–latch–MUX number, latency and critical path delay comparisons GFð2k Þ
AND
XOR
MUXs
HW comp.
Latches
Latency
Crit. path delay
Bit-seriala LSBit [8] MSBit [8] Kitsosa [13] Hasana [14] Semisyst.a Jaina [12] Wang [27] Tsai [28] Par. pipel.a
k 2k 2k 2k 3k
k 2k 2k k 2k
2k þ 1 0 0 k DeMUX k
4k þ 1 4k 4k 5k 1b 7k 2b
k2 k þ 1 4k2 þ k 2k2 2k2 kp þ 1
k2 1 2k2 1 2k2 2k2 kp þ k 1
2k2 þ k 3 0 k2 k2 2kp þ 2k þ p 2
4k2 3 7k2 þ 2k 1c 5k2 5k2 4kp þ 3k þ p 2
3k 3k 3k 3k k2 2k2 k 3k2 þ k 7k2 8k2 2kp
TA þ TX TA þ TX T A þ 2T X 2T A þ 2T X þ T N þ ðk þ 1ÞT O T X þ dlog2 ðkÞe ðT A þ T N þ T O Þ TA þ TX ðk 1ÞT O TA þ TX TA þ TX TA þ TX
Song [10]
2kp
2kp
k
4kp þ k
2k þ p 1
Guo [29]
2kp þ k þ p 1
2kp þ p 1
2k
4kp þ 3k þ 2p 2
10k þ 5kp
2k k k k 3k 2k ðk þ 1Þ 3k 2k k 2p p k þ1 p k ð3 þ p 1Þ p
a
Versatile design. k 1 additional OR gates. c 2 k þ k additional OR gates. b
pðT A þ T X Þ T A þ 2T X
ARTICLE IN PRESS 382
A.P. Fournaris, O. Koufopavlou / INTEGRATION, the VLSI journal 41 (2008) 371–384
The proposed versatile bit-serial Montgomery multiplier is compared with the well-known LSbit and MSbit multipliers [8], along with two versatile bit-serial designs [13,14]. The proposed bit-serial Montgomery multiplier achieves minimum critical path delay with half the gates and the same latches number of the LSbit and MSbit multipliers but with 2k þ 1 additional MUXs and more clock cycles to finish one multiplication. The overall space and time complexity of the bit-serial design, therefore, remains approximately the same with that of the LSBit and MSBit multipliers as indicated by the overall Hardware component number (HW comp.). However, the above comparison cannot be considered fair since the LSbit and MSbit multipliers are not versatile. When compared to the versatile designs of [13,14], the proposed bit-serial Montgomery multiplier is advantageous in terms of gate–latch number against the other versatile designs since it uses only k AND gates, k XOR gates and 3k latches. The extra MUXs of our design are counterbalanced by the additional AND, XOR and OR gates of [13,14]. As a result, the Hardware component sum of our design is smaller than the one of [13,14]. Considering the small number of utilized latches in our design, it can be remarked that the space complexity of the proposed bit-serial multiplier is the smallest among all others compared versatile designs. The critical path delay of the proposed design is significantly smaller than that of the versatile designs of [13,14] and the latency (number of clock cycles for one multiplication) is smaller than [14] but higher than [13]. The proposed versatile pipelined-semisystolic Montgomery multiplier compared to the nonversatile semisystolic design of [27,28] has k2 þ k 1 less XOR gates, k2 þ 1 less AND gates, even fewer latches (5k2 þ k and 5k2 þ k, respectively) but more MUXs. However, by summing up all the hardware components of the compared architectures, it can be remarked that the proposed pipelinedsemisystolic design has the smallest such value. The critical path delay of both [27,28] is the same as the proposed pipelined-semisystolic Montgomery multiplier and the latency remains the same as [28] but higher than [27]. When compared to the versatile design of [12], the proposed versatile pipelined-semisystolic architecture has smaller critical path delay and considerably less gates and latches. The versatile design of [12] does not utilize MUX at the expense of many extra gates and has lower latency than the proposed pipelined-semisystolic Montgomery multiplier. Although the proposed versatile partially pipelined Montgomery multiplier is not exactly a digit-serial architecture, it is still compared with digit-serial designs, since this seems the most relevant type of comparison. Using 1pppk 1 pipeline stages it still has less AND–XOR gates than the nonversatile, nonpipelined LSbit digit-serial multiplier of [10] but more latches and MUXs. However, the hardware component sum of our design when compared to [10] is of the same magnitude (2kp). When compared to the nonversatile pipelined (p pipeline
stages) architecture of [29], the proposed versatile partially pipelined Montgomery multiplier performs very well, with smaller gate–latch number and smaller critical path by one XOR gate delay. However, the latency and the MUX number of the proposed partially pipelined design is higher than those in [29]. It must be noted, however, that [10,29] are not versatile so their competitive space and time complexity is not a fair criterion of their efficiency in comparison to our versatile design. Unfortunately, no appropriate versatile digit-serial design was found in order to be added in Table 3. From the above remarks it can be noted that the proposed MME and therefore the three proposed versatile Montgomery multipliers, give better results when compared to other well-known versatile designs based on the MSbit and LSbit algorithms in terms of gate–latch number and critical path delay. The proposed versatile Montgomery multiplier architectures achieve high throughput rates because of their high clock frequency–small critical path delay. Even against the nonversatile designs of [10,27–29] which are generally considered very efficient in terms of gate–latch–MUX number, critical path and latency, the three proposed versatile designs perform admirably, lacking only in latency against [10,12,29]. The extra complexity due to the MUX number is counterbalanced by the reduced AND–XOR number of the proposed versatile architectures as indicated by the hardware component sums, shown in Table 3. The three proposed multipliers where captured in VHDL and implemented in hardware using Xilinx Virtex 2 XC2V8000 device FPGA technology with time delay synthesis constrains, for an arbitrary GFð2m Þ field, where 1pmp163. Each such multiplier is functional for GFð2163 Þ field and all underlined GFð2m Þ fields. Also, for comparison reasons, the most promising versatile designs of Table 3 were also implemented in the same FPGA technology. The resulting implementation synthesis results are shown in Table 4. The proposed versatile bit-serial Montgomery multiplier implementation has the smallest chip covered area with only 328 LUTs and 489 flip flops (FF) against many Table 4 Area, clock frequency and throughput measurements for GFð2163 Þ multipliers GFð2k Þ multiplier
Area (LUT/FF)
Max frequency (MHz)
Throughput
Bit-seriala LSB [8] Kitsosa [13] Pip. semisystolica Jaina [12] Par. pipelineda p ¼ 4 Par. pipelineda p ¼ 8 Par. pipelineda p ¼ 16 Par. pipelineda p ¼ 32
326/489 326/507 869/489 53640/52479 89592/79620 2114/1011 3418/2383 6026/5127 11242/10615
377 377 108 308 110.8 249.3 249.3 249.6 249.6
188.5 Mbps 377 Mbps 108 Mbps 25.1 Gbps 18 Gbps 498.6 Mbps 997.2 Mbps 1.994 Gbps 3.988 Gbps
a
Versatile design.
ARTICLE IN PRESS A.P. Fournaris, O. Koufopavlou / INTEGRATION, the VLSI journal 41 (2008) 371–384
thousand LUTs-FF of the other proposed implementations. This proposed architecture is very advantageous when compared to the most promising bit-serial versatile design of Table 3 [13]. Even against the nonversatile LSB multiplier implementation that does not include the extra space and time complexity of versatile designing, the proposed bit-serial multiplier implementation fairs very well lacking only in throughput. The proposed versatile pipelined-semisystolic Montgomery multiplier has very high chip covered area due to its semisystolic nature but can give one multiplication result every two clock cycle and therefore has a very high throughput value (25.1 Gbps). This implementation achieves far better results when compared to the versatile design of [12] in terms of chip covered area, maximum frequency and throughput values. In Table 4 there are also presented several implementations of the proposed versatile partially pipelined Montgomery multiplier with different digit size. The clock frequency of the proposed versatile partially pipelined Montgomery multiplier implementations remains constant as the chip covered area increases. However, the throughput increases analogous to the area increase. Summarizing all the above, throughput is very high for the proposed versatile pipelined-semisystolic Montgomery multiplier implementation, analogous to the pipeline stages number p (digit size) for the proposed versatile partially pipelined Montgomery multiplier implementation and low for the proposed versatile bit-serial Montgomery multiplier implementation. 7. Conclusion In this paper a Montgomery multiplication element (MME) was proposed for hardware applications, based on a proposed optimized version of MM algorithm for GFð2k Þ fields. The proposed MME is designed for general type irreducible polynomials and arbitrary GFð2k Þ field and leads to versatile multiplier architectures. The hardware applicability of the MM algorithm in versatile designs and the efficiency of the proposed MME in terms of gate latch MUX number, critical path and latency along with clock frequency and chip covered area, are examined when applying the MME to three different versatile Montgomery multiplier architectures. The proposed bit-serial Montgomery multiplier, pipelined-semisystolic Montgomery multiplier and partially pipelined Montgomery multiplier achieve very promising results in terms of space and time complexity when compared to other similar designs. The proposed versatile pipelined-semisystolic Montgomery multiplier is ideal for applications with high throughput needs while the proposed versatile bit-serial Montgomery multiplier is ideal for applications with strict chip covered area constrains. The proposed versatile partially pipelined Montgomery multiplier by controlling the number of pipeline stages offers an adjustable architecture for application that are constrained both in throughput and
383
chip covered area, covering needs that the other multipliers cannot satisfy.
References [1] L.C. Washington, Elliptic Curves: Number Theory and Cryptography, Chapman and Hall, CRC, New York, 2003. [2] M. Rosing, Implementing Elliptic Curve Cryptography, Manning Publications, Greenwich, 1999. [3] I. Blake, G. Serrousi, N. Smart, Elliptic Curves in Cryptography, Cambridge University Press, Cambridge, UK, 1999. [4] A. Menezes, P. van Oorschot, S. Vanstone, Handbook of Applied Cryptography, CRC Press, Boca Raton, FL, 1996. [5] E.D. Mastrovito, VLSI architectures for computations in Galois fields, Ph.D. Thesis, Linkoping University, Sweden, 1991. [6] F. Rodriguez-Henriquez, C.K. Koc, Parallel multipliers based on special irreducible pentanomials, IEEE Trans. Comput. 52 (12) (2003) 1533–1542. [7] B. Sunar, C.K. Koc, Mastrovito multiplier for all trinomials, IEEE Trans. Comput. 48 (5) (1999) 522–527. [8] D. Hankerson, A. Menezes, S. Vanstone, Guide to Elliptic Curve Cryptography, Springer, New York, 2004. [9] G. Orlando, C. Paar, A super-serial Galois field multiplier for FPGA and its application to public-key algorithms, in: Proceedings of the Seventh IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 99), Napa Valley, April 21–23, 1999. [10] L. Song, K. Parhi, Low-energy digit-serial/parallel finite field multipliers, J. VLSI Signal Process. 19 (2) (1998) 149–166. [11] C.S. Yeh, I.S. Reed, T.K. Truong, Systolic multipliers for Finite Fields GFð2m Þ, IEEE Trans. Comput. C-33 (1984) 357–360. [12] S.K. Jain, L. Song, K. Parhi, Efficient semisystolic architectures for finite-field arithmetic, IEEE Trans. Very Large Scale Integration Syst. 6 (1) (1998) 101–113. [13] P. Kitsos, G. Theodoridis, O. Koufopavlou, An efficient reconfigurable multiplier architecture for Galois field GFð2m Þ, Microlectron. J. 34 (10) (2003) 975–980. [14] M.A. Hasan, M. Ebtedaei, Efficient architectures for computations over variable dimensional Galois field, IEEE Trans. Circuits Syst. I Fundam. Theory Appl. 45 (11) (1998). [15] G. Bertoni, J. Guajardo, G. Orlando, Systolic and scalable architectures for digit-serial multiplication in fields GFðpm Þ, in: INDOCRYPT 2003, Lecture Notes in Computer Science, vol. 2904, Springer, Berlin, 2003, pp. 349–362. [16] P.L. Montgomery, Modular multiplication without trial division, Math. Comput. 44 (170) (1985) 519–521. [17] A.P. Fournaris, O. Koufopavlou, Montgomery modular multiplier architectures and hardware implementations for an RSA cryptosystem, in: Proceedings of the 46th IEEE Midwest Symposium on Circuits and Systems ’03, Cairo, Egypt, December 27–30, 2003. [18] C.K. Koc, T. Acar, Montgomery multiplication in GFð2k Þ, in: Proceedings of the Third Annual Workshop on Selected Areas in Cryptography, Ontario, Canada, August 1996, pp. 95–106. [19] H. Wu, Montgomery multiplier and squarer for a class of finite fields, IEEE Trans. Comput. 51 (5) (2002) 521–529. [20] C.-Y. Lee, J.-S. Horng, I-C. Jou, E.-H. Lu, Low-complexity bitparallel systolic Montgomery multipliers for special classes of GFð2m Þ, IEEE Trans. Comput. 54 (9) (2005) 1061–1070. [21] E. Ozturk, B. Sunar, E. Savas, A versatile Montgomery multiplier architecture with characteristic three support, Pre-print, June 2005. [22] G. Gaubatz, Versatile Montgomery multiplier architectures, Master Thesis, Worcester Polytechnic Institute, May 2002. [23] A. Menezes, I. Blake, X. Gao, R. Mullin, S. Vanstone, T. Yaghoobian, Applications of Finite Fields, Kluwer Academic Publishers, Dordrecht, 1993. [24] G. Orlando, Efficient Elliptic Curve Processor Architectures for Field Programmable Logic, Worcester Polytechnic Institute, March 2002.
ARTICLE IN PRESS 384
A.P. Fournaris, O. Koufopavlou / INTEGRATION, the VLSI journal 41 (2008) 371–384
[25] K. Parhi, VLSI Signal Processing Systems: Design and Implementation, Wiley, New York, 1999. [26] J.M. Rabaey, A. Chandrakasan, B. Nikolic, Digital Integrated Circuits: A Design Perspective, second ed., Prentice-Hall, Englewood Cliffs, NJ, 2003. [27] C.L. Wang, J.L. Lin, Systolic array implementation of multipliers for finite fields GFð2m Þ, IEEE Trans. Circuits Syst. 38 (7) (1991) 796–800. [28] W.C. Tsai, S.-J. Wang, Two systolic architectures for multiplication in GFð2m Þ, IEE Proc. Comput. Digit. Tech. 147 (6) (2000) 375–382. [29] J.-H. Guo, C.-L. Wang, Digit-serial systolic multiplier for finite fields GFð2m Þ, IEE Proc. Comput. Digit. Tech. 145 (2) (1998) 143–148.
Apostolos P. Fournaris has received his diploma and Ph.d. degree in Electrical and Computer Engineering department of University of Patras, Greece, in 2001 and 2007 respectively. He also holds a visiting position in
the Technical University of Patras. His research interests include public key cryptography, finite field arithmetic, wireless network security and VLSI design.
Odysseas Koufopavlou received the Diploma of Electrical Engineering in 1983 and the Ph.D. degree in Electrical Engineering in 1990, both from University of Patras, Greece. From 1990 to 1994 he was at the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. He is currently Professor with the Department of Electrical and Computer Engineering, University of Patras. His research interests include computer networks, high performance communication subsystems architecture and implementation, VLSI low power design, and VLSI crypto systems. Dr. Koufopavlou has published more than 150 technical papers and received patents and inventions in these areas. He has participated as coordinator or partner in many Greek and European R&D programmes. He served as general chairman for the IEEE ICECS’1999.