Fast, compact and symmetric modular exponentiation architecture by common-multiplicand Montgomery modular multiplications

Fast, compact and symmetric modular exponentiation architecture by common-multiplicand Montgomery modular multiplications

INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] Contents lists available at SciVerse ScienceDirect INTEGRATION, the VLSI journal journal homepage: ww...

815KB Sizes 0 Downloads 25 Views

INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]

Contents lists available at SciVerse ScienceDirect

INTEGRATION, the VLSI journal journal homepage: www.elsevier.com/locate/vlsi

Fast, compact and symmetric modular exponentiation architecture by common-multiplicand Montgomery modular multiplications Tao Wu a,n, Shuguo Li b,nn, Litian Liu b a b

Department of Microelectronics and Nanoelectronics, Tsinghua University, Beijing 100084, PR China Institute of Microelectronics, Tsinghua University, Beijing 100084, PR China

a r t i c l e i n f o

abstract

Article history: Received 1 April 2012 Received in revised form 23 September 2012 Accepted 24 September 2012

In this paper, the primitive common-multiplicand Montgomery modular multiplication is developed for modular exponentiation. Together with Montgomery powering ladder, a fast, compact and symmetric modular exponentiation architecture is proposed for hardware implementation. The architecture consists of one group of processing elements along the central line and two symmetric groups of accumulation units on two sides. The central elements perform modular reductions, while the symmetric units on both sides accumulate the modular multiplication results. A feedforwarding architecture is employed to decrease the latency between processing elements, in parallel with the word-based accumulation units, which are also pipelined. Meanwhile, due to the symmetric architecture and Montgomery powering ladder, the modular exponentiation is immune from fault and simple power attacks. Implemented in FPGA platform, the performance of our proposed design outperforms most results so far in the literature. & 2012 Elsevier B.V. All rights reserved.

Keywords: Modular exponentiation architecture Common-multiplicand Montgomery modular multiplication Feedforwarding mechanism Fault attack resistance Simple power attack resistance

1. Introduction Public-key cryptography leads to finite field arithmetic over long integers, such as RSA cryptography [1], ElGamal cryptography [2], and Diffie-Hellman key exchange protocol [3]. These cryptographic applications are interfered with modular exponentiations. Efficient hardware architectures have been described for modular exponentiations in the literature, which are based on systolic arrays [4], residue number systems [5], high-radix scalable architectures [6], carry-save additions [7]. A review of general hardware architectures for modular exponentiation can be found in [8]. In fact, fast modular exponentiations can be implemented by two ways: (1) fast algorithms or architectures for modular multiplications; (2) efficient algorithms or architectures for modular exponentiation itself. Besides the performance with time and area of modular exponentiations, the resistance from intended attacks should also be considered since modular exponentiations themselves are used as the mathematical barrier for security. In this paper, we expect to improve modular exponentiations by merging the above requirements.

n

Principal corresponding author. Corresponding author. E-mail addresses: [email protected], [email protected] (T. Wu), [email protected] (S. Li), [email protected] (L. Liu). nn

In this work, a fast, compact and symmetric modular exponentiation architecture is proposed, which outperforms most results so far in the literature both in speed and area overhead. There are several characteristics with this architecture: (1) Implement a similar feedforwarding mechanism to a lowlatency scalable Montgomery modular multiplier [9,10] in processing elements. (2) Apply common-multiplicand Montgomery modular multiplication algorithm [11,12] for modular exponentiations. (3) Employ Montgomery ladder in left-to-right binary method, and the resulted symmetry will keep the modular exponentiation from fault and simple power attacks [13]. Especially, although there are no two Montgomery modular multipliers in the architecture, the modular exponentiation architecture can be divided into three parts: two symmetric groups of accumulation units and one group of central processing elements. Together with the input pattern by Montgomery powering ladder, the centrosymmetry helps the architecture resistant from fault and simple power attacks. Nevertheless, the proposed architecture only deals with operands of fixed binary lengths, and therefore is not scalable [14]. The rest of this paper is organized as follows: Section 2 introduces the original common-multiplicand Montgomery modular multiplication; in Section 3, we develop a common-multiplicand algorithm for hardware implementation of modular exponentiations; then in

0167-9260/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.vlsi.2012.09.002

Please cite this article as: T. Wu, et al., Fast, compact and symmetric modular exponentiation architecture by common-multiplicand Montgomery modular multiplications, INTEGRATION, the VLSI journal (2012), http://dx.doi.org/10.1016/j.vlsi.2012.09.002

2

T. Wu et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]

Section 4 the architecture to implement the common-multiplicand Montgomery modular exponentiation is proposed, where both radix2 and radix-4 modular exponentiation architectures are presented; in Section 5 some experiments are given for comparison with the results in the literature; and Section 6 concludes this paper.

2. Common-multiplicand Montgomery modular multiplication Common-multiplicand Montgomery modular multiplication is proposed for modular exponentiation in [11,12] and has been developed in [12,15,16]. In [12], together with signed-digit recoding technique about 66.7% computational cost is saved. However, on the one hand, this algorithm is aiming at software implementations and has only been applied in hardware architectures for modular multiplication/exponentiation in GFð2m Þ [17]. On the other hand, the more advanced common-multiplicand Montgomery modular multiplications [12,15,16] are really uneasy for hardware implementations due to complex operations. Therefore, in this paper we will focus on the primitive algorithm proposed in [11], from which a fast, compact and symmetric modular exponentiation architecture can be devised. In this section, we will firstly give an overview of the commonmultiplicand Montgomery modular multiplication algorithm [11]. Suppose F, S and M are all integers with n binary bits, and there are two Montgomery modular multiplications: MtulðS,FÞ ¼ S  F  2n mod M, and MtulðS,SÞ ¼ S  S  2n mod M. The common-multiplicand Montgomery modular multiplication takes the privilege of one common multiplicand S between two Montgomery modular multiplications and divides the two modular multiplications into two parallel processes:

 common modular reductions. T i :¼ S  2i mod M, for i ¼ 1,2, . . . ,n;

Output: X ¼ MtulðS,SÞ, Y ¼ MtulðS,FÞ. 1: X :¼ 0, Y :¼ 0; 2: T :¼ S; 3: for i ¼ n3 downto 0 do 4: m :¼ T 0  r mod 2; 5: T :¼ ðT þ mMÞ=2; 6: X :¼ X þ f i T, Y :¼ Y þ si T; 7: end for 8: for i¼0 to 1 do 9: X :¼ X þ f n2i S, Y :¼ Y þ sn2i S; 10: mX :¼ X 0 r mod 2, mY :¼ Y 0 r mod 2; 11: X :¼ ðX þ mX MÞ=2, Y :¼ ðY þ mY MÞ=2; 12: end for 13: while X ZMdo 14: X :¼ XM; 15: end while 16: while Y ZM do 17: Y :¼ YM; 18: end while 19: return X, Y.

Algorithm 1 effectively cuts down the computational efforts with Montgomery modular multiplication owing to common computation. While it is suitable for software implementation, there is no report about hardware implementations in prime fields so far. We will present an exponentiation architecture based on such common-multiplicand Montgomery modular multiplications in the following sections. The advanced common-multiplicand algorithms in [15,16] just find out more common computation in modular exponentiations, which requires complex operations and is inefficient for hardware implementation.

 two separate accumulations. Suppose si and fi are, respectively, the i-th digit of S and F, then there are X¼

n X

i

f i S  2 ðmod MÞ,

ð1Þ

3. Montgomery modular exponentiation based on revised common-multiplicand Montgomery modular multiplications

i¼1



n X

i

si S  2 ðmod MÞ:

ð2Þ

i¼1

Montgomery modular exponentiation is named after modular exponentiation by Montgomery modular multiplications, which is shown in Algorithm 2. Algorithm 2. Montgomery modular multiplication.

In the above equations, Montgomery modular multiplication of MtulðS,FÞ modulo M has been divided into a sum of partial products in the following [11]: n

MtulðS,FÞ ¼ S  F  2 ðmod MÞ ! n1 X f i  2i  2n ðmod MÞ ¼S i¼0

¼ f 0  S  2n þ f 1  S  2ðn1Þ þ    þf n1  S  21 ðmod MÞ:

ð3Þ

A similar expression exists for MtulðS,SÞ. The common multiplicand S within Eqs. (1) and (2) then enables parallel computation of two Montgomery modular multiplications. An algorithm that utilizing this parallelism is firstly proposed in [11] and is shown below: Algorithm 1. Original common-multiplicand Montgomery modular multiplication. P i Input: S and F are both n-bit numbers, with S ¼ n1 i ¼ 0 si  2 , Pn1 i n1 n o M o2 , F ¼ i ¼ 0 f i  2 . M is the modulus, with 2 GCDðM,2Þ ¼ 1. In addition, r ¼ M 1 0 mod 2.

Input: M o R ¼ 2n , GCDðM,2Þ ¼ 1. 0 r A r2n , 0 rB r 2n , and 0 r A  B o M  R. Precompute m ¼ M1 mod R. Output: S ¼ A  B  R1 mod M, 0 r S o M. 1: T ¼ A  B; 2: Q ¼ T  m mod R; 3: S ¼ ðT þ Q  MÞ=R; 4: if S Z M then 5: S :¼ SM; 6: end if 7: return S.

Montgomery modular multiplication has many variant forms [18], and is efficient for hardware implementation. Combined with the binary method, it can be used to perform modular exponentiation. 3.1. Common-multiplicand Montgomery modular multiplication for hardware implementation For the sake of hardware implementation, the original commonmultiplicand Montgomery modular multiplication is rewritten in

Please cite this article as: T. Wu, et al., Fast, compact and symmetric modular exponentiation architecture by common-multiplicand Montgomery modular multiplications, INTEGRATION, the VLSI journal (2012), http://dx.doi.org/10.1016/j.vlsi.2012.09.002

T. Wu et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]

Algorithm 3, in which the last two steps with mX and mY are replaced by regular operations. Algorithm 3. Regular common-multiplicand Montgomery modular multiplication. Input: S and F are both n-bit numbers, and the modulus is M, P Pn1 i i n1 with S ¼ n1 o M o 2n , i ¼ 0 si 2 , F ¼ i ¼ 0 f i 2 , and 2 GCDðM,2Þ ¼ 1. Output: X ¼ MtulðF,SÞ, Y ¼ MtulðS,SÞ. 1: X :¼ 0, Y :¼ 0; 2: T :¼ S; 3: for i¼0 downto n1 do 4: qi :¼ T 0 mod 2; 5: T :¼ ðT þ qi MÞ=2; 6: X :¼ X þf i T, Y :¼ Y þsi T; 7: end for 8: while X Z M do 9: X :¼ XM; 10: end while 11: while Y Z M do 12: Y :¼ YM; 13: end while 14: return X, Y.

Dividing both sides of the above equation by 2g þ 1 , we get U 1 o S=2g þ 1 þ M o2n1 þ M o 2M:

In the updated algorithm, T  2 for i Z2 can be accumulated at each loop. Meanwhile, the accumulation leads to (n þlog2 n)-bit operands, which should be reduced in later modular multiplications. Therefore, we can define a new Montgomery modular multiplication as NmulðS,FÞ ¼ S  F  2ðn þ 2gÞ mod M,

ð4Þ

with 2g Z1. In order to determine the proper value of g, we have to consider both the reduction and accumulation processes. First, in the reduction process, after k loops, there is T  2k ¼ S þ

k1 X

qi 2i M r S þ

i¼0

k1 X

1  2i M ¼ S þ ð2k 1ÞM oS þ 2k M:

ð5Þ

i¼0

k

Therefore, T oS=2 þ M. We can choose a proper value of k, so that S=2k o 2n and T o 2n þ M o 2n þ 1 . Then, after iZ k all the values of T will fall below 2n þ 1 . Second, we only have to accumulate values from i¼k to i ¼ n1 þ k, and the final sum will keep below ðn þ 1Þ þ log2 ðn þ 1Þ bits. Meanwhile, the condition of S=2k o 2n equals S=2n o2k , which can be ensured by k 4 ðn þ1Þ þ log2 ðn þ 1Þn ¼ 1 þlog2 ðn þ 1Þ. So we can set g ¼ 1 þ dlog2 ðn þ 1Þe, and then the accumulated result will keep below ðn þ 1Þ þ log2 ðn þ1Þ rn þg bits. In the situation, define P two numbers 0 r S o2n þ g , F ¼ ni ¼þ 0g1 f i 2i , and the new Montgomery modular multiplication reads ðn þ 2gÞ

H ¼ NmulðS,FÞ ¼ S  F  2

mod M ¼

nþ g1 X

in2g

f i S2

ðmod MÞ

i¼0

¼

nþ g1 X

f n þ g1i S  2ðg þ 1 þ iÞ ðmod MÞ:

ð6Þ

i¼0

Theorem 1. Define U 1 ¼ S  2ðg þ 1Þ mod M 0 r S o 2n þ g , then we have U 1 o 2M. Proof. According to Algorithm 3, there is 2g þ 1 U 1 ¼ S þ

g X i¼0

qi 2i M o S þ

g X

in Eq. (6), with

&

ð7Þ

Theorem 2. Define U 2 ¼ S  2ðn þ 2gÞ mod M ¼ U 1  2ðn þ g1Þ mod M, then there is U 2 rM. Proof. Similarly, we have 2n þ g1 U 2 ¼ U 1 þ

nþ g1 X

qi  2n þ g1i M

i¼0

oU 1 þ

nþ g1 X

1  2n þ g1i M ¼ U 1 þð2n þ g1 1ÞM:

ð8Þ

i¼1

Divide both sides of the above equation by 2n þ g1 , it yields U 2 o U 1 =2n þ g1 þM o2M=2n þ g1 þ M ¼ ð1 þ2ðn þ g2Þ ÞM:

ð9Þ

ðn þ g2Þ

Because U2 is an integer and 2 5 1, the above equation leads to U 2 r M. If U2 is the final result for cryptographic applications, no final subtraction is needed [19]. & Theorem 1 is used at once while Theorem 2 will be applied in Section 3.2. Substituting U1 for S  2ðg þ 1Þ mod M in Eq. (6) yields H ¼ S  F  2ðn þ 2gÞ mod M ¼

i

3

nþ g1 X

f n þ g1i U 1 2i ðmod MÞ:

ð10Þ

i¼0

Furthermore, it is obvious that all the partial products in Eq. (6) are below 2M. From Eqs. (10) and (7), one gets log2 H o log2 fðn þ gÞ2Mg o n þ log2 f2ðn þ gÞg:

ð11Þ

When n ¼1024, g ¼ 1 þdlog2 ðn þ 1Þe ¼ 12. Meanwhile, as g bn, we can still get g :¼ 1 þ dlog2 ðn þgÞe ¼ 12. And then substituting this value into Eq. (11), one finds that log2 H o n þlog2 f2ðn þ 12Þg ¼ n þ 11:02 o n þ 12 ¼ n þ g,

ð12Þ

which shows the self-consistence with g. Accordingly, about 2g ¼ 24 extra clock cycles are required to obtain the final result, in addition to n clock cycles in the original Montgomery modular multiplication. Now, we can develop an optimized common-multiplicand Montgomery modular multiplication algorithm for modular exponentiation. Algorithm 4. Proposed modular multiplication.

common-multiplicand

Montgomery

Input: S and F are both (n þ g)-bit numbers, with P P g ¼ 1 þ dlog2 ðn þ1Þe, S ¼ ni ¼þ 0g1 si 2i , and F ¼ ni ¼þ 0g1 f i 2i . The modulus is M, with 2n1 o M o 2n and gcdðM,2Þ ¼ 1. Output: X ¼ NmulðS,FÞ ¼ S  F  2ðn þ 2gÞ mod M, Y ¼ NmulðS,SÞ ¼ S2 2ðn þ 2gÞ mod M, with 0 rX o2n þ g , 0 r Y o 2n þ g . 1: X :¼ 0, Y :¼ 0; 2: T :¼ S; 3: for i¼1 to n þ2g do 4: qi :¼ T 0 mod 2; 5: T :¼ ðT þ qi MÞ=2; 6: if g þ 1r i rn þ 2g then 7: X :¼ X þ f n þ 2gi  T, Y :¼ Y þ sn þ 2gi  T; 8: end if 9: end for 10: return X, Y.

1  2i M

i¼0

¼ S þð2g þ 1 1ÞM o S þ 2g þ 1 M:

As is shown in Algorithm 4, the accumulation process is initialized at i ¼ g þ 1 and ends at i ¼ n þ 2g. Meanwhile, all the partial

Please cite this article as: T. Wu, et al., Fast, compact and symmetric modular exponentiation architecture by common-multiplicand Montgomery modular multiplications, INTEGRATION, the VLSI journal (2012), http://dx.doi.org/10.1016/j.vlsi.2012.09.002

4

T. Wu et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]

products are below 2n þ 1 , and only (n þ g)-bit registers are required to store X and Y. The word-based algorithm for common-multiplicand Montgomery modular multiplication is shown in Algorithm 5. Algorithm 5. Word-based common-multiplicand Montgomery modular multiplication. Input: S and F are both (n þg)-bit numbers, with P P g ¼ 1 þ dlog2 ðn þ 1Þe, S ¼ ni ¼þ g1 si  2i , and F ¼ ni ¼þ 0g1 f i  2i , 0 si A f0,1g, f i A f0,1g. With m ¼ dðn þgÞ=we, all the operands can then be separated into m words. M is the modulus, with 2n1 o M o2n and gcdðM,2Þ ¼ 1. Output: X ¼ NmulðS,FÞ ¼ S  F  2ðn þ 2gÞ mod M, Y ¼ NmulðS,SÞ ¼ S2 2ðn þ 2gÞ mod M, with 0 rX o2n þ g , 0 rY o2n þ g . 1: X :¼ 0, Y :¼ 0; 2: T :¼ S; 3: for i¼1 to n þ 2g do 4: qðiÞ :¼ T 0 mod 2; 5: 6:

Sð0Þ :¼ 0, Dð0Þ :¼ 0, c0 :¼ 0; for j ¼0 to m1 do

7:

ði1Þ ðiÞ SðiÞ j :¼ T j,w2: :0 þ q M j þc j ;

8:

dj :¼ SðiÞ j,0 ;

9:

cj þ 1  2w þ T ðiÞ ¼ SðiÞ þ dj þ 1  2w1 ; j j,w1: :1

ðiÞ

ði1Þ

10: 11: 12: 13:

end for if g þ 1 ri r n þ2g then for j¼ 0 to m1 do cf  2w þ X j :¼ X j þ f n þ 2gi  T j þ cf ;

14: cs  2w þ Y j :¼ Y j þsn þ 2gi  T j þcs ; 15: end for 16: end if 17: end for 18: return X, Y.

3.2. Common-multiplicand modular exponentiation Based upon the above common-multiplicand modular multiplication algorithm, left-to-right binary method and Montgomery powering ladder [20,21], our algorithm for modular exponentiation is written in Algorithm 6.

Montgomery domain, and there are simultaneous modular multiplications and modular squaring whether the current exponent bit is 1 or 0. On Lines 5 and 7 of Algorithm 6, the values of F and S are updated at the same time. The final result is then transformed back to integer domain by NmulðZ,1Þ. Finally, from Theorem 2 we know that U 2 ¼ S  2ðn þ 2gÞ mod M r M with S o 2n þ g . If we set the common multiplicand as F in NmulðF,1Þ at Step 9, then there will be Z ¼ F  2ðn þ 2gÞ mod M A ½0,M. In cryptographic applications, there is little chance that AE  Mðmod MÞ, so that no final subtractions of M are needed.

4. Hardware implementation The hardware architecture for the common-multiplicand modular exponentiation algorithm is shown in Fig. 1. Different from popular scalable Montgomery modular multiplier [22], this architecture divides one group of processing elements into one group of processing elements and two other groups of accumulation units, all of which are w-bit processors. The outputs of each processing element are summed by the two accumulation units on its two sides. Meanwhile, both the elements and units work in a pipeline, with a latency of one clock cycle. During modular exponentiation, the inputs into the processing elements are updated at the beginning of every new modular multiplication, which are selected from either group of accumulation units by multiplexors. As a result, the datapath and power are the same for different exponent bits. The mechanism of the exponentiation architecture is symmetric, which is just like riding a bicycle. As is shown in Fig. 2, the processing elements act like the chain receiving the force, while two groups of accumulation units look like two wheels running ahead. However, there is still some difference between the two. In a bicycle, the front wheel receives force from the fixed supporting parts, while both accumulation units X and Y are directly driven by the processing elements. In other words, the proposed architecture is more symmetric, looking like a two-chain bicycle. Since the frontal wheel and the back wheel always move in the same manner, it is extremely difficult to discriminate the one from the other by observing their rolling traces. In this way, together with the left-to-right binary method and Montgomery powering

Algorithm 6. Montgomery modular exponentiation by commonmultiplicand Montgomery modular multiplication and Montgomery powering ladder. Input: 0 r A oM o2n , E ¼ n þ 2g

d¼2

Pn1

i¼0

ei 2i . l ¼ 2ð2n þ 4gÞ mod M,

mod M, C ¼ R mod M, GCDðM,2Þ ¼ 1.

Output: Z ¼ AE mod M, 0 rZ rM. S :¼ A  R mod M ¼ MtulðA, lÞ, F :¼ C; for i ¼ n1 to 0 do if ei ¼1 then F :¼ MtulðS,FÞ, S :¼ MtulðS,SÞ; else F :¼ MtulðF,FÞ, S :¼ MtulðS,FÞ; end if end for Z :¼ Mtulð1,FÞ; return Z. In Algorithm 6, at first the modular multiplication of NmulðA, lÞ transforms A from integer domain into Montgomery domain as A ¼ A  2n þ 2g mod M. The following loops compute AE mod M in

Fig. 1. The architecture of the whole system. ‘PE’ denotes processing element, ‘X-acc’ denotes the accumulation unit for X, and ‘Y-acc’ denotes the accumulation unit for Y.

Please cite this article as: T. Wu, et al., Fast, compact and symmetric modular exponentiation architecture by common-multiplicand Montgomery modular multiplications, INTEGRATION, the VLSI journal (2012), http://dx.doi.org/10.1016/j.vlsi.2012.09.002

T. Wu et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]

5

Fig. 2. Bicycle-analogy of the modular exponentiation architecture.

Fig. 3. Boxes of F :¼ X or S :¼ Y.

ladder, there will be no power variation with different patterns of exponent bits. And the system is immune from simple power attacks. Moreover, just like the distance between two wheels is constant, there is a fixed relation between S and F: S ¼ MtulðF, lÞ. In the course of modular exponentiation, if the relation disappears, then the fault attack is detected [13]. In Fig. 1, the boxes with ‘F:¼X’ and ‘S:¼Y’ are two shift registers with some control logics. The box ‘F:¼X’ selects one word Xk among buffering registers of X every w clock cycles, and shifts a bit fi into X-acc 0 every clock cycle. Similarly, the box ‘S:¼Y’ selects one word Yk among the buffering registers with Y every w clock cycles, and feeds si into Y-acc 0 every clock cycle. As a result, in the boxes there are two levels of selection logics and buffers with both fi and si. They are depicted in Fig. 3, which are, respectively, an m : 1 multiplexor and a w : 1 multiplexor. At the last ðmþ gÞ clock cycles of one new Montgomery modular multiplication ‘Nmul’, both si and fi are set as zeros. Each new Montgomery modular multiplication ‘Nmul’ consists of ðN þ mþ gÞ clock cycles, as is shown in Fig. 4. At first glance, it requires ðN þ m þ2g þ2Þ clock cycles to produce one ‘Nmul’, since the highest bit is generated at the last word but is used at the first clock cycle for accumulation. In fact, since the first ðg þ2Þ clock cycles are only used for reduction rather than accumulation, these cycles can be shared. Therefore, the last ðg þ 2Þ words of temporary results in the previous ‘Nmul’ are loaded at the beginning ðg þ 2Þ clock cycles of a new ‘Nmul’. 4.1. Processing elements The word-based processing elements (PEs) are critical for the proposed architecture, which coordinates with the accumulation units to perform the common-multiplicand modular multiplication. A feedforward mechanism is applied in the processing element to decrease the latency between neighbor elements from two clock cycles to one clock cycle. This idea is initiated by the feedforwarding processing elements in [9], which reduce the clock cycles of a scalable Montgomery modular multiplication [14] by a half. However, this work differs from those in that the processing element only processes one partial product instead of two. In detail, it computes ðT j þ qðiÞ M j Þ=2 rather than ðT j þqðiÞ  M j þ f i  T j Þ=2, while the latter partial product is just accumulated by adders aside. The processing elements are demonstrated from Figs. 5 to 7. Suppose that the j-th word in the i-th loop reads SðiÞ j , then only the lowest ðw1Þ bits are available at the current clock cycle, i.e.,

Fig. 4. The delay of a modular multiplication in modular exponentiation, measured by clock cycles.

Fig. 5. PE 0: the first processing element. u1 and u0 are highest two bits of the ðiÞ sum, d0 denotes the shifted bit from PE 1 to PE 0. The circle J denotes the ðiÞ concatenation of T 0,out ¼ fT ðiÞ 0,w1 ,T 0,w2: :0 g.

ðiÞ SðiÞ j,w2: :0 . The highest bit Sj,w1 should be a shifted bit from the (j þ1)-th word, which comes from the next processing element after one clock cycle. In the feedforward architecture, we can just compute Sjði þ 1Þ :¼ T ðiÞ þ qðiÞ M j þcj at the current clock cycle, j,w2: :0 and adjust the top bit and carry out with this result at the next ðiÞ clock cycle, i.e., cj þ 1  2w þ T jði þ 1Þ ¼ Sðij þ 1Þ þ2w1  dj þ 1 . The first processing element is shown in Fig. 5, in which the bit ðiÞ of d0 denotes a shifted bit for feedforwarding. In one ‘Nmul’, there are N successive operations of ðT j þ qðiÞ Mj Þ=2. And the quotients qi are just generated by T 0,0 in the this first processing element. However, at the most least digit, it does not generating feedforwarding signal and just shift out zero bits every clock cycle. The signal ‘null’ sets qðiÞ as zero at the end of modular exponentiation, which avoids the overlap of M to the final result. The intermediate processing elements are shown in Fig. 6, where there are shifted bits both output and input for feedforði þ 1Þ ðiÞ warding, i.e., dj1 and dj . The last processing element is shown in Fig. 7, in which there is only shifted bit for output to the previous processing element.

Please cite this article as: T. Wu, et al., Fast, compact and symmetric modular exponentiation architecture by common-multiplicand Montgomery modular multiplications, INTEGRATION, the VLSI journal (2012), http://dx.doi.org/10.1016/j.vlsi.2012.09.002

6

T. Wu et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]

Fig. 8. Y-acc: accumulation unit II, with 0 r jr m1.

modular multiplication can be written as ðiÞ dj

Fig. 6. PE j for 0o j o m1. u1 and u0 are highest two bits of the sum, denotes the shifted bit from PE j þ 1 to PE j. The circle J denotes the concatenation of ðiÞ ðiÞ T j,out ¼ fT j,w1 ,T j,w2: :0 g.

MtulðS,FÞ ¼ S  F  4ðh þ 2tÞ ðmod MÞ hX þ t1

¼ ðS  4ðt þ 1Þ Þ 

f i  4i

!  4ðh þ t1Þ ðmod MÞ

i¼0

¼ V1 

hX þ t1

f h þ t1i  4i ðmod MÞ,

ð13Þ

i¼0

where V 1 ¼ S  4ðt þ 1Þ mod M. It can be found out that 4t þ 1  V 1 ¼ S þ

t X

qðiÞ  4i  M

i¼0

rð4h þ t 1Þ þ M 

t X

3  4i

i¼0

rð4h þ t 1Þ þ M  ð4t þ 1 4Þ ¼ 4h þ t þ 4t þ 1  M4M1:

ð14Þ

Therefore, V 1 r 4h1 þM o2M:

ð15Þ

1

Fig. 7. PE m1: the last processing element.

Meanwhile, V 1  4 mod M o 2M=4þ 3M=4 ¼ 5M=4. Thus by induction V 1  4i mod M o 5M=4 for 1 r i rh þ t1. From Eq. (13) we have MtulðS,FÞ o

hX þ t1

f h þ t1i  2M o3  2M þ

i¼0

4.2. Accumulation units The accumulation units X-acc and Y-acc are the same, which are illustrated in Fig. 8. X-acc units accumulate the updated value of Ak  2n þ 2g mod M, while Y-acc units accumulate the value of Ak þ 1  2n þ 2g mod M during the Montgomery modular exponentiations. The final result after NmulðZ,1Þ is stored in X-acc, in which no final subtraction is needed for cryptographic applications [19]. Both X-acc and Y-acc accumulate the temporary results word by word until all the loops for a modular multiplication ends. For example, when the j-th accumulation element is processing the i-th loop, the (j1)-th element has entered in the (iþ 1)-th loop and the (jþ 1)-th element is getting rid of the word in the (i1)-th loop. In Fig. 8, si is the i-th bit of S in Algorithm 4, which also flows through all the accumulation elements during every modular multiplication. The structure of X-acc only substitutes si and S for fi and F. So long as Algorithm 6 is concerned, the ‘X-acc’ units accumulate NmulðF,FÞ or NmulðF,SÞ, while ‘Y-acc’ units accumulate NmulðS,FÞ or NmulðS,SÞ, depending on the current exponent bit.

4.3. Radix-4 implementation Assuming that n and g are even, with h ¼ n=2, t ¼ g=2, and P f i  4i , then the radix-4 new Montgomery S o 4h þ t , F ¼ hi ¼þ t1 0

o

hX þ t1

3  5M=4

i¼1

3M ð5h þ5t þ 3Þ: 4

ð16Þ

Assuming h ¼ n=2 ¼ 512, t ¼ g=2 ¼ 6 and M o 2n , we get log4 fMtulðS,FÞg o 517:463 o hþ t:

ð17Þ

Therefore, the parameters are self-consistent for modular exponentiation. In case of F ¼1, V 2 ¼ S  4ðh þ 2tÞ mod M ¼ V 1  4ðh þ t1Þ mod M, there is 4h þ t1  V 2 ¼ V 1 þ

hX þ t2

qðiÞ  4i  M

i¼0

rV1 þ

hX þ t2

3  4i  M

i¼0

r 2M þ ð4h þ t1 1ÞM ¼ ð4h þ t1 þ 1ÞM: Dividing both sides by 4 ðh þ t1Þ

V 2 r ð1þ 4

ÞM:

h þ t1

ð18Þ

, the above equation yields ð19Þ

Since V2 is an integer, there is V 2 r M. Algorithm 7 shows the radix-4 word-based commonmultiplicand Montgomery modular multiplication. From Eq. (19), at the end of modular exponentiation only one subtraction is required.

Please cite this article as: T. Wu, et al., Fast, compact and symmetric modular exponentiation architecture by common-multiplicand Montgomery modular multiplications, INTEGRATION, the VLSI journal (2012), http://dx.doi.org/10.1016/j.vlsi.2012.09.002

T. Wu et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]

7

Algorithm 7. Radix-4 word-based common-multiplicand Montgomery modular multiplication. Input: S and F are both (n þg)-bit numbers, with P g ¼ 1þ dlog2 ðn þ 1Þe. Also, n ¼ 2h, g ¼ 2t. S ¼ hi ¼þ t1 si  4i , and 0 Ph þ t1 i F ¼ i ¼ 0 f i  4 , with si A f0,1,2,3g, f i A f0,1,2,3g. Set m ¼ dðn þgÞ=we, then all the operands can be partitioned into m words. M is the modulus, with 2n1 o M o 2n and gcdðM,2Þ ¼ 1. Output: X ¼ NmulðS,FÞ ¼ S  F  4ðh þ 2tÞ mod M, Y ¼ NmulðS,SÞ ¼ S2  4ðh þ 2tÞ mod M, with 0 rX o2n þ g , 0 rY o2n þ g . 1: X :¼ 0, Y :¼ 0; 2: T :¼ S; 3: for i¼1 to hþ 2t do 4:

ðiÞ qðiÞ 0 :¼ T 0,0 , q1 :¼ M 1  T 1  T 0 þT 1  ðT 0 þ M 1 Þ;

5: 6:

Sð0Þ :¼ 0, Dð0Þ :¼ 0, c0 :¼ 0; for j ¼0 to m1 do

7:

ði1Þ ðiÞ SðiÞ j :¼ T j,w3: :0 þ ðq  MÞj þ cj ;

8:

dj :¼ SðiÞ j,1: :0 ;

9:

cj þ 1  2w þ T ðiÞ ¼ SðiÞ þ dj þ 1  2w2 ; j j,w1: :2

10: 11: 12: 13:

Fig. 9. PE 0: the first radix-4 processing element. The circle ðiÞ concatenation of T 0,out ¼ fT ðiÞ 0,w1: :w2 ,T 0,w3: :0 g.

J

denotes the

ðiÞ

ði1Þ

end for if t þ 1r i rh þ 2t then for j ¼0 to m1 then ðcf ,1 þcf ,2 Þ  2w þ X j :¼ X j þ f n þ 2gi  T j þ cf ,1 þ cf ,2 ;

14: ðcs,1 þ cs,2 Þ  2w þY j :¼ Y j þsn þ 2gi  T j þ cs,1 þ cs,2 ; 15: end for 16: end if 17: end for 18: return X, Y.

ðiÞ

Fig. 10. PE j: radix-4 processing element for 0 o j o m1. dj denotes the shifted bit from PE j þ 1 to PE j. The circle J denotes the concatenation of ðiÞ T j,out ¼ fT ðiÞ j,w1: :w2 ,T j,w3: :0 g.

ðiÞ

In Algorithm 7, both dj and qðiÞ have two binary bits. The expression ðqðiÞ  MÞj denotes the j-th word of qðiÞ  M. In detail, the radix-4 processing elements are shown from Figs. 9 to 11. The systematic view is the same as Fig. 1, in which the boxes with ‘F:¼X’ and ‘S:¼Y’ output 2 bits rather than 1 bit. In Fig. 9, qðiÞ is determined by the second least significant bit of the modulus M 0,1 (assuming M 0,0 ¼ 1) and the two least significant bits of current word T 1: :0 . To make qðiÞ  M1: :0 þ T 1: :0  0ðmod 4Þ, qi can be encoded by Table 1. Furthermore, by Karnaugh map we get qðiÞ 1 ¼ M 1  T 1  T 0 þT 1  ðT 0 þ M0,1 Þ, and qðiÞ 0 ¼ T 0 . Finally, at the last moment of modular exponentiation, the signal ‘null’ sets qðiÞ 1: :0 ¼ 0 to avoid overlap of a multiple of M in the final result. ðiÞ In both Figs. 9 and 10, the feedforwarding signals dj and the buffered two-bit signal u0 are combined through ‘OR’ gates. When a new ‘Nmul’ is initiated, u0 passe through the register and the ðiÞ feedforwarding signals dj are zeros. After the initial clock cycle, ðiÞ u0 become zeros and dj turn effective. Therefore, the two signals are separated in time, and they can be combined by ‘OR’ gates. The radix-4 accumulation units are shown in Fig. 12. Except for different inputs and outputs, the accumulation unit for Xj is the same as Fig. 12. In order to account for 2X and 3X in the accumulation, we have partitioned them into two parts: 2X :¼ 1  2X þ 0  X, and 3X :¼ 1  2X þ 1  X. The double of 2X is achieved by transferring the highest bit of current unit to the next one. Then they are added up to T i,j by one carry save additions.

5. Complexity analysis and experiment results Considering the computational effort of one N-bit full addition as C, then on average an N-bit traditional modular exponentiation

Fig. 11. PE m1: the last radix-4 processing element.

Table 1 Encoding qi by M 0,1 and T 1: :0 . M1

T 1: :0

qi

0 0 0 0 1 1 1 1

00 01 10 11 00 01 10 11

00 11 10 01 00 01 10 11

Please cite this article as: T. Wu, et al., Fast, compact and symmetric modular exponentiation architecture by common-multiplicand Montgomery modular multiplications, INTEGRATION, the VLSI journal (2012), http://dx.doi.org/10.1016/j.vlsi.2012.09.002

8

T. Wu et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]

costs an effort of ð3=2ÞN  2C ¼ 3NC. Also, common-multiplicand Montgomery modular exponentiation costs N  ð2C þ CÞ ¼ 3NC. Therefore, the total computational complexity is about the same as that of traditional modular exponentiations. In the proposed architecture, the number of clock cycles for an Nbit modular exponentiation is not affected by the Hamming weight of the exponent, but is interfered with the word size w, the radix of the iteration step, and the binary length of the modulus N. To complete a general N-bit modular exponentiation with word size w, the proposed radix-2 design requires about N  ðN þ m þgÞ ¼ N  ðN þN=wþ gÞ clock cycles, while the radix-4 design needs about N  ðN=2 þN=w þg=2Þ clock cycles (Table 3). For a full 1024-bit modular exponentiation, the clock cycles for selected radices r and word sizes w are shown in Table 2. The analysis of time and area costs is shown in Table 3, in which AREG , AMUXð2:1Þ and AFA are the area cost of 1-bit register, 1-bit 2:1 multiplexor, and 1-bit full adder. Meanwhile, T FA and T MUXð2:1Þ denote the logic time delay of 1-bit full adder and 1-bit 2:1 multiplexor. In particular, the area cost AOther includes k=w  ð1AXOR þ 2AMUXð2:1Þ þ AINV Þ, where the subscript ‘XOR’ denotes an exclusive OR gate and ‘INV’ denotes an invertor. As they are divided by the word length w, AOther is then insignificant in the whole area overhead. Compared with the design in [7], both the area and clock cycles have been reduced by almost a half with the radix-2 design, and with the radix-4 design the clock cycles are only about 1/4 of that in [7]. In [23] the area overhead is not parameterized, and the parameter nc denotes the clock cycles waiting for carry-save-to-binary conversion. The

number of clock cycles is about the same as our proposed radix-4 design, while its critical path is shorter. The critical path is determined by the word size and the number of processing elements. In other words, it depends on the balance of the path delay with the first processing element and the path delay with the selection logic in ‘F:¼X’ or ‘S:¼Y’. In fact, we have chosen balanced parameters ðw,mÞ A fð32,33Þ; ð16,64Þg in this work, so that the critical path always occurs at the first processing element. As is shown in Table 3, the critical path for radix-4 implementation includes 2 T AND , 3 T OR , 1 T 4:1MUX , 1 T 2:1MUX , 1 T 2bit FA , and 1 T ðw2Þbit FA . Although the critical path is longer than that in [7], the local and symmetric communications in a word-based architecture lead to short delay in place and route process, which partly offsets the increase in the critical path. Several 1024-bit modular exponentiation units with the proposed architecture have been described in Verilog Hardware Description Language, synthesized by Synplify Pro 9.6.2, and then placed and routed in Xilinx ISE 10.1. They are targeting several FPGA devices for comparison shown in Tables 4–6. The word size and the number of processing elements are chosen as ðw,mÞ ¼ ð16,65Þ or ðw,mÞ ¼ ð32,33Þ, ensuring a balanced architecture with m 4 g ¼ 12. The Roman labels I, II, III, and IV after ‘This work’ just denotes the four different groups of parameters in Table 2. It can be found out that the proposed architecture is quite fast for modular exponentiations, while its area cost has also decreased much compared with other work. This is mainly owing to several advantages: 1. low latency due to feedforwarding processing elements and pipelined accumulation units; 2. the Montgomery ladder algorithm decreases sequential modular multiplications in modular exponentiations from 1:5N  2N to N þ2; 3. the Montgomery ladder can be implemented by commonmultiplicand Montgomery modular multipliers, which decreases the area overhead. 4. there are no or fewer carry-save logics in the word-based architecture. Meanwhile, the resistance from power attacks is also obtained within the common-multiplicand Montgomery modular multiplier. In the tables, the last column ‘FA-SPA’ just denotes ‘fault attack and simple power attack’.

Fig. 12. Radix-4 accumulation unit for Yj.

Table 2 Clock cycles of 1024-bit modular exponentiation for different word sizes and radices.

Table 4 Performance of 1024-bit modular exponentiation in Xilinx Virtex-E FPGA. Reference

Type

I

II

III

IV

ðr,wÞ

(2, 16)

(2, 32)

PE number Clock cycles

65 1,130,735

33 1,097,870

(22 , 16) 65 599,327

(22 , 32) 33 566,431

Radix Technology

This work II 2 This work III 22 [4] 24

Max freq. Area (MHz) (LUTs)

Time (ms) FA-SPA resistance

XCV1000E-8 113 XCV1000E-8 115

11,491 18,731

9.72 5.21

Yes Yes

XC4000-9

19,899 11.95

No

45.6

Table 3 Analysis of time and area complexity of N-bit full modular exponentiation. Reference

Area

This work Radix-2 6N  AREG þ 5N  AMUXð2:1Þ þ 3N  AFA þ AOther

Critical path T wbit

This work Radix-4 7N  AREG þ 4N  AMUXð2:1Þ þ N  AMUXð4:1Þ þ 3N  AFA þ 2N  AXOR þ 3N  AAND þ AOther T wbit [7] [23] Radix-4

13N  AREG þ 2N  AFA þ 4N  AAND þ 5N  AMUXð2:1Þ þ N  AMUXð3:1Þ –

Clock cycles

FA þ T MUXð2:1Þ FA þ T MUXð2:1Þ þ T MUXð4:1Þ þ 2T AND þ

T FA þ T MUXð2:1Þ þ T AND 3T FA þ T AND

3 T OR

  N N N þ þg w   N N g þ þ N 2 w 2 2NðN þ 5Þ NðN=2 þ nc Þ

Please cite this article as: T. Wu, et al., Fast, compact and symmetric modular exponentiation architecture by common-multiplicand Montgomery modular multiplications, INTEGRATION, the VLSI journal (2012), http://dx.doi.org/10.1016/j.vlsi.2012.09.002

T. Wu et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]

Table 5 Performance of 1024-bit modular exponentiation in Xilinx Virtex-5 FPGA. Reference

Max freq. (MHz)

Area (slices)

Time (ms)

FA-SPA resistance

XC5VLX50T XC5VLX50T

345 290

3218 5225

3.18 1.95

Yes Yes

XC5VLX50T XC5V XC5V

274 526 385

7158 2982 7303

3.83 2.98 1.38

Yes No Yes

90 nm CMOS 90 nm CMOS

471.70

7.27

Yes

0.67

Yes

Radix Technology

This work II 2 This work IV 22 [13] – [23] 2 [23] 22 [24]

232

[24]

2128

11,437 gates 421.94 153,862 gates

Table 6 Performance of 1024-bit modular exponentiation in Xilinx Virtex-2 FPGA. Reference

Radix

Technology

Max freq. (MHz)

Area (slices)

Time (ms)

This work I This work IV

2

XC2V6000 XC2V6000

196.0 150.0

7197 11,850

5.77 3.78

Yes Yes

XC2V6000 XC2V6000

215.83 97.08

15,826 23,208

7.20 17.92

No No

[7] [25]

22 2 2

FA-SPA resistance

In [4], a high-radix systolic architecture for modular exponentiation is implemented in Xilinx XC40250-9 FPGA, which is an early and classic FPGA device. For the sake of comparison, we have implemented our design in an updated device Xilinx Virtex 1000E-8FG680. As is shown in Table 4, both the area and time are superior to those in [4]. Only radix-2 digits are used in our architecture, which leads to a higher frequency than that of the radix-16 design in [4]. The low latency and common-multiplicand approach have also help our design run faster than that in [4]. In [13], the Montgomery ladder algorithm [20] has been used to achieve parallel modular multiplication and squaring by costing more hardware resources. The CSA-based architecture and Montgomery ladder accelerate the Montgomery modular exponentiation, the latter of which also results in resistance from fault and simple power attack. The time for a 1024-bit modular exponentiation in [13] is estimated by its throughput (14.5 Mbps) in calculating AE with E ¼ 216 þ 1, i.e., t ¼ 103 =ð14:5  ð16 þ 2ÞÞ ¼ 3:83 (ms). It can be found in Table 5 that this work is faster than that in [13], and the hardware resource of our design is much smaller than that in [13]. The data with [23] are from Table IX in the reference. One is with the radix-2 LSB design, and the other is with the radix-4 MSB design, in which both do not include the area of memory units with precomputation. It can be found that the proposed radix-4 design is better than the MSB radix-4 design in [23]. Since it enjoys two Montgomery modular multipliers in parallel, we assume that it is possible to be resistant to FA-SPA. Meanwhile, our radix-2 design is slower than the radix-2 LSB design in [23]. However, this work is very close to its counterpart in performance at a much lower frequency. Moreover, the proposed design is resistent to FA-SPA, while its counterpart does not enjoy this advantage. In Table 5, the balanced radix-232 design and the high-speed radix-2128 design from [24] are also listed for reference. They are synthesized in ASIC 90 nm CMOS process, and only the multiplication block is included, without memory units and sequencer modules. Also, their RSA time without Chinese Remainder Theorem is used as the modular exponentiation time. Since different technologies are employed, it is difficult to compare the result in [24] to this work in terms of Time  Area. However, we give a rough estimate about it

9

here. First, the Virtex-5 device (65 nm CMOS node) is supposed to be of similar performance of 0:18 mm CMOS process in ASIC implementation [23], which is further estimated to be of half performance of 90 nm CMOS process in [24]. Also, the hardware implementation results with Virtex 5 include 8794 LUTs (1% 1-input LUT, 27% 2-input LUT, 37% 3-input LUT, 10% 4-input LUT, 9% 5-input LUT, 16% 6-input LUT, 3218 slices) for Type II, and 17025 LUTs (0.6% 1-input LUT, 30% 2-input LUT, 31% 3-input LUT, 11% 4-input LUT, 17% 5-input LUT, 10% 6-input LUT, 5225 slices) for Type IV, respectively. Suppose that one LUT in Virtex 5 slice accounts for 5  6 ASIC gates, then this work is of similar performance to those in [24]. In addition, the synthesis results in [24] may get degraded after place and route process. Finally, while Montgomery powering ladder has been used [24], no symmetric computational parts exist in their modular exponentiation architecture, which is different from those in [13] and this work. Nevertheless, we still mark it as resistant from FA-SPA in case that some other technique might have been applied. In [25], a carry-save architecture is proposed for Montgomery modular exponentiation, and in [7] a carry-save architecture, a parallel computation of high part and low part, and a quotient pipeline are used to accelerate modular exponentiations. The time for a 1024-bit modular exponentiation in [25] is also estimated from 16 the time of 0.21 ms for calculating A2 þ 1 . Since there is no Montgomery ladder in their modular exponentiation method, a factor of 1:5  1024=ð16 þ2Þ is multiplied at last, i.e., t ¼ 0:21  1:5  1024=ð16 þ 2Þ ¼ 17:92 (ms). The time for 1024-bit modular exponentiation in [7] is calculated from its throughput in ASIC implementation and the ratio of frequencies in FPGA and ASIC implementation: t ¼ 103 =ð265:44  215:83=550Þ ¼ 9:60 (ms). However, since the author assumes the worst case of modular exponentiations, we then multiply it by 3/4 to get 7.20 ms for the average case. By contrast, the average and the worst case in this work are the same. It can be found that the maximum frequency (or the critical path) of our proposed design is close to that in [7], since the local connections in a wordbased architecture decrease the routed path. Obviously, this work is faster and more area-efficient than those in [7,25].

6. Conclusions A fast, area-efficient and symmetric modular exponentiation architecture has been proposed and analyzed in this work. Its efficiency is demonstrated by acceleration of computation, reduction in area overhead, and resistance to fault and simple power attacks. First of all, we have designed a new modular exponentiation algorithm based on common-multiplicand Montgomery modular multiplication algorithm, in which no final subtractions are needed. Then, we propose a word-based and pipelined architecture for its hardware implementation. The whole architecture consists of three parts: one group of processing elements at the center line, and two groups of accumulation units along the side lines, which is symmetric in both function and layout. By the help of left-to-right binary method and Montgomery powering ladder, the power consumption will not be influenced by the input pattern. Therefore, it is resistent to fault and simple power attacks. Meanwhile, a feedforwarding mechanism for scalable Montgomery modular multiplier is applied in the architecture. Together with the pipelined accumulation units, the modular exponentiation architecture enjoys a very low latency. The common-multiplicand Montgomery modular multiplication algorithm itself reduces the computational efforts of modular exponentiation to a great extent. In this work, this reduction in computational efforts is converted to the decrease in Time  Area for hardware implementation and the symmetry in computation, the latter of which provides resistance to fault and simple power attacks. In fact, the proposed architecture with Montgomery powering ladder has about the same

Please cite this article as: T. Wu, et al., Fast, compact and symmetric modular exponentiation architecture by common-multiplicand Montgomery modular multiplications, INTEGRATION, the VLSI journal (2012), http://dx.doi.org/10.1016/j.vlsi.2012.09.002

10

T. Wu et al. / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]

computational complexity as modular exponentiation with conventional Montgomery algorithm. Both radix-2 and radix-4 architectures have been presented in this work. In particular, there is no carry-save logic in radix-2 units, while one stage of carry save logic is applied in radix-4 units to decrease the path delay. In general, radix-4 architecture has a higher data throughput at a lower frequency than radix-2 architecture. Although its area overhead is larger than the radix-2 design in this work, it is still more area-efficient than most work in the literature. The radix-4 architecture can be extended to even higher radix implementation, which may bring out more selection logics in the processing elements and more carry save logics in the accumulation units.

Acknowledgement This work was partly supported by the National HighTechnology Research and Development Program of China (No. 2012AA012402), the National Natural Science foundation of China (No. 61073173), and Independent Research and Development Program of Tsinghua University (No. 2011Z05116). The author would also like to thank the editor and reviewers for their contributive comments. References [1] R. Rivest, A. Shamir, L. Adleman, A method for obtaining digital signatures and public-key cryptosystems, Communications of the ACM 21 (1978) 120–126. [2] T. ElGamal, A public-key cryptosystem and a signature scheme based on discrete logarithms, IEEE Transactions on Information Theory 31 (4) (1985) 469–472. [3] W. Diffie, M. Hellman, New directions in cryptography, IEEE Transactions on Information Theory 22 (1976) 644–654. [4] T. Blum, C. Paar, High-radix Montgomery modular exponentiation on reconfigurable hardware, IEEE Transactions on Computers 50 (7) (2001) 759–764. [5] H. Nozaki, M. Motoyama, A. Shimbo, S. Kawamura, Implementation of RSA algorithm based on RNS Montgomery modular multiplication, in: Third International Workshop on Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science, vol. 2162, Springer, Berlin, 2001, pp. 364–376. [6] N. Jiang, D. Harris, Quotient pipelined very high radix scalable Montgomery multipliers, in: Fortieth Asilomar Conference on Signals, Systems and Computers, 2006, pp. 1673–1677. [7] M. Shieh, J. Chen, W. Lin, H. Wu, A new algorithm for high-speed modular multiplication design, IEEE Transactions on Circuits and Systems-I: Regular Paper 56 (9) (2009) 2009–2019. [8] A.E. Cohen, K.K. Parhi, Architecture optimizations for the RSA public key cryptosystem: a tutorial, IEEE Circuits and Systems Magazine (2011) 24–34. [9] M. Huang, K. Gaj, T. El-Ghazawi, New hardware architectures for Montgomery modular multiplication algorithm, IEEE Transactions on Computers 60 (7) (2011) 923–936. [10] T. Wu, S. Li, L. Liu, CSA-based design of feedforward scalable Montgomery modular multiplier, in: IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), 2011, pp. 54–59. [11] J. Ha, S. Moon, A common-multiplicand method to the Montgomery algorithm for speeding up exponentiation, Information Processing Letters 66 (2) (1998) 105–107. [12] C. Wu, D. Lou, T. Chang, An efficient Montgomery exponentiation algorithm for public-key cryptosystems, in: IEEE International Conference on Intelligence and Security Informatics, 2008, pp. 284–285. [13] A.P. Fournaris, Fault and simple power attack resistant RSA using Montgomery modular multiplication, in: IEEE International Symposium on Circuits and Systems, 2010, pp. 1875–1878. [14] A. Tenca, C - . Koc- , A scalable architecture for modular multiplication based on Montgomery’s algorithm, IEEE Transactions on Computers 52 (9) (2003) 1215–1221. [15] C.-L. Wu, An efficient common-multiplicand-multiplication method to the Montgomery algorithm for speeding up exponentiation, Information Sciences 179 (4) (2009) 410–421. [16] A. Rezai, P. Keshavarzi, High-performance modular exponentiation algorithm by using a new modified modular multiplication algorithm and commonmultiplicand-multiplication method, in: World Congress on Internet Security, 2011, pp. 192–197.

[17] H.S. Kim, K.Y. Yoo, Area efficient exponentiation using modular multiplier/ squarer in GFð2m Þ, in: Computing and Combinatorics, Lecture Notes in Computer Science, vol. 2108, Springer, Berlin, 2001, pp. 262–267. [18] H. Orup, Simplifying quotient determination in high-radix modular multiplication, in: 12th IEEE Symposium on Computer Arithmetic, Bath, England, UK, 1995, pp. 193–199. [19] C. Walter, Montgomery exponentiation needs no final subtractions, Electronics Letters 35 (21) (1999) 1831–1832. [20] M. Joye, S. Yen, The Montgomery powering ladder, in: International Workshop on Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science, vol. 2523, Springer-Verlag, 2002, pp. 291–302. [21] T. Wu, S. Li, L. Liu, A two-stage pipelined architecture for parallel modular exponentiation, in: IEEE International Conference on Information Science and Technology, 2012, pp. 215–218. [22] A. Tenca, C - . Koc- , Scalable architecture for Montgomery multiplication, in: First International Workshop on Cryptographic Hardware and Embedded Systems, Worcester, USA, 1999, pp. 94–108. ˜a, Modular multiplication and exponentia[23] G.D. Sutter, J.-P. Deschamps, J.L. Iman tion architectures for fast RSA cryptosystem based on digit serial computation, IEEE Transactions on Industrial Electronics 58 (7) (2011) 3101–3109. [24] A. Miyamoto, N. Homma, T. Aoki, A. Satoh, Systematic design of RSA processors based on high-radix Montgomery multipliers, IEEE Transactions on Very Large Scale Integration (VLSI) SYSTEMS 19 (7) (2011) 1136–1146. [25] C. McIvor, M. McLoone, J. McCanny, Modified Montgomery modular multiplication and RSA exponentiation techniques, IEE Proceedings on Computers and Digital Techniques 151 (6) (2004) 402–408.

Tao Wu received the Bachelor’s degree and Master’s degree in electronic science and technology, respectively, from Wuhan University in 2003 and from Tsinghua University in 2006. From September 2006 to April 2007, he served as a temporary assistant in the Device and System Laboratory in the Institute of Microelectronics at Tsinghua. Then he got a job in the Department of Physics and Electronic Engineering in Guangxi Normal University, and worked there for one year from July 2007 to July 2008. Since September 2008 he has been pursuing the Ph.D. in Tsinghua University.

Shuguo Li received the Bachelor, the Master and the Ph.D. degrees in Computer Department from Xidian University, Shandong University and Northwestern Polytechnical University in China in 1986, 1993 and 1999, respectively. In 2001, he finished his postdoctoral position research at Tsinghua University. Now he is a research professor at the Institute of Microelectronics at Tsinghua University. His current research interests include the algorithm for cryptography and design for encryption processor and microprocessor processor.

Litian Liu received the B.S. degree in electronic engineering from Tsinghua University, Beijing, China, in 1970. He is currently a full professor at the Institute of Microelectronics, Tsinghua University. His research interests include the development of semiconductor devices and integrated circuits.

Please cite this article as: T. Wu, et al., Fast, compact and symmetric modular exponentiation architecture by common-multiplicand Montgomery modular multiplications, INTEGRATION, the VLSI journal (2012), http://dx.doi.org/10.1016/j.vlsi.2012.09.002