Parallelization of scalable elliptic curve cryptosystem processors in GF(2m)

Parallelization of scalable elliptic curve cryptosystem processors in GF(2m)

Accepted Manuscript Parallelization of Scalable Elliptic Curve Cryptosystem Processors in GF(2m ) K.C.Cinnati Loi, Seok-Bum Ko PII: DOI: Reference: ...

2MB Sizes 0 Downloads 81 Views

Accepted Manuscript

Parallelization of Scalable Elliptic Curve Cryptosystem Processors in GF(2m ) K.C.Cinnati Loi, Seok-Bum Ko PII: DOI: Reference:

S0141-9331(16)00044-2 10.1016/j.micpro.2016.02.013 MICPRO 2359

To appear in:

Microprocessors and Microsystems

Received date: Revised date: Accepted date:

14 May 2015 14 December 2015 23 February 2016

Please cite this article as: K.C.Cinnati Loi, Seok-Bum Ko, Parallelization of Scalable Elliptic Curve Cryptosystem Processors in GF(2m ), Microprocessors and Microsystems (2016), doi: 10.1016/j.micpro.2016.02.013

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Parallelization of Scalable Elliptic Curve Cryptosystem Processors in GF (2m )

a Department

CR IP T

K. C. Cinnati Loia , Seok-Bum Koa,∗ of Electrical and Computer Engineering, University of Saskatchewan, Saskatoon, Canada

Abstract

AN US

The parallelization of scalable elliptic curve cryptography (ECC) processors (ECPs) is investigated in this paper. The proposed scalable ECPs support all

5 pseudo-random curves or all 5 Koblitz curves recommended by the National Institute of Standards and Technology (NIST) without the need to reconfigure the hardware. The proposed ECPs parallelize the finite field arithmetic unit and the elliptic curve point multiplication (ECPM) algorithm to gain perfor-

M

mance improvement. The finite field multiplication is separated such that the reduction step is executed in parallel with the next polynomial multiplication.

ED

Subsequently, the finite field arithmetic of the ECPs are further parallelized and the performance can be further improved by over 50%. Since the multiplier blocks consume a low number of hardware resources, the latency reduction out-

PT

weighs the cost of the extra multiplier resulting in more efficient ECP designs. The technique is applied for both pseudo-random curve and Koblitz curve algorithms. A novel ECPM algorithm is also proposed for Koblitz curves that take

CE

advantage of the proposed finite field arithmetic architecture. The implementation results show that the proposed parallelized scalable ECPs have better

AC

performance compared to state-of-the-art scalable ECPs that support the same set of elliptic curves. ∗ Corresponding author at: Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK S7N 5A9, Canada Email addresses: [email protected] (K. C. Cinnati Loi), [email protected] (Seok-Bum Ko)

Preprint submitted to Journal of Microprocessors and Microsystems

February 27, 2016

ACCEPTED MANUSCRIPT

Keywords: FPGA, Elliptic Curve Cryptography (ECC), elliptic curve point multiplication (ECPM), binary finite field arithmetic.

CR IP T

1. Introduction

In the 1980s, Miller [1] and Koblitz [2] independently proposed the use of Elliptic Curve Cryptography (ECC). ECC has gained much popularity recently

over other public-key cryptography algorithms, such as Rivest-Shamir-Adleman 5

(RSA) [3] since the same level of security can be provided using shorter key

AN US

sizes. As a result, implementations of ECC consume fewer resources and can

achieve higher throughput. Due to its many advantages, ECC has been adopted by many standards, such as NIST [4], SEC [5], and FIPS 186-3 [6]. The main operation in ECC protocols is the elliptic curve point multiplica10

tion (ECPM). Software implementations of ECPM are available (e.g. [7]) and can be optimized to run very fast. However, due to its complexity, many im-

M

plementations offload the ECPM operation into hardware co-processors to free up the processor for other operations and as a result speed up the performance of the overall system. Offloading the ECPM operation to a hardware platform also provides a power advantage to the system. This paper presents the archi-

ED

15

tecture of these hardware co-processors, or elliptic curve processors (ECPs) and

PT

proposes a parallel architecture for scalable ECPs for increased performance. Scalability refers to a hardware architecture that supports the evaluation of the ECPM for multiple elliptic curves and key sizes recommended by NIST [4]. The scalable ECC processors proposed in this paper support the ECPM calcu-

CE

20

lation for all 5 pseudo-random or Koblitz curves recommended by NIST without

AC

the need to reconfigure the hardware. The advantage of a scalable design is in its ability to modify the key size on-the-fly, which is useful for security protocols such as Transport Layer Security/Secure Socket Layer (TLS/SSL), where the

25

ECC parameters are negotiated at run-time [8, 9]. There are many ECPs proposed in literature that are not scalable designs. In [10], the authors propose an algorithm for Koblitz curves that make use of

2

ACCEPTED MANUSCRIPT

concurrent τ and τ −1 Frobenius operations to parallelize the point multiplication. In [11], the authors propose an ECC processor that uses 4 parallel finite30

field multipliers to speed up point multiplication. In [12], the paper presents an

CR IP T

ECC co-processor for binary fields that runs at 66 MHz and can perform the ECPM in 0.233 ms on generic curves and in 0.075 ms on Koblitz curves. In [13],

an ECP based on Guassian Normal Bases (GNB) is presented and can execute

the ECPM for 163-bit binary field in 5.05 µs. The ECP presented in [14] pro35

poses the use a hybrid Karatsuba multiplier to reduce the resource utilization

on the FPGA. These ECP may be fast and require low resource utilization.

AN US

However, in order for these designs to support multiple key sizes, multiple in-

stances of the ECP must be instantiated in the design, which would result in an increase in routing delay, which lowers maximum frequency, or an insufficient 40

amount of hardware resources, which demands the use of multiple FPGAs. In 2009, Hassan and Benaissa [15] proposed a scalable ECP that supports binary field SEC curves up to 193 bits. The design uses the hardware/software

M

co-design (HSC) approach, making use of the PicoBlaze soft-core microcontroller in Xilinx FPGAs. Their design goal is to reduce area consumption for area constrained platforms, such as RFID, mobile handsets, smart cards, and wireless

ED

45

sensor networks [15]. In addition, Hassan and Benaissa have also proposed scalable designs that support curves up to 571 bits recommended by NIST [16,

PT

17] also for area-constrained environments. Hardware implementation of prime field scalable ECPs have also been explored by the authors of this paper [18], but the implementation of prime field

CE

50

ECPs is out of the scope of this paper. In 2013, the authors of this paper published scalable ECP designs that sup-

AC

port all 5 Koblitz curves [19] or all 5 pseudo-random curves [20] without the need to reconfigure the hardware. In these publications, the authors proposed a novel

55

finite field arithmetic unit (FFAU) design that performs finite field arithmetic for all 5 binary fields recommended by NIST [4] in the same hardware efficiently. As a result, the designs outperform other scalable designs in literature. However, there are some drawbacks in the architecture of the designs in [19] and [20]. 3

ACCEPTED MANUSCRIPT

The FFAU can perform finite field multiplication or finite field squaring along 60

with finite field addition, but subsequent multiplication and squaring operations must be performed sequentially. Furthermore, the reduction step for multiplica-

CR IP T

tion and squaring is performed for every instruction before the next instruction can be executed and it consumes a significant amount of clock cycles. Thus, this paper proposes architectures that further improve the performance of the 65

scalable ECPs proposed in [19] and [20] by exploring the parallelization of the finite field arithmetic and the ECPM algorithm. The proposed designs separate

the multiplication and squaring operations to allow for simultaneous computa-

AN US

tion of the two operations. The arithmetic blocks also separate the reduction step from the finite field multiplication to further improve the performance of 70

the ECP. As in the authors’ previous works, the proposed ECPs only support the NIST recommended binary pseudo-random curves and Koblitz curves. The main contribution of this paper is the proposed architecture of a scalable ECP that parallelizes the ECPM operations. The effect of parallelization

75

M

is analyzed for both pseudo-random and Koblitz curves. Firstly, the polynomial multiplication step is separated from the reduction step for finite field multipli-

ED

cation, such that these operations can be performed in parallel. Subsequently, the multiplier block that performs the polynomial multiplication is also replicated for further parallelization. Since the hardware resource utilization of the

80

PT

multiplier block is relatively low compared to that of the square-add (SA) block, latency reduction of using 2 multiplier blocks outweighs the increase in hard-

CE

ware utilization, increasing the performance. For Koblitz curves, a novel τ NAF ECPM algorithm is also proposed, which is made possible by the finite field arithmetic block’s efficiency in performing efficient repeated finite field squaring.

AC

The efficiency of repeated finite field squaring also improves the performance of

85

the finite field inversion operation using the Itoh-Tsujii algorithm [21]. The rest of this paper is organized as follows: Section 2 reviews finite field

operations and elliptic curve cryptography for both pseudo-random and Koblitz curves; Section 3 discusses the hardware architecture and implementation of the proposed scalable ECPs; Section 4 presents a latency analysis and the FPGA 4

ACCEPTED MANUSCRIPT

90

implementation results and comparison with other designs in literature; and Section 5 concludes the paper.

CR IP T

2. Elliptic Curve Cryptography This section is organized into two subsections. Firstly, the finite field operations used by the designs in this paper are described. Secondly, the algorithms 95

used for elliptic curve point multiplication (ECPM) for both pseudo-random and Koblitz curves are briefly described.

AN US

2.1. Scalable Finite Field Operations

In this paper, binary finite field (FF) operations represented in polynomial basis are used for implementing elliptic curve operations. These FF opera100

tions include FF addition (FFADD), FF squaring (FFSQ), FF multiplication (FFMULT) and FF inversion (FFINV). Among these operations, FFADD is

M

the most trivial and can be implemented using a bit-wise exclusive-OR (XOR) operation. FFINV is the most complex operation, but using the Itoh-Tsujii algorithm [21], FFINV is simplified to a series of FFMULT and FFSQ. FFSQ can be implemented by using the following property:

ED

105

mod P (t)

(1)

PT

A(t)2 = am−1 t2m−2 + · · · + a1 t2 + a0

which is simply interleaving 0 bits and operand bits. Thus, the most complicated finite field operation that needs to be imple-

CE

mented by the ECP hardware is FFMULT, which has the highest impact on the performance of the ECP in terms of speed and area. In addition, in order to implement scalable ECPs, algorithm used for the FF operations must result

AC

110

in architectures that support multiple key sizes with the same hardware. In this paper, FFMULT is implemented using the Comba algorithm [22]

with the digit width, w, chosen to be 32. The Comba algorithm is a digit-wise multiplication algorithm and it processes the operands digit-per-digit, which

115

facilitates scalability.

5

ACCEPTED MANUSCRIPT

Both FFMULT and FFSQ require a modulo P (t) operation called the reduction operation. P (t) is an irreducible polynomial chosen for each specific curve and it is shown in [4]. In this paper, the reduction operation is not performed

120

CR IP T

when evaluating FFMULT. Rather, it is only performed when computing either FFSQ or FF addition/reduction (FFADDRED). By doing so, the complexity of the system is greatly reduced because FFMULT can be simplified to only performing polynomial multiplication.

In this paper, the reduction operation is performed using a reduction matrix

125

operation is defined as follows:

AN US

for each finite field, namely 163, 233, 283, 409 and 571, such that the reduction

D(t) = R × C(t)

(2)

where C(t) is a binary column vector of the coefficients of the polynomial to be reduced, (c2m−2 , . . . , c1 , c0 ), R is the m × 2m − 1 reduction matrix and D(t) is

M

the reduced column vector, (dm−1 , . . . , d1 , d0 ). The multiplication and addition operations in the matrix multiplication are performed in GF (2). 2.2. ECPM for Pseudo-Random Curves

ED

130

Pseudo-random curves recommended by NIST [4] over GF (2m ) have the

PT

following form:

E : y 2 + xy = x3 + x2 + b

(3)

CE

where b is a constant specific to each curve. The main operation in ECC is the elliptic curve point multiplication (ECPM). Given a point, P , defined on the

AC

curve E and an integer, k, ECPM is defined as follows: Q = kP = P + P + · · · + P | {z }

(4)

k times

where Q is the resultant point, which is also on the curve E. In this paper, the algorithm chosen for computing ECPM in pseudo-random curves is the Lopez-

135

Dahab (LD) algorithm [23] that is shown in Algorithm 1. 6

ACCEPTED MANUSCRIPT

In Algorithm 1, Madd is defined as:

{

CR IP T

(X, Z) ← Madd(X1 , X2 , Z1 , Z2 , x) X ← X1 X2 Z1 Z2 + x(X1 Z2 + X2 Z1 )2 Z ← (X1 Z2 + X2 Z1 )2

} and Mdouble is defined as:

{

AN US

(X, Z) ← Mdouble(X1 , Z1 , b) X ← X14 + bZ14 Z ← (X1 Z1 ) }

(5)

(6)

2

Furthermore, the projective to affine coordinate conversion shown in Algo-

140

M

rithm 1 requires 3 FFINV operations, for x, Z1 and Z2 . In this paper, the conversion algorithm has been modified to the following such that only 1 inver-

ED

sion is required:

x0 ←

(7)

1 )(xZ2 +X2 ) 1 +X1 ) ( Z2 (xZ )( x(xZ1 +X xZ1 Z2 xZ1 Z2

PT

y0 ←

xZ2 X1 xZ1 Z2

2

+ x + y) + y

The Lopez-Dahab (LD) algorithm uses standard projective coordinate sys-

CE

tem that uses 3 coordinates to represent a point, (X, Y, Z), where x = X/Z and y = Y /Z. One of the main advantages of using the LD algorithm is that only the X and Z coordinates need to be computed in the loop and the conversion back to the affine coordinates can be obtained by simply using the X and Z

AC

145

coordinates of the resultant point and the x and y affine coordinates of the original point. A summary of the number of FF operations for each point operation is provided in TABLE 1.

7

ACCEPTED MANUSCRIPT

150

2.3. ECPM for Koblitz Curves Koblitz curves [24] recommended by NIST [4] have the following form:

(8)

CR IP T

Ea : y 2 + xy = x3 + ax2 + 1

where a = 0 or 1. Similar to pseudo-random curves, the fundamental operation in Koblitz curve ECC is also the ECPM. In this paper, Lopez-Dahab (LD)

coordinates [25] are used to delay the need for an FFINV until the end of the 155

algorithm. The mixed LD and affine coordinate point addition (PADD) [26] is used to reduce the number of operations to 9 FFMULT, 5 FFSQ and 9 FFADD.

AN US

The expression for PADD of a point in LD coordinates, (X1 , Y1 , Z1 ), with a point

in affine coordinates, (x, y), to result in a point in LD coordinates, (X3 , Y3 , Z3 ), is given as follows:

Z3 = (Z1 (xZ1 + X1 ))2 X3 = (yZ12 + Y1 )2

M

+(xZ1 + X1 )2 (Z1 (xZ1 + X1 ) + aZ1 )2 +(yZ12 + Y1 )(Z1 (xZ1 + X1 ))

(9)

ED

Y3 = ((yZ12 + Y1 )(Z1 (xZ1 + X1 )) + Z3 )(X3 + xZ3 ) +(x + y)Z32

160

Since a in (8) is 0 or 1, and the addition of (x+y) can be precomputed, the total

PT

number of operations can be reduced to 8 FFMULT, 5 FFSQ and 8 FFADD. When evaluating ECPM for Koblitz curves, the scalar, k, is converted into

CE

τ -non-adjacent form (τ NAF) to simplify the point doubling (PDBL) operations [27]. The τ NAF converted algorithm performs Frobenius endomorphism

165

(PFRB) instead of PDBL, which reduces the number of operations from 4 FF-

AC

MULT, 5 FFSQ and 4 FFADD to 3 FFSQ. A summary of the number of FF operations is provided in TABLE 1. In some systems [28, 29, 30], the τ NAF conversion is included in the ECP

implementation. However, as noted in [31], in some systems, the τ NAF con-

170

verted digits of k can be generated randomly, and converted back to its binary

8

ACCEPTED MANUSCRIPT

equivalent. In these systems, a separate τ NAF to binary converter may be used in parallel with the ECP. In addition, similar work in the literature [19, 32] also do not include the τ NAF conversion in the ECP. In order to more easily

175

CR IP T

compared the proposed ECPs with the ECPs in literature, the τ NAF conversion is out of the scope of this paper. Interested readers can refer to [27], [31] and [33]. Nevertheless, should τ NAF conversion be required when using the ECP proposed in this paper, the converters presented in [28, 29, 30] may be used in parallel with the proposed ECP.

In this paper, a novel τ NAF ECPM algorithm for Koblitz curves is presented that improves on the one used in [19]. The novel ECPM algorithm is shown in

AN US

180

Algorithm 2. The τ NAF ECPM algorithm used in [19] only performs the else section of the main loop shown in Algorithm 2, where the PFRB is evaluated at every iteration and PADD is executed if the current digit is non-zero. In the proposed algorithm, if the current digit is zero, further optimization is performed 185

by using the proposed double Frobenius endomorphism (PDFRB) (Q ← τ 2 Q)

M

followed by a PADD if the next digit is non-zero. As will be shown in the next section, the architecture of the finite field arithmetic block allows for the efficient

ED

repeated squaring, where each subsequent squaring only requires 1 additional clock cycle. Thus, the proposed PDFRB step is more efficient than performing 190

PFRB twice and Algorithm 2 reduces the number of iterations of the main loop

PT

resulting in a lower latency.

The operation of the algorithm is as follows: if the currently indexed τ NAF

CE

digit, ui , is 0, then a PDFRB operation is executed and the index, i, is reduced by 2 instead of 1. If the next indexed τ NAF digit, ui−1 , is also 0, then no

other operation is required. Otherwise, PADD executes a point addition or subtraction depending on the sign of ui . If ui is non-zero, then a PFRB is

AC

195

performed, followed by a PADD, and the index is decremented by 1. The index i is decremented by 1 twice after PDFRB in Algorithm 2 is to make PADD perform the same operation in both ui = 0 and ui = 1/ − 1 cases.

9

ACCEPTED MANUSCRIPT

200

3. Design and Architecture of the Scalable ECPs In this section, the hardware architectures of the proposed scalable ECPs are presented. This section is divided into 3 subsections. In the first subsection,

CR IP T

the architecture of the finite field arithmetic blocks are presented. In the second subsection, the architecture of the scalable ECPs using a single multiplier is 205

presented. In the third subsection, the architecture of the ECPs with multiple multipliers is explored. For the remainder of this paper, random ECP refers to a scalable ECP for pseudo-random curves and Koblitz ECP refers to a scalable

AN US

ECP for Koblitz curves recommended by NIST. 3.1. Finite Field Arithmetic Blocks

In [19] and [20], the finite field arithmetic is performed using a finite field

210

arithmetic unit (FFAU) that can either perform FFMULT or FFSQ. In this paper, the proposed ECP uses 2 finite field arithmetic blocks that work closely

M

with each other, the multiplier block (MULT) and the square-add block (SA). 3.1.1. Multiplier (MULT) Block

The MULT block performs the Comba algorithm and is shown in Fig. 1.

ED

215

Its inputs are 32-bit buses and the values are stored in dual-port RAMs. Since the digit size is 32, the RAMs are s = d571/32e = 18 words deep. The MULT

PT

block uses 2 ‘multiplier units’ (‘x’ in Fig. 1) in parallel. Each ‘x’ is a purely combinational 32-bit Karatsuba-Ofman multiplier [34]. The digits are read out to the ‘x’ block according to the indexes in the inner and outer loops of the

CE

220

Comba algorithm. The output of the ‘multiplier units’ are accumulated in the 63-bit ‘UV reg-

AC

ister’. The addition operation is performed using XOR operations. Once the inner loop is completed the least-significant 32 bits of ‘UV register’ are sent to

225

the ‘FIFO C’ or ‘SIPO C’ for storage and the register is right-shifted by 32 bits to prepare for the next inner loop calculation. Both ‘FIFO C’ and ‘SIPO C’ are storage units for the resultant product. ‘FIFO C’ is a first-in-first-out unit that is used for the least-significant dm/32e 10

ACCEPTED MANUSCRIPT

digits of the product. ‘SIPO C’ is a digit-serial-in-parallel-out shift register that 230

stores the remaining most-significant digits. The separation of the product’s storage is due to the architecture of the SA unit which is discussed below. Thus,

CR IP T

the MULT block has 2 outputs. ‘C’ is the output of ‘FIFO C’ that outputs the least-significant 32-bit digits, one digit at a time, whereas ‘C msd’ is the parallel

output of ‘SIPO C’, which requires a maximum of (2×571−1)−(d571/32e×32) = 235

565 bits.

Using the proposed architecture, the MULT block completes its operation

in (s/2)2 × 2 + s/2 + s + 3 clock cycles, where s = dm/32e and s + 3 clock cycles

3.1.2. Square-Add (SA) Block

AN US

are used for loading the input digits, the pipelining stages.

The SA block performs both FF addition/reduction (FFADDRED) and re-

240

peated FFSQ and is shown in Fig. 2. The ‘A’ and ‘B’ inputs of the SA block are 32-bit digits. When performing FFADDRED, the ‘A’ and ‘B’ are added by

M

a 2-input 32-bit XOR block. During FFSQ, input ‘B’ is set to 0. ‘SREG C’ is a shift register with both digit-serial and parallel inputs and outputs. The output of the adder connects to the digit-serial input (‘s in’), which shifts by 32 bits

ED

245

on every clock cycle. Once all the digits are collected, ‘SREG C’ outputs the complete value through the 576-bit parallel output port.

PT

For FFADDRED, the value is concatenated with the input ‘B full’, which is connected to the ‘C msd’ output of the MULT block. By doing so, the SA block 250

effectively adds ‘A’ and ‘B’, where ‘B’ can be the output of a polynomial multi-

CE

plication to be reduced. The concatenated value is chosen by the multiplexers to input into 5 reduction blocks, ‘R163’,‘R233’, ‘R283’,‘R409’, and ‘R571’, which

AC

are combinational logic blocks derived from the R matrix in (2) for each of the

5 finite fields. The reduction blocks output to a multiplexer, which selects the

255

appropriate value to be stored back in ‘SREG C’ through its parallel input port. Finally, the result is output through the digit-serial output port of ‘SREG C’. When operating for repeated FFSQ, the parallel output of ‘SREG C’ is input into the ‘SQ’ block, which interleaves 0s to perform polynomial squaring. The 11

ACCEPTED MANUSCRIPT

result is selected by the multiplexer to input into the reduction blocks. Similar 260

to FFADDRED, the reduced value is input back into ‘SREG C’ via the parallel input port. At this point, if another FFSQ is required, ‘SREG C’ outputs the

CR IP T

value through its parallel output port again and the process is repeated as many times as required. By doing so, apart from the first FFSQ, which requires s clock

cycles to load the operand, every FFSQ can be completed in 1 additional clock 265

cycle. This characteristic is especially useful for performing FF inversion using Itoh-Tsujii algorithm, where FFSQ is repeated many times [35].

AN US

Based on the above description, the operations that the SA block supports     r are: (A + B) mod P or (A2 ) mod P , where A has size m, B in FFADDRED mode has size 2m − 1, r ≥ 1 is the number of times FFSQ is repeated, P

270

is the reduction polynomial. The SA block completes a FFADDRED operation in s + 1 clock cycles and a repeated FFSQ operation in s + 1 + (r − 1) clock cycles, where s + 1 clock cycles are used for loading and r − 1 clock cycles for repeated squaring.

275

M

The architecture of the MULT and SA blocks are an improvement over the finite field arithmetic unit (FFAU) used in [19] and [20] as follows. The

ED

reduction step of FFMULT is removed from the multiplier, which reduces the latency. Instead, the reduction step for both FFMULT and FFSQ are performed in the SA block. As previously mentioned, one of the drawbacks of the FFAU

280

PT

in [19] and [20] is the long latency of the FFMULT and FFSQ operations, which includes the reduction step in a single operation. In addition, the ability for the

CE

SA block to compute repeated FFSQ with 1 additional clock cycle allows for the use of Algorithm 2 for Koblitz curves to further reduce latency, whereas the

AC

FFAU in [19] and [20] does not have this ability. 3.2. Single-Multiplier Scalable ECPs

285

In this subsection, the ECP architecture of Koblitz ECP and random ECP

using the finite field arithmetic blocks presented in Section 3.1 are described. The Koblitz ECP is presented first, followed by the random ECP.

12

ACCEPTED MANUSCRIPT

3.2.1. Koblitz ECPs The block diagram of the single multiplier (1-MULT) scalable Koblitz ECP 290

is shown in Fig. 3. The scalable ECP evaluates Algorithm 2 after the τ NAF(k)

CR IP T

computation. The inputs x1 and y1 are 32-bit buses that enter the affine co-

ordinates of a point digit-by-digit. The τ NAF converted value of k, with a magnitude and sign are input by 32-bit buses into the controller. The outputs

of the ECP are 32-bit x3 and y3 buses for the affine coordinates of the resultant 295

point. Since values are transferred digit-by-digit in and out of the ECP and it

takes s = dm/we clock cycles. For simplicity, the finite state machine (FSM)

AN US

and some control signals are not shown. The advantage of using 32-bit ports is

to allow for simpler interfacing with general purpose processors that commonly operate in 32- or 64-bit data paths.

The inputs to the MULT and SA blocks are controlled by the current state

300

of the FSM and 2 program counters, MULT PC and SA PC. The instructions executed by the processor in each state is shown in TABLE 2. The RAM stores

M

the input values x1 and y1 into x and y and their sum into xy. It also stores the temporary values X1 , Y1 , Z1 , T1 , T2 , T3 that are used in TABLE 2. The RAM stores all the values in 32-bit digits. As a result, the total size of the

ED

305

RAM is d571/32e × (32 × 9) = 18 × 288 bits. The outputs of the MULT block are connected to the ‘B full’ port of the SA block and the multiplexer for input

PT

‘B’. The advantage of the proposed architecture is that there is no need to store the product, which would require twice the number of words in the RAM. The disadvantage is that every multiplication must be followed by an addition

CE

310

performed on the SA block. However, since the addition can be performed in parallel with the next multiplication, in many occasions, the number of clock

AC

cycles used by the addition does not affect the latency. Finally, the result of x3 and y3 are obtained from T1 and T2 , respectively.

315

In TABLE 2, the MULT operations are performed without reduction and the

SA operations are reduced by P (t) as described in Section 3.1. There are a few special features that contribute to the improved performance of the proposed

13

ACCEPTED MANUSCRIPT

design. The PDQA state is a combination of PADD with PFRB or PDFRB, and the PQUAD state performs PDFRB or PFRB if a single 0 digit is the 320

least significant digit of τ NAF(k). The operations are combined as such so

CR IP T

that PDQA is the only state that needs to perform PADD. The PQUAD state will only need to implement FFSQ or double FFSQ (A4 ). These states execute the main loop in Algorithm 2. However, some operations in these states are optimized to perform instructions for the previous or next iteration.

The most important feature of the sequence of instructions presented is the

325

ability for several SA block operations to be executed simultaneously with a

AN US

single MULT block operation because of the number of clock cycles required by the MULT block operation. Due to this feature, the clock cycles required by the reduction step of FFMULT performed in the SA block are masked by the 330

execution of the next MULT block operation.

The FSM of the scalable ECP is shown in Fig. 4. The FSM resets to the IDLE state. The ECPM operation is triggered by asserting the load signal,

M

which moves the FSM to the LOAD state. The FSM only stays in the LOAD state for 1 clock cycle. At the LOAD state, the first point addition of Q ← ∞±P in Algorithm 2 is performed by loading the appropriate values into the RAM.

ED

335

If the magnitude of the 3rd most-significant digit of k is 1, the FSM moves to the PDQA state, otherwise it moves to the PQUAD state.

PT

When the operations for PDQA and PQUAD states are complete, the FSM goes to the PDQA state if the magnitude of either the current digit of k (cur k) or the next digit of k (next k) is 1, otherwise it goes to PQUAD state. k count

CE

340

is used to keep track of the current index of k that is being processed. When

k count is 0, the main loop in Algorithm 2 is completed except for the evaluation

AC

of Y1 , which is performed in the BX state. After the BX state, the FSM enters the ISQ state, which initiates the Itoh-Tsujii algorithm [21].

345

Instead of performing both 1/Z3 and 1/Z32 for coordinate conversion, only

1/Z3 is performed. Subsequently, 1/Z32 can be obtained by (1/Z3 )2 . The ISQ state computes repeated FFSQ operations, followed by IMULT, which computes 1 multiplication, IRED, which reduces the product, and returns to the ISQ state. 14

ACCEPTED MANUSCRIPT

The number of times the ISQ and IMULT states cycle depends on the selected 350

field. Once FFINV is completed, the FSM moves to the FMULT state which com-

CR IP T

putes 2 multiplications to complete the coordinate conversion. After the FMULT state, the FSM moves to the FINAL state and to the

WAIT state, where x3 and y3 are output. The FSM is able to move immedi355

ately back to the LOAD state if the load signal is detected at the WAIT state, otherwise it will return to the IDLE state.

AN US

3.2.2. Pseudo-Random ECPs

Using an architecture similar to the Koblitz ECP, the 1-MULT random ECP is implemented for pseudo-random curves. As in the Koblitz ECP, the inputs 360

and outputs of the ECP are 32-bit buses, x1 , y1 , x3 and y3 . The binary representation of the scalar multiplier k is also input through a 32-bit bus. The core of the 1-MULT random ECP is also the MULT and SA blocks.

M

The order of instructions executed by the processor in each FSM state is presented in TABLE 3. These instructions are stored in the controller along with the PCs and the ROM that stores the b coefficients. The RAM stores the

ED

365

input values x1 and y1 in x and y and the temporary values X1 , X2 , Z1 , Z2 , T1 , T2 , T3 that are used in TABLE 3, so it also has dimensions 18 × 288 bits as

PT

in the Koblitz ECP.

The FSM of the 1-MULT random ECP is similar to the 1-MULT Koblitz 370

ECP shown in Fig. 4, except the main loop (LOOP state) executes Lopez-

CE

Dahab algorithm and a couple multiplications are required prior to the inversion states. In TABLE 3, the operations given in the LOOP state are obtained from

AC

rearranging the Madd and Mdouble operations in Algorithm 1. The MUL1,

MUL1R, MUL2 and MUL2R states compute the value of xZ1 Z2 to set up for

375

FFINV operation in ISQ, IMULT and IRED. Finally the CONV state converts the projective coordinates to affine.

15

ACCEPTED MANUSCRIPT

3.3. Multiple-Multiplier Scalable ECP Since the MULT block only needs to perform the polynomial multiplication, its hardware resource utilization is much lower than the hardware utilization of the SA block. In this subsection, the use of multiple MULT blocks is explored

CR IP T

380

to improve the performance of the scalable ECP. The use of 2 MULT blocks is examined for random ECPs first. Subsequently, the Koblitz ECP using 2 MULT blocks is also presented.

For the 2-MULT ECPs, the architecture of the MULT block does not need 385

to be modified. However, since 2 MULT blocks are used, the architecture of the

AN US

SA block is modified to interact with both MULT blocks. The block diagram

of the new SA block is shown in Fig. 5. The main difference between this SA block and the one shown in Fig. 2 in terms of hardware resources is the addition of an extra 32-bit 2-input XOR gate and the shift register ‘SREG C’, which are 390

shaded in Fig. 5.

In order to interface with the outputs of the 2 MULT blocks, the SA block is

M

modified to take 2 sets of inputs. ‘A1’, ‘A1 full’, and ‘B1’ connects to one of the MULT blocks and ‘A2’, ‘A2 full’ and ‘B2’ connects to the other MULT block.

395

ED

The operation of the SA block has also been modified slightly, to combine FFSQ, FFADD and FF reduction into 1 type of operation. Thus, the new SA block r

always performs the operation (A1+B1)2

mod P (t) and (A2+B2) mod P (t),

PT

where r is used for repeated FFSQ. Thus, if r = 0, only addition is performed. Note that the sum of ‘A2’ and ‘B2’ cannot be subsequently squared. During its

CE

operation, all inputs are loaded into the SA block simultaneously. Thus, both 400

‘SREG C’ are loaded simultaneously. Once all digits are inputted, the data from the top ‘SREG C’ goes through the multiplexers into the appropriate reduction

AC

block and stored back at ‘SREG C’. A control signal is used to indicate whether or not the second ‘SREG C’ is being used. If so, the second ‘SREC C’ is selected as inputs to the reduction blocks and result stored back in the second ‘SREG

405

C’. Finally, the results are output through ‘C1’ and ‘C2’ as 32-bit digits. Thus, the latency of FFADD for only the first set of operands is s + 1 clock cycles, repeated FFSQ is s + 1 + r clock cycles and FFADD using both sets of operands 16

ACCEPTED MANUSCRIPT

is s + 2, where s = dm/32e is the number of 32-bit digits and m is the key size. To take advantage of the 2 MULT blocks, the operations in TABLE 3 has 410

been parallelized to produce the operations in TABLE 4. The major differences

CR IP T

are shaded in TABLE 4, where MULT and SA block operations are parallelized. Comparing the operations in TABLE 3 and TABLE 4, the most significant

difference occurs at the LOOP state, where the latency of 6 FFMULT (6M) are reduced to 3 FFMULT (3M) and the 2 FFADD (2A) operations in MUL1 state 415

that now must be run in every iteration. This reduction is very significant as

the LOOP is the most time consuming step of the ECPM and must execute

AN US

m − 1 time. The CONV state operations are also reduced from 7M to 4M + 1A operation.

The same parallelization technique has been applied to the 1-MULT Koblitz 420

ECP. The PDQA state in the 1-MULT Koblitz ECP requires 8M + 1A operations. Due to the data dependency in PADD, only certain FFMULTs can be parallelized and the resultant algorithm requires 5M + 4A + 1S (FFSQ) opera-

M

tions. Thus, the latency reduction in the Koblitz ECP using 2 MULT blocks is not as significant as in the random ECP. The resultant series of operations are shown in TABLE 5.

ED

425

The same technique can be used to further parallelize the multiplication instructions to use 3 or 4 MULT blocks. In the random ECP, the LOOP state

PT

reduces to 2M + 3A + 1S operations and the Koblitz ECP, the PDSA state reduces to 4M + 9A + 1S operations. For the random ECP, further parallelization using 4 MULT blocks does not further reduce the latency due to the data

CE

430

dependency of the Lopez-Dahab algorithm. In the Koblitz ECP, the method of interleaving multiplications used in [36] may be applied to the proposed ECP

AC

designs by using 4 MULT blocks and 2 SA blocks. However, the structure of the SA block would require some modifications and due to the data dependency

435

structure in the proposed ECP, not all addition and squaring operations can be completely masked by multiplication. Thus, the number of operations in the PDSA state reduces to approximately 2M + 5A + 2S. Using this method, the number of clock cycles of the ECPM reduces by approximately 30% - 40% 17

ACCEPTED MANUSCRIPT

but the hardware resource utilization doubles compared to the 2-MULT Koblitz 440

ECP. Since the latency reduction is not significant, the increase in hardware resource utilization by using 3 or 4 MULT blocks outweighs the benefits of the

CR IP T

latency reduction. Thus, the use of 3 or 4 MULT blocks worsens the efficiency (as defined in Section 4.2) of the ECP, so their implementation results are not shown in this paper.

The parallelization of the SA block is not considered in this paper because

445

the SA block occupies a majority of the hardware resources of the ECP. Since

the latency of the FFMULT is the bottleneck of the operations and the SA op-

AN US

erations are masked in the MULT block operations, parallelizing the SA block does not have a great impact on the latency of the system. Thus, it is not fea450

sible to parallelize the SA operations, which would cause the hardware resource utilization to increase dramatically, with only a minor decrease in latency.

4.1. Latency Estimation

M

4. Implementation Results and Analysis

455

ED

According to the designs of the scalable ECPs described above, TABLE 6 is constructed to present the latency in terms of the number of clock cycles required for each operation. tMULT is the latency of the MULT block and it

PT

does not change from one design to another because all 4 ECPs use the same MULT block. tSA has been previously discussed, but it must be noted that for the 1-MULT ECPs, the latency of FFADD is s + 1 clock cycles and repeated FFSQ is s + 1 + (r − 1), whereas for the 2-MULT ECPs, the latency is s + 1 + r,

CE

460

where s is the number of digits and r is the exponent of the repeated FFSQ.

AC

Thus, for the 2-MULT ECP, the repeated FFSQ operation requires an extra clock cycle. Note that the tSA value shown in TABLE 6 and in the expressions below only represents the latency of a FFADD operation, tSA = s + 1.

465

For the 1-MULT random ECP, tINIT is given by 2 × tSA + 1. tLOOP is given

by the number of iterations of the LD algorithm, which is m − 1. Each iteration requires 6 FFMULT operations, so tLOOP = (6tMULT )(m − 1). The number of 18

ACCEPTED MANUSCRIPT

clock cycles for FFINV, tINV , is given by the number of times the ISQ, IMULT and IRED states are entered, which is field-dependent. tm depends on the field 470

and it also depends on how many times the ISQ state has been entered. In total,

CR IP T

tINV = g ×tMULT +2g ×tSA +m−2−g, where g = blog2 (m − 1)c+h(m−1)−1,

and h(x) is the Hamming weight of the number x. tP2AC is the total number of clock cycles required by the projective to affine conversion, including the

time of inversion and it is given by 2tMULT + 2tSA + tINV + 8tMULT + tSA , 475

where the first 2 tMULT and tSA are from states MUL1, MUL2, MUL1R, and

MUL2R, the 8 tMULT are from the CONV state and the last tSA is in the FINAL

AN US

state. Finally the number of clock cycles of the complete ECPM, tECPM =

1 + tINIT + tLOOP + tP2AC + s, where 1 clock cycle is used in the LOAD state and s is the number of 32-bit digits and consumed by the WAIT state. For the 2-MULT random ECP, tINIT is given by 3tSA + 3. Each iteration

480

of the LOOP state requires 3 FFMULT, 1 FFADD with 2 sets of inputs and 1 FFADD with 1 set of inputs, except for the final iteration, which executes

M

FFMULT to mask the SA operations. Thus, tLOOP = (3tMULT )(m−1)+(2tSA + 1)(m−2)+tMULT . Due to the set up of FFINV, tINV is slightly different from the 1-MULT case and is given by tINV = g ×tMULT +(2g +1)×tSA +g +1+m−2−g.

ED

485

The latency of the coordinate conversion is given by tP2AC = 2tMULT + tSA + tINV + 4tMULT + tSA + 1 + tSA + 1. The ECPM is given by the same expression

PT

as in the 1-MULT case.

For the 1-MULT Koblitz ECPs, the number of clock cycles in the PDQA state is given by 8 × tMULT + tSA since most of the SA operations execute in

CE

490

parallel with the MULT operations. The number of clock cycles spent in the PQUAD state is given by 4 × tSA + 3, where the 3 clock cycles are a result of 1

AC

clock cycle per double-square. In order to estimate the number of clock cycles for the ECPM, one must estimate the average number of times the PDQA and

495

PQUAD states are entered. Since τ NAF(k) has an average Hamming weight of m/3 [27], the PDQA state is entered on average m/3 times. The Markov chain

is used to estimate the number of times that the PQUAD state is executed. The 3-state Markov chain is shown in Fig. 6. The ‘00’ state executes the 19

ACCEPTED MANUSCRIPT

PQUAD state. The ‘01’ and ‘1’ states both execute the PDQA state. According to [27], τ NAF(k) cannot have 2 successive non-zero digits. Thus, state ‘1’ can only be followed by states ‘00’ or ‘01’, each with a probability of 0.5. Similarly,

CR IP T

state ‘01’ can only be followed by ‘00’ or ‘01’, each with a probability of 0.5. Finally, the ‘00’ state can be followed by any of the 3 states, each with a probability of 0.333. From the Markov chain, the following transition matrix can be written:

(01) (1)

(00)

(01)

(1)

0.333 0.333 0.333

   0.5  0.5

0.5

0

    

AN US

P =

(00)



0.5

0

(10)

where the first row represents the transition from ‘00’, the second row represents the transition from ‘01’ and the third row represents the transition from ‘1’. 500

From the transition matrix, the steady state vector of the Markov chain can be obtained and we find that the steady state probability of state ‘00’ is

3 7,

state

M

‘01’ is 37 , and state ‘1’ is 17 . Based on the analysis, the ratio of PDQA to PQUAD is 4:3, which means if the PDQA is entered m/3 times, the PQUAD state is

505

ED

entered m/4 times. Using these estimates, tPDQA = d(m−1)/3e×(8tMULT +tSA ) and tPQUAD = d(m − 1)/4e × (4tSA + 3). The number of clock cycles for FFINV, tINV , is the same as in the 1-MULT

PT

random ECP. tP2AC is the total number of clock cycles required by the projective to affine conversion and it is given by tSA + tINV + 2tMULT + tSA . Finally,

CE

tECPM = 1 + tPDQA + tPQUAD + tP2AC + s, where 1 clock cycle is used by the 510

LOAD state and s = dm/32e clock cycles are used by the WAIT state. For the 2-MULT Koblitz ECP, the same ratio of PDQA to PQUAD is used

AC

and the latencies are given by tPDQA = d(m − 1)/3e × (5tMULT + 5tSA + 3) and tPQUAD = d(m − 1)/4e × (4tSA + 6). The FFINV latency is once again slightly

different due to the change in the algorithm and is given by tINV = g×tMULT +g×

515

(2tSA +1)+m−2−g. Finally, tP2AC is given by tSA +tINV +tSA +1+tMULT +tSA +1 and tECPM is given by 1 + tPDQA + tPQUAD + tP2AC + s. Comparing the 1-MULT and 2-MULT tECPM for each of the ECPs, one can 20

ACCEPTED MANUSCRIPT

notice that for random ECPs, the 2-MULT implementation decreases the latency between 41% and 46%. However, the same impact is not observed in Koblitz 520

ECPs, where the decrease is only 20% to 30%. This observation is consistent

CR IP T

with the observation discussed earlier where the 2-MULT random ECP reduces the LOOP state from 6M to 3M + 2A operations, whereas 2-MULT Koblitz ECP only reduces the PDQA state from 8M + 1A operation to 5M + 4A + 1S operations. 525

4.2. FPGA Implementation Results

AN US

The proposed scalable ECPs have been implemented using the Xilinx ISE

11.5 software. The target FPGA selected is the Xilinx Virtex-5 XC5LX110T for comparison purposes with other designs of ECP in the current literature. The post-place-and-route hardware utilization and timing performance results 530

are shown in TABLE 7 along with other ECP designs in the current literature. To better compare the performance of the various designed shown in TABLE 7,

M

an efficiency metric is used to take into account both the hardware utilization and timing latency. The efficiency metric is defined as follows: Number of ECPMs per second Number of slices

ED Efficiency =

(11)

The design in [16] uses a hardware-software co-design (HSC) approach. The design uses the PicoBlaze soft-core microcontroller in the FPGA to implement

PT

535

a majority of the control signals and only the finite field operations are im-

CE

plemented in hardware. Due to the use of a different target FPGA, a fair comparison cannot be made with the proposed ECP. The design in [37] is not a scalable design. It is highly optimized for a

specific curve and does not need to handle multiple curves, so it is much more

AC

540

efficient than the design proposed in this paper. Simply using the efficiency metric to compare the design in [37] and the proposed designs is not fair. The most significant advantage of the proposed designs is in the scalability of the ECP, while maintaining a low resource utilization. For instance, the 163-bit

545

design in [37] uses 6150 slices, whereas the proposed 1-MULT random ECP only 21

ACCEPTED MANUSCRIPT

requires 2290 slices and supports all 5 pseudo-random curves recommended by NIST. For Koblitz ECPs, the design in [17] is similar to the design in [16], where

550

CR IP T

the HSC approach is used. However, the design in [17] only implements 3 of the 5 NIST recommended Koblitz curves and the latencies are much higher due to the software operations.

The design in [38] shows a non-scalable ECP design that is optimized for 163-bit key size. Even though, the latency is low, the number of slices required is extremely high, which is the same observation made for [37] for pseudo-random curves.

AN US

555

The proposed 1-MULT random and Koblitz ECPs most resemble the designs in [19] and [20], which are previous designs published by the authors of this paper for pseudo-random and Koblitz curves recommended by NIST [4]. Both designs [19, 20] are scalable and support all 5 key sizes. There are some major 560

improvements that make the proposed ECPs in this paper superior. Firstly, the

M

designs in [19, 20] use 1 finite field arithmetic unit (FFAU) that can only perform 1 operation at a time, whereas the proposed 1-MULT ECPs parallelize the FFAU

ED

into the MULT and SA blocks. This allows the current designs to perform FFMULT and FFSQ or FFADD simultaneously, reducing the number of clock 565

cycles. Secondly, the proposed 1-MULT ECPs do not perform reduction for

PT

FFMULT until the subsequent FFADDRED, which further reduces the number of clock cycles per operation. Furthermore, for Koblitz curves, a novel τ NAF

CE

ECPM algorithm is proposed that takes advantage of the efficient repeated FFSQ capability of the SA block to reduce the latency. Overall, the proposed

570

1-MULT ECPs reduce the number of clock cycles of the ECPM dramatically.

AC

The hardware utilization increases is due to the use of the reduction blocks for each of the 5 key lengths. Even though the hardware utilization of both proposed 1-MULT ECPs are higher than their counterpart in [19] and [20], the benefit of the latency reduction outweighs the area increase, as shown by the

575

increase in efficiency for both 1-MULT ECPs. In another recent publication by the authors of this paper [32], the MULT 22

ACCEPTED MANUSCRIPT

block of the 1-MULT ECPs have been replaced with Karatsuba-Ofman multipliers. As shown in TABLE 7, even though the latencies of the proposed ECPs are higher, the efficiency metric shows that for lower key sizes, the proposed 2-MULT ECPs outperform the designs in [32].

CR IP T

580

Comparing the proposed 1-MULT and 2-MULT random ECPs asserts some

of the observations stated in previous sections. By further parallelizing the FF arithmetic into using 2 MULT blocks, the number of slices only increases from 2290 to 2708. The increase in the number of registers and LUTs is due to the 585

use of the extra ‘SREG C’ shift register and the additional MULT block. Since

AN US

the critical path of the design is not affected, the change in the maximum clock

frequency is minimal. However, the decrease in latency is significant in the 2MULT random ECP as described in Section 4.1. Thus, the overall efficiency increases between 43% to 57%. Furthermore, if the 3-MULT random ECP 590

implementation results were obtained, it is expected that the efficiency would decrease because the number of slices would increase by approximately 500, but

not as significant.

M

the LOOP state would only decrease from 3 to 2 MULT operations, which is

595

ED

In addition, the latency improvement gained by using 2 MULT blocks in the random ECP does not translate well into the Koblitz ECP. The hardware resource utilization shows a similar increase from the 1-MULT Koblitz ECP to

PT

the 2-MULT Koblitz ECP in terms of the number of slices, registers and LUTs. Thus, the efficiency of the 2-MULT Koblitz ECP only increases by 1.4% to 16%

CE

compared to the 1-MULT Koblitz ECP. As previously mentioned, the use of 3 or 4 MULT block does not further

600

improve the efficiency because the increase in the number of slices required out-

AC

weighs the decrease in latency of the ECPM operation. The use of interleaving multiplications described in [36] could be applied to the proposed Koblitz ECPs, but the reduction of the MULT block operation must be followed by an SA block

605

operation to reduce its result. Thus, the latency only improves between 30% to 40% compared to the 2-MULT Koblitz ECP. Furthermore, the ECPs described in [36] only support 1 Koblitz curve at a time, which is not a scalable design as 23

ACCEPTED MANUSCRIPT

defined in this paper. Thus, in [36], the reduction step for multiplication and squaring can be much more optimized. If the design in [36] supported multi610

ple finite fields simultaneously, the hardware resource utilization would increase

CR IP T

dramatically. Efficient software implementations such as the ones in [7] may provide an alternative for the proposed ECP. However, as previously mentioned, the goal

of the hardware ECP is to offload the operation from the software such that the 615

main processor may be freed up to perform other tasks. In addition, the use of a

hardware accelerator is expected to reduce the power consumption of the system

AN US

compared to a software-only solution. Furthermore, due to the compact nature of the proposed ECP, it can be instantiated multiple times in a single FPGA to achieve higher throughput. For example, the Virtex-5 XC5LX110T FPGA has 620

17,280 slices [39] and 148 Block RAMs, which means up to 7 1-MULT ECPs or 6 2-MULT ECPs may be instantiated on the FPGA.

Overall, the proposed designs have a better performance compared to scal-

M

able ECPs in the current literature and provide much higher flexibility compared to the highly optimized designs targeting a single key length presented in the current literature.

ED

625

PT

5. Conclusion

This paper proposes the parallelization of scalable ECPs that can support all 5 pseudo-random curves or all 5 Koblitz curves recommended by NIST [4]

CE

without reconfiguring the hardware. The proposed designs are implemented for

630

Virtex-5 FPGA as the target platform for comparison with designs in current literature. Compared to other scalable ECPs, the both 1-MULT and 2-MULT

AC

ECPs show a superior performance when comparing their efficiency to other designs computing the same type of curves. Comparing with designs that are highly optimized and non-scalable, even though the proposed ECPs have longer

635

latencies, the hardware utilization of the proposed scalable ECPs are much lower.

24

ACCEPTED MANUSCRIPT

The implementation results show the effect of the parallelization of operations in ECPM for pseudo-random and Koblitz curves. In pseudo-random curves, the parallelization of the LD algorithm by using 2 MULT blocks shows to be extremely advantageous as the efficiency increases between 43% and 57%.

CR IP T

640

In Koblitz curves, the efficiency increases between 1.4% and 16%. Further parallelization by using 3 MULT blocks or multiple SA blocks is not beneficial.

The proposed scalable ECPs are highly efficient designs that are very suitable for both server-side and client-side applications using security protocols 645

such as TLS/SSL, where the ECC parameters are negotiated at the start of each

AN US

session. For server-side applications, the scalability of the proposed design is

beneficial as the same logic implementation is able to support multiple key sizes, eliminating the need to reconfigure the FPGA when the key size changes. For client-side applications, the proposed ECPs have lower hardware resource uti650

lizations, while maintaining a high-speed operation by parallelizing the MULT block. In contrast, high-speed ECPs in the literature that only implement a

M

single key size require many more hardware resources.

Future work of this research looks to transfer the success of the scalable archi-

655

ED

tecture designed for binary finite fields into prime fields recommended by NIST. Furthermore, an efficient implementation of a scalable Koblitz τ NAF converter

PT

that supports all 5 NIST-recommended key sizes may also be investigated.

Acknowledgment

CE

This work is supported by the National Science and Engineering Research

Council of Canada (NSERC) Alexander Graham Bell Canada Graduate Scholarship.

AC

660

References [1] V. Miller, Use of elliptic curves in cryptography, in: CRYPTO85: Proceedings of the Advances in Cryptology, Springer–Verlag, 1986, pp. 417–426.

25

ACCEPTED MANUSCRIPT

[2] N. Koblitz, Elliptic curve cryptosystems, Mathematics of Computation 48 (177) (1987) 203–209.

665

[3] R. Rivest, A. Shamir, L. Adleman, A method for obtaining Digital Signa-

(1978) 120–126.

CR IP T

tures and Public-Key Cryptosystems, Communications of the ACM 21 (2)

[4] National Institute of Standards and Technology, Recommended Elliptic Curves for Federal Government Use (July 1999).

670

[5] Standards for Efficient Cryptography, SEC 2: Recommended Elliptic Curve

AN US

Domain Parameters (July 2000).

[6] Federal Information Processing Standard, FIPS PUB 186-3: Digital Signature Standard (DSS) (June 2009). 675

[7] D. Aranha, A. Faz-Hernndez, J. Lpez, F. Rodrguez-Henrquez, Faster implementation of scalar multiplication on koblitz curves, in: A. Hevia, G. Neven

M

(Eds.), Progress in Cryptology LATINCRYPT 2012, Vol. 7533 of Lecture

ED

Notes in Computer Science, Springer Berlin Heidelberg, 2012, pp. 177–193. [8] M. Morales-Sandoval, C. Feregrino-Uribe, R. Complido, I. Algredo-Badillo, A reconfigurable GF (2m ) elliptic curve cryptographic coprocessor, in: 2011

680

PT

VII Southern Conference on Programmable Logic (SPL), 2011, pp. 209– 214.

CE

[9] S. Blake-Wilson, N. Bolyard, V. Gupta, C. Hawk, B. Moeller, Elliptic Curve Cryptography (ECC) Cipher Suites for Transport Layer Security (TLS), RFC 4492 (Informational), updated by RFC 5246 (May 2006).

AC

685

URL http://www.ietf.org/rfc/rfc4492.txt

[10] O. Ahmadi, D. Hankerson, F. Rodrguez-Henrquez, Parallel formulations of scalar multiplication on koblitz curves, Journal of Universal Computer Science 14 (3) (2008) 481–504.

26

ACCEPTED MANUSCRIPT

690

[11] R. Azarderakhsh, A. Reyhani-Masoleh, High-Performance Implementation of Point Multiplication on Koblitz Curves 60 (1) (2013) 41–45. [12] J. Lutz, A. Hasan, High performance fpga based elliptic curve crypto-

CR IP T

graphic co-processor, in: Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on, Vol. 2, 2004, pp. 486–492 Vol.2.

695

[13] P. Realpe-Mue noz, V. Trujillo-Olaya, J. Velasco-Medina, Design of elliptic

curve cryptoprocessors over gf(2163) on koblitz curves, in: Circuits and

AN US

Systems (LASCAS), 2014 IEEE 5th Latin American Symposium on, 2014, pp. 1–4. 700

[14] C. Rebeiro, D. Mukhopadhyay, High speed compact elliptic curve cryptoprocessor for fpga platforms, in: D. Chowdhury, V. Rijmen, A. Das (Eds.), Progress in Cryptology - INDOCRYPT 2008, Vol. 5365 of Lecture Notes

M

in Computer Science, Springer Berlin Heidelberg, 2008, pp. 376–388. [15] M. Hassan, M. Benaissa, Low Area - Scalable Hardware/Software Co-design for Elliptic Curve Cryptography, in: 3rd International Conference on New

ED

705

Technologies, Mobility and Security (NTMS), 2009, pp. 1–5. [16] M. Hassan, M. Benaissa, A scalable hardware/software co-design for elliptic

PT

curve cryptography on PicoBlaze microcontroller, in: Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS), 2010, pp. 2111–2114.

CE

710

[17] M. Hassan, M. Benaissa, Flexible Hardware/Software Co-design for Scal-

AC

able Elliptic Curve Cryptography for Low-Resource Applications, in: 21st

715

IEEE International Conference on Application-specific Systems Architectures and Processors (ASAP), 2010, pp. 285–288.

[18] K. C. C. Loi, S.-B. Ko, Scalable elliptic curve cryptosystem fpga processor for nist prime curves, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 23 (11) (2015) 2753–2756. 27

ACCEPTED MANUSCRIPT

[19] K. C. C. Loi, S.-B. Ko, High Performance Scalable Elliptic Curve Cryptosystem Processor for Koblitz Curves, Microprocessors and Microsystems 37 (45) (2013) 394 – 406.

720

CR IP T

[20] K. C. C. Loi, S.-B. Ko, High Performance Scalable Elliptic Curve Cryp-

tosystem Processor in GF (2m ), in: IEEE International Symposium on Circuits and Systems (ISCAS) 2013, 2013, pp. 2585–2588.

[21] T. Itoh, S. Tsujii, A Fast Algorithm for Computing Multiplicative Inverses in GF (2m ) Using Normal Bases, Information and Computation 78 (3)

725

AN US

(1988) 171–177.

[22] P. G. Comba, Exponentiation cryptosystems on the IBM PC, IBM Systems Journal 29 (4) (1990) 526–538.

[23] J. Lopez, R. Dahab, Fast multiplication on elliptic curves over GF (2m ) without precomputation, in: CHES99: Proceedings of the First Inter-

730

M

national Workshop on Cryptographic Hardware and Embedded Systems, Springer-Verlag, 1999, pp. 316–327. CM-Curves with Good Cryptographic Properties,

ED

[24] N. Koblitz,

in:

CRYPTO91: Proceedings of the Advances in Cryptology, Lecture Notes in Computer Science, Vol. 576, Springer, 1991, pp. 279–287.

PT

735

[25] J. Lopez, R. Dahab, Improved algorithms for elliptic curve arithmetic in gf(2n), in: S. Tavares, H. Meijer (Eds.), Selected Areas in Cryptography,

CE

Vol. 1556 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 1999, pp. 201–212.

[26] E. Al-Daoud, R. Mahmod, M. Rushdan, A. Kilicman, A new addition

AC

740

formula for elliptic curves over gf(2n), Computers, IEEE Transactions on 51 (8) (2002) 972–975.

[27] J. Solinas, Efficient Arithmetic on Koblitz Curves, Designs, Codes and Cryptography 19 (2000) 195–249.

28

ACCEPTED MANUSCRIPT

745

[28] K. J¨ arvinen, J. Skytt¨ a, High-speed elliptic curve cryptography accelerator for Koblitz curves, in: Proceedings of the 16th IEEE Symposium on Fieldprogrammable Custom Computing Machines,FCCM 2008,IEEE Computer

CR IP T

Society, 2008, pp. 109–118. [29] K. J¨ arvinen, Optimized FPGA-based elliptic curve cryptography processor

for high-speed applications, INTEGRATION, the VLSI journal 44 (2011)

750

270–279.

[30] V. Dimitrov, K. J¨ arvinen, M. Jacobson, W. Chan, Z.Huang, Provably sub-

AN US

linear point multiplication on Koblitz curves and its hardware implementation 57 (11) (2008) 1469–1481. 755

[31] B. B. Brumley, K. U. J¨ arvinen, Conversion Algorithms and Implementations for Koblitz Curve Cryptography 59 (1) (2010) 81–92.

[32] K. C. C. Loi, S. An, S.-B. Ko, FPGA implementation of low latency scal-

M

able Elliptic Curve Cryptosystem processor in GF (2m ), in: Circuits and Systems (ISCAS), 2014 IEEE International Symposium on, 2014, pp. 822– 825.

ED

760

[33] J. Adikari, V. Dimitrov, K. Jarvinen, A fast hardware architecture for integer to taunaf conversion for koblitz curves, Computers, IEEE Transactions

PT

on 61 (5) (2012) 732–737.

[34] A. Karatsuba, Y. Ofman, Multiplication of multi-digit numbers on automata, Soviet Physics Doklady 7 (1963) 595–596.

CE

765

[35] K. J¨ arvinen, On repeated squarings in binary fields, in: J. Jacobson,

AC

MichaelJ., V. Rijmen, R. Safavi-Naini (Eds.), Selected Areas in Cryptog-

770

raphy, Vol. 5867 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2009, pp. 331–349.

[36] K. J¨ arvinen, J. Skytt¨ a, Fast point multiplication on Koblitz curves: Parallelization method and implementations, Microprocessors and Microsystems 33 (2009) 106–116. 29

ACCEPTED MANUSCRIPT

[37] G. D. Sutter, J.-P. Deschamps, J. L. Ima˜ na, Efficient Elliptic Curve Point Multiplication Using Digit-Serial Binary Field Operations 60 (1) (2013) 217–225.

775

CR IP T

[38] S. Roy, C. Rebeiro, D. Mukhopadhyay, A parallel architecture for koblitz curve scalar multiplications on fpga platforms, in: Digital System Design (DSD), 2012 15th Euromicro Conference on, 2012, pp. 553–559. [39] Xilinx, Virtex-5 Family Overview (February 2009).

URL http://www.xilinx.com/support/documentation/data_sheets/

780

AN US

ds100.pdf

K. C. Cinnati Loi received his dual B.Sc. in Electrical Engineering and in Computer Science in 2008 from the University of Saskatchewan, Canada. He received his M.Sc. at the University of Saskatchewan in 2010. He is currently a Ph.D. candidate at the University of Saskatchewan. His research in-

M

terests are hardware implementation of cryptosystems, high performance FPGA applications and hardware/software co-design.

ED

Seok-Bum Ko received his Ph.D. in Electrical and Computer Engineering at the University of Rhode Island, USA in 2002. He is currently professor in Electrical and Computer

PT

Engineering at the University of Saskatchewan, Canada. His research interests include computer arithmetic, computer ar-

CE

chitecture, computer network and biomedical engineering.

AC

Dr. Ko is a senior member of IEEE.

30

AN US

CR IP T

ACCEPTED MANUSCRIPT

Table 1: Summary of Point Operations

Curve

Point Operation Madd

Pseudo-Random

Mdouble Coordinate

4M + 1S + 2A 2M + 5S + 1A

1I + 10M + 1S + 5A

M

Conversion

Number of FF operations

8M + 5S + 8A

PFRB

3S

ED

Koblitz

PADD

Coordinate

AC

CE

PT

1I + 2M + 1S Conversion I = FFINV; M = FFMULT; S = FFSQ; A = FFADD

31

ACCEPTED MANUSCRIPT

Table 2: Instructions executed by the 1-MULT Koblitz ECP MULT PC MULT SA PC SA PDQA State

1 2

x × (Z1 |R)

0 1

T2 × (y|xy)

7 8

X1 = X1

T1 = X1 + M Y1 = Y1

T1 × Z1

0

X1 = Y1 + M

1

T1 = T12

0

Z1 = 0 + M

X1 × R

0

Y1 = Z1 + a · T2

T1 × Y1 x × Z1

(2|4)

AN US

6

(2|4)

2 0

1

T3 × (xy|y)

M

5

T2 = Z12

1

3 4

Y1 = T1 + (Y1 |M )

CR IP T

0

T2 × T3

Z1 = Z12

0

T2 = 0 + M

1

X1 = X12

0

Y1 = T2 + M

1

T3 = Z12

2

X1 = X1 + Y1

0

T3 = X1 + M

1

T2 = Z1 + T2

0

T1 = 0 + M

1

Z1 = Z1

(2|4)

PQUAD State 0

ED

0

1 2 3

PT CE AC

(4|2)

Y1 = R 1

(4|2)

X1 = X1 Z1 =

(4|2) Z1

BX State

0

0 ISQ State

0

0 IMULT State

0

Y1 = T1 + (Y1 |M )

(Z1 |T3 ) × R

Y1 = T1 + Y1 |R R = (Z1 |R)2

r

0

IRED State

0

0

T3 = 0 + M

FMULT State 0 1

X1 × R

Y1 × T3

0

T3 = R2

0

T1 (x3 ) = 0 + M

FINAL State

0

0

32

T2 (y3 ) = 0 + M

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Table 3: Instructions executed by the 1-MULT Random ECP MULT PC MULT SA PC SA INIT State 0 0 Z2 = x2 1 R = R2 LOOP State 0 (X1 |X2 ) × (Z2 |Z1 ) 0 (X2 |X1 ) = (M |R) + (T2 |b) 1 T3 = (Z1 |Z2 )4 1 (X2 |X1 ) × (Z1 |Z2 ) 0 T2 = M + 0 2 (X1 |X2 ) × (Z1 |Z2 ) 0 T1 = M + 0 3 T2 × T1 0 R=M +0 1 (Z1 |Z2 ) = R2 2 T1 = T1 + T2 4 b × T3 0 T2 = M + 0 1 (Z2 |Z1 ) = T12 2 T3 = (X1 |X2 )4 5 x × (Z2 |Z1 ) 0 (X1 |X2 ) = M + T 3 MUL1 State 0 x × Z2 0 (X2 |X1 ) = M + T2 MUL1R State 0 0 T2 = M + 0 MUL2 State 0 R × Z1 0 MUL2R State 0 0 T1 = M + 0 ISQ State r 0 0 R = R2 IMULT State 0 R × (T1 |T3 ) 0 IRED State 0 0 T3 = M + 0 CONV State 0 x × Z1 0 T3 = R2 1 T1 = X2 + T2 1 T2 × T3 0 T2 = M + X1 2 T2 × T3 0 Z1 = M + 0 3 x × T1 0 T1 = M + 0 4 T1 × Z2 0 T3 = M + 0 5 T1 × T3 0 T1 = M + 0 1 R = x2 2 T3 = R + y 6 X1 × Z1 0 T2 = M + T3 7 T2 × T1 0 T1 (x3 ) = M + 0 FINAL State 0 0 R(y3 ) = M + y

33

ACCEPTED MANUSCRIPT

Table 4: Instructions executed by the 2-MULT Random ECP MULT PC MULT 1 MULT 2 SA PC SA 1

SA 2

INIT State 0 1 2 LOOP State 0 1

M1 = (X1 |R1 ) × Z2

M2 = (X2 |R1 ) × Z1

M1 = (X1 |X2 ) × (Z1 |Z2 ) M2 = T3 × b

0 0 1 2

M1 = x × (Z2 |Z1 )

M2 = T2 × T1

0

M1 = x × Z2

M1 = x × Z 1 M1 = T2 × Z1

ED

0 0

0

AC

1

2

M1 = x × X2

M1 = T2 × T3

M1 = T2 × Z2

M1 = R1 × R2

R1 = M1 + 0

R1 = M1 + 0

1

(X2 |X1 ) = R1 + R2

0

T2 = M1 + 0

0

Z1 = M1 + X1

1

X2 = T2 + X2

0

T1 = M1 + 0

0

(R1 |T3 ) = (R1 + 0)2

(X1 |X2 ) = M2 + T3

R2 = M2 + 0

MUL2R State

r

0 IRED State 0

T3 = M1 + 0

CONV State M2 = R1 × Z1

M2 = T2 × T3

3 4

T3 = (0 + (X1 |X2 ))4

0

IMULT State

M1 = R1 × (T1 |T3 )

CE

0

T2 = M2 + 0

(Z2 |Z1 ) = (R1 + R2 )2

ISQ State

PT

0

T1 = M1 + 0

MUL2 State

M

0

T3 = (Z1 |Z2 + 0)4

(Z1 |Z2 ) = (R1 + 0)2

MUL1R State

0

X2 = R1 + b

1

MUL1 State

0

R1 = (R1 + 0)2

AN US

2

Z2 = (x + 0)2

CR IP T

0

M2 = T2 × X1

0 0

T3 = M1 + 0

1

T1 = (x + 0)2

T2 = M2 + 0

0

T2 = M1 + 0

1

T3 = T1 + y

0

R1 = M1 + 0

R2 = M2 + T3

T1 (y3 ) = M1 + y

T2 (x3 ) = M2 + 0

0

FINAL State

0

0

34

ACCEPTED MANUSCRIPT

Table 5: Instructions executed by the 2-MULT Koblitz ECP MULT PC MULT 1 MULT 2 SA PC SA 1 M1 = x × (Z1 |R1 )

0 1

1 2

M1 = R1 × Z1

M2 = T2 × (y|xy)

5

M1 = T1 × Y1

M2 = X1 × T3

M1 = x × Z1

M2 = T3 × (xy|y)

M1 = T2 × R1

CE

0

PT

0

0

M1 = R1 × (Z1 |T3 )

0

R1 = M1 + X1

0

T1 = R12

1

Y1 = Y1

0

T3 = M1 + 0

1

Y1 = R 1 + T 2 · a

AC

X1 = M2 + Y1

Z1 = T32

X1 = X12

1

T3 = Z12

0

R1 = M1 + 0

1

R1 = R1 + R2

2

X1 = R1 + X1

3

T2 = Z1 + T2

0

R1 = M1 + X1

0

Z1 = Z1

0 2

T2 = M2 + 0

T1 = M2 + 0

(2|4)

3

R1 = (Y1 |M1 ) + (0|T1 ) (4|2)

Y1 = R 1

(4|2)

X1 = X1 Z1 =

(4|2) Z1

BX State 0 ISQ State 0 IMULT State

Y1 = (Y1 |M1 ) + (0|T1 ) R1 = (Z1 |R1 )2

r

0 IRED State 0

T3 = M1 + 0

FMULT State

0

0 M1 = R1 × Y1

(2|4)

0

1

0

1

(2|4)

X1 = X1

PQUAD State

ED

0

M

6 7

T2 = Z12

2

2

4

Y1 = (Y1 |M1 ) + (0|T1 )

AN US

3

CR IP T

PDQA State 0

SA 2

M2 = X1 × T3

R1 = R12

0

FINAL State

0

0

35

T2 (y3 ) = M1 + 0

T1 (x3 ) = M2 + 0

Table 6: Clock Cycles of ECPM 1-MULT Random ECP tINIT

tLOOP

tINV tP2AC tECPM

163

tMULT tSA 30

7

15

29160

548

869

233

47

9

19

65424

871

1368

283

57

10

21

96444

1117

1717

409

107

14

29

261936

1881

2993

571

192

19

656640

3546

5523

39

30051 66820

98192

264972

AN US

m

CR IP T

ACCEPTED MANUSCRIPT

662221

2-MULT Random ECP tMULT tSA

tINIT

tLOOP

tINV tP2AC tECPM

7

24

17025

565

768

47

9

30

37148

891

1202

38389

57

10

33

54180

1139

1513

55736

409

107

14

45

142878

1907

2593

145530

571

192

19

60

350703

3579

4790

355572

30

233 283

M

m 163

17824

1-MULT Koblitz ECP

m

tMULT tSA tPDQA tPQUAD tINV tP2AC tECPM 30

7

13338

1271

548

622

233

47

9

30030

2262

871

983

33284

283

57

10

43804

3053

1117

1251

48118

409

107

14

118320

6018

1881

2123

126475

571

192

19

295450

11297

3546

3968

310734

PT

ED

163

AC

CE

m

15238

2-MULT Koblitz ECP

tMULT tSA tPDQA tPQUAD tINV tP2AC tECPM

163

30

7

10152

1394

557

610

233

47

9

22074

2436

881

957

25476

283

57

10

31772

3266

1128

1217

36265

409

107

14

82688

6324

1892

2043

91069

571

192

19

201020

11726

3559

3810

216575

36

12163

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Table 7: Results Comparison Max. Latency Efficiency  Scalab Work FPGA Regis- LUT Slices BRAM Freq. m ECPM ility (ms) s·slice (MHz) ters NIST Curve: Pseudo-Random 163 38 0.023 233 73.4 0.012 2010 Spartan-3 68 283 104 0.009 650 2025 1127 4 Yes XC3S200 [16] 409 251 0.004 571 287.4 0.003 n/a 22936 6150 n/a 250 163 0.0055 29.56 n/a 22340 6487 n/a 192 233 0.020 7.746 2013 Virtex-5 n/a 25030 7069 n/a 189 283 0.034 4.21 No XC5VLX110 [37] n/a 28503 10236 n/a 161 409 0.103 0.952 n/a 32432 11640 n/a 127 571 0.348 0.247 163 0.380 2.290 233 0.860 1.011 2013∗ Virtex-5 0.787 3191 1150 5 181.19 283 1.105 Yes XC5LX110T 1225 [20] 409 3.037 0.286 571 7.243 0.120 163 0.059 2.119 233 0.084 1.489 2014 Virtex-5 1.228 0 154.35 283 0.102 Yes XC5LX110T 12983 24974 7978 [32] 409 0.147 0.852 571 0.205 0.611 163 0.135 3.246 233 0.299 1.460 Proposed Virtex-5 0.993 1-MULT 7128 2290 5 224.84 283 0.440 Yes XC5LX110T 1650 409 1.186 0.368 Random 571 2.965 0.147 163 0.080 4.626 233 0.172 2.148 Proposed Virtex-5 1.479 2-MULT 8784 2708 5 223.26 283 0.250 Yes XC5LX110T 3118 409 0.652 0.567 Random 571 1.593 0.232 NIST Curve: Koblitz 163 15.5 0.050 2010 Spartan-3 0.017 913 2028 1278 4 90 283 45.1 Yes XC3S200 [17] 571 121.4 0.0065 2012 Virtex-4 n/a n/a 12430 n/a 45.5 163 0.012 6.649 No [38] 163 0.206 3.903 233 0.455 1.764 2013∗ Virtex-5 1.449 3003 1246 8 206.27 283 0.554 Yes XC5LX110T 1401 [19] 409 1.451 0.553 571 3.266 0.246 163 0.029 4.599 233 0.042 3.213 2014 Virtex-5 2.667 0 162.07 283 0.050 Yes XC5LX110T 13076 26111 7427 [32] 409 0.073 1.855 571 0.101 1.331 163 0.068 6.669 Proposed 233 0.149 3.053 Virtex-5 1-MULT 2.112 7073 2199 5 223.46 283 0.215 Yes XC5LX110T 1704 Koblitz 409 0.566 0.803 571 1.391 0.327 163 0.055 6.760 Proposed 233 0.114 3.228 Virtex-5 2-MULT 2.267 8609 2708 5 222.67 283 0.163 Yes XC5LX110T 3134 Koblitz 409 0.409 0.903 571 0.973 0.380 ∗ Results re-implemented for Virtex-5

37

B

18x32 RAM 32 A

32

AN US

A

CR IP T

ACCEPTED MANUSCRIPT

18x32 RAM B 32

32

32 X

X

M

63

63

32

0

ED

+

FIFO C

PT

UV register 63

63

32

C 32

SIPO C 32

C_msd 565

AC

CE

Figure 1: Block diagram of the multiplier (MULT) block.

38

32

AN US

B_full

B

A

CR IP T

ACCEPTED MANUSCRIPT

565

32 + 325 R163

32

565

R233

R283

233

283

ED

M

163

465

817

1141

R409

409

R571

571

576

571 32

p_in

SREG C

s_out

p_out

PT

s_in

SQ

AC

CE

Figure 2: Block diagram of the square-add (SA) block.

39

C

ks

x1

y1

+ 32

32

32

ks

32

32

32

32

18x288 RAM

x

18x32 18x32 RAM RAM Controller

32

32

y

...

32

X3

...

Y3

Z3

T1

...

A

MULT C_msd

B_full

B

ED

A

xy

M

k

32

AN US

k

CR IP T

ACCEPTED MANUSCRIPT

C

T2

T3

32 32

...

B

SA C

PT

32

AC

CE

Figure 3: Block diagram of the 1-MULT Koblitz ECP.

40

x3 y3

CR IP T

ACCEPTED MANUSCRIPT

cur_k | next_k = 1 load = 1

load = 0

k_3msb = 1

PDQA Exit: MULT_PC = 8 & SA_PC = 1 (8tMULT +1tSA )

load = 0

LOAD (1)

IDLE

load = 1

PQUAD k_count = 0 Exit: SA_PC = 3 (4tSA + 3)

k_3msb = 0

FMULT (2tMULT )

ED

FINAL (tSA )

cur_k | next_k = 0

cur_k | next_k = 1

M

WAIT (s)

k_count = 0

AN US

reset

Done Inv

ISQ (tSA +t m )

IRED (tSA )

PT

Inversion states (t INV )

Done squares

IMULT (t MULT )

AC

CE

Figure 4: FSM of the 1-MULT Koblitz ECP.

41

BX (tSA )

32

A1_full

B1

A1 32 32

565

32

+

+ 325

32

R163

32

465

565

R233

233

R283

283

M

163

SQ

817

ED

s_in

s_in

565

SREG C

1141

R409

409

R571

571

576

571 32

p_in

PT

A2_full

AN US

B2

A2

CR IP T

ACCEPTED MANUSCRIPT

C1

s_out

p_out

571

p_in

SREG C p_out

s_out

32

C2 576

AC

CE

Figure 5: Block diagram of the SA block for 2 MULT blocks.

42

0.333

AN US

CR IP T

ACCEPTED MANUSCRIPT

0.333

00 (PQUAD)

0.5

0.5

01 (PDQA)

0.5

1 (PDQA)

PT

ED

0.5

M

0.333

AC

CE

Figure 6: Markov chain analysis for PQUAD state.

43

ACCEPTED MANUSCRIPT

Algorithm 1 Lopez-Dahab algorithm Input: k = (kt−1 , . . . , k1 , k0 ) with kt−1 = 1, P (x, y), b – curve specific coefficient Output: Q(x0 , y0 ) = kP // Initialization - Affine to Projective Conversion

CR IP T

// and processing kt−1 = 1 (X1 , Z1 ) ← (x, 1), (X2 , Z2 ) ← (x4 + b, x2 ) // Main Loop for i from t − 2 down to 0 do if ki = 1 then (X1 , Z1 ) ← Madd(X1 , X2 , Z1 , Z2 , x) (X2 , Z2 ) ← Mdouble(X2 , Z2 , b) else

(X1 , Z1 ) ← Mdouble(X1 , Z1 , b) end if end for

AN US

(X2 , Z2 ) ← Madd(X1 , X2 , Z1 , Z2 , x)

// Mxy - Projective to Affine Conversion

y0 ←

X1 Z1 1 (x x

+

X1 )[(x Z1

+

X1 )(x Z1

X2 ) Z2

+ x2 + y] + y

AC

CE

PT

ED

return Q(x0 , y0 )

+

M

x0 ←

44

ACCEPTED MANUSCRIPT

Algorithm 2 Modified τ NAF ECPM on Koblitz Curves Input: k – a binary integer, P (x, y) – a point on Ea Output: Q = kP Compute τ NAF(k) =

l−1 X

ui τ i

// Perform the first point addition of Q ← ∞ ± P if ul−1 = 1 then Q(X3 , Y3 , Z3 ) ← P (x, y) else Q(X3 , Y3 , Z3 ) ← P (x, x + y) end if

while i ≥ 0 do // Main loop

AN US

i←l−2 if ui = 0 then // Section added for this algorithm // Perform proposed PDFRB (Q ← τ 2 Q) Q(X3 , Y3 , Z3 ) ← Q(X34 , Y34 , Z34 ) if ui−1 = 0 then i←i−2 i←i−1 // Perform PADD if ui = 1 then

M

else

ED

Q(X3 , Y3 , Z3 ) ← Q(X3 , Y3 , Z3 ) + P (x, y) else // ui = −1

Q(X3 , Y3 , Z3 ) ← Q(X3 , Y3 , Z3 ) + P (x, x + y) end if

PT

i←i−1

end if

else // This is performed traditionally

CE

// Perform PFRB (Q ← τ Q)

Q(X3 , Y3 , Z3 ) ← Q(X32 , Y32 , Z32 ) // Perform PADD if ui = 1 then

AC

CR IP T

i=0

Q(X3 , Y3 , Z3 ) ← Q(X3 , Y3 , Z3 ) + P (x, y) else // ui = −1 Q(X3 , Y3 , Z3 ) ← Q(X3 , Y3 , Z3 ) + P (x, x + y) end if i←i−1 end if end while return Q(x3 , y3 ) ← Q(X3 /Z3 , Y3 /Z32 )

45