Accepted Manuscript
Parallelization of Scalable Elliptic Curve Cryptosystem Processors in GF(2m ) K.C.Cinnati Loi, Seok-Bum Ko PII: DOI: Reference:
S0141-9331(16)00044-2 10.1016/j.micpro.2016.02.013 MICPRO 2359
To appear in:
Microprocessors and Microsystems
Received date: Revised date: Accepted date:
14 May 2015 14 December 2015 23 February 2016
Please cite this article as: K.C.Cinnati Loi, Seok-Bum Ko, Parallelization of Scalable Elliptic Curve Cryptosystem Processors in GF(2m ), Microprocessors and Microsystems (2016), doi: 10.1016/j.micpro.2016.02.013
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Parallelization of Scalable Elliptic Curve Cryptosystem Processors in GF (2m )
a Department
CR IP T
K. C. Cinnati Loia , Seok-Bum Koa,∗ of Electrical and Computer Engineering, University of Saskatchewan, Saskatoon, Canada
Abstract
AN US
The parallelization of scalable elliptic curve cryptography (ECC) processors (ECPs) is investigated in this paper. The proposed scalable ECPs support all
5 pseudo-random curves or all 5 Koblitz curves recommended by the National Institute of Standards and Technology (NIST) without the need to reconfigure the hardware. The proposed ECPs parallelize the finite field arithmetic unit and the elliptic curve point multiplication (ECPM) algorithm to gain perfor-
M
mance improvement. The finite field multiplication is separated such that the reduction step is executed in parallel with the next polynomial multiplication.
ED
Subsequently, the finite field arithmetic of the ECPs are further parallelized and the performance can be further improved by over 50%. Since the multiplier blocks consume a low number of hardware resources, the latency reduction out-
PT
weighs the cost of the extra multiplier resulting in more efficient ECP designs. The technique is applied for both pseudo-random curve and Koblitz curve algorithms. A novel ECPM algorithm is also proposed for Koblitz curves that take
CE
advantage of the proposed finite field arithmetic architecture. The implementation results show that the proposed parallelized scalable ECPs have better
AC
performance compared to state-of-the-art scalable ECPs that support the same set of elliptic curves. ∗ Corresponding author at: Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK S7N 5A9, Canada Email addresses:
[email protected] (K. C. Cinnati Loi),
[email protected] (Seok-Bum Ko)
Preprint submitted to Journal of Microprocessors and Microsystems
February 27, 2016
ACCEPTED MANUSCRIPT
Keywords: FPGA, Elliptic Curve Cryptography (ECC), elliptic curve point multiplication (ECPM), binary finite field arithmetic.
CR IP T
1. Introduction
In the 1980s, Miller [1] and Koblitz [2] independently proposed the use of Elliptic Curve Cryptography (ECC). ECC has gained much popularity recently
over other public-key cryptography algorithms, such as Rivest-Shamir-Adleman 5
(RSA) [3] since the same level of security can be provided using shorter key
AN US
sizes. As a result, implementations of ECC consume fewer resources and can
achieve higher throughput. Due to its many advantages, ECC has been adopted by many standards, such as NIST [4], SEC [5], and FIPS 186-3 [6]. The main operation in ECC protocols is the elliptic curve point multiplica10
tion (ECPM). Software implementations of ECPM are available (e.g. [7]) and can be optimized to run very fast. However, due to its complexity, many im-
M
plementations offload the ECPM operation into hardware co-processors to free up the processor for other operations and as a result speed up the performance of the overall system. Offloading the ECPM operation to a hardware platform also provides a power advantage to the system. This paper presents the archi-
ED
15
tecture of these hardware co-processors, or elliptic curve processors (ECPs) and
PT
proposes a parallel architecture for scalable ECPs for increased performance. Scalability refers to a hardware architecture that supports the evaluation of the ECPM for multiple elliptic curves and key sizes recommended by NIST [4]. The scalable ECC processors proposed in this paper support the ECPM calcu-
CE
20
lation for all 5 pseudo-random or Koblitz curves recommended by NIST without
AC
the need to reconfigure the hardware. The advantage of a scalable design is in its ability to modify the key size on-the-fly, which is useful for security protocols such as Transport Layer Security/Secure Socket Layer (TLS/SSL), where the
25
ECC parameters are negotiated at run-time [8, 9]. There are many ECPs proposed in literature that are not scalable designs. In [10], the authors propose an algorithm for Koblitz curves that make use of
2
ACCEPTED MANUSCRIPT
concurrent τ and τ −1 Frobenius operations to parallelize the point multiplication. In [11], the authors propose an ECC processor that uses 4 parallel finite30
field multipliers to speed up point multiplication. In [12], the paper presents an
CR IP T
ECC co-processor for binary fields that runs at 66 MHz and can perform the ECPM in 0.233 ms on generic curves and in 0.075 ms on Koblitz curves. In [13],
an ECP based on Guassian Normal Bases (GNB) is presented and can execute
the ECPM for 163-bit binary field in 5.05 µs. The ECP presented in [14] pro35
poses the use a hybrid Karatsuba multiplier to reduce the resource utilization
on the FPGA. These ECP may be fast and require low resource utilization.
AN US
However, in order for these designs to support multiple key sizes, multiple in-
stances of the ECP must be instantiated in the design, which would result in an increase in routing delay, which lowers maximum frequency, or an insufficient 40
amount of hardware resources, which demands the use of multiple FPGAs. In 2009, Hassan and Benaissa [15] proposed a scalable ECP that supports binary field SEC curves up to 193 bits. The design uses the hardware/software
M
co-design (HSC) approach, making use of the PicoBlaze soft-core microcontroller in Xilinx FPGAs. Their design goal is to reduce area consumption for area constrained platforms, such as RFID, mobile handsets, smart cards, and wireless
ED
45
sensor networks [15]. In addition, Hassan and Benaissa have also proposed scalable designs that support curves up to 571 bits recommended by NIST [16,
PT
17] also for area-constrained environments. Hardware implementation of prime field scalable ECPs have also been explored by the authors of this paper [18], but the implementation of prime field
CE
50
ECPs is out of the scope of this paper. In 2013, the authors of this paper published scalable ECP designs that sup-
AC
port all 5 Koblitz curves [19] or all 5 pseudo-random curves [20] without the need to reconfigure the hardware. In these publications, the authors proposed a novel
55
finite field arithmetic unit (FFAU) design that performs finite field arithmetic for all 5 binary fields recommended by NIST [4] in the same hardware efficiently. As a result, the designs outperform other scalable designs in literature. However, there are some drawbacks in the architecture of the designs in [19] and [20]. 3
ACCEPTED MANUSCRIPT
The FFAU can perform finite field multiplication or finite field squaring along 60
with finite field addition, but subsequent multiplication and squaring operations must be performed sequentially. Furthermore, the reduction step for multiplica-
CR IP T
tion and squaring is performed for every instruction before the next instruction can be executed and it consumes a significant amount of clock cycles. Thus, this paper proposes architectures that further improve the performance of the 65
scalable ECPs proposed in [19] and [20] by exploring the parallelization of the finite field arithmetic and the ECPM algorithm. The proposed designs separate
the multiplication and squaring operations to allow for simultaneous computa-
AN US
tion of the two operations. The arithmetic blocks also separate the reduction step from the finite field multiplication to further improve the performance of 70
the ECP. As in the authors’ previous works, the proposed ECPs only support the NIST recommended binary pseudo-random curves and Koblitz curves. The main contribution of this paper is the proposed architecture of a scalable ECP that parallelizes the ECPM operations. The effect of parallelization
75
M
is analyzed for both pseudo-random and Koblitz curves. Firstly, the polynomial multiplication step is separated from the reduction step for finite field multipli-
ED
cation, such that these operations can be performed in parallel. Subsequently, the multiplier block that performs the polynomial multiplication is also replicated for further parallelization. Since the hardware resource utilization of the
80
PT
multiplier block is relatively low compared to that of the square-add (SA) block, latency reduction of using 2 multiplier blocks outweighs the increase in hard-
CE
ware utilization, increasing the performance. For Koblitz curves, a novel τ NAF ECPM algorithm is also proposed, which is made possible by the finite field arithmetic block’s efficiency in performing efficient repeated finite field squaring.
AC
The efficiency of repeated finite field squaring also improves the performance of
85
the finite field inversion operation using the Itoh-Tsujii algorithm [21]. The rest of this paper is organized as follows: Section 2 reviews finite field
operations and elliptic curve cryptography for both pseudo-random and Koblitz curves; Section 3 discusses the hardware architecture and implementation of the proposed scalable ECPs; Section 4 presents a latency analysis and the FPGA 4
ACCEPTED MANUSCRIPT
90
implementation results and comparison with other designs in literature; and Section 5 concludes the paper.
CR IP T
2. Elliptic Curve Cryptography This section is organized into two subsections. Firstly, the finite field operations used by the designs in this paper are described. Secondly, the algorithms 95
used for elliptic curve point multiplication (ECPM) for both pseudo-random and Koblitz curves are briefly described.
AN US
2.1. Scalable Finite Field Operations
In this paper, binary finite field (FF) operations represented in polynomial basis are used for implementing elliptic curve operations. These FF opera100
tions include FF addition (FFADD), FF squaring (FFSQ), FF multiplication (FFMULT) and FF inversion (FFINV). Among these operations, FFADD is
M
the most trivial and can be implemented using a bit-wise exclusive-OR (XOR) operation. FFINV is the most complex operation, but using the Itoh-Tsujii algorithm [21], FFINV is simplified to a series of FFMULT and FFSQ. FFSQ can be implemented by using the following property:
ED
105
mod P (t)
(1)
PT
A(t)2 = am−1 t2m−2 + · · · + a1 t2 + a0
which is simply interleaving 0 bits and operand bits. Thus, the most complicated finite field operation that needs to be imple-
CE
mented by the ECP hardware is FFMULT, which has the highest impact on the performance of the ECP in terms of speed and area. In addition, in order to implement scalable ECPs, algorithm used for the FF operations must result
AC
110
in architectures that support multiple key sizes with the same hardware. In this paper, FFMULT is implemented using the Comba algorithm [22]
with the digit width, w, chosen to be 32. The Comba algorithm is a digit-wise multiplication algorithm and it processes the operands digit-per-digit, which
115
facilitates scalability.
5
ACCEPTED MANUSCRIPT
Both FFMULT and FFSQ require a modulo P (t) operation called the reduction operation. P (t) is an irreducible polynomial chosen for each specific curve and it is shown in [4]. In this paper, the reduction operation is not performed
120
CR IP T
when evaluating FFMULT. Rather, it is only performed when computing either FFSQ or FF addition/reduction (FFADDRED). By doing so, the complexity of the system is greatly reduced because FFMULT can be simplified to only performing polynomial multiplication.
In this paper, the reduction operation is performed using a reduction matrix
125
operation is defined as follows:
AN US
for each finite field, namely 163, 233, 283, 409 and 571, such that the reduction
D(t) = R × C(t)
(2)
where C(t) is a binary column vector of the coefficients of the polynomial to be reduced, (c2m−2 , . . . , c1 , c0 ), R is the m × 2m − 1 reduction matrix and D(t) is
M
the reduced column vector, (dm−1 , . . . , d1 , d0 ). The multiplication and addition operations in the matrix multiplication are performed in GF (2). 2.2. ECPM for Pseudo-Random Curves
ED
130
Pseudo-random curves recommended by NIST [4] over GF (2m ) have the
PT
following form:
E : y 2 + xy = x3 + x2 + b
(3)
CE
where b is a constant specific to each curve. The main operation in ECC is the elliptic curve point multiplication (ECPM). Given a point, P , defined on the
AC
curve E and an integer, k, ECPM is defined as follows: Q = kP = P + P + · · · + P | {z }
(4)
k times
where Q is the resultant point, which is also on the curve E. In this paper, the algorithm chosen for computing ECPM in pseudo-random curves is the Lopez-
135
Dahab (LD) algorithm [23] that is shown in Algorithm 1. 6
ACCEPTED MANUSCRIPT
In Algorithm 1, Madd is defined as:
{
CR IP T
(X, Z) ← Madd(X1 , X2 , Z1 , Z2 , x) X ← X1 X2 Z1 Z2 + x(X1 Z2 + X2 Z1 )2 Z ← (X1 Z2 + X2 Z1 )2
} and Mdouble is defined as:
{
AN US
(X, Z) ← Mdouble(X1 , Z1 , b) X ← X14 + bZ14 Z ← (X1 Z1 ) }
(5)
(6)
2
Furthermore, the projective to affine coordinate conversion shown in Algo-
140
M
rithm 1 requires 3 FFINV operations, for x, Z1 and Z2 . In this paper, the conversion algorithm has been modified to the following such that only 1 inver-
ED
sion is required:
x0 ←
(7)
1 )(xZ2 +X2 ) 1 +X1 ) ( Z2 (xZ )( x(xZ1 +X xZ1 Z2 xZ1 Z2
PT
y0 ←
xZ2 X1 xZ1 Z2
2
+ x + y) + y
The Lopez-Dahab (LD) algorithm uses standard projective coordinate sys-
CE
tem that uses 3 coordinates to represent a point, (X, Y, Z), where x = X/Z and y = Y /Z. One of the main advantages of using the LD algorithm is that only the X and Z coordinates need to be computed in the loop and the conversion back to the affine coordinates can be obtained by simply using the X and Z
AC
145
coordinates of the resultant point and the x and y affine coordinates of the original point. A summary of the number of FF operations for each point operation is provided in TABLE 1.
7
ACCEPTED MANUSCRIPT
150
2.3. ECPM for Koblitz Curves Koblitz curves [24] recommended by NIST [4] have the following form:
(8)
CR IP T
Ea : y 2 + xy = x3 + ax2 + 1
where a = 0 or 1. Similar to pseudo-random curves, the fundamental operation in Koblitz curve ECC is also the ECPM. In this paper, Lopez-Dahab (LD)
coordinates [25] are used to delay the need for an FFINV until the end of the 155
algorithm. The mixed LD and affine coordinate point addition (PADD) [26] is used to reduce the number of operations to 9 FFMULT, 5 FFSQ and 9 FFADD.
AN US
The expression for PADD of a point in LD coordinates, (X1 , Y1 , Z1 ), with a point
in affine coordinates, (x, y), to result in a point in LD coordinates, (X3 , Y3 , Z3 ), is given as follows:
Z3 = (Z1 (xZ1 + X1 ))2 X3 = (yZ12 + Y1 )2
M
+(xZ1 + X1 )2 (Z1 (xZ1 + X1 ) + aZ1 )2 +(yZ12 + Y1 )(Z1 (xZ1 + X1 ))
(9)
ED
Y3 = ((yZ12 + Y1 )(Z1 (xZ1 + X1 )) + Z3 )(X3 + xZ3 ) +(x + y)Z32
160
Since a in (8) is 0 or 1, and the addition of (x+y) can be precomputed, the total
PT
number of operations can be reduced to 8 FFMULT, 5 FFSQ and 8 FFADD. When evaluating ECPM for Koblitz curves, the scalar, k, is converted into
CE
τ -non-adjacent form (τ NAF) to simplify the point doubling (PDBL) operations [27]. The τ NAF converted algorithm performs Frobenius endomorphism
165
(PFRB) instead of PDBL, which reduces the number of operations from 4 FF-
AC
MULT, 5 FFSQ and 4 FFADD to 3 FFSQ. A summary of the number of FF operations is provided in TABLE 1. In some systems [28, 29, 30], the τ NAF conversion is included in the ECP
implementation. However, as noted in [31], in some systems, the τ NAF con-
170
verted digits of k can be generated randomly, and converted back to its binary
8
ACCEPTED MANUSCRIPT
equivalent. In these systems, a separate τ NAF to binary converter may be used in parallel with the ECP. In addition, similar work in the literature [19, 32] also do not include the τ NAF conversion in the ECP. In order to more easily
175
CR IP T
compared the proposed ECPs with the ECPs in literature, the τ NAF conversion is out of the scope of this paper. Interested readers can refer to [27], [31] and [33]. Nevertheless, should τ NAF conversion be required when using the ECP proposed in this paper, the converters presented in [28, 29, 30] may be used in parallel with the proposed ECP.
In this paper, a novel τ NAF ECPM algorithm for Koblitz curves is presented that improves on the one used in [19]. The novel ECPM algorithm is shown in
AN US
180
Algorithm 2. The τ NAF ECPM algorithm used in [19] only performs the else section of the main loop shown in Algorithm 2, where the PFRB is evaluated at every iteration and PADD is executed if the current digit is non-zero. In the proposed algorithm, if the current digit is zero, further optimization is performed 185
by using the proposed double Frobenius endomorphism (PDFRB) (Q ← τ 2 Q)
M
followed by a PADD if the next digit is non-zero. As will be shown in the next section, the architecture of the finite field arithmetic block allows for the efficient
ED
repeated squaring, where each subsequent squaring only requires 1 additional clock cycle. Thus, the proposed PDFRB step is more efficient than performing 190
PFRB twice and Algorithm 2 reduces the number of iterations of the main loop
PT
resulting in a lower latency.
The operation of the algorithm is as follows: if the currently indexed τ NAF
CE
digit, ui , is 0, then a PDFRB operation is executed and the index, i, is reduced by 2 instead of 1. If the next indexed τ NAF digit, ui−1 , is also 0, then no
other operation is required. Otherwise, PADD executes a point addition or subtraction depending on the sign of ui . If ui is non-zero, then a PFRB is
AC
195
performed, followed by a PADD, and the index is decremented by 1. The index i is decremented by 1 twice after PDFRB in Algorithm 2 is to make PADD perform the same operation in both ui = 0 and ui = 1/ − 1 cases.
9
ACCEPTED MANUSCRIPT
200
3. Design and Architecture of the Scalable ECPs In this section, the hardware architectures of the proposed scalable ECPs are presented. This section is divided into 3 subsections. In the first subsection,
CR IP T
the architecture of the finite field arithmetic blocks are presented. In the second subsection, the architecture of the scalable ECPs using a single multiplier is 205
presented. In the third subsection, the architecture of the ECPs with multiple multipliers is explored. For the remainder of this paper, random ECP refers to a scalable ECP for pseudo-random curves and Koblitz ECP refers to a scalable
AN US
ECP for Koblitz curves recommended by NIST. 3.1. Finite Field Arithmetic Blocks
In [19] and [20], the finite field arithmetic is performed using a finite field
210
arithmetic unit (FFAU) that can either perform FFMULT or FFSQ. In this paper, the proposed ECP uses 2 finite field arithmetic blocks that work closely
M
with each other, the multiplier block (MULT) and the square-add block (SA). 3.1.1. Multiplier (MULT) Block
The MULT block performs the Comba algorithm and is shown in Fig. 1.
ED
215
Its inputs are 32-bit buses and the values are stored in dual-port RAMs. Since the digit size is 32, the RAMs are s = d571/32e = 18 words deep. The MULT
PT
block uses 2 ‘multiplier units’ (‘x’ in Fig. 1) in parallel. Each ‘x’ is a purely combinational 32-bit Karatsuba-Ofman multiplier [34]. The digits are read out to the ‘x’ block according to the indexes in the inner and outer loops of the
CE
220
Comba algorithm. The output of the ‘multiplier units’ are accumulated in the 63-bit ‘UV reg-
AC
ister’. The addition operation is performed using XOR operations. Once the inner loop is completed the least-significant 32 bits of ‘UV register’ are sent to
225
the ‘FIFO C’ or ‘SIPO C’ for storage and the register is right-shifted by 32 bits to prepare for the next inner loop calculation. Both ‘FIFO C’ and ‘SIPO C’ are storage units for the resultant product. ‘FIFO C’ is a first-in-first-out unit that is used for the least-significant dm/32e 10
ACCEPTED MANUSCRIPT
digits of the product. ‘SIPO C’ is a digit-serial-in-parallel-out shift register that 230
stores the remaining most-significant digits. The separation of the product’s storage is due to the architecture of the SA unit which is discussed below. Thus,
CR IP T
the MULT block has 2 outputs. ‘C’ is the output of ‘FIFO C’ that outputs the least-significant 32-bit digits, one digit at a time, whereas ‘C msd’ is the parallel
output of ‘SIPO C’, which requires a maximum of (2×571−1)−(d571/32e×32) = 235
565 bits.
Using the proposed architecture, the MULT block completes its operation
in (s/2)2 × 2 + s/2 + s + 3 clock cycles, where s = dm/32e and s + 3 clock cycles
3.1.2. Square-Add (SA) Block
AN US
are used for loading the input digits, the pipelining stages.
The SA block performs both FF addition/reduction (FFADDRED) and re-
240
peated FFSQ and is shown in Fig. 2. The ‘A’ and ‘B’ inputs of the SA block are 32-bit digits. When performing FFADDRED, the ‘A’ and ‘B’ are added by
M
a 2-input 32-bit XOR block. During FFSQ, input ‘B’ is set to 0. ‘SREG C’ is a shift register with both digit-serial and parallel inputs and outputs. The output of the adder connects to the digit-serial input (‘s in’), which shifts by 32 bits
ED
245
on every clock cycle. Once all the digits are collected, ‘SREG C’ outputs the complete value through the 576-bit parallel output port.
PT
For FFADDRED, the value is concatenated with the input ‘B full’, which is connected to the ‘C msd’ output of the MULT block. By doing so, the SA block 250
effectively adds ‘A’ and ‘B’, where ‘B’ can be the output of a polynomial multi-
CE
plication to be reduced. The concatenated value is chosen by the multiplexers to input into 5 reduction blocks, ‘R163’,‘R233’, ‘R283’,‘R409’, and ‘R571’, which
AC
are combinational logic blocks derived from the R matrix in (2) for each of the
5 finite fields. The reduction blocks output to a multiplexer, which selects the
255
appropriate value to be stored back in ‘SREG C’ through its parallel input port. Finally, the result is output through the digit-serial output port of ‘SREG C’. When operating for repeated FFSQ, the parallel output of ‘SREG C’ is input into the ‘SQ’ block, which interleaves 0s to perform polynomial squaring. The 11
ACCEPTED MANUSCRIPT
result is selected by the multiplexer to input into the reduction blocks. Similar 260
to FFADDRED, the reduced value is input back into ‘SREG C’ via the parallel input port. At this point, if another FFSQ is required, ‘SREG C’ outputs the
CR IP T
value through its parallel output port again and the process is repeated as many times as required. By doing so, apart from the first FFSQ, which requires s clock
cycles to load the operand, every FFSQ can be completed in 1 additional clock 265
cycle. This characteristic is especially useful for performing FF inversion using Itoh-Tsujii algorithm, where FFSQ is repeated many times [35].
AN US
Based on the above description, the operations that the SA block supports r are: (A + B) mod P or (A2 ) mod P , where A has size m, B in FFADDRED mode has size 2m − 1, r ≥ 1 is the number of times FFSQ is repeated, P
270
is the reduction polynomial. The SA block completes a FFADDRED operation in s + 1 clock cycles and a repeated FFSQ operation in s + 1 + (r − 1) clock cycles, where s + 1 clock cycles are used for loading and r − 1 clock cycles for repeated squaring.
275
M
The architecture of the MULT and SA blocks are an improvement over the finite field arithmetic unit (FFAU) used in [19] and [20] as follows. The
ED
reduction step of FFMULT is removed from the multiplier, which reduces the latency. Instead, the reduction step for both FFMULT and FFSQ are performed in the SA block. As previously mentioned, one of the drawbacks of the FFAU
280
PT
in [19] and [20] is the long latency of the FFMULT and FFSQ operations, which includes the reduction step in a single operation. In addition, the ability for the
CE
SA block to compute repeated FFSQ with 1 additional clock cycle allows for the use of Algorithm 2 for Koblitz curves to further reduce latency, whereas the
AC
FFAU in [19] and [20] does not have this ability. 3.2. Single-Multiplier Scalable ECPs
285
In this subsection, the ECP architecture of Koblitz ECP and random ECP
using the finite field arithmetic blocks presented in Section 3.1 are described. The Koblitz ECP is presented first, followed by the random ECP.
12
ACCEPTED MANUSCRIPT
3.2.1. Koblitz ECPs The block diagram of the single multiplier (1-MULT) scalable Koblitz ECP 290
is shown in Fig. 3. The scalable ECP evaluates Algorithm 2 after the τ NAF(k)
CR IP T
computation. The inputs x1 and y1 are 32-bit buses that enter the affine co-
ordinates of a point digit-by-digit. The τ NAF converted value of k, with a magnitude and sign are input by 32-bit buses into the controller. The outputs
of the ECP are 32-bit x3 and y3 buses for the affine coordinates of the resultant 295
point. Since values are transferred digit-by-digit in and out of the ECP and it
takes s = dm/we clock cycles. For simplicity, the finite state machine (FSM)
AN US
and some control signals are not shown. The advantage of using 32-bit ports is
to allow for simpler interfacing with general purpose processors that commonly operate in 32- or 64-bit data paths.
The inputs to the MULT and SA blocks are controlled by the current state
300
of the FSM and 2 program counters, MULT PC and SA PC. The instructions executed by the processor in each state is shown in TABLE 2. The RAM stores
M
the input values x1 and y1 into x and y and their sum into xy. It also stores the temporary values X1 , Y1 , Z1 , T1 , T2 , T3 that are used in TABLE 2. The RAM stores all the values in 32-bit digits. As a result, the total size of the
ED
305
RAM is d571/32e × (32 × 9) = 18 × 288 bits. The outputs of the MULT block are connected to the ‘B full’ port of the SA block and the multiplexer for input
PT
‘B’. The advantage of the proposed architecture is that there is no need to store the product, which would require twice the number of words in the RAM. The disadvantage is that every multiplication must be followed by an addition
CE
310
performed on the SA block. However, since the addition can be performed in parallel with the next multiplication, in many occasions, the number of clock
AC
cycles used by the addition does not affect the latency. Finally, the result of x3 and y3 are obtained from T1 and T2 , respectively.
315
In TABLE 2, the MULT operations are performed without reduction and the
SA operations are reduced by P (t) as described in Section 3.1. There are a few special features that contribute to the improved performance of the proposed
13
ACCEPTED MANUSCRIPT
design. The PDQA state is a combination of PADD with PFRB or PDFRB, and the PQUAD state performs PDFRB or PFRB if a single 0 digit is the 320
least significant digit of τ NAF(k). The operations are combined as such so
CR IP T
that PDQA is the only state that needs to perform PADD. The PQUAD state will only need to implement FFSQ or double FFSQ (A4 ). These states execute the main loop in Algorithm 2. However, some operations in these states are optimized to perform instructions for the previous or next iteration.
The most important feature of the sequence of instructions presented is the
325
ability for several SA block operations to be executed simultaneously with a
AN US
single MULT block operation because of the number of clock cycles required by the MULT block operation. Due to this feature, the clock cycles required by the reduction step of FFMULT performed in the SA block are masked by the 330
execution of the next MULT block operation.
The FSM of the scalable ECP is shown in Fig. 4. The FSM resets to the IDLE state. The ECPM operation is triggered by asserting the load signal,
M
which moves the FSM to the LOAD state. The FSM only stays in the LOAD state for 1 clock cycle. At the LOAD state, the first point addition of Q ← ∞±P in Algorithm 2 is performed by loading the appropriate values into the RAM.
ED
335
If the magnitude of the 3rd most-significant digit of k is 1, the FSM moves to the PDQA state, otherwise it moves to the PQUAD state.
PT
When the operations for PDQA and PQUAD states are complete, the FSM goes to the PDQA state if the magnitude of either the current digit of k (cur k) or the next digit of k (next k) is 1, otherwise it goes to PQUAD state. k count
CE
340
is used to keep track of the current index of k that is being processed. When
k count is 0, the main loop in Algorithm 2 is completed except for the evaluation
AC
of Y1 , which is performed in the BX state. After the BX state, the FSM enters the ISQ state, which initiates the Itoh-Tsujii algorithm [21].
345
Instead of performing both 1/Z3 and 1/Z32 for coordinate conversion, only
1/Z3 is performed. Subsequently, 1/Z32 can be obtained by (1/Z3 )2 . The ISQ state computes repeated FFSQ operations, followed by IMULT, which computes 1 multiplication, IRED, which reduces the product, and returns to the ISQ state. 14
ACCEPTED MANUSCRIPT
The number of times the ISQ and IMULT states cycle depends on the selected 350
field. Once FFINV is completed, the FSM moves to the FMULT state which com-
CR IP T
putes 2 multiplications to complete the coordinate conversion. After the FMULT state, the FSM moves to the FINAL state and to the
WAIT state, where x3 and y3 are output. The FSM is able to move immedi355
ately back to the LOAD state if the load signal is detected at the WAIT state, otherwise it will return to the IDLE state.
AN US
3.2.2. Pseudo-Random ECPs
Using an architecture similar to the Koblitz ECP, the 1-MULT random ECP is implemented for pseudo-random curves. As in the Koblitz ECP, the inputs 360
and outputs of the ECP are 32-bit buses, x1 , y1 , x3 and y3 . The binary representation of the scalar multiplier k is also input through a 32-bit bus. The core of the 1-MULT random ECP is also the MULT and SA blocks.
M
The order of instructions executed by the processor in each FSM state is presented in TABLE 3. These instructions are stored in the controller along with the PCs and the ROM that stores the b coefficients. The RAM stores the
ED
365
input values x1 and y1 in x and y and the temporary values X1 , X2 , Z1 , Z2 , T1 , T2 , T3 that are used in TABLE 3, so it also has dimensions 18 × 288 bits as
PT
in the Koblitz ECP.
The FSM of the 1-MULT random ECP is similar to the 1-MULT Koblitz 370
ECP shown in Fig. 4, except the main loop (LOOP state) executes Lopez-
CE
Dahab algorithm and a couple multiplications are required prior to the inversion states. In TABLE 3, the operations given in the LOOP state are obtained from
AC
rearranging the Madd and Mdouble operations in Algorithm 1. The MUL1,
MUL1R, MUL2 and MUL2R states compute the value of xZ1 Z2 to set up for
375
FFINV operation in ISQ, IMULT and IRED. Finally the CONV state converts the projective coordinates to affine.
15
ACCEPTED MANUSCRIPT
3.3. Multiple-Multiplier Scalable ECP Since the MULT block only needs to perform the polynomial multiplication, its hardware resource utilization is much lower than the hardware utilization of the SA block. In this subsection, the use of multiple MULT blocks is explored
CR IP T
380
to improve the performance of the scalable ECP. The use of 2 MULT blocks is examined for random ECPs first. Subsequently, the Koblitz ECP using 2 MULT blocks is also presented.
For the 2-MULT ECPs, the architecture of the MULT block does not need 385
to be modified. However, since 2 MULT blocks are used, the architecture of the
AN US
SA block is modified to interact with both MULT blocks. The block diagram
of the new SA block is shown in Fig. 5. The main difference between this SA block and the one shown in Fig. 2 in terms of hardware resources is the addition of an extra 32-bit 2-input XOR gate and the shift register ‘SREG C’, which are 390
shaded in Fig. 5.
In order to interface with the outputs of the 2 MULT blocks, the SA block is
M
modified to take 2 sets of inputs. ‘A1’, ‘A1 full’, and ‘B1’ connects to one of the MULT blocks and ‘A2’, ‘A2 full’ and ‘B2’ connects to the other MULT block.
395
ED
The operation of the SA block has also been modified slightly, to combine FFSQ, FFADD and FF reduction into 1 type of operation. Thus, the new SA block r
always performs the operation (A1+B1)2
mod P (t) and (A2+B2) mod P (t),
PT
where r is used for repeated FFSQ. Thus, if r = 0, only addition is performed. Note that the sum of ‘A2’ and ‘B2’ cannot be subsequently squared. During its
CE
operation, all inputs are loaded into the SA block simultaneously. Thus, both 400
‘SREG C’ are loaded simultaneously. Once all digits are inputted, the data from the top ‘SREG C’ goes through the multiplexers into the appropriate reduction
AC
block and stored back at ‘SREG C’. A control signal is used to indicate whether or not the second ‘SREG C’ is being used. If so, the second ‘SREC C’ is selected as inputs to the reduction blocks and result stored back in the second ‘SREG
405
C’. Finally, the results are output through ‘C1’ and ‘C2’ as 32-bit digits. Thus, the latency of FFADD for only the first set of operands is s + 1 clock cycles, repeated FFSQ is s + 1 + r clock cycles and FFADD using both sets of operands 16
ACCEPTED MANUSCRIPT
is s + 2, where s = dm/32e is the number of 32-bit digits and m is the key size. To take advantage of the 2 MULT blocks, the operations in TABLE 3 has 410
been parallelized to produce the operations in TABLE 4. The major differences
CR IP T
are shaded in TABLE 4, where MULT and SA block operations are parallelized. Comparing the operations in TABLE 3 and TABLE 4, the most significant
difference occurs at the LOOP state, where the latency of 6 FFMULT (6M) are reduced to 3 FFMULT (3M) and the 2 FFADD (2A) operations in MUL1 state 415
that now must be run in every iteration. This reduction is very significant as
the LOOP is the most time consuming step of the ECPM and must execute
AN US
m − 1 time. The CONV state operations are also reduced from 7M to 4M + 1A operation.
The same parallelization technique has been applied to the 1-MULT Koblitz 420
ECP. The PDQA state in the 1-MULT Koblitz ECP requires 8M + 1A operations. Due to the data dependency in PADD, only certain FFMULTs can be parallelized and the resultant algorithm requires 5M + 4A + 1S (FFSQ) opera-
M
tions. Thus, the latency reduction in the Koblitz ECP using 2 MULT blocks is not as significant as in the random ECP. The resultant series of operations are shown in TABLE 5.
ED
425
The same technique can be used to further parallelize the multiplication instructions to use 3 or 4 MULT blocks. In the random ECP, the LOOP state
PT
reduces to 2M + 3A + 1S operations and the Koblitz ECP, the PDSA state reduces to 4M + 9A + 1S operations. For the random ECP, further parallelization using 4 MULT blocks does not further reduce the latency due to the data
CE
430
dependency of the Lopez-Dahab algorithm. In the Koblitz ECP, the method of interleaving multiplications used in [36] may be applied to the proposed ECP
AC
designs by using 4 MULT blocks and 2 SA blocks. However, the structure of the SA block would require some modifications and due to the data dependency
435
structure in the proposed ECP, not all addition and squaring operations can be completely masked by multiplication. Thus, the number of operations in the PDSA state reduces to approximately 2M + 5A + 2S. Using this method, the number of clock cycles of the ECPM reduces by approximately 30% - 40% 17
ACCEPTED MANUSCRIPT
but the hardware resource utilization doubles compared to the 2-MULT Koblitz 440
ECP. Since the latency reduction is not significant, the increase in hardware resource utilization by using 3 or 4 MULT blocks outweighs the benefits of the
CR IP T
latency reduction. Thus, the use of 3 or 4 MULT blocks worsens the efficiency (as defined in Section 4.2) of the ECP, so their implementation results are not shown in this paper.
The parallelization of the SA block is not considered in this paper because
445
the SA block occupies a majority of the hardware resources of the ECP. Since
the latency of the FFMULT is the bottleneck of the operations and the SA op-
AN US
erations are masked in the MULT block operations, parallelizing the SA block does not have a great impact on the latency of the system. Thus, it is not fea450
sible to parallelize the SA operations, which would cause the hardware resource utilization to increase dramatically, with only a minor decrease in latency.
4.1. Latency Estimation
M
4. Implementation Results and Analysis
455
ED
According to the designs of the scalable ECPs described above, TABLE 6 is constructed to present the latency in terms of the number of clock cycles required for each operation. tMULT is the latency of the MULT block and it
PT
does not change from one design to another because all 4 ECPs use the same MULT block. tSA has been previously discussed, but it must be noted that for the 1-MULT ECPs, the latency of FFADD is s + 1 clock cycles and repeated FFSQ is s + 1 + (r − 1), whereas for the 2-MULT ECPs, the latency is s + 1 + r,
CE
460
where s is the number of digits and r is the exponent of the repeated FFSQ.
AC
Thus, for the 2-MULT ECP, the repeated FFSQ operation requires an extra clock cycle. Note that the tSA value shown in TABLE 6 and in the expressions below only represents the latency of a FFADD operation, tSA = s + 1.
465
For the 1-MULT random ECP, tINIT is given by 2 × tSA + 1. tLOOP is given
by the number of iterations of the LD algorithm, which is m − 1. Each iteration requires 6 FFMULT operations, so tLOOP = (6tMULT )(m − 1). The number of 18
ACCEPTED MANUSCRIPT
clock cycles for FFINV, tINV , is given by the number of times the ISQ, IMULT and IRED states are entered, which is field-dependent. tm depends on the field 470
and it also depends on how many times the ISQ state has been entered. In total,
CR IP T
tINV = g ×tMULT +2g ×tSA +m−2−g, where g = blog2 (m − 1)c+h(m−1)−1,
and h(x) is the Hamming weight of the number x. tP2AC is the total number of clock cycles required by the projective to affine conversion, including the
time of inversion and it is given by 2tMULT + 2tSA + tINV + 8tMULT + tSA , 475
where the first 2 tMULT and tSA are from states MUL1, MUL2, MUL1R, and
MUL2R, the 8 tMULT are from the CONV state and the last tSA is in the FINAL
AN US
state. Finally the number of clock cycles of the complete ECPM, tECPM =
1 + tINIT + tLOOP + tP2AC + s, where 1 clock cycle is used in the LOAD state and s is the number of 32-bit digits and consumed by the WAIT state. For the 2-MULT random ECP, tINIT is given by 3tSA + 3. Each iteration
480
of the LOOP state requires 3 FFMULT, 1 FFADD with 2 sets of inputs and 1 FFADD with 1 set of inputs, except for the final iteration, which executes
M
FFMULT to mask the SA operations. Thus, tLOOP = (3tMULT )(m−1)+(2tSA + 1)(m−2)+tMULT . Due to the set up of FFINV, tINV is slightly different from the 1-MULT case and is given by tINV = g ×tMULT +(2g +1)×tSA +g +1+m−2−g.
ED
485
The latency of the coordinate conversion is given by tP2AC = 2tMULT + tSA + tINV + 4tMULT + tSA + 1 + tSA + 1. The ECPM is given by the same expression
PT
as in the 1-MULT case.
For the 1-MULT Koblitz ECPs, the number of clock cycles in the PDQA state is given by 8 × tMULT + tSA since most of the SA operations execute in
CE
490
parallel with the MULT operations. The number of clock cycles spent in the PQUAD state is given by 4 × tSA + 3, where the 3 clock cycles are a result of 1
AC
clock cycle per double-square. In order to estimate the number of clock cycles for the ECPM, one must estimate the average number of times the PDQA and
495
PQUAD states are entered. Since τ NAF(k) has an average Hamming weight of m/3 [27], the PDQA state is entered on average m/3 times. The Markov chain
is used to estimate the number of times that the PQUAD state is executed. The 3-state Markov chain is shown in Fig. 6. The ‘00’ state executes the 19
ACCEPTED MANUSCRIPT
PQUAD state. The ‘01’ and ‘1’ states both execute the PDQA state. According to [27], τ NAF(k) cannot have 2 successive non-zero digits. Thus, state ‘1’ can only be followed by states ‘00’ or ‘01’, each with a probability of 0.5. Similarly,
CR IP T
state ‘01’ can only be followed by ‘00’ or ‘01’, each with a probability of 0.5. Finally, the ‘00’ state can be followed by any of the 3 states, each with a probability of 0.333. From the Markov chain, the following transition matrix can be written:
(01) (1)
(00)
(01)
(1)
0.333 0.333 0.333
0.5 0.5
0.5
0
AN US
P =
(00)
0.5
0
(10)
where the first row represents the transition from ‘00’, the second row represents the transition from ‘01’ and the third row represents the transition from ‘1’. 500
From the transition matrix, the steady state vector of the Markov chain can be obtained and we find that the steady state probability of state ‘00’ is
3 7,
state
M
‘01’ is 37 , and state ‘1’ is 17 . Based on the analysis, the ratio of PDQA to PQUAD is 4:3, which means if the PDQA is entered m/3 times, the PQUAD state is
505
ED
entered m/4 times. Using these estimates, tPDQA = d(m−1)/3e×(8tMULT +tSA ) and tPQUAD = d(m − 1)/4e × (4tSA + 3). The number of clock cycles for FFINV, tINV , is the same as in the 1-MULT
PT
random ECP. tP2AC is the total number of clock cycles required by the projective to affine conversion and it is given by tSA + tINV + 2tMULT + tSA . Finally,
CE
tECPM = 1 + tPDQA + tPQUAD + tP2AC + s, where 1 clock cycle is used by the 510
LOAD state and s = dm/32e clock cycles are used by the WAIT state. For the 2-MULT Koblitz ECP, the same ratio of PDQA to PQUAD is used
AC
and the latencies are given by tPDQA = d(m − 1)/3e × (5tMULT + 5tSA + 3) and tPQUAD = d(m − 1)/4e × (4tSA + 6). The FFINV latency is once again slightly
different due to the change in the algorithm and is given by tINV = g×tMULT +g×
515
(2tSA +1)+m−2−g. Finally, tP2AC is given by tSA +tINV +tSA +1+tMULT +tSA +1 and tECPM is given by 1 + tPDQA + tPQUAD + tP2AC + s. Comparing the 1-MULT and 2-MULT tECPM for each of the ECPs, one can 20
ACCEPTED MANUSCRIPT
notice that for random ECPs, the 2-MULT implementation decreases the latency between 41% and 46%. However, the same impact is not observed in Koblitz 520
ECPs, where the decrease is only 20% to 30%. This observation is consistent
CR IP T
with the observation discussed earlier where the 2-MULT random ECP reduces the LOOP state from 6M to 3M + 2A operations, whereas 2-MULT Koblitz ECP only reduces the PDQA state from 8M + 1A operation to 5M + 4A + 1S operations. 525
4.2. FPGA Implementation Results
AN US
The proposed scalable ECPs have been implemented using the Xilinx ISE
11.5 software. The target FPGA selected is the Xilinx Virtex-5 XC5LX110T for comparison purposes with other designs of ECP in the current literature. The post-place-and-route hardware utilization and timing performance results 530
are shown in TABLE 7 along with other ECP designs in the current literature. To better compare the performance of the various designed shown in TABLE 7,
M
an efficiency metric is used to take into account both the hardware utilization and timing latency. The efficiency metric is defined as follows: Number of ECPMs per second Number of slices
ED Efficiency =
(11)
The design in [16] uses a hardware-software co-design (HSC) approach. The design uses the PicoBlaze soft-core microcontroller in the FPGA to implement
PT
535
a majority of the control signals and only the finite field operations are im-
CE
plemented in hardware. Due to the use of a different target FPGA, a fair comparison cannot be made with the proposed ECP. The design in [37] is not a scalable design. It is highly optimized for a
specific curve and does not need to handle multiple curves, so it is much more
AC
540
efficient than the design proposed in this paper. Simply using the efficiency metric to compare the design in [37] and the proposed designs is not fair. The most significant advantage of the proposed designs is in the scalability of the ECP, while maintaining a low resource utilization. For instance, the 163-bit
545
design in [37] uses 6150 slices, whereas the proposed 1-MULT random ECP only 21
ACCEPTED MANUSCRIPT
requires 2290 slices and supports all 5 pseudo-random curves recommended by NIST. For Koblitz ECPs, the design in [17] is similar to the design in [16], where
550
CR IP T
the HSC approach is used. However, the design in [17] only implements 3 of the 5 NIST recommended Koblitz curves and the latencies are much higher due to the software operations.
The design in [38] shows a non-scalable ECP design that is optimized for 163-bit key size. Even though, the latency is low, the number of slices required is extremely high, which is the same observation made for [37] for pseudo-random curves.
AN US
555
The proposed 1-MULT random and Koblitz ECPs most resemble the designs in [19] and [20], which are previous designs published by the authors of this paper for pseudo-random and Koblitz curves recommended by NIST [4]. Both designs [19, 20] are scalable and support all 5 key sizes. There are some major 560
improvements that make the proposed ECPs in this paper superior. Firstly, the
M
designs in [19, 20] use 1 finite field arithmetic unit (FFAU) that can only perform 1 operation at a time, whereas the proposed 1-MULT ECPs parallelize the FFAU
ED
into the MULT and SA blocks. This allows the current designs to perform FFMULT and FFSQ or FFADD simultaneously, reducing the number of clock 565
cycles. Secondly, the proposed 1-MULT ECPs do not perform reduction for
PT
FFMULT until the subsequent FFADDRED, which further reduces the number of clock cycles per operation. Furthermore, for Koblitz curves, a novel τ NAF
CE
ECPM algorithm is proposed that takes advantage of the efficient repeated FFSQ capability of the SA block to reduce the latency. Overall, the proposed
570
1-MULT ECPs reduce the number of clock cycles of the ECPM dramatically.
AC
The hardware utilization increases is due to the use of the reduction blocks for each of the 5 key lengths. Even though the hardware utilization of both proposed 1-MULT ECPs are higher than their counterpart in [19] and [20], the benefit of the latency reduction outweighs the area increase, as shown by the
575
increase in efficiency for both 1-MULT ECPs. In another recent publication by the authors of this paper [32], the MULT 22
ACCEPTED MANUSCRIPT
block of the 1-MULT ECPs have been replaced with Karatsuba-Ofman multipliers. As shown in TABLE 7, even though the latencies of the proposed ECPs are higher, the efficiency metric shows that for lower key sizes, the proposed 2-MULT ECPs outperform the designs in [32].
CR IP T
580
Comparing the proposed 1-MULT and 2-MULT random ECPs asserts some
of the observations stated in previous sections. By further parallelizing the FF arithmetic into using 2 MULT blocks, the number of slices only increases from 2290 to 2708. The increase in the number of registers and LUTs is due to the 585
use of the extra ‘SREG C’ shift register and the additional MULT block. Since
AN US
the critical path of the design is not affected, the change in the maximum clock
frequency is minimal. However, the decrease in latency is significant in the 2MULT random ECP as described in Section 4.1. Thus, the overall efficiency increases between 43% to 57%. Furthermore, if the 3-MULT random ECP 590
implementation results were obtained, it is expected that the efficiency would decrease because the number of slices would increase by approximately 500, but
not as significant.
M
the LOOP state would only decrease from 3 to 2 MULT operations, which is
595
ED
In addition, the latency improvement gained by using 2 MULT blocks in the random ECP does not translate well into the Koblitz ECP. The hardware resource utilization shows a similar increase from the 1-MULT Koblitz ECP to
PT
the 2-MULT Koblitz ECP in terms of the number of slices, registers and LUTs. Thus, the efficiency of the 2-MULT Koblitz ECP only increases by 1.4% to 16%
CE
compared to the 1-MULT Koblitz ECP. As previously mentioned, the use of 3 or 4 MULT block does not further
600
improve the efficiency because the increase in the number of slices required out-
AC
weighs the decrease in latency of the ECPM operation. The use of interleaving multiplications described in [36] could be applied to the proposed Koblitz ECPs, but the reduction of the MULT block operation must be followed by an SA block
605
operation to reduce its result. Thus, the latency only improves between 30% to 40% compared to the 2-MULT Koblitz ECP. Furthermore, the ECPs described in [36] only support 1 Koblitz curve at a time, which is not a scalable design as 23
ACCEPTED MANUSCRIPT
defined in this paper. Thus, in [36], the reduction step for multiplication and squaring can be much more optimized. If the design in [36] supported multi610
ple finite fields simultaneously, the hardware resource utilization would increase
CR IP T
dramatically. Efficient software implementations such as the ones in [7] may provide an alternative for the proposed ECP. However, as previously mentioned, the goal
of the hardware ECP is to offload the operation from the software such that the 615
main processor may be freed up to perform other tasks. In addition, the use of a
hardware accelerator is expected to reduce the power consumption of the system
AN US
compared to a software-only solution. Furthermore, due to the compact nature of the proposed ECP, it can be instantiated multiple times in a single FPGA to achieve higher throughput. For example, the Virtex-5 XC5LX110T FPGA has 620
17,280 slices [39] and 148 Block RAMs, which means up to 7 1-MULT ECPs or 6 2-MULT ECPs may be instantiated on the FPGA.
Overall, the proposed designs have a better performance compared to scal-
M
able ECPs in the current literature and provide much higher flexibility compared to the highly optimized designs targeting a single key length presented in the current literature.
ED
625
PT
5. Conclusion
This paper proposes the parallelization of scalable ECPs that can support all 5 pseudo-random curves or all 5 Koblitz curves recommended by NIST [4]
CE
without reconfiguring the hardware. The proposed designs are implemented for
630
Virtex-5 FPGA as the target platform for comparison with designs in current literature. Compared to other scalable ECPs, the both 1-MULT and 2-MULT
AC
ECPs show a superior performance when comparing their efficiency to other designs computing the same type of curves. Comparing with designs that are highly optimized and non-scalable, even though the proposed ECPs have longer
635
latencies, the hardware utilization of the proposed scalable ECPs are much lower.
24
ACCEPTED MANUSCRIPT
The implementation results show the effect of the parallelization of operations in ECPM for pseudo-random and Koblitz curves. In pseudo-random curves, the parallelization of the LD algorithm by using 2 MULT blocks shows to be extremely advantageous as the efficiency increases between 43% and 57%.
CR IP T
640
In Koblitz curves, the efficiency increases between 1.4% and 16%. Further parallelization by using 3 MULT blocks or multiple SA blocks is not beneficial.
The proposed scalable ECPs are highly efficient designs that are very suitable for both server-side and client-side applications using security protocols 645
such as TLS/SSL, where the ECC parameters are negotiated at the start of each
AN US
session. For server-side applications, the scalability of the proposed design is
beneficial as the same logic implementation is able to support multiple key sizes, eliminating the need to reconfigure the FPGA when the key size changes. For client-side applications, the proposed ECPs have lower hardware resource uti650
lizations, while maintaining a high-speed operation by parallelizing the MULT block. In contrast, high-speed ECPs in the literature that only implement a
M
single key size require many more hardware resources.
Future work of this research looks to transfer the success of the scalable archi-
655
ED
tecture designed for binary finite fields into prime fields recommended by NIST. Furthermore, an efficient implementation of a scalable Koblitz τ NAF converter
PT
that supports all 5 NIST-recommended key sizes may also be investigated.
Acknowledgment
CE
This work is supported by the National Science and Engineering Research
Council of Canada (NSERC) Alexander Graham Bell Canada Graduate Scholarship.
AC
660
References [1] V. Miller, Use of elliptic curves in cryptography, in: CRYPTO85: Proceedings of the Advances in Cryptology, Springer–Verlag, 1986, pp. 417–426.
25
ACCEPTED MANUSCRIPT
[2] N. Koblitz, Elliptic curve cryptosystems, Mathematics of Computation 48 (177) (1987) 203–209.
665
[3] R. Rivest, A. Shamir, L. Adleman, A method for obtaining Digital Signa-
(1978) 120–126.
CR IP T
tures and Public-Key Cryptosystems, Communications of the ACM 21 (2)
[4] National Institute of Standards and Technology, Recommended Elliptic Curves for Federal Government Use (July 1999).
670
[5] Standards for Efficient Cryptography, SEC 2: Recommended Elliptic Curve
AN US
Domain Parameters (July 2000).
[6] Federal Information Processing Standard, FIPS PUB 186-3: Digital Signature Standard (DSS) (June 2009). 675
[7] D. Aranha, A. Faz-Hernndez, J. Lpez, F. Rodrguez-Henrquez, Faster implementation of scalar multiplication on koblitz curves, in: A. Hevia, G. Neven
M
(Eds.), Progress in Cryptology LATINCRYPT 2012, Vol. 7533 of Lecture
ED
Notes in Computer Science, Springer Berlin Heidelberg, 2012, pp. 177–193. [8] M. Morales-Sandoval, C. Feregrino-Uribe, R. Complido, I. Algredo-Badillo, A reconfigurable GF (2m ) elliptic curve cryptographic coprocessor, in: 2011
680
PT
VII Southern Conference on Programmable Logic (SPL), 2011, pp. 209– 214.
CE
[9] S. Blake-Wilson, N. Bolyard, V. Gupta, C. Hawk, B. Moeller, Elliptic Curve Cryptography (ECC) Cipher Suites for Transport Layer Security (TLS), RFC 4492 (Informational), updated by RFC 5246 (May 2006).
AC
685
URL http://www.ietf.org/rfc/rfc4492.txt
[10] O. Ahmadi, D. Hankerson, F. Rodrguez-Henrquez, Parallel formulations of scalar multiplication on koblitz curves, Journal of Universal Computer Science 14 (3) (2008) 481–504.
26
ACCEPTED MANUSCRIPT
690
[11] R. Azarderakhsh, A. Reyhani-Masoleh, High-Performance Implementation of Point Multiplication on Koblitz Curves 60 (1) (2013) 41–45. [12] J. Lutz, A. Hasan, High performance fpga based elliptic curve crypto-
CR IP T
graphic co-processor, in: Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on, Vol. 2, 2004, pp. 486–492 Vol.2.
695
[13] P. Realpe-Mue noz, V. Trujillo-Olaya, J. Velasco-Medina, Design of elliptic
curve cryptoprocessors over gf(2163) on koblitz curves, in: Circuits and
AN US
Systems (LASCAS), 2014 IEEE 5th Latin American Symposium on, 2014, pp. 1–4. 700
[14] C. Rebeiro, D. Mukhopadhyay, High speed compact elliptic curve cryptoprocessor for fpga platforms, in: D. Chowdhury, V. Rijmen, A. Das (Eds.), Progress in Cryptology - INDOCRYPT 2008, Vol. 5365 of Lecture Notes
M
in Computer Science, Springer Berlin Heidelberg, 2008, pp. 376–388. [15] M. Hassan, M. Benaissa, Low Area - Scalable Hardware/Software Co-design for Elliptic Curve Cryptography, in: 3rd International Conference on New
ED
705
Technologies, Mobility and Security (NTMS), 2009, pp. 1–5. [16] M. Hassan, M. Benaissa, A scalable hardware/software co-design for elliptic
PT
curve cryptography on PicoBlaze microcontroller, in: Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS), 2010, pp. 2111–2114.
CE
710
[17] M. Hassan, M. Benaissa, Flexible Hardware/Software Co-design for Scal-
AC
able Elliptic Curve Cryptography for Low-Resource Applications, in: 21st
715
IEEE International Conference on Application-specific Systems Architectures and Processors (ASAP), 2010, pp. 285–288.
[18] K. C. C. Loi, S.-B. Ko, Scalable elliptic curve cryptosystem fpga processor for nist prime curves, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 23 (11) (2015) 2753–2756. 27
ACCEPTED MANUSCRIPT
[19] K. C. C. Loi, S.-B. Ko, High Performance Scalable Elliptic Curve Cryptosystem Processor for Koblitz Curves, Microprocessors and Microsystems 37 (45) (2013) 394 – 406.
720
CR IP T
[20] K. C. C. Loi, S.-B. Ko, High Performance Scalable Elliptic Curve Cryp-
tosystem Processor in GF (2m ), in: IEEE International Symposium on Circuits and Systems (ISCAS) 2013, 2013, pp. 2585–2588.
[21] T. Itoh, S. Tsujii, A Fast Algorithm for Computing Multiplicative Inverses in GF (2m ) Using Normal Bases, Information and Computation 78 (3)
725
AN US
(1988) 171–177.
[22] P. G. Comba, Exponentiation cryptosystems on the IBM PC, IBM Systems Journal 29 (4) (1990) 526–538.
[23] J. Lopez, R. Dahab, Fast multiplication on elliptic curves over GF (2m ) without precomputation, in: CHES99: Proceedings of the First Inter-
730
M
national Workshop on Cryptographic Hardware and Embedded Systems, Springer-Verlag, 1999, pp. 316–327. CM-Curves with Good Cryptographic Properties,
ED
[24] N. Koblitz,
in:
CRYPTO91: Proceedings of the Advances in Cryptology, Lecture Notes in Computer Science, Vol. 576, Springer, 1991, pp. 279–287.
PT
735
[25] J. Lopez, R. Dahab, Improved algorithms for elliptic curve arithmetic in gf(2n), in: S. Tavares, H. Meijer (Eds.), Selected Areas in Cryptography,
CE
Vol. 1556 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 1999, pp. 201–212.
[26] E. Al-Daoud, R. Mahmod, M. Rushdan, A. Kilicman, A new addition
AC
740
formula for elliptic curves over gf(2n), Computers, IEEE Transactions on 51 (8) (2002) 972–975.
[27] J. Solinas, Efficient Arithmetic on Koblitz Curves, Designs, Codes and Cryptography 19 (2000) 195–249.
28
ACCEPTED MANUSCRIPT
745
[28] K. J¨ arvinen, J. Skytt¨ a, High-speed elliptic curve cryptography accelerator for Koblitz curves, in: Proceedings of the 16th IEEE Symposium on Fieldprogrammable Custom Computing Machines,FCCM 2008,IEEE Computer
CR IP T
Society, 2008, pp. 109–118. [29] K. J¨ arvinen, Optimized FPGA-based elliptic curve cryptography processor
for high-speed applications, INTEGRATION, the VLSI journal 44 (2011)
750
270–279.
[30] V. Dimitrov, K. J¨ arvinen, M. Jacobson, W. Chan, Z.Huang, Provably sub-
AN US
linear point multiplication on Koblitz curves and its hardware implementation 57 (11) (2008) 1469–1481. 755
[31] B. B. Brumley, K. U. J¨ arvinen, Conversion Algorithms and Implementations for Koblitz Curve Cryptography 59 (1) (2010) 81–92.
[32] K. C. C. Loi, S. An, S.-B. Ko, FPGA implementation of low latency scal-
M
able Elliptic Curve Cryptosystem processor in GF (2m ), in: Circuits and Systems (ISCAS), 2014 IEEE International Symposium on, 2014, pp. 822– 825.
ED
760
[33] J. Adikari, V. Dimitrov, K. Jarvinen, A fast hardware architecture for integer to taunaf conversion for koblitz curves, Computers, IEEE Transactions
PT
on 61 (5) (2012) 732–737.
[34] A. Karatsuba, Y. Ofman, Multiplication of multi-digit numbers on automata, Soviet Physics Doklady 7 (1963) 595–596.
CE
765
[35] K. J¨ arvinen, On repeated squarings in binary fields, in: J. Jacobson,
AC
MichaelJ., V. Rijmen, R. Safavi-Naini (Eds.), Selected Areas in Cryptog-
770
raphy, Vol. 5867 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2009, pp. 331–349.
[36] K. J¨ arvinen, J. Skytt¨ a, Fast point multiplication on Koblitz curves: Parallelization method and implementations, Microprocessors and Microsystems 33 (2009) 106–116. 29
ACCEPTED MANUSCRIPT
[37] G. D. Sutter, J.-P. Deschamps, J. L. Ima˜ na, Efficient Elliptic Curve Point Multiplication Using Digit-Serial Binary Field Operations 60 (1) (2013) 217–225.
775
CR IP T
[38] S. Roy, C. Rebeiro, D. Mukhopadhyay, A parallel architecture for koblitz curve scalar multiplications on fpga platforms, in: Digital System Design (DSD), 2012 15th Euromicro Conference on, 2012, pp. 553–559. [39] Xilinx, Virtex-5 Family Overview (February 2009).
URL http://www.xilinx.com/support/documentation/data_sheets/
780
AN US
ds100.pdf
K. C. Cinnati Loi received his dual B.Sc. in Electrical Engineering and in Computer Science in 2008 from the University of Saskatchewan, Canada. He received his M.Sc. at the University of Saskatchewan in 2010. He is currently a Ph.D. candidate at the University of Saskatchewan. His research in-
M
terests are hardware implementation of cryptosystems, high performance FPGA applications and hardware/software co-design.
ED
Seok-Bum Ko received his Ph.D. in Electrical and Computer Engineering at the University of Rhode Island, USA in 2002. He is currently professor in Electrical and Computer
PT
Engineering at the University of Saskatchewan, Canada. His research interests include computer arithmetic, computer ar-
CE
chitecture, computer network and biomedical engineering.
AC
Dr. Ko is a senior member of IEEE.
30
AN US
CR IP T
ACCEPTED MANUSCRIPT
Table 1: Summary of Point Operations
Curve
Point Operation Madd
Pseudo-Random
Mdouble Coordinate
4M + 1S + 2A 2M + 5S + 1A
1I + 10M + 1S + 5A
M
Conversion
Number of FF operations
8M + 5S + 8A
PFRB
3S
ED
Koblitz
PADD
Coordinate
AC
CE
PT
1I + 2M + 1S Conversion I = FFINV; M = FFMULT; S = FFSQ; A = FFADD
31
ACCEPTED MANUSCRIPT
Table 2: Instructions executed by the 1-MULT Koblitz ECP MULT PC MULT SA PC SA PDQA State
1 2
x × (Z1 |R)
0 1
T2 × (y|xy)
7 8
X1 = X1
T1 = X1 + M Y1 = Y1
T1 × Z1
0
X1 = Y1 + M
1
T1 = T12
0
Z1 = 0 + M
X1 × R
0
Y1 = Z1 + a · T2
T1 × Y1 x × Z1
(2|4)
AN US
6
(2|4)
2 0
1
T3 × (xy|y)
M
5
T2 = Z12
1
3 4
Y1 = T1 + (Y1 |M )
CR IP T
0
T2 × T3
Z1 = Z12
0
T2 = 0 + M
1
X1 = X12
0
Y1 = T2 + M
1
T3 = Z12
2
X1 = X1 + Y1
0
T3 = X1 + M
1
T2 = Z1 + T2
0
T1 = 0 + M
1
Z1 = Z1
(2|4)
PQUAD State 0
ED
0
1 2 3
PT CE AC
(4|2)
Y1 = R 1
(4|2)
X1 = X1 Z1 =
(4|2) Z1
BX State
0
0 ISQ State
0
0 IMULT State
0
Y1 = T1 + (Y1 |M )
(Z1 |T3 ) × R
Y1 = T1 + Y1 |R R = (Z1 |R)2
r
0
IRED State
0
0
T3 = 0 + M
FMULT State 0 1
X1 × R
Y1 × T3
0
T3 = R2
0
T1 (x3 ) = 0 + M
FINAL State
0
0
32
T2 (y3 ) = 0 + M
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Table 3: Instructions executed by the 1-MULT Random ECP MULT PC MULT SA PC SA INIT State 0 0 Z2 = x2 1 R = R2 LOOP State 0 (X1 |X2 ) × (Z2 |Z1 ) 0 (X2 |X1 ) = (M |R) + (T2 |b) 1 T3 = (Z1 |Z2 )4 1 (X2 |X1 ) × (Z1 |Z2 ) 0 T2 = M + 0 2 (X1 |X2 ) × (Z1 |Z2 ) 0 T1 = M + 0 3 T2 × T1 0 R=M +0 1 (Z1 |Z2 ) = R2 2 T1 = T1 + T2 4 b × T3 0 T2 = M + 0 1 (Z2 |Z1 ) = T12 2 T3 = (X1 |X2 )4 5 x × (Z2 |Z1 ) 0 (X1 |X2 ) = M + T 3 MUL1 State 0 x × Z2 0 (X2 |X1 ) = M + T2 MUL1R State 0 0 T2 = M + 0 MUL2 State 0 R × Z1 0 MUL2R State 0 0 T1 = M + 0 ISQ State r 0 0 R = R2 IMULT State 0 R × (T1 |T3 ) 0 IRED State 0 0 T3 = M + 0 CONV State 0 x × Z1 0 T3 = R2 1 T1 = X2 + T2 1 T2 × T3 0 T2 = M + X1 2 T2 × T3 0 Z1 = M + 0 3 x × T1 0 T1 = M + 0 4 T1 × Z2 0 T3 = M + 0 5 T1 × T3 0 T1 = M + 0 1 R = x2 2 T3 = R + y 6 X1 × Z1 0 T2 = M + T3 7 T2 × T1 0 T1 (x3 ) = M + 0 FINAL State 0 0 R(y3 ) = M + y
33
ACCEPTED MANUSCRIPT
Table 4: Instructions executed by the 2-MULT Random ECP MULT PC MULT 1 MULT 2 SA PC SA 1
SA 2
INIT State 0 1 2 LOOP State 0 1
M1 = (X1 |R1 ) × Z2
M2 = (X2 |R1 ) × Z1
M1 = (X1 |X2 ) × (Z1 |Z2 ) M2 = T3 × b
0 0 1 2
M1 = x × (Z2 |Z1 )
M2 = T2 × T1
0
M1 = x × Z2
M1 = x × Z 1 M1 = T2 × Z1
ED
0 0
0
AC
1
2
M1 = x × X2
M1 = T2 × T3
M1 = T2 × Z2
M1 = R1 × R2
R1 = M1 + 0
R1 = M1 + 0
1
(X2 |X1 ) = R1 + R2
0
T2 = M1 + 0
0
Z1 = M1 + X1
1
X2 = T2 + X2
0
T1 = M1 + 0
0
(R1 |T3 ) = (R1 + 0)2
(X1 |X2 ) = M2 + T3
R2 = M2 + 0
MUL2R State
r
0 IRED State 0
T3 = M1 + 0
CONV State M2 = R1 × Z1
M2 = T2 × T3
3 4
T3 = (0 + (X1 |X2 ))4
0
IMULT State
M1 = R1 × (T1 |T3 )
CE
0
T2 = M2 + 0
(Z2 |Z1 ) = (R1 + R2 )2
ISQ State
PT
0
T1 = M1 + 0
MUL2 State
M
0
T3 = (Z1 |Z2 + 0)4
(Z1 |Z2 ) = (R1 + 0)2
MUL1R State
0
X2 = R1 + b
1
MUL1 State
0
R1 = (R1 + 0)2
AN US
2
Z2 = (x + 0)2
CR IP T
0
M2 = T2 × X1
0 0
T3 = M1 + 0
1
T1 = (x + 0)2
T2 = M2 + 0
0
T2 = M1 + 0
1
T3 = T1 + y
0
R1 = M1 + 0
R2 = M2 + T3
T1 (y3 ) = M1 + y
T2 (x3 ) = M2 + 0
0
FINAL State
0
0
34
ACCEPTED MANUSCRIPT
Table 5: Instructions executed by the 2-MULT Koblitz ECP MULT PC MULT 1 MULT 2 SA PC SA 1 M1 = x × (Z1 |R1 )
0 1
1 2
M1 = R1 × Z1
M2 = T2 × (y|xy)
5
M1 = T1 × Y1
M2 = X1 × T3
M1 = x × Z1
M2 = T3 × (xy|y)
M1 = T2 × R1
CE
0
PT
0
0
M1 = R1 × (Z1 |T3 )
0
R1 = M1 + X1
0
T1 = R12
1
Y1 = Y1
0
T3 = M1 + 0
1
Y1 = R 1 + T 2 · a
AC
X1 = M2 + Y1
Z1 = T32
X1 = X12
1
T3 = Z12
0
R1 = M1 + 0
1
R1 = R1 + R2
2
X1 = R1 + X1
3
T2 = Z1 + T2
0
R1 = M1 + X1
0
Z1 = Z1
0 2
T2 = M2 + 0
T1 = M2 + 0
(2|4)
3
R1 = (Y1 |M1 ) + (0|T1 ) (4|2)
Y1 = R 1
(4|2)
X1 = X1 Z1 =
(4|2) Z1
BX State 0 ISQ State 0 IMULT State
Y1 = (Y1 |M1 ) + (0|T1 ) R1 = (Z1 |R1 )2
r
0 IRED State 0
T3 = M1 + 0
FMULT State
0
0 M1 = R1 × Y1
(2|4)
0
1
0
1
(2|4)
X1 = X1
PQUAD State
ED
0
M
6 7
T2 = Z12
2
2
4
Y1 = (Y1 |M1 ) + (0|T1 )
AN US
3
CR IP T
PDQA State 0
SA 2
M2 = X1 × T3
R1 = R12
0
FINAL State
0
0
35
T2 (y3 ) = M1 + 0
T1 (x3 ) = M2 + 0
Table 6: Clock Cycles of ECPM 1-MULT Random ECP tINIT
tLOOP
tINV tP2AC tECPM
163
tMULT tSA 30
7
15
29160
548
869
233
47
9
19
65424
871
1368
283
57
10
21
96444
1117
1717
409
107
14
29
261936
1881
2993
571
192
19
656640
3546
5523
39
30051 66820
98192
264972
AN US
m
CR IP T
ACCEPTED MANUSCRIPT
662221
2-MULT Random ECP tMULT tSA
tINIT
tLOOP
tINV tP2AC tECPM
7
24
17025
565
768
47
9
30
37148
891
1202
38389
57
10
33
54180
1139
1513
55736
409
107
14
45
142878
1907
2593
145530
571
192
19
60
350703
3579
4790
355572
30
233 283
M
m 163
17824
1-MULT Koblitz ECP
m
tMULT tSA tPDQA tPQUAD tINV tP2AC tECPM 30
7
13338
1271
548
622
233
47
9
30030
2262
871
983
33284
283
57
10
43804
3053
1117
1251
48118
409
107
14
118320
6018
1881
2123
126475
571
192
19
295450
11297
3546
3968
310734
PT
ED
163
AC
CE
m
15238
2-MULT Koblitz ECP
tMULT tSA tPDQA tPQUAD tINV tP2AC tECPM
163
30
7
10152
1394
557
610
233
47
9
22074
2436
881
957
25476
283
57
10
31772
3266
1128
1217
36265
409
107
14
82688
6324
1892
2043
91069
571
192
19
201020
11726
3559
3810
216575
36
12163
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Table 7: Results Comparison Max. Latency Efficiency Scalab Work FPGA Regis- LUT Slices BRAM Freq. m ECPM ility (ms) s·slice (MHz) ters NIST Curve: Pseudo-Random 163 38 0.023 233 73.4 0.012 2010 Spartan-3 68 283 104 0.009 650 2025 1127 4 Yes XC3S200 [16] 409 251 0.004 571 287.4 0.003 n/a 22936 6150 n/a 250 163 0.0055 29.56 n/a 22340 6487 n/a 192 233 0.020 7.746 2013 Virtex-5 n/a 25030 7069 n/a 189 283 0.034 4.21 No XC5VLX110 [37] n/a 28503 10236 n/a 161 409 0.103 0.952 n/a 32432 11640 n/a 127 571 0.348 0.247 163 0.380 2.290 233 0.860 1.011 2013∗ Virtex-5 0.787 3191 1150 5 181.19 283 1.105 Yes XC5LX110T 1225 [20] 409 3.037 0.286 571 7.243 0.120 163 0.059 2.119 233 0.084 1.489 2014 Virtex-5 1.228 0 154.35 283 0.102 Yes XC5LX110T 12983 24974 7978 [32] 409 0.147 0.852 571 0.205 0.611 163 0.135 3.246 233 0.299 1.460 Proposed Virtex-5 0.993 1-MULT 7128 2290 5 224.84 283 0.440 Yes XC5LX110T 1650 409 1.186 0.368 Random 571 2.965 0.147 163 0.080 4.626 233 0.172 2.148 Proposed Virtex-5 1.479 2-MULT 8784 2708 5 223.26 283 0.250 Yes XC5LX110T 3118 409 0.652 0.567 Random 571 1.593 0.232 NIST Curve: Koblitz 163 15.5 0.050 2010 Spartan-3 0.017 913 2028 1278 4 90 283 45.1 Yes XC3S200 [17] 571 121.4 0.0065 2012 Virtex-4 n/a n/a 12430 n/a 45.5 163 0.012 6.649 No [38] 163 0.206 3.903 233 0.455 1.764 2013∗ Virtex-5 1.449 3003 1246 8 206.27 283 0.554 Yes XC5LX110T 1401 [19] 409 1.451 0.553 571 3.266 0.246 163 0.029 4.599 233 0.042 3.213 2014 Virtex-5 2.667 0 162.07 283 0.050 Yes XC5LX110T 13076 26111 7427 [32] 409 0.073 1.855 571 0.101 1.331 163 0.068 6.669 Proposed 233 0.149 3.053 Virtex-5 1-MULT 2.112 7073 2199 5 223.46 283 0.215 Yes XC5LX110T 1704 Koblitz 409 0.566 0.803 571 1.391 0.327 163 0.055 6.760 Proposed 233 0.114 3.228 Virtex-5 2-MULT 2.267 8609 2708 5 222.67 283 0.163 Yes XC5LX110T 3134 Koblitz 409 0.409 0.903 571 0.973 0.380 ∗ Results re-implemented for Virtex-5
37
B
18x32 RAM 32 A
32
AN US
A
CR IP T
ACCEPTED MANUSCRIPT
18x32 RAM B 32
32
32 X
X
M
63
63
32
0
ED
+
FIFO C
PT
UV register 63
63
32
C 32
SIPO C 32
C_msd 565
AC
CE
Figure 1: Block diagram of the multiplier (MULT) block.
38
32
AN US
B_full
B
A
CR IP T
ACCEPTED MANUSCRIPT
565
32 + 325 R163
32
565
R233
R283
233
283
ED
M
163
465
817
1141
R409
409
R571
571
576
571 32
p_in
SREG C
s_out
p_out
PT
s_in
SQ
AC
CE
Figure 2: Block diagram of the square-add (SA) block.
39
C
ks
x1
y1
+ 32
32
32
ks
32
32
32
32
18x288 RAM
x
18x32 18x32 RAM RAM Controller
32
32
y
...
32
X3
...
Y3
Z3
T1
...
A
MULT C_msd
B_full
B
ED
A
xy
M
k
32
AN US
k
CR IP T
ACCEPTED MANUSCRIPT
C
T2
T3
32 32
...
B
SA C
PT
32
AC
CE
Figure 3: Block diagram of the 1-MULT Koblitz ECP.
40
x3 y3
CR IP T
ACCEPTED MANUSCRIPT
cur_k | next_k = 1 load = 1
load = 0
k_3msb = 1
PDQA Exit: MULT_PC = 8 & SA_PC = 1 (8tMULT +1tSA )
load = 0
LOAD (1)
IDLE
load = 1
PQUAD k_count = 0 Exit: SA_PC = 3 (4tSA + 3)
k_3msb = 0
FMULT (2tMULT )
ED
FINAL (tSA )
cur_k | next_k = 0
cur_k | next_k = 1
M
WAIT (s)
k_count = 0
AN US
reset
Done Inv
ISQ (tSA +t m )
IRED (tSA )
PT
Inversion states (t INV )
Done squares
IMULT (t MULT )
AC
CE
Figure 4: FSM of the 1-MULT Koblitz ECP.
41
BX (tSA )
32
A1_full
B1
A1 32 32
565
32
+
+ 325
32
R163
32
465
565
R233
233
R283
283
M
163
SQ
817
ED
s_in
s_in
565
SREG C
1141
R409
409
R571
571
576
571 32
p_in
PT
A2_full
AN US
B2
A2
CR IP T
ACCEPTED MANUSCRIPT
C1
s_out
p_out
571
p_in
SREG C p_out
s_out
32
C2 576
AC
CE
Figure 5: Block diagram of the SA block for 2 MULT blocks.
42
0.333
AN US
CR IP T
ACCEPTED MANUSCRIPT
0.333
00 (PQUAD)
0.5
0.5
01 (PDQA)
0.5
1 (PDQA)
PT
ED
0.5
M
0.333
AC
CE
Figure 6: Markov chain analysis for PQUAD state.
43
ACCEPTED MANUSCRIPT
Algorithm 1 Lopez-Dahab algorithm Input: k = (kt−1 , . . . , k1 , k0 ) with kt−1 = 1, P (x, y), b – curve specific coefficient Output: Q(x0 , y0 ) = kP // Initialization - Affine to Projective Conversion
CR IP T
// and processing kt−1 = 1 (X1 , Z1 ) ← (x, 1), (X2 , Z2 ) ← (x4 + b, x2 ) // Main Loop for i from t − 2 down to 0 do if ki = 1 then (X1 , Z1 ) ← Madd(X1 , X2 , Z1 , Z2 , x) (X2 , Z2 ) ← Mdouble(X2 , Z2 , b) else
(X1 , Z1 ) ← Mdouble(X1 , Z1 , b) end if end for
AN US
(X2 , Z2 ) ← Madd(X1 , X2 , Z1 , Z2 , x)
// Mxy - Projective to Affine Conversion
y0 ←
X1 Z1 1 (x x
+
X1 )[(x Z1
+
X1 )(x Z1
X2 ) Z2
+ x2 + y] + y
AC
CE
PT
ED
return Q(x0 , y0 )
+
M
x0 ←
44
ACCEPTED MANUSCRIPT
Algorithm 2 Modified τ NAF ECPM on Koblitz Curves Input: k – a binary integer, P (x, y) – a point on Ea Output: Q = kP Compute τ NAF(k) =
l−1 X
ui τ i
// Perform the first point addition of Q ← ∞ ± P if ul−1 = 1 then Q(X3 , Y3 , Z3 ) ← P (x, y) else Q(X3 , Y3 , Z3 ) ← P (x, x + y) end if
while i ≥ 0 do // Main loop
AN US
i←l−2 if ui = 0 then // Section added for this algorithm // Perform proposed PDFRB (Q ← τ 2 Q) Q(X3 , Y3 , Z3 ) ← Q(X34 , Y34 , Z34 ) if ui−1 = 0 then i←i−2 i←i−1 // Perform PADD if ui = 1 then
M
else
ED
Q(X3 , Y3 , Z3 ) ← Q(X3 , Y3 , Z3 ) + P (x, y) else // ui = −1
Q(X3 , Y3 , Z3 ) ← Q(X3 , Y3 , Z3 ) + P (x, x + y) end if
PT
i←i−1
end if
else // This is performed traditionally
CE
// Perform PFRB (Q ← τ Q)
Q(X3 , Y3 , Z3 ) ← Q(X32 , Y32 , Z32 ) // Perform PADD if ui = 1 then
AC
CR IP T
i=0
Q(X3 , Y3 , Z3 ) ← Q(X3 , Y3 , Z3 ) + P (x, y) else // ui = −1 Q(X3 , Y3 , Z3 ) ← Q(X3 , Y3 , Z3 ) + P (x, x + y) end if i←i−1 end if end while return Q(x3 , y3 ) ← Q(X3 /Z3 , Y3 /Z32 )
45