Area-efficient architectures for double precision multiplier on FPGA, with run-time-reconfigurable dual single precision support

Area-efficient architectures for double precision multiplier on FPGA, with run-time-reconfigurable dual single precision support

Microelectronics Journal 44 (2013) 421–430 Contents lists available at SciVerse ScienceDirect Microelectronics Journal journal homepage: www.elsevie...

411KB Sizes 0 Downloads 23 Views

Microelectronics Journal 44 (2013) 421–430

Contents lists available at SciVerse ScienceDirect

Microelectronics Journal journal homepage: www.elsevier.com/locate/mejo

Area-efficient architectures for double precision multiplier on FPGA, with run-time-reconfigurable dual single precision support Manish Kumar Jaiswal n, Ray C.C. Cheung Department of Electronic Engineering, City University of Hong Kong, Hong Kong

a r t i c l e i n f o

abstract

Article history: Received 24 June 2012 Received in revised form 10 February 2013 Accepted 12 February 2013 Available online 15 March 2013

Floating point arithmetic (FPA) is a crucial basic building block in many application domains such as scientific, numerical and signal processing applications. Multiplication is one of the most commonly used one in FPA. This paper presents three architectures targeting Double Precision (D.P.) multiplier, with one being capable of performing run-time-reconfigurable (RTR) dual Single Precision (S.P.) multiplication operation. The first design is based on a novel block-level truncated multiplication, which is able to reduce 1/3 of multiplier blocks with high performance, and is within 1-ULP (unit in the last place) precision from IEEE-754 floating-point standard precision. The second design regains the accuracy lost from the first design, with the same amount of multiplier blocks but some extra hardware, is also able to achieve better performance with less latency than existing work. The third architecture in this paper is able to perform either, with the single double (extended) precision or dual single (extended) precision operands, without any pipeline stall, and with attractive area, speed and latency results. The first design is suitable for the applications with slightly less precision requirement, whereas the other two designs are fully compatible to the IEEE standard accuracy. Design-1 is able to achieve around 300 MHz and 450 MHz on Virtex-4 (V4) and Virtex-5 (V5), respectively, with only 6 DSP48, and latency of 9 cycles. Design-2 is capable of achieving about 325 MHz (V4) and 400 MHz (V5), with only 6 DSP48, with full precision support. The third design achieves more than 250 MHz (V4) and 325 MHz (V5) speed, providing on-the-fly dual precision support, with hardware requirement similar to only double precision supported implementations in the literature. Promising results are obtained by comparing the proposed designs with the best reported floating point multipliers in the literature. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Arithmetic Floating point multiplication FPGA Double precision Run-time-reconfigurable Truncated block multiplier Karatsuba multiplication High performance computing

1. Introduction Floating point number system is a common choice for many scientific computations due to its wide dynamic range feature. For instance, floating point arithmetic is widely used in many areas, especially in scientific computation, numerical processing and signal processing [1–6]. The IEEE 754 defines the standard [7,8] for floating point formats, and efficient hardware implementations of arithmetic operations for this standard form a crucial part of many processors. Among all of the floating point arithmetic operations, the multiplication is a dominating core operation in a large number of scientific and signal processing computations. These applications often aim at high performance and areaefficient implementation of floating point arithmetic operation and thus, efficient implementations of floating point multipliers is a very crucial consideration.

n

Corresponding author. Tel.: þ85 264 375329. E-mail addresses: [email protected], [email protected] (M.K. Jaiswal), [email protected] (R.C.C. Cheung). 0026-2692/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.mejo.2013.02.021

Over the past few decades lots of work have been dedicated to performance improvement of floating point computations, at both, algorithmic level and implementation level [2,9–15]. Several work have also focused on their FPGA-based implementations including [13,15–20,11]. All these work have followed the general arithmetic flow for the floating point implementation. Most of these work mainly use normalized number implementation and are not fully compatible to IEEE 754 standard, basically in terms of exceptional cases handling, rounding methods. In [18,13], Hemmert and Underwood have used FPGA-based optimization as key components for improving the area and speed of the design. Venishetti and Akoglu [11] have used Vedic mathematics approach to do the mantissa multiplication in floating point multiplication unit. The work in [11,18] also support the denormalized number. Belanovic [16] and Wang [17] have presented a library of parameterized floating point arithmetic units, which are fully pipelined. Lienhart [19] and Paschalakis [20] have shown the pipelined implementation of double precision floating point arithmetic and are not fully IEEE 754 compatible. Most of the available implementations are not fully compatible to the IEEE-754 standards and mainly supports for normal numbers only. All these previous approaches have put their effort to

422

M.K. Jaiswal, R.C.C. Cheung / Microelectronics Journal 44 (2013) 421–430

optimize the floating point arithmetic unit in terms of area and speed. Xilinx [21] also provides the soft IP core for the floating point arithmetic unit on their FPGA platform, but does not support denormalized numbers and is not compatible with the IEEE 754 standard. Some open source libraries [22,23] are also available for the floating point arithmetic operation on hardware platform. These libraries can be reconfigured for various exceptional case handling and rounding methods. In general, FPGA circuit design differs significantly from traditional VLSI circuit design approach due to the mapping on to specific FPGA primitives. The use of FPGA, for different computation, can be reprogrammed at the hardware level, depending on each application requirements. Similarly, the floating point unit is used to adopt on the application requirement basis, for instance, the change of rounding mode, normal/denormal support, and latency variations. FPGAs are now becoming a major competitor for the high performance computing machines, due to its lower energyrequirement using the same process technology [24,25]. The available speed, amount of logic and several available on-board intellectual property (IP) cores making them suitable for large set of applications. They are now used in various field of numerical and scientific computation [1–3,26], image processing [4,27], communications [5,28], and cryptography [29–31] computations with significant performance. Even current era Super-Computers and Hybrid-Computers are using the FPGAs [32–36] to off-load and accelerate the parallelizable complex routines over them. Obviously, our proposed work on efficient implementation of crucial floating point arithmetic operation, namely double precision floating point multiplication, is very desirable for FPGA platforms. The floating point multiplier implementation is relatively simple compared to other floating point arithmetic operations. The crucial part of the floating point multiplication lies in mantissa multiplication. This is the bottleneck in the performance of FPU multiplication. The mantissa portion of double precision floating point numbers is 53 bits in length (including one hidden bit), and in general, this would require implementation of a 53  53 multiplier in hardware, which is very expensive. In this work, we propose three designs for the mantissa multiplication of double precision floating point numbers. The idea of the first design is based on the ‘‘Block Multiplication’’ technique [37]. In this work, we have used the modified version of it, named ‘‘Truncated Block Multiplication (TBM) Method, Design1’’, which uses less number of multipliers with a maximum 1-ULP loss in precision. The second design has used 3-partitioning version of ‘‘Karatsuba Multiplication’’ technique [38], a wellknown multiplication theory, as a basis of our design, named ‘‘3-Partition Karatsuba Multiplication (3-PKM) Method, Design2’’. However, third design is powered with the capability of runtime-reconfigurable, either of single double-precision or two parallel single-precision floating point multiplications, we call it Design-3 hereafter. A proper shared allocation of the resources, for both double and single precision computation, leads to a significant area reduction. This capability of third design will help to reduce the total effective area with more computational power to the module. The TBM method provide support only for normal numbers because of accuracy limitation, whereas, 3-PKM and Design-3 (DPdSP) provide supports for both, normal and denormal numbers. The major goal of this paper is to achieve the best achievable area reduction with better performance running on FPGA platforms, and prove their applicability in high performance reconfigurable computing. We have compared our results with optimized implementation of [39], Hemmert [18], Xilinx [21], Govindu [15], Venishetti [11] and NEU [16,17] (North Eastern University, Boston) floating point library multipliers. We have implemented our modules for optimum area, with balanced latency and performance. We have

used Xilinx ISE synthesis tool, ModelSim SE simulation tool, and Xilinx Virtex-2 Pro, Virtex-4 and Virtex-5 FPGA platforms for implementation and to compare the results. Specifically, the key contributions of this paper are as follows:

 Proposed three FPGA-friendly architectures for double preci 

 

sion floating point multiplication, with both normal and denormal support. First two designs (TBM and 3-PKM) are able to reduce the amount of multiplier blocks in mantissa multiplication by 33%. Proposed third design is a run-time-reconfigurable double (extended) precision/dual single (extended) precision (DPdSP) multiplier with approximately 70% resource sharing (compared to only double precision implementation) and promising performance, along with normal and denormal support. Analyzed the accuracy of truncated block multiplication and proposed its optimum use for double precision floating point mantissa multiplication; Provided extensive comparisons between our proposed designs and the previously reported implementations in literature in terms of area and performance efficiency.

This paper is organized as follows. Section 2 discusses the basic operations of floating point multiplication. Section 3 explain our design methodologies of mantissa multiplication with reduced multiplication for first two designs and RTR multiplier design. Section 4 discusses the complete implementation with further processing of floating point multiplication. Section 5 gives the implementation details (Hardware Utilization and Performance Measures) of proposed designs. Comparisons with previously reported implementations are shown in Section 6, and we have concluded the paper in Section 7.

2. Background The binary format of single and double precision floating point number is as follows: For single precision: 1-bit

23-bit

zfflfflfflfflffl}|fflfflfflfflffl{ zfflfflfflfflffl8-bit ffl}|fflfflfflfflfflffl{ zfflfflfflfflfflffl}|fflfflfflfflfflffl{ Sign-bit exponent mantissa For double precision: 1-bit

52-bit

zfflfflfflfflffl}|fflfflfflfflffl{ zfflfflfflfflffl11-bit ffl}|fflfflfflfflfflffl{ zfflfflfflfflfflffl}|fflfflfflfflfflffl{ Sign-bit exponent mantissa Floating point arithmetic implementation includes the processing of the sign, exponent and mantissa part of the operands separately, and then after required rounding and normalization they framed back to desired format. Floating point multiplier is a basic and simpler arithmetic unit, except it requires a large integer multiplier for doing mantissa multiplication, which causes the performance and area overhead in the hardware design. The computational flow of this arithmetic operation is as follows, 1. Pre-Normalization (in case of denormal operands). 2. Output Sign-bit will be XOR of Sign bit of both operands. 3. Product Exponent will be addition of both operands exponent after proper Bias adjustment. 4. Perform the operands mantissa multiplication. 5. Post-Normalization and Rounding of mantissa. The steps 2-to-4 can be easily stated by the following expression: y ¼ Operand1  Operand2

M.K. Jaiswal, R.C.C. Cheung / Microelectronics Journal 44 (2013) 421–430

¼ ð1Þsign1 n2exp1 n1:mant 1  ð1Þsign2 n2exp2 n1:mant 2 ¼ ð1Þ

sign1 xor sign2

¼ ð1Þsign1

xor sign2

ðexp1 biasÞ þ ðexp2 biasÞ þ bias

n2

ðexp1 þ exp2 biasÞ

n2

nð1:mant 1  1:mant 2 Þ

nð1:mant 1  1:mant 2 Þ

where, sign1 , exp1 , mant 1 are the sign, exponent and mantissa of first operand, and sign2 ,exp2 ,mant2 are the sign, exponent and mantissa of second operand. Getting the result correctly rounded is an essential part of the floating point arithmetic operation. Even with floating point, there is still a finite precision to the computations. The results of calculations therefore need to be restricted to the given precision, and this makes it necessary to do the rounding. The IEEE 754 standard defines a number of different ‘‘rounding modes’’. The details can be obtained from [7,8,40]. Normalization used in all the operations is to bring the final result in the standard floating point representation format.

3. Design methodologies This section discusses the underlying design methodologies for the architectures of all the three proposed double precision floating point multiplier modules. Section 3.1 discusses the basics concept and implementation strategy of the Truncated Block Multiplication (TBM) Design. This design takes the benefit of less desirable block multipliers to reduce the required area at a slight accuracy cost of 1-ULP, achieves an improved performance. This provide support only for normal number system. The design strategies of the 3-PKM method is discussed in the Section 3.2. This design is also able to reduce the similar multiplier block as in TBM, however there is no precision loss in this design, and thus it provide support for both, normal as well as denormal formats. The DPdSP design methodology is provided in the Section 3.3. This design also fulfills the complete precision requirement of double precision format, and supported for normal and denormal as well. Moreover, the proper resource sharing of the hardware units leads to the dual functionality of the design. This design provide support for the operation of either a single double precision multiplication or two single precision multiplication with each alternate clock cycle, without any pipeline stall. All the proposed architectures are focused on better area and speed metrics, with fully pipelined structure. 3.1. Truncated Block Multiplication (TBM) method (Design-1) Truncated multipliers [39] are well-known and efficient way of reducing the area requirement, when the need of precision requirement is less, and when the required output demands only some MSB products. A bunch of literature are available on the accuracy and rounding methods of the truncated multipliers, a recent one is [41]. Generally, it is based on the decision of discarding the generation of some lower order partial product matrix (to save some area related to the discarded partial product and further addition of them). We have proposed a form of truncated multiplier for double precision format [42], which is based on the truncated block-based multiplier. We have discarded some LSB block multipliers, as it is felt that the general methodology is not directly applicable to block multiplication strategy. The reported work [42] has used only 6-Multiplier blocks with accuracy loss of 1-ULP (based on a vast random-test simulation). Here, we present the maximum error cost of the design, along with its improved implementation by incorporating the benefit of DSP48 IPs on FPGA platform.

423

For the double precision mantissa multiplication, the two 53bit mantissas are multiplied. However, after multiplication, the result has to be trimmed back down to a 53-bit mantissa (can be accomplished by suitably truncating or rounding the result), so a complete 53  53 multiplier is not required. For the purpose, 53bit mantissa operands A&B have been partitioned as follows: 2bit

17bit 17bit 17bit z}|{ zfflfflfflffl ffl}|fflfflfflfflffl{ zfflfflfflfflffl}|fflfflfflfflffl{ zfflfflfflfflffl}|fflfflfflfflffl{ MSB 1x xx    xx xx    xx xx    xx LSB A4 ,B4

A3 ,B3

A2 ,B2

A1 ,B1

So, the mantissa product can be seen as, A  B ¼ A4 :B4 :2102 þ ðA4 :B3 þ A3 :B4 Þ:285 þ ðA4 :B2 þ A3 :B3 þA2 :B4 Þ:268 þ ðA4 :B1 þ A3 :B2 þ A2 :B3 þ A1 :B4 Þ:251 þ ðA3 :B1 þ A2 :B2 þ A1 :B3 Þ:234

þ ðA2 :B1 þ A1 :B2 Þ:217 þ A1 :B1

ð1Þ 17

In Eq. (1), removal of ðA2 :B1 þ A1 :B2 Þ:2 þA1 :B1 will cause a minor accuracy problem for the present scope of multiplication. The maximum error caused by this reduction can be quantify as follows. The maximum value of a 17-bit block would be 0x1FFFF (17-bit, in Hexadecimal). The product of two maximum value 17bit block will be given as, 17bits

17bits

34bits

zfflfflfflfflffl}|fflfflfflfflffl{ zfflfflfflfflffl}|fflfflfflfflffl{ zfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflffl{ 0x1FFFF  0x1FFFF ¼ 0x3FFFC0001 Thus, the maximum error can be given as the sum of, f0x3FFFC0001,170 b0g þ f0x3FFFC0001,170 b0g þ 0x3FFFC0001 ¼ 0xFFFF400000001ðLSB 55 to 106 bitsÞ The decimal equivalent of this maximum error will be, 2:2204206382618959e16 And, 252 ¼ 2:2204460492503131e16 Thus, the maximum error is caused by the last bit of double precision floating point number, which is within 1-ULP of error. For most signal processing and graphics applications, this level of error would be acceptable [43,44]. For scientific computations, the user can determine whether this level of accuracy is acceptable by trading off for a higher performance and less area. The proposed truncated block multiplication double precision mantissa multiplier is shown in Fig. 1. The architecture of TBM mantissa multiplier is shown in Fig. 2. The Architecture shown in Fig. 2 is focused for using DSP48, but can be easily used with simple MULT18  18 IPs. In this, six MULT18  18/DSP48 have been used for 17  17 multiplications, all 2  17 multiplier is done by logic in the FPGA fabric and 4-look-up tables (LUT’s) for 2  2 multiplication. Further, the partial products can be arranged in suitable manner and added to get the final result. Also, by efficiently using the available DSP48 IP on Virtex-4 and Virtex5, the amount of logic used for addition of partial products can be reduced. In this implementations, we ported most of the partial product additions on the DSP48 block. The sum of partial products P1, P2, P3, and P4, P5, P6 þP7 and P8, P9 þ P10 have been carried out on the DSP48s along with the multiplication. This has helped to reduce the area of the design on the Virtex-4 and Virtex-5 FPGAs. The latency of this TBM mantissa multiplier is 6-clock cycles. Furthermore, since this design is prone to have 1-ULP precision loss, it is suitable for normalized floating point operands. For the case of denormal numbers, the precision loss can be more in postnormalization of the mantissa product. This is due to the left shifting of the mantissa product, while in post-normalization, may results in the shifting of 1-ULP error to the shifted place, and

424

M.K. Jaiswal, R.C.C. Cheung / Microelectronics Journal 44 (2013) 421–430

A3 x B3

A4 x B4 4−bits

34−bits

P1

34−bits

P2

A3 x B1

34−bits

P3

34−bits

P5 P6

A4 xB1 19−bits

P7

17−bits A1

B2

A2 x B2

A1 x B4 19−bits

34−bits

17−bits A2

A1 x B3

A3 x B2

B1

34−bits

17−bits

This part is ignored

P8

P10

19−bits 17−bits

B3

P4

A4 x B2 19−bits 19−bits

B4

34−bits

P9

A4 x B3

17−bits A3

A2 x B3

A2 x B4 19−bits

A3 x B4

2−bits A4

17−bits A1 x B1

17−bits

A1 x B2

34−bits

A2 x B1

34−bits

34−bits 17−bits

Fig. 1. 53-bit Mantissa block multiplication.

b2

a2

b2

aU a2 bU b2 aM a2

bM aU

34-bit Left-shift aU

17-bit Left-shift aU

bM

bM

b2

aL a2

aL

bL aU

bM

bU

bL 0

aM

bM DSP48

DSP48 DSP48

DSP48

DSP48

17-bit Left-shift

DSP48

17-bit Left-shift a[52:0]={a2[52:51], aU[50:34], aM[33:17], aL[16:0]}; b[52:0]={b2[52:51], bU[50:34], bM[33:17], bL[16:0]};

Multiplier o/p

Fig. 2. Architecture of 53-bit TBM.

could leads to the large ULP error. So, this design is aimed for only normal support. The mantissa product has been further normalized and the exponent is accordingly adjusted. Final rounding is performed to get the complete floating point multiplication result. 3.2. 3-Partition Karatsuba Multiplication (3-PKM) method (Design-2) In this design, we have used complete multiplication of 53  53 bit as against the Design-1 (TBM). Thus, there is no precision loss occurred in this design, but with the same benefit of less multiplier blocks with some extra logic. The basis of this design has been adopted from Karatsuba Multiplication Technique [38]. Karatsuba Multiplication is a fast multiplication algorithm. It reduces the multiplication of two n-digit numbers from simple n2 to at most 3nlog2 3  3n1:585 single digit multiplication. The basic steps for this algorithm depend on the divide-andconquer paradigm and proceed in the following ways. Let W&X be the two n-digit numbers. By breaking these number in two parts, for some base B, we can rewrite them as, W ¼ W 1 :Bm þW 0 , and X ¼ X 1 :Bm þ X 0 where W 0 &X 0 are of

m-digit. Now, we can write the product of W&X as follows: WX ¼ W 1 :X 1 :B2m þ ðW 1 :X 0 þ W 0 :X 1 Þ :Bm þ W 0 :X 0 |fflfflffl{zfflfflffl} |fflfflffl{zfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl} a

ð2Þ

g

b

Eq. (1) requires four multiplications to get the complete result, whereas, using Karatsuba method, by re-writing b as in Eq. (2), it can be reduced to only three multipliers.

b ¼ ðW 1 þW 0 ÞðX 1 þX 0 Þag

ð3Þ

‘‘or’’ ¼ a þ gðW 1 W 0 ÞðX 1 X 0 Þ

ð4Þ

Similarly by extended this technique, i.e. by splitting the operands into three parts, we reduce number of multiplier from 9 to 6. The details are described as follows: We can divide the operands W&X, and do the multiplication W:X as follows: W ¼ W 2 :B2m þ W 1 :Bm þ W 0 ,

X ¼ X 2 :B2m þ X 1 :Bm þX 0

W:X ¼ a2 :B4m þ a1 :B2m þ a0 þ b2 :B3m þ b1 :B2m þ b0 :Bm

ð5Þ

where a2 ¼ W 2 :X 2 , b2 ¼ W 2 :X 1 þ W 1 :X 2 , a1 ¼ W 1 :X 1 , b1 ¼ W 2 :X 0 þ W 0 :X 2 , a0 ¼ W 0 :X 0 , b0 ¼ W 1 :X 0 þ W 0 :X 1 .

M.K. Jaiswal, R.C.C. Cheung / Microelectronics Journal 44 (2013) 421–430

Up to this level we need 9 multiplier to accomplish the task. The number of multiplication can be reduced to 6 by modifying the b2 , b1 , and b0 , as below. 9 b2 ¼ ðW 2 þW 1 ÞðX 2 þ X 1 Þa2 a1 > = b1 ¼ ðW 2 þW 0 ÞðX 2 þ X 0 Þa2 a0 ð6Þ > b0 ¼ ðW 1 þW 0 ÞðX 1 þX 0 Þa1 a0 ; ‘‘or’’

9

b2 ¼ a2 þ a1 ðW 2 W 1 ÞðX 2 X 1 Þ > = b1 ¼ a2 þ a0 ðW 2 W 0 ÞðX 2 X 0 Þ > b0 ¼ a1 þ a0 ðW 1 W 0 ÞðX 1 X 0 Þ ;

ð7Þ

As a result, the number of multiplication has been reduced from 9 to 6, giving the complete and correct multiplication result. The overhead is just some extra addition and subtraction, the cost of which are much lesser than that of a multiplier. We have adopted this method for the Mantissa multiplication of the double precision floating point multiplier. Both of the 53bit mantissa (including one hidden bit) have been decomposed into three parts as below, W 2 ,17bit

W 1 ,18bit W 0 ,18bit zfflfflfflfflffl}|fflfflfflfflffl{ zfflfflfflffl ffl}|fflfflfflfflffl{ zfflfflfflfflffl}|fflfflfflfflffl{ 1x    xx xx    xx xx    xx X 2 ,17bit

X 1 ,18bit X 0 ,18bit zfflfflfflfflffl}|fflfflfflfflffl{ zfflfflfflffl ffl}|fflfflfflfflffl{ zfflfflfflfflffl}|fflfflfflfflffl{ 1x    xx xx    xx xx    xx

As above in Eq. (5), a2 ¼ W 2 :X 2 requires a 17-bit unsigned multiplier, both a1 ¼ W 1 :X 1 & a0 ¼ W 0 :X 0 needs 18-bit unsigned multiplier. The computation of each, b2 , b1 and b0 requires 19-bit unsigned multiplier. The computations of the 18-bit and the 19-bit multiplier are shown in Fig. 3. For these multipliers, partial product P0 has been computed by a dedicated hard-core multiplier IP, whereas all other partial products (P1, P2, and P3) have been computed using logic resources, and implementation of these are simple and straight forward (will need some set of AND gates and adders logic). Generally speaking, all the partial products can be added simply to get the multiplication result. However, it can be optimize when using DSP48 IP for 17-bit multiplication for partial product P0 (on Virtex-4 FPGAs). The built-in 47-bit adder of DSP48 IP can be used to reduce some logic resources. And here, the sum of the partial products P1, P2, and P3 has been supplied to the DSP48 IP (when used) and added to P0 to get the multiplication result. On Virtex-5 or higher version of FPGAs, DSP48E

425

has 24  17 unsigned (25  18 Signed) multiplication capability. On these the 18-bit multiplier can be achieved using 18  17 on DSP48E and 18  1 using logic and with further addition of partial products on same DSP48E. Similarly, for 19-bit multiplier need one 19  17 using DSP48E and 19  2 using logic, with final partial product addition on the same DSP48E, Fig. 4. These will have some extra advantage over Virtex-4 FPGA. By using these 18-bit and 19-bit multipliers and some extra 0 adders/subtractors (3-adders and 1-subtractor for each b s), we 0 can compute all the a0 s and b s. Finally by combining them, we can get the total 53-bit mantissa multiplication result, that contains only 6 multiplier IPs. Thus, it is a minimum of 33% reduction in number (compare to at least 9 Multiplier IPs for 53bit multiplier). The architecture of 53-bit multiplier using 3-PKM method is shown in Fig. 5. This module has a latency of 8 clock cycles on Virtex-4 and 7 clock cycles on Virtex-5. The 18-bit and 19-bit multiplier has latency of 3-clock cycles on Virtex-4 and 2 clock cycles on Virtex-5. In Fig. 5, m00, m11, m22 are 18-bit multipliers used to generate a0 s and m10, m20, m21 are 19-bit 0 multipliers used to generate b s. Further additions and subtractions took 5 clock cycles for remaining computation. Since the design perform complete multiplication (though with less multiplier blocks), there is no precision loss in this. So, it has been targeted for both normal as well as denormal support. In this, the operands have been first pre-normalized (in case of denormal operands); then processed for sign, exponent, and mantissa computation; which further need post-normalization and rounding, to bring back the result in standard format.

3.3. RTR Double/Dual Single Precision (DPdSP) multiplier (Design-3) In this design of double (extended) precision floating point multiplication, we have incorporated the capability of on-the-fly dual single (extended) precision floating point multiplications. This design has been achieved by the resource sharing of the most costly and critical part, the mantissa multiplication, among the single precision and double precision operands. The architecture for this dual single precision and double precision multiplication is shown in Fig. 6. It consists of pre-processing and post-processing, separately, for single precision and double precision operands, with a shared unit for mantissa multiplications. In the initial pre-processing stage, the input data has been processed separately for the exceptional case detection, normal– denormal status check-up, and further, the processed mantissa’s

18−bits

19−bits

1−bit

17−bits

2−bits

17−bits

1−bit

17−bits

2−bits

17−bits

17x17 Mult 1x17 Mult

P1

1x17 Mult

P2

P0

1x1 Mult P3

17x17 Mult 2x17 Mult

P1

2x17 Mult

P2

P0

2x2 Mult P3 Fig. 3. 18-bit and 19-bit multiplier on Virtex-4.

18 bits

19 bits

18 bits 1 bit 17x18 Mult 1x18 Mult

19 bits

17 bits

P1

2 bits P0

17x19 Mult 2x19 Mult

Fig. 4. 18-bit and 19-bit multiplier on Virtex-5.

P1

17 bits P0

426

M.K. Jaiswal, R.C.C. Cheung / Microelectronics Journal 44 (2013) 421–430

17-bit 1/2-bit

17-bit

17-bit

1/2-bit

1/2-bit

1/2-bit

17-bit

17-bit

18/19-bit

18/19-bit

1/2-bit

17-bit Left-shift

17-bit Left-shift

17-bit Left-shift DSP48

DSP48E

18/19-bit Multiplication o/p

18/19-bit Multiplication o/p (b) On Virtex-5 DSP48E

(a) On Virtex-4 DSP48

X[52:0]={X2[52:36], X1[35:18], X0[17:0])

W[52:0]={W2[52:36], W1[35:18], W0[17:0]} W2

X2

m22

W2+W1

W1

W1+W0

X1

W0

m11

X1+X0

W2+W0

X2+X0

X0

m00

X2+X1

m21

a22

a11

m10

s21

s10

m20

a00

s20

106-bit Product Fig. 5. 53-bit multiplier architecture using 3-PKM.

of single precision and double precision operands are multiplexed to the next stage of shared mantissa multiplier block. In general, the mantissa multiplication component in Fig. 6 implement one 66-bit or dual 34-bit integer multiplications. Thus, it can easily used for single and double precision and its extended precision version. Using two partition Karatsuba method a 66-bit multiplier can be implemented using two 33-bit unsigned multipliers (m00 , m11 and one 34-bit unsigned multiplier (m10), with the help of Eqs. (2) and (3). Whereas, each 33/34-bit multiplier block has been again implemented using two partition Karatsuba method [45], which further needs two 17-bit unsigned multipliers and one 18-bit signed multiplier (accomplished by DSP48 available on the FPGA), using Eqs. (2) and (4), as in Fig. 7. Thus, in total, it requires 9 DSP48 blocks for 66-bit multiplication. The latency of 34-bit multiplier is 5 clock cycles with registered inputs, as shown in Fig. 7. For the 66-bit multiplier, the remaining additions and subtractions require additional 4 clock cycles, thus, a total 9 clock cycles for 66-bit multiplier. In order to add reconfigurability, the input operands mantissa have been multiplexed over the 34-bit multipliers m00 &m11 . The inputs of m00 have been multiplexed between first set of single precision mantissa operands and half of the double precision mantissa operands, and similar happens for inputs of m11 between second set of single precision mantissa operands and second half of the double precision mantissa operands. In the single precision processing, the output of multipliers m00 &m11 have been passed to their single precision post-processing counterparts, whereas for double precision their outputs have been processed further with m10 and later passed to its double precision post-processing counterpart. Thus, we can process either of a

double (extended) precision or dual single (extended) precision operands in each clock cycle without any pipeline stall. Moreover, in the case of dual single (extended) precision processing, since the mantissa computation pass through only one level of 34-bit multipliers, it needs less clock cycles to produce the complete result, compare to double (extended) precision result. The post-processing stage includes the computation of leading-one-detector (LOD), which in effect is the priority encoder, for denormal processing. This part can also be shared between single precision and double precision mantissa products, using controlled multiplexer, however it will have a equivalent area cost, but with some extra delay overhead. The mantissa products along with LOD outputs further feed to dynamic-shift registers to normalize the mantissa products. These normalized products then processed for rounding (for the case round-tonearest method has been used), which needs some logic and an adder of mantissa size. Finally, the multiplication results came out with a proper exponent biasing and exceptional case handling.

4. Implementations The implementation of the floating point multiplication requires the processing on the Sign, Exponent and Mantissa separately. The implementation of mantissa multiplications for all proposed designs has been done as discussed in Section 3. The further computational processing are similar for all. Next, we will discuss about the details of each operation.

M.K. Jaiswal, R.C.C. Cheung / Microelectronics Journal 44 (2013) 421–430

427

Pre-processing of Input Operands (1-set Double Precision, 2-set Single Precision)

DP-M2-H DP-M2-H

SP1-M2

SP2-M1

SP2-M2

DP-M1-H

SP1-M1 DP-M1-L

DP-M2-L

DP-M2-L

DP-M1-H

DP-M1-L

a2

a3

m11

m10

m00

a0

SP-1 Mantissa Product

s0

SP-2 Mantissa Product

DP-M1-L: Double Precision Mantissa-1 [32:0] DP-M1-H: Double Precision Mantissa-1 [52:33] DP-M2-L: Double Precision Mantissa-2 [32:0] DP-M2-H: Double Precision Mantissa-2 [52:33] SPX-MY: Single Precision Data-X Mantissa-Y

a1

DP Mantissa Product

Post-processing for 1-set Double Precision and 2-set Single Precision Fig. 6. DP/Dual-SP FP multiplication architecture.

operands. W1 (17-bit)

X1 (17-bit)

W0 (17-bit)

X0 (17-bit)

Sign_out ¼ Sign_in1  Sign_in2 The output exponent is given by addition of both input exponents and then adjusting it by BASE, i.e W1-W0

X1-X0

m11

m00

Exp_out ¼ Exp_in1 þExp_in21023 DSP48

m10

For double precision floating point numbers the BASE is equal to 1023 (2111 1). BASE of any floating point number is given by (2exponent bits-1 1).

a0

s0

s1 DSP48

a1

68-bit Product Fig. 7. 34-bit Multiplier Architecture.

4.1. Sign and exponent computations These computations are performed in a straightforward way. Output sign will be the logical XOR of the sign-bit of both

4.2. Exceptional case handling As defined by the IEEE 754 standard [7,8], there are several exceptional cases like NaN, INFINITE, ZERO, UNDERFLOW, OVERFLOW appears in any floating point arithmetic. Thus, the main computation has been combined with the detection of all exceptional cases, and determining the final output as defined by the standard. The execution of all the exceptional cases are handled in-line with Xilinx Core multiplier, which differs slightly from IEEE 754 standard. For example, if any/both of the operands is infinite we produce a Infinity as output (with computed signbit). If either of the input operands is denormalized, output will be zero (with respective sign-bit). If output exponent goes ZERO or below ZERO, UNDERFLOW will activate, and if it goes beyond 11’h7fe (2046 in decimal), OVERFLOW will be activated. In addition, when one operand is infinite and the other is denormalized, an INVALID operation is indicated and results in NaN output.

428

M.K. Jaiswal, R.C.C. Cheung / Microelectronics Journal 44 (2013) 421–430

4.3. Normalization and rounding The requirement of putting the final result again in the 64-bit Sign-Exponent-Mantissa format demands for the normalization of the result. Often the mantissa multiplication results to an extra bit in the MSB before decimal point. Similarly, sometimes after rounding the same situation appears. These results need to be fixed to get the mandatory formatting of the result. As a result, whenever there is an extra carry generated after multiplication or rounding, the product is right shifted by one bit and exponent is incremented by one to make the result correctly normalized. Similarly, when the operands are denormal in nature, then mantissa product need to be left shifted with help of leadingone-detector (LOD) circuit, to bring the result in normalized format. Here, LOD is a priority encoder circuit used to search for leading one in mantissa product. Rounding is required to trim back the 106-bit mantissa multiplication result to a standard 53-bit result only. We have implemented only round to nearest rounding specified by the IEEE standard. The other rounding methods can also be used depending on the requirement of the application. Since the purpose of this work is to reduce the expensive multiplication area cost, we have just focused on only a simple rounding method, however it can be extended to other methods. The ‘‘round to nearest’’ is the mostly used method for the rounding, so we prefer to include it. Its implementation requires one adder, and some slices for AND, OR and NOT operation (to find whether to add an ULP or not). Other rounding methods are: ‘‘round to zero’’, ‘‘round to þinf’’ and ‘‘round to -inf’’. ‘‘Round to zero’’ is just truncation of the significant to its desired length, thus need no extra hardware for rounding. Remaining two methods also needs one adder and some logic for AND, OR and NOT operation (to find whether to add an ULP or not, based on the sign and round-bits), have hardware requirement almost similar to ‘‘round to nearest’’. The error evaluation of the designs are tested based on 5 Million unique random test cases, generated by a 64-bit Linear Feedback Shift Register (LFSR) algorithm. The error was found to be at most 1-ULP for TBM method (due to omission of some blocks) and 3-PKM and DPdSP designs are fully compatible with rounding results.

5. Results The design has been implemented using Verilog-HDL. It has been synthesized and place and routed on Virtex-2 Pro (xc2vp307-ff896), on Virtex-4 (V4) (xc4vfx100-12ff1517) and Virtex-5 (V5) (xc5vlx155-3ff1760) FPGAs using Xilinx ISE. Simulation results has been analyzed in ModelSim-SE. In this proposed work TBM design has a latency of 9 (6 cycles for mantissa multiplication and 3 cycles for rounding and post-processing). Design-2, the 3-PKM design, has a latency of 11 (V4) and 10 (V5) for normal support and 14 (V4) and 13 (V5) for denormal support. For 3-PKM, for only normal support, it need 8 (V4)/7 (V5) cycles for mantissa multiplication, 3 cycles for rounding and post-processing, and for denormal support it requires additional 3 cycles for LOD and dynamic shift registers (for denormal to normal conversion). Whereas, for normal support, DPdSP design has a total latency of 12 cycles (9 cycles for 66-bit multiplier and 3 cycles for rounding and post-processing) for Double Precision result and a latency 7 cycles (5 cycles for 34-bit multiplier and 2 cycles for rounding and post-processing) for Single Precision results. For denormal support, it needs 2 extra cycles for Single Precision and 3 extra cycles for Double Precision, for LOD and dynamic shift registers. All designs have a throughput of one clock cycle.

Hardware utilization and performance for proposed implementation of TBM, 3-PKM and DPdSP are shown in Tables 1–3, respectively, on Virtex-2 Pro, Virtex-4 and Virtex-5 FPGAs. All hardware reported results are post place-and-route data from the FPGA design tools. The TBM design has been presented for only normal support due to inherent precision loss in it (as discussed earlier in Section 3A), whereas 3-PKM has been reported for both normal and denormal support. The design of DPdSP has been also focused for normal and denormal support as well, moreover this design can well used for extended version of single precision and double precision. As shown in the results, it is obvious that the TBM design is faster and need less required resources than 3-PKM method, however it has a accuracy loss of 1-ULP, and 3-PKM method design is fully accurate. But, both of this design is able to reduce the amount of DSP48 multiplier block by 33%, a big saving in terms of total equivalent area with previous works, which will be discussed more in next section. The implementation results of DPdSP, required some more area (compared to only double precision implementation) to provide the capability of dual single (extended) precision along with double (extended) precision

Table 1 Implementation details of TBM method. Virtex-2Pro

Virtex-4

Virtex-5

Latency Slices

9 620

9 414 (539a)

Freq (MHz) MULT18  18/DSP48

255 6

280 (375a) 6

9 522 LUT 546 FF 440 6

a

With speed optimization (  c¼ 100).

Table 2 Implementation details of 3-PKM method. Virtex-4 Normal support Latency lices Freq (MHz) DSP48 Denormal support Latency Slices Freq (MHz) DSP48

11 704 320 6 14 1086 300 6

Virtex-5

10 272 (765 LUTs, 790 FFs) 390 6 13 429 (1103 LUTs, 1035 FFs) 335 6

Table 3 Implementation details of DPdSP design. Virtex-4

Virtex-5

Normal support Latency Slices Freq (MHz) DSP48

7 (S.P.), 12 (D.P.) 943 260 9

467 (1168 LUTs, 1373 FFs) 336 9

Denormal support Latency Slices Freq (MHz) DSP48

9 (S.P.), 15 (D.P.) 1550 255 9

662 (1748 LUTs, 1812 FFs) 335 9

M.K. Jaiswal, R.C.C. Cheung / Microelectronics Journal 44 (2013) 421–430

computation. The performance of this design is also promising, and can be speed up more with extra pipelining.

6. Comparisons In this section, fair comparisons with the most optimized implementations of double precision floating point multiplier will be discussed. The comparisons with Xilinx [21], Hemmert [18], Govindu [15], Venishetti [11], NEU [16,17] and Flopoco [39] multipliers are shown in Tables 4–6. Only [18,11] implementation supports the denormalized number. The number of block multipliers in case of [15,16] and Xilinx can also be reduced up to 9 with the expense of some extra slices (i.e. six 17  2 and one 2  2 multiplication can be implemented by logic). Table 4 Area and timing comparison: on Virtex-2 Pro FPGA. Design

Latency

MULT18  18

Slices

Freq (MHz)

Venishetti [11] (Denormal)

11 6 11 12 (Opti.) 15 (Max.) 6 (Min.) 5 5 8 9

9 9 9 16 16 16 9 16 16 6

1527 1333 841 910 1019 491 633 1074 648 620

222 156 206 205 215 60 135 98 170 255

Hemmert [18] (Denormal) Govindu [15] (Normal) Govindu [15] (Normal) Govindu [15] (Normal) Wang [17] (Normal) Belanovic [16] (Normal) Xilinx [21] (Normal) TBM (Normal)

Table 5 Area and timing comparison: on Virtex-4 FPGA. Design

Latency

Venishetti [11] (Denormal)

11

Hemmert [18] (Denormal) Banescu [39] (Normal) Wang [17] (Normal) Xilinx [21] (Normal) TBMa (Normal) TBMb (Normal) 3-PKM (Normal) 3-PKM (Denormal) DPdSP (Normal)

14 16 5 17 9 9 11 14 7 (S.P.) 12 (D.P.) 9 (S.P.) 15 (D.P.)

DPdSP (Denormal)

a b

DSP48

Slices

Freq (MHz) 228

9 10 13 9 6 6 6 6 9

2471 (LUT) 1601 (FF) 737 729 1048 1165 414 539 704 1086 943

9

1550

255

9

274 338 98 247 280 375 320 300 260

Area optimized (-c¼ 0). Speed optimized (-c ¼ 100).

Table 6 Area and timing comparison: on Virtex-5 FPGA. Design

Latency

DSP48

LUTs

FFs

Freq (MHz)

Xilinx [21,39] (Normal) Banescu [39] (Normal) Banescu [39] (Normal) TBM (Normal) 3-PKM (Normal) 3-PKM (Denormal) DPdSP (Normal)

18 14 13 9 10 13 7 (S.P.) 12 (D.P.) 9 (S.P.) 15 (D.P.)

10 9 9 6 6 6 9

339 804 1184 522 765 1103 1168

482 804 1080 546 790 1035 1373

319 407 407 440 390 335 336

9

1748

1812

335

DPdSP (Denormal)

429

Although our approaches are using slightly more slices than [18], a significant advantage of 33% reduction in the number of multiplier units with a much improved operating frequency are achieved. For the speed perspective, the design of [18] and Xilinx are close to ours but with more number of multiplier-blocks (MULT18  18). Also, [18] and Xilinx have manually placed their logic on FPGA to get the optimized design on the mentioned platform. Hemmert and Underwood [18] has been optimized for a particular FPGA platform, whereas our proposed modules are general and can be placed on any platform with similar design metrics. The most important advantage of our designs are the fewer use of the multiplier IP cores, and higher operating performance but smaller latency. Banescu [39] has some better speed numbers compared to our designs, but has occupied more area, both, in terms of slices and DSP48 blocks. Moreover, Banescu et al., [39] only supports normal numbers and the implementation results are provided after synthesis process instead of complete place and route results. The 3-PKM implementations have more slice counts, but if we measure the equivalent of 3 extra multiplier block, the design will have a significant area impact. With a simple synthesis on Xilinx ISE tool, the equivalent hardware for three embedded DSP48 multiplier block will be equal to 548 Slices (1065 LUTs), where a 17  17-bit multiplier and a 34-bit adder is used as an equivalent to a DSP48 block, for the current purpose. Thus, if we look from a total equivalent hardware perspective, the presented design has much hardware saving. However, if we compare the TBM and 3-PKM designs by using the implementation results, TBM has better area and performance numbers than 3-PKM. Whereas it has an additional penalty of accuracy, i.e. the 1-ULP loss compared to the IEEE 754 standard, while Design-2 does not cause any accuracy loss. Moreover, DPdSP (Design-3) is taking either equivalent or slightly more hardware resources, with similar performance (which can be, further, easily improved with more pipelining), compared to previously reported double precision multipliers. However, it has the benefit of having on-the-fly dual functionality for single (extended) and double (extended) precision computation. This design is also able to support the extended single and double precision format, due to wide multipliers width (34-bit for S.P. and 66-bit for D.P.), for which it just need only 9 DSP48 blocks rather than 16 with the traditional method, which is again a major area saving with dual functionality.

7. Conclusions In this paper, we have presented efficient architectures for implementation of the IEEE 754 double precision floating point multiplication on FPGAs. The proposed three architectures achieve higher performance compared to other previously designs in the literature. By trading off a slight precision (a maximum error of 1-ULP) loss of TBM (Design-1) from the IEEE standard, it is able to minimize the multipliers by a minimum of 33%. On the contrary, 3-PKM (Design-2) does not incur any precision loss, with same benefit on the number of multipliers. By complementing the two proposed designs, we result a significant reduction in resources, accuracy assured and improved performance gain. Moreover, the third architecture has reconfigurable dual functionality, both the double precision and dual single precision (along with their extended precision) multiplication, with promising area and performance numbers, without any pipeline stall. This paper provides an efficient and concise building blocks for applications using floating point multipliers. Future work includes the inclusion of our proposed designs in some applications (like FFT, matrix multiplication) and to measure the advantage of our

430

M.K. Jaiswal, R.C.C. Cheung / Microelectronics Journal 44 (2013) 421–430

solution. Future work includes the application of using these three proposed architectures, and further optimization. An ASIC optimized architecture will also be explored, specifically for DPdSP, the Design-3, to get a better essence of its benefit.

References [1] O. Storaasli, R.C. Singleterry, S. Brown, Scientific computation on a NASA reconfigurable hypercomputer, in: MAPLD International Conference, September 2002. [2] K. Underwood, FPGAs vs. CPUs: trends in peak floating-point performance, in: Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays ser. FPGA ’04, ACM, New York, NY, USA, 2004, pp. 171–180. [3] M.K. Jaiswal, N. Chandrachoodan, FPGA-based high-performance and scalable block LU decomposition architecture, IEEE Trans. Comput. 61 (2012) 60–62. [4] G. Zhi, N. Walid, V. Frank, V. Kees, A quantitative analysis of the speedup factors of FPGAs over processors, in: Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (FPGA04), ACM, New York, NY, USA, 2004, pp. 162–170. [5] H. Parizi, A. Niktash, A. Kamalizad, N. Bagherzadeh, A reconfigurable architecture for wireless communication systems, in: Third International Conference on Information Technology: New Generations, vol. 0, 2006, pp. 250– 255. [6] S. Kestur, J.D. Davis, E.S. Chung, Towards a universal FPGA matrix-vector multiplication architecture, in: IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM-2012), April 2012, pp. 9–16. [7] IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Std 754-1985, 1985. [8] IEEE Standard for Floating-Point Arithmetic, Technical Report, August 2008. [9] S.F. Anderson, J.G. Earle, R.E. Goldschmidt, D.M. Powers, The IBM system/360 model 91: floating-point execution unit, IBM J. Res. Dev. 11 (January) (1967) 34–53. ¨ [10] R. Strzodka, D. Goddeke, Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components, in: Fourteenth Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM ’06, 2006, pp. 259–270. [11] S.K. Venishetti, A. Akoglu, A highly parallel FPGA based IEEE-754 compliant double-precision binary floating-point multiplication algorithm, in: International Conference on Field-Programmable Technology (ICFPT 2007), December 2007, pp. 145–152. [12] D.W. Sweeney, An analysis of floating-point addition, IBM Syst. J. 4 (March) (1965) 31–42. [13] K.S. Hemmert, K.D. Underwood, Open source high performance floating-point modules, in: Fourteenth Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM-06), April 2006, pp. 349–350. [14] M. Huang, L. Wang, T. El-Ghazawi, Accelerating double precision floatingpoint Hessenberg reduction on FPGA and multicore architectures, in: Symposium on Application Accelerators in High Performance Computing (SAAHPC’10), July 2010. [15] G. Govindu, L. Zhuo, S. Choi, V. Prasanna, Analysis of high-performance floating-point arithmetic on FPGAs, in: Proceedings of 18th International Parallel and Distributed Processing Symposium, IEEE, 2004, pp. 149–156. [16] P. Belanovic, M. Leeser, A library of parameterized floating-point modules and their use, in: Twelfth International Conference on Field-Programmable Logic and Applications (FPL-02), Springer-Verlag, London, UK, September 2002, pp. 657–666. [17] X. Wang, M. Leeser, VFloat: a variable precision fixed- and floating-point library for reconfigurable hardware, ACM Trans. Reconfigurable Technol. Syst. 3 (September) (2010). [18] K.S. Hemmert, K.D. Underwood, Fast, efficient floating point adders and multipliers for FPGAs, ACM Trans. Reconfigurable Technol. Syst. 3 (September (3)) (2010). [19] G. Lienhart, A. Kugel, R. Manner, Using floating-point arithmetic on FPGAs to accelerate scientific N-body simulations, in: Tenth Annual IEEE Symposium on Field-Programable Custom Computing Machines (FCCM’02), IEEE Computer Society, 2002.

[20] S. Paschalakis, P. Lee, Double precision floating-point arithmetic on FPGAs, in: Second IEEE International Conference on Field Programmable Technology (FPT’03), 2003, pp. 352–358. [21] Xilinx, Xilinx Floating-Point IP Core. [Online]. Available: /http://www.xilinx. comS. [22] M. Leeser, VFLOAT: The Northeastern Variable precision FLOATing point library. [Online]. Available: /http://www.ece.neu.edu/groups/rcl/projects/ floatingpoint/index.htmlS. [23] Usselmann Rudolf, Floating Point Unit :: Overview. [Online]. Available: /http://opencores.com/project,fpuS. [24] A.H.T. Tse, D.B. Thomas, W. Luk, Design exploration of quadrature methods in option pricing, IEEE Trans. VLSI Syst. 99 (April) (2011) 1–9. [25] S. Kestur, J.D. Davis, O. Williams, BLAS Comparison on FPGA, CPU and GPU, in: IEEE Annual Symposium on VLSI, July 2010, pp. 288–293. [26] M.C. Smith, J.S. Vetter, X. Liang, Accelerating scientific applications with the SRC-6 reconfigurable computer: methodologies and analysis, in: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, April 2005, p. 157b. [27] V. Aggarwal, A.D. George, K.C. Slatton, Reconfigurable computing with multiscale data fusion for remote sensing, in: Proceedings of the 2006 ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays (FPGA-06), ACM, New York, NY, USA, 2006, p. 235. [28] Koohi, N. Bagherzadeh, C. Pan, A fast parallel Reed–Solomon decoder on a reconfigurable architecture, in: First IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, October 2003, pp. 59–64. [29] W.N. Chelton, M. Benaissa, Fast elliptic curve cryptography on FPGA, IEEE Trans. Very Large Scale Integration (VLSI) Syst. 16 (February (2)) (2008) 198–205. [30] C.H. Kim, S. Kwon, C.P. Hong, FPGA implementation of high performance elliptic curve cryptographic processor over GFð2163 Þ, J. Syst. Archit. 54 (10) (2008) 893–900. [31] H.M. Choi, C.P. Hong, C.H. Kim, High performance elliptic curve cryptographic processor over GFð2163 Þ, in: IEEE International Workshop on Electronic Design, Test and Applications, vol. 0, 2008, pp. 290–295. [32] SRC Supercomputers, 2008. [Online]. Available: /http://www.srccomp.com/S. [33] SGI Supercomputers. [Online]. Available: /http://www.sgi.com/S. [34] Cray XD1 Supercomputers, 2008. [Online]. Available: /http://www.cray. com/S. [35] Convey Computers. [Online]. Available: /http://www.conveycomputer.com/S. [36] A. George, H. Lam, G. Stitt, Novo-G: at the forefront of scalable reconfigurable supercomputing, Comput. Sci. Eng. 13 (December (1)) (2010) 82–86. [37] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, 2nd ed., Oxford University Press, New York, 2010. [38] A. Karatsuba, Y. Ofman, Multiplication of many-digital numbers by automatic computers, in: Proceedings of the USSR Academy of Sciences, vol. 145, 1962, pp. 293–294. [39] S. Banescu, F. de Dinechin, B. Pasca, R. Tudoran, Multipliers for floating-point double precision and beyond on FPGAs, SIGARCH Comput. Archit. News 38 (January) (2011) 73–79. [40] D. Goldberg, What every computer scientist should know about floatingpoint arithmetic, ACM Comput. Surv. 23 (1) (1991) 5–48. [41] V. Garofalo, N. Petra, E. Napoli, Analytical calculation of the maximum error for a family of truncated multipliers providing minimum mean square error, IEEE Trans. Comput. 60 (September (9)) (2011) 1366–1371. [42] M.K. Jaiswal, N. Chandrachoodan, Efficient implementation of IEEE double precision floating-point multiplier on FPGA, in: IEEE Region 10 and the Third international Conference on Industrial and Information Systems (ICIIS-2008), December 2008, pp. 1–4. [Online]. Available: /http://dx.doi.org/10.1109/ ICIINFS.2008.4798393S. [43] J. Hopf, A parameterizable HandelC divider generator for FPGAs with embedded hardware multipliers, in: Proceedings of the 2004 IEEE International Conference on Field-Programmable Technology, 2004, December 2004, pp. 355–358. [44] J.S. Meredith, G. Alvarez, T.A. Maier, T.C. Schulthess, J.S. Vetter, Accuracy and performance of graphics processors: a Quantum Monte Carlo application case study, Parallel Comput. 35 (3) (2009) 151–163. [45] M.K. Jaiswal, R.C.C. Cheung, Area-efficient architectures for large integer and quadruple precision floating point multipliers, in: The 20th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society, Los Alamitos, CA, USA, 2012, pp. 25–28.