Microprocessors and Microsystems 37 (2013) 287–298
Contents lists available at SciVerse ScienceDirect
Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro
High performance FPGA-based decimal-to-binary conversion schemes for decimal arithmetic Osama Al-Khaleel a,⇑, Zakaria Al-Qudah b, Mohammad Al-Khaleel c, Christos Papachristou d a
Department of Computer Engineering, Jordan University of Science and Technology, Irbid, Jordan Department of Computer Engineering, Yarmouk University, Irbid, Jordan c Department of Mathematics, Yarmouk University, Irbid, Jordan d Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, United States b
a r t i c l e
i n f o
Article history: Available online 6 February 2013 Keywords: FPGAs Decimal arithmetic BCD Conversion LUT Schemes
a b s t r a c t Despite that it has been recognized that decimal arithmetic is more suitable than binary arithmetic for human-centric applications, binary arithmetic is still predominant in today’s computers. One approach to bridging this gap involves converting the decimal operands to binary, performing arithmetic in binary, and converting the result back to decimal. Based on this approach, this paper presents novel high-performance decimal-to-binary conversion circuits to support decimal arithmetic over different FPGAs families. Our circuits are based on a simple, yet effective idea. Bits of the BCD inputs are grouped into a number of groups. The contribution of each group to the overall binary result is computed separately. Then these contributions are added to form the final binary result. The performance evaluation presented in this paper indicates that the proposed circuits perform significantly better than existing BCD-to-binary conversion circuits. Furthermore, for a given FPGA family, the comparison reveals that certain bit-grouping may perform better than others. In addition, we have studied the growth in area and time for each bitgrouping scheme with respect to the number of digits in the BCD input. Ó 2013 Elsevier B.V. All rights reserved.
1. Introduction Many of today’s applications such as financial, Internet-based, and scientific applications handle decimal operands. There are two major approaches for performing arithmetic on decimal operands.1 The first approach involves directly manipulating these decimal operands (such as [1]) which has the advantage of reducing potential rounding errors [2]. Furthermore, for applications that require extensive processing of decimal operands, direct manipulation of these decimal numbers might promise better performance due to the elimination of BCD-to-binary and binary-to-BCD conversions. The second approach, as illustrated in Fig. 1, involves converting the decimal operands to binary, perform the required arithmetic in binary, and convert the result back to decimal. The advantage of this approach is that it utilizes the already predominant binary arithmetic hardware. Furthermore, some applications require converting the operands to binary once, perform several operations on the converted operands and converting the final result back to decimal. For these applications, the second approach promises better perfor⇑ Corresponding author. Tel.: +962 772107130. E-mail addresses:
[email protected] (O. Al-Khaleel),
[email protected] (Z. Al-Qudah),
[email protected] (M. Al-Khaleel),
[email protected] (C. Papachristou). 1 We assume Binary-Coded-Decimal (BCD) representation for these decimal operands. Therefore, we use the terms decimal and BCD numbers interchangeably. 0141-9331/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.micpro.2013.01.002
mance due to the use of the already optimized binary arithmetic hardware. In this paper, we present several high performance architectures for decimal-to-binary conversion to support decimal arithmetic. We first split the BCD input into several groups of bits. We then compute the contribution of each group to the final result and add the contributions to form the final result. For example, suppose that the BCD operand to be converted to binary is (346)10 or (0011 0100 0110)BCD. For a group size of 4 bits (one BCD digit), the digit 6 contributes (0110)2 to the final binary result. The digit 4 contributes (40)10 or (101000)2. The digit 3 contributes (300)10 or (100101100)2. After obtaining the contribution of each digit in the BCD operand we add these contributions to obtain the final result. Therefore, adding (0110)2 + (101000)2 + (100101100)2 in binary results into (101011010)2 = (346)10. While this approach seems simple, we show in Section 6 that it performs significantly better than existing techniques for converting BCD numbers to their binary equivalents on FPGAs. There are several reasons for this superior performance. First, all the contributions are computed in parallel and the addition of these contributions is done via the fast carry-chain logic in the FPGA which results into a fast circuit. Second, the outputs of the contribution generation circuit for a given group are functions of a number of variables equals to the group size. For example, in a
288
O. Al-Khaleel et al. / Microprocessors and Microsystems 37 (2013) 287–298
A disadvantage of converting BCD numbers based on this formula is that it requires multiplication of powers of 10. As the number of digits grows, the size of the multipliers needed quickly grows. The authors in [8] re-write the above equation using Horner’s rule as follows:
BCD Operands
BCD2BIN Conversion circuits
D ¼ ðððDn1 10 þ Dn2 Þ10 þ Þ10 þ D0 Þ:
Binary Operands
ð2Þ
The authors in [8] also present other arrangements of this formula to increase parallelism of computations as follows:
D ¼ ðððDn1 10 þ Dn2 Þ100 þ Þ100 þ ðD1 10 þ D0 ÞÞ;
ð3Þ
Binary Arithmetic Hardware D ¼ ðððDn1 100 þ Dn2 10 þ Dn3 Þ103 þ Þ103 þ ðD2 100 þ D1 10 þ D0 ÞÞ;
ð4Þ
Binary Results D ¼ ððððDn1 10 þ Dn2 Þ100 þ Dn3 10 þ Dn4 ÞÞ104 þ Þ104 þ ððD7 10 þ D6 Þ100 þ D5 10 þ D4 Þ104 þ ððD3 10 þ D2 Þ100
BIN2BCD Conversion circuits
þ D1 10 þ D0 Þ
BCD Results Fig. 1. BCD arithmetic based on binary hardware.
4-bit grouping, the outputs of the group contribution generation are functions of the 4 bits of the group. By choosing the group size that matches the look-up table size on an FPGA family, each function requires only one look-up table (LUT) in the FPGA which results into a compact overall design. The rest of this paper is organized as follows: Section 2 discusses techniques for decimal to binary conversion and other related work. Section 3 discusses the proposed techniques in details. Section 4 provides area and delay analysis for the 4-bit grouping scheme. Section 5 discusses the implementation of our schemes on various FPGA architectures. Section 6 discusses the performance of our techniques. We conclude in Section 7. 2. Related work As mentioned in Section 1, there are two major approaches to decimal arithmetic: direct manipulation of BCD numbers and BCD arithmetic based on binary hardware. For example, the authors in [3,4] present architectures for BCD digit by BCD digit multiplication. Vazquez et al. [1] presents an architecture that operates on multiple BCD digits. The authors in [5,6] present techniques for BCD addition/subtraction on two BCD operands while [7] operates on multi-operands. The architectures described in [8–10] are examples of BCD arithmetic based on binary hardware (the BCD number is first converted to its binary equivalent, the arithmetic is performed in binary, and the result is converted back to BCD). The authors highlight several techniques to perform decimal-to-binary conversion. For example, one technique is the traditional successive division by two. In this technique, the BCD input is shifted right by one bit position. Each BCD digit in the shifted number is tested. If the BCD digit is greater than or equal to 8, the number 3 is subtracted from the digit [8,11–13] or the most significant bit in the digit is cleared and the number 5 is added to the digit [8]. The procedure is repeated until all bits are generated. Another method for converting a BCD number to its binary equivalent is direct computation based on the following formula in binary:
Dn1 D1 D0 ¼ Dðn1Þ 10n1 þ þ D1 101 þ D0 100 :
ð1Þ
ð5Þ
In [13], the authors proposed a method to convert a BCD number to its binary equivalent based on expanding the BCD number and then shifting the individual BCD digits to left (multiplying by multiples of 2) according to their position in the BCD number. For example, the BCD number 76 is expanded to (7 10 + 6 1 = 7 (8 + 2) + 6 1 = 7 23 + 7 21 + 6 20). This means that the binary equivalent of the BCD number 76 can be obtained by the addition of 7 shifted to the left 3 times (0111000)2, 7 shifted to the left 1 time (01110)2, and 6 not shifted (0110)2. The binary result is (0111000)2 + (01110)2 + (0110)2 = (1001100)2. The authors in [13] employed 4-bit carry-look-ahead addition in a complex tree structure to design an 8-digit to 27-bit converter. The authors in [14] proposed a faster implementation for the expansion method of [13]. The method was demonstrated by presenting the implementation of a 7-digit to 24-bit converter. Instead of using 4-bit carry-look-ahead adders to add the bits of the expanded numbers, the bits are grouped according to their positions carefully such that the result of adding the arranged bits within each group does not exceed (15)10 = (1111)2. By applying this rule, partial sums are obtained without any carry propagation at this stage of the design. PROMs are used to generate these partial sums. Same approach is followed in the second level of logic. The final result, which is the sum of the outcomes of the first and the second stages and any individual bits that do not belong to any group, is obtained in the third stage where 4bit carry-look-ahead adders and other logic are used whenever needed. The BCD to binary conversion method presented in [15] employs a code converter that converts consecutive pairs of BCD digits to their binary equivalent. The binary codes are generated using PROMs. The final binary result is obtained by adding up individual bits in the binary codes using the same approach in [14]. In this paper we compute the binary equivalent of a BCD number based on the direct formula (1) using a novel method. Instead of using multipliers to compute each term of the formula, we compute the contribution of each digit using direct Boolean functions. These contributions are then added appropriately to form the final binary result. The authors in [10] have employed a similar technique to design a 4-digit BCD-to-binary conversion circuit which they used in developing an iterative decimal multiplier. However, the BCD-to-binary circuit they developed does not constitute a generic BCD-to-Binary conversion circuit. Furthermore, they did not discuss the implementation and performance of this conversion circuit as a separate module since their focus was on creating an iterative BCD multiplier. In our case, we focus on designing a generic circuit for parallel BCD-to-binary conversion which can be used in any BCD arithmetic circuit that utilizes binary hardware. While the technique in [15] groups input bits into groups of 8-bits
289
O. Al-Khaleel et al. / Microprocessors and Microsystems 37 (2013) 287–298
similar to one of the techniques presented in this work (the 8-bit grouping), the design of the code converter and the addition stages in [15] is different than ours. In addition, the design in [15] always separates the least significant BCD digit and considers it as a one group by itself. More importantly, we explore various grouping schemes in addition to the 8-bit grouping. When implementing the work of [15,14] for performance evaluation, we found various typos that we report in Section 6.
Table 1 Example: D2 contribution generator output functions.
3. Proposed architecture Our scheme is based on splitting the input BCD number into groups of consecutive bits from the least significant position to the most significant position. Throughout this work each group will be denoted as Gi where i is an integer that represents the position of the group starting from right to left. For example, if there are M groups then the least significant group would be denoted as G0 and the most significant group would be denoted as GM1. The binary contribution of each group varies based on its position or index. It is generated using a digital hardware that has the individual bits of the group as inputs and the individual bits of the binary contribution as outputs. The binary contributions from all groups are added using a binary addition stage to generate the binary equivalent of the input BCD number. This section discusses three different bit-grouping schemes that have been investigated in this work for BCD-to-binary conversion. These are 4-bit grouping, 6-bit grouping, and 8-bit grouping.
BCD digit
Contribution in binary (wd2)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
00 00 00 01 01 01 10 10 11 11
0000 0000 0110 0100 1100 1000 0010 1100 1001 0000 1111 0100 0101 1000 1011 1100 0010 0000 1000 0100
10i Di. We list the contributions for each digit to the final result and derive the Boolean functions that compute these contributions. Table 1 lists these contributions for the digit D2 as an example. For the invalid inputs (greater than 1001), the functions’ outputs are don’t care. As shown, D2’s contribution to the final binary result wd2 is 10 bits in size. From the table, we can derive the equations for these bits as follows:
wd2 ½9 ¼ A3 þ A2 A1 ; wd2 ½8 ¼ A2 A1 A0 þ A3 þ A2 A1 ; wd2 ½7 ¼ A3 A0 þ A2 A1 A0 þ A2 A1 þ A2 A0 ;
3.1. 4-Bit grouping
wd2 ½6 ¼ A1 A0 þ A3 A1 A0 ;
In this scheme the size of the group is 4 bits (1 BCD digit). The scheme is outlined in Fig. 2 where WD0 is used to represent the size of the output of the D0 contribution generator unit, WD1 is used to represent the size of the output of the D1 contribution generator unit, and so on. The BCD input size is N BCD digits, DN1DN2 D1D0, each digit is fed to its corresponding contribution generator unit that computes the contribution of that digit to the final binary result. The contribution of D0 is the same four bits representing D0. For example, a (4)10 = (0100)BCD contributes (0100)2 to the final binary result. D1 contributes the binary equivalent of 10 D1 to the final binary result. In general, Di contributes the binary equivalent of
wd2 ½5 ¼ A3 A0 þ A3 A0 ; wd2 ½4 ¼ A2 ; wd2 ½3 ¼ A1 ; wd2 ½2 ¼ A0 ; wd2 ½1 ¼ 0; wd2 ½0 ¼ 0; where A3 A0 are the 4-bit BCD representation of D2. These Boolean functions represent the contribution generation box corresponding to D2 among the contribution generation boxes shown in Fig. 2.
DN−1 4
DN−2 4
DN−1Contribution Generator
DN−2Contribution Generator
D1 Contribution Generator
WDN−1
WDN−2
WD1
D1 4
Binary Adder
D0 4 D0 Contribution Generator WD0
Binary Adder
Binary Adder
Binary Adder
Binary Addition organized in tree structure Binary Adder W Fig. 2. Architecture for the 4-bit grouping scheme.
290
O. Al-Khaleel et al. / Microprocessors and Microsystems 37 (2013) 287–298
We note that a digit at position i contributes wdi to the final binary results with size WDi where
WDi ¼ blog2 ð9 10i Þc þ 1 bits; i ¼ 0; 1; . . . ; N 1;
ð6Þ
and the final binary result w requires W bits where
W ¼ blog2 ð10N 1Þc þ 1 bits;
ð7Þ
where we define for any number x, bxc = ‘, and ‘ is the unique integer such that ‘ 6 x < ‘ + 1. We observe the following characteristics of these contributions: The contribution of a digit at position i has the least significant i bits always equal zero (the same observation was reported in [10]). In the example for D2 above, the least 2 bits are equal to zero for all possible input combinations. In the contributions addition stage, these bits are not added with other contributions which results into a smaller adders size. Some other bits of the contributions are always zero as well for all possible combinations. For example, for D5 bit number 14 of the contribution is always zero. We note however that the number of these zero bits is small. Furthermore, the contribution generation Boolean equations are functions of 4 bits (A3 A0). Therefore, each of these functions fits into a single look-up table on FPGAs with 4-input LUTs which results into an overall design with a small area. We note that the contribution generation for all digits is done in parallel. The next stage is to add these contributions to form the final binary result. We organize the adders in a tree-like structure to speed up the addition. Further, the fast carry chain logic available in the FPGA is used for a speedy addition. An example to illustrate the 4-bit grouping approach is shown in Fig. 3. In this example, the input size is 9 BCD digits (i.e., nine 4-bit groups). The binary contribution of each group is first generated and then all these contributions are added using the binary addition stage to generate the binary equivalent.
3.2. 6-Bit grouping In this scheme, the size of each group is 6 bits (1.5 BCD digits). The BCD input size is N BCD digits DN1DN2 . . . D1D0. The groups 1e. For an are referenced as G0, G1, G2, . . . , Gm, where m ¼ d2N 3 even integer i, the least significant four bits of group Gi are composed of the BCD digit D3i . On the other hand, the most significant 2
two bits of group Gi are the least significant two bits of the BCD digit D3iþ1 . For an odd integer i, the least significant two bits of group 2
Gi are composed of the most significant two bits of the BCD digit D3i1 , while the most significant four bits of group Gi are the BCD di2
git D3i1þ1 . 2
Given a group Gi, if i is even then the greatest decimal value of 3i
the group is 39 10 2 . This is because the most significant two bits of the group comes from the least two bits of a BCD digit. Therefore, they can be only 00, 01, 10, 11 in binary or 0, 1, 2, 3 in decimal whereas the least significant four bits can be any BCD digit (0–9). On the other hand, if i is odd then the greatest decimal value of 3i1
the group is 98 10 2 . This is because the least significant two bits of the group come from the most significant two bits of a BCD digit. Therefore, they can take values of 00, 01, and 10 in binary or 0, 4, and 8 in decimal whereas the most significant four bits can be any BCD digit (0–9). It should be pointed out that the size of the most significant group in this scheme can be 6 bits, 4 bits, or 2 bits according to the number of BCD digits in the input. When converting the decimal equivalent of each group to binary, the least significant K bits of the binary equivalent are zeros. K increases according to the position of the group from right to left. For a given group Gi, Ki is calculated using the following equation:
(
Ki ¼
3i ; 2 3iþ1 ; 2
i ev en;
ð8Þ
i odd:
Decimal Digits
D8
D7
D6
D5
D4
D3
D2
D1
D0
1001
0111
0110
0010
0100
0000
0101
1001
0010
Decimal Equivalent
9x10 8
7x10 7
6x10 6
2x10 5
4x10 4
0x10 3
5x10 2
9x10 1
2x10 0
Binary Equivalent
To Binary
To Binary
To Binary
To Binary
To Binary
To Binary
To Binary
To Binary
To Binary
4
30
4
4
27
4
24
4
20
27
4
17
4
14
20
4
10
7
14
27
4
7
Binary Addition
4
14
27
30
Binary_out Fig. 3. An example to illustrate the approach for the case of 4-bit grouping (1 BCD digit).
291
O. Al-Khaleel et al. / Microprocessors and Microsystems 37 (2013) 287–298
Decimal Digits
D8
D7
D6
D5
D4
D3
D2
D1
D0
1001
0111
0110
0010
0100
0000
0101
1001
0010
4
6−bit Groups
Decimal Equivalent
Binary Equivalent
G5
2
2
1001 01
G4
4
11 0110
4
G3
2
2
0010 01
G2
4
4
00 0000
G1
2
2
0101 10
G0
4
01 0010
9x10 8
6x10 6
2x10 5
0x10 3
5x10 2
2x10 0
4x10 7
3x10 7
4x10 4
0x10 4
8x10 1
1x10 1
94x10 7
36x10 6
24x10 4
00x10 3
58x10 1
12x10 0
To Binary
To Binary
To Binary
To Binary
To Binary
To Binary
30
26
20
30
16
10
20
6
10
Binary Addition 20
30
Binary_out Fig. 4. An example to illustrate the approach for the case of 6-bit grouping (1.5 BCD digit).
Also, the size of the binary equivalent of each group increases according to the position of the group from right to left. If the size of the binary equivalent of group Gi is WGi, then WGi is calculated using the following equation:
8 3i > < blog2 39 10 2 c þ 1; i ev en; WGi ¼ 3i1 > : blog2 98 10 2 c þ 1; i odd:
ð9Þ
The Boolean equation of each bit in the binary equivalent of group Gi is formulated based on the different combinations of the group and its position among all groups. An example to illustrate the 6-bit grouping approach is shown in Fig. 4. In this example, the input size is 9 BCD digits (i.e., six 6-bit groups). The decimal equivalent of each group is formed based on the 2 and 4 bits that compose the group as has been mentioned. The binary contribution of each group is first generated and then all these contributions are added using the binary addition stage to generate the binary equivalent.
4. Area and delay analysis for the 4-bit grouping scheme This section presents an estimation for the area and delay for the implementation of the 4-bit grouping scheme on 4-input LUT FPGAs. Similar discussion can be easily derived for the other two grouping schemes (6-bit and 8-bit on 6-input LUT FPGAs and 8-input LUT FPGAs respectively). As mentioned before, the proposed BCD-to-binary convertor using the 4-bit grouping has two main stages: the BCD digits’ contributions generation stage and the binary addition stage. Each BCD digit consists of 4 bits, and therefore, the individual bits of the contribution of a BCD digit are computed based on its four bits. Hence, the Boolean logic functions of the contribution generator blocks of Fig. 2 are functions of four variables. This means that targeting an FPGA family of 4-input LUTs or higher would result in the BCD digit’s contributions generation stage be implemented with a single LUT per Boolean function. Since all of these functions are computed concurrently, the logic delay of the contributions generation stage is equal to the delay of a single LUT. If the logic delay for a single LUT is DTLUT, then:
CGlogic 3.3. 8-Bit grouping In this approach the size of the group is 8 bits (2 BCD digits). The number of groups in this case is dN2 e. Group Gi is composed from the BCD digit D2i that lies in the right hand side of the group and the BCD digit D2i+1 that lies in the left hand side of the group. The largest decimal equivalent of Gi is 99 102i. If the size of the binary equivalent of group Gi is WGi then WGi is given by blog2(99 102i)c + 1 and the number of the least significant bits in the binary equivalent of group Gi that are zeros is 2i. The size of the most significant group in this scheme can be 8 bits, or 4 bits according to the number of BCD digits in the input.
delay
¼ DT LUT ;
ð10Þ
where CGlogic_delay is the logic delay of the contributions generation stage. It should be noted that CGlogic_delay is independent of the number of BCD digits. To estimate the area of the contributions generation stage, the number of Boolean functions that come out of this stage is computed using Eq. (6) by substituting the values of is and then adding up the number of Boolean functions per contribution generator block. Observing that the least significant i bits of the contribution of the ith BCD digit are always zeros, for N BCD digits the total number of Boolean functions (NBF) is calculated as follows:
292
O. Al-Khaleel et al. / Microprocessors and Microsystems 37 (2013) 287–298
NBF ¼ ½ðblog2 ð9 100 Þc þ 1Þ 0 þ ½ðblog2 ð9 101 Þc þ 1Þ N2
1 þ þ ½ðblog2 ð9 10 N1
þ ½ðblog2 ð9 10
Þc þ 1Þ ðN 2Þ
Þc þ 1Þ ðN 1Þ
N1 X fðblog2 ð9 10i Þc þ 1Þ ig: ¼
ð11Þ
Note that Eq. (17) does not account for the area needed for the routing logic as the area in the equation is measured in terms of the number of LUTs. Furthermore, we note that some of the contribution bits might be functions of one variable and therefore do not require a LUT. Therefore, Eq. (17) represents an upper bound on the area (in number of LUTs).
i¼0
Since each of these functions can be mapped to a single LUT, the total area of the contributions generation stage (CGarea) can be calculated as:
CGarea ¼ NBF DAreaLUT ;
ð12Þ
where DAreaLUT is the area of a single LUT. In the binary addition stage, the adders are arranged in a tree structure. Let us assume that N is a power of two integer for simplicity. Then the number of levels of the tree is log2(N). The number of adders in each level from top to bottom is N2 ; N4 ; N8 ; . . . ; 2; 1. The size of the adders in the same level increases from right to the left. Therefore, the logic delay of the addition stage is dominated by the delay of the group of adders located to the leftmost side of the architecture of Fig. 2. In FPGAs, binary adders are implemented using the fast carry chain. An n-bit binary adder consumes n LUTs (n stages of the fast carry chain). The size of the leftmost adder in the top level of the addition stage is equal to the number of Boolean function from DN1 contribution generator block of Fig. 2 which is (blog2(9 10N1)c + 1). If we assume that the size of the leftmost adder in each level is the same as the size of this adder (which is the worst case) and if we assume that the delay of a single stage in the fast carry chain is DTFCC, then the estimated logic delay of the binary addition stage in the proposed architecture (BAlogic_delay) is calculated using the following equation:
BAlogic
delay
¼ log2 ðNÞ ðblog2 ð9 10N1 Þc þ 1Þ DT FCC :
ð13Þ
To estimate the area of the binary addition stage, one should observe that in the top level of the tree, two consecutive BCD digits contributions are added using one adder and the adder size is dominated by the size of the contribution of the digit in the odd position (given that positions start from 0). Since the adders are to be implemented using the fast carry chain, the number of LUTs required to implement a level in the addition stage is equal to the sum of the adders sizes in that level. Based on this, and using Eq. (6), the total number of LUTs (NLUTs) of the binary addition stage is as follows: N
NLUTs ¼
1 2 X fðblog2 ð9 102iþ1 Þc þ 1Þ ð2i þ 1Þg i¼0 N
þ
1 4 X fðblog2 ð9 104iþ3 Þc þ 1Þ ð4i þ 3Þg þ i¼0
þ ½ðblog2 ð9 10N1 Þc þ 1Þ ðN 1Þ:
ð14Þ
Therefore, the area of the binary addition stage (BAarea) in terms of DAreaLUT is:
BAarea ¼ NLUTs DAreaLUT :
ð15Þ
Based on Eqs. (10) and (13), the estimated time delay (Architime) of the proposed architecture is estimated as follows:
Architime ¼ CGlogic
delay
þ BAlogic
delay
þ DT routing ;
ð16Þ
where DTrouting is the routing delay after the placement and routing of the design in the FPGA. The estimated area (in terms of LUTs) of the proposed architecture (Archiarea) is calculated using Eqs. (12) and (15) as follows:
Archiarea ¼ CGarea þ BAarea :
ð17Þ
5. FPGA implementation The implementation of our various bit-grouping schemes varies from one FPGA family to another based on the size of the look-up table (LUT) and the fabrication technology of the FPGA family. For example, a logic function of 4 variables fits into a single 4-input LUT. On the other hand, a 6-variable function requires a hierarchy of 4-input LUTs to be implemented on 4-input LUT FPGAs. The 4-variable function would fit in a single 6-input LUT, but the utilization of the LUT would be low because only 25% of the LUT is used and the rest is wasted (i.e., a 4-input function has 16 different combinations whereas the 6-input LUT has the capacity of 64 combinations). Fig. 5 shows the expected implementation of a four BCD digits to binary converter on a 4-input LUT FPGAs. In this figure, the implementation of the binary contribution of digit D1 or G1 (D1_Cont) is shown in details where each bit in this contribution is generated using a single 4-input LUT as each bit is a function of four variables (the bits of D1). The least significant bit of D1_Cont is zero. All binary contributions of the BCD digits are added using the binary addition stage that employs the dedicated fast carry chain in the FPGA. The implementation of the 6-bit grouping scheme on 4-input LUT FPGAs requires each bit in the binary contributions of the groups to be implemented using a hierarchy of 4-input LUTs since each bit is a function of six variables. However, the number of levels in the addition stage and the number of adders in each level are reduced by approximately a factor of 1.5 when compared to the case of 4-bit grouping. On the other hand, the size of the adders in the addition stage becomes larger when compared to the size of the adders in the 4-bit grouping. The implementation example of the four BCD digits to binary converter using 6-bit grouping on 4-input LUT FPGAs is illustrated in Fig. 6. Again, we show the implementation of the bits of the binary contributions of G1 (G1_Cont) in details. The least significant two bits (G1_Cont[0] and G1_Cont[1]) are both zeros. The implementation of the 4-bit grouping on a 6-input LUT FPGAs is illustrated in Fig. 7. In this case, the binary addition stage is similar to that of the 4-bit grouping on 4-input LUTs. The drawback here is that the utilization of the 6-input LUTs used to implement the binary contributions generation stage is low because two out of the six inputs of the look-up table are unused. The implementation of the 4-digit BCD-to-binary converter using 6-bit grouping on 6-input LUT FPGAs produces similar addition stage to that of the 6-bit grouping on a 4-input LUT FPGAs. However, in this case each bit in the binary contribution generator is implemented using one 6-input LUTs as opposed to the case of 6-bit grouping over 4-input LUTs in which a hierarchy of LUTs is needed. An illustration of this implementation is shown in Fig. 8. The implementation of the binary contribution of the group G1 (G1_Cont) is shown in details. In the case of 8-bit grouping, both 4-input and 6-input LUT FPGAs require a hierarchy of LUTs to implement the contributions generation stage. However, the number of levels in the addition stage as well as the number of adders in each level are reduced by a factor of two (compared to 4-bit grouping) at the expense of larger sizes for the adders. In the next section, we experimentally evaluate the different bit grouping schemes on FPGAs with different look-up table sizes and fabrication technology.
293
D1 [0] D1 [1] D1 [2] D1 [3]
4 LUT−4
D1 _Cont[1]
Binary adder using fast carry chain
D0 _Cont
0
Binary adder using fast carry chain
O. Al-Khaleel et al. / Microprocessors and Microsystems 37 (2013) 287–298
D1 _Cont
LUT−4
D1 _Cont[2]
7 D2 _Cont 10 D3 _Cont
D1 [0] D1 [1] D1 [2] D1 [3]
LUT−4
D1 _Cont[6]
14
Binary adder using fast carry chain
D1 [0] D1 [1] D1 [2] D1 [3]
Binary_Out
Fig. 5. The implementation of 4-digit BCD to binary conversion using 4-bit groups on 4-input look-up tables (LUTs-4).
Hierarchy of LUTs−4 G1 _Cont[2]
G 0 _Cont 6 G1 _Cont 10
G1 [0] G1 [1] G1 [2] G1 [3] G1 [4] G1 [5]
Hierarchy of LUTs−4
G1 [0] G1 [1] G1 [2] G1 [3] G1 [4] G1 [5]
Hierarchy of LUTs−4
G1 _Cont[3]
G2 _Cont 14
Binary adder using fast carry chain
G1 [0] G1 [1] G1 [2] G1 [3] G1 [4] G1 [5]
Binary adder using fast carry chain
0 0
Binary_Out
G1 _Cont[9]
Fig. 6. The implementation of 4-digit BCD to binary conversion using 6-bit groups on 4-input look-up tables (LUTs-4).
6. Results
6.1. 4-Bit grouping scheme versus existing schemes
In this section, we first compare the performance of the proposed 4-bit grouping (one BCD digit grouping) scheme to other existing schemes. We then compare the performance of the different grouping schemes (4-bit or 1 BCD digit, 6-bit or 1.5 BCD digits, and 8-bit or 2 BCD digits) on a variety of FPGA devices (devices with 4-input and 6-input LUTs).
We first select the 4-bit grouping scheme to compare it to existing schemes since most of these schemes are implemented on a 4input LUT FPGAs, and we believe that the best performance will be obtained with 4-bit grouping in this case. We have implemented the 4-bit grouping scheme along with six different existing schemes for BCD-to-binary conversion (discussed in Section 2)
D0 _Cont
0 D1 [0] D1 [1] D1 [2] D1 [3]
4 D1 _Cont[1] LUT−6 D1 _Cont
D1 [0] D1 [1] D1 [2] D1 [3]
D2 _Cont D1 _Cont[2] LUT−6
10 D3 _Cont 14
D1 [0] D1 [1] D1 [2] D1 [3]
Binary adder using fast carry chain
7
Binary adder using fast carry chain
O. Al-Khaleel et al. / Microprocessors and Microsystems 37 (2013) 287–298
Binary adder using fast carry chain
294
Binary_Out
D1 _Cont[6] LUT−6
Fig. 7. The implementation of 4-digit BCD to binary conversion using 4-bit groups on 6-input look-up tables (LUTs-6).
G1 _Cont[2] LUT−6
G0 _Cont 6 G1 _Cont 10
G1 [0] G1 [1] G1 [2] G1 [3] G1 [4] G1 [5]
LUT−6
G1 [0] G1 [1] G1 [2] G1 [3] G1 [4] G1 [5]
LUT−6
G1 _Cont[3] G2 _Cont 14
Binary adder using fast carry chain
G1 [0] G1 [1] G1 [2] G1 [3] G1 [4] G1 [5]
Binary adder using fast carry chain
0 0
Binary_Out
G1 _Cont[9]
Fig. 8. The implementation of 4-digit BCD to binary conversion using 6-bit groups on 6-input look-up tables (LUTs-6).
using Verilog HDL data flow model. All schemes are functionally verified by simulation. The code of each scheme is then synthesized using Xilinx ISE 10.1 cad tool targeting Xilinx Virtex-4
SX35-12 FPGA. Table 2 shows the area results and Table 3 shows the delay results. As shown, our scheme achieves between 31% and 65% reduction in delay for various input size (number of BCD
295
O. Al-Khaleel et al. / Microprocessors and Microsystems 37 (2013) 287–298 Table 2 Comparison of resources utilization of our scheme (the 4-bit grouping) versus other schemes (in number of LUTs). Design
Number of digits
Our 4-bit grouping ShiftAdd5 [8] ShiftSub3 [8,11–13] Eq. (2) [8] Eq. (3) [8] Eq. (4) [8] Eq. (5) [8]
2
4
8
16
6 20 22 11 9 9 9
34 133 142 48 43 54 43
141 575 653 205 173 236 170
538 2429 2792 838 671 915 627
Table 5 Performance comparison between our scheme (the 4-bit grouping) and [14] for a 7digit BCD to binary convertor. Design
Area (LUTs)
Delay (ns)
[14]’s Design Our 4-bit grouping
152 102
8.57 4.38
Table 6 Performance comparison between our scheme (the 4-bit grouping) and [15]. Design
Design
Our 4-bit grouping ShiftAdd5 [8] ShiftSub3 [8,11–13] Eq. (2) [8] Eq. (3) [8] Eq. (4) [8] Eq. (5) [8]
Number of digits 2
4
8
16
1.02 5.06 5.06 2.03 1.99 1.99 1.99
2.99 14.41 14.41 6.49 5.42 4.33 5.42
4.49 30.33 31.11 15.45 12.12 11.18 10.00
6.46 63.40 64.18 33.42 25.82 22.81 19.39
Table 4 Performance comparison between our scheme (the 4-bit grouping) and [13] for an 8digit BCD to binary convertor. Design
Area (LUTs)
Delay (ns)
[13]’s Design Our 4-bit grouping
156 141
5.47 4.49
Our 4-bit grouping [15]’s Design
Area (LUTs) 16-Digit
8-Digit
16-Digit
4.49 7.66
6.46 12.37
141 426
538 2005
1 BCD digit 1.5 BCD digits 2 BCD digits
2350 2100
Area (Number of LUTs)
Table 3 Comparison of delay of our scheme (the 4-bit grouping) versus other schemes (in ns).
Delay (ns) 8-Digit
1850 1600 1350 1100 850 600 350 100 Virtex-4
digits) when compared to the best of other schemes. Furthermore, our scheme achieves, when compared to the best of other schemes, 33%, 20%, 17%, and 14% reduction in area for operands of two digits, four digits, eight digits, and 16 digits respectively. Then we have used the same implementation environment (Xilinx ISE 10.1 cad tool targeting Xilinx Virtex-4 SX35-12 FPGA) to implement the designs presented in [13–15] and compare their results to our results. In the implementation of the design of [13], we have used fast carry chain instead of the 4-bit carry-lookahead additions (which will increase the performance of their scheme). We note that we did not attempt to design larger conversion circuits based on their approach since the design complexity increases dramatically as the BCD input size increases. Therefore, we implement an 8-digit BCD to binary convertor based on their approach and compare it with an 8-digit BCD to binary convertor using our 4-bit grouping approach. The results are shown in Table 4. Despite the fact that we have used the fast carry chain in implementing the design of [13], our 4-bit grouping scheme achieves 9.6% reduction in size and 18% speedup. In [14],2 instead of using the old PROM-based technology as they suggest, we use 4-bit adders structured in a way similar to their approach. Again, we implement only a 7-digit BCD to binary convertor based on their approach since implementing larger converters involves a significant design effort to find the appropriate groups. We then compare their scheme with a 7-digit BCD to binary 2
We note that the scheme as described in Fig. 3 in [14] involves a typo. Specifically, it is not clear where bits E2 and E1 at position 9 and E2, E1, F8, and F2 at position 10 should be added. We found two possible ways to account for these bits: either individually or with group P7 as adding these bits to P7 in the first stage of design preserves the property of the group sum not exceeding 15 (i.e., no carry out). We have verified the correctness of both options and found that the best results are achieved when the bits are included in P7. Therefore, we report these results.
Virtex-5
Virtex-6
Virtex-7
Fig. 9. Area comparison of the implementations of a 16-digit BCD to binary convertor using the three grouping schemes and targeting different FPGA families.
convertor based on our 4-bit grouping. The results are shown in Table 5. Our 4-bit grouping scheme achieves about 33% reduction in area and about 49% speedup. For the case of [15], we have replaced the PROMs with boolean equations. However, we have preserved the same addition stage structure that the authors in [15] presented. Table 6 shows the results.3 Our 4-bit grouping scheme achieves 41% and 48% speedup for 8-digit and 16-digit BCD to binary converters respectively. Furthermore, 67% and 73% reduction in the area of the design is achieved for 8-digit and 16-digit BCD to binary converters respectively. 6.2. Comparisons of the various grouping schemes The performance of a particular scheme on a particular FPGA architecture is affected by the performance of contributions generation stage and the performance of the addition stage. Larger bit grouping may require a hierarchy of LUTs on FPGAs with smaller size LUTs which may result in poor contributions generation per3 The design they provided also includes several typos. (i) In Fig. 3B in [15], at position 31 the bits g20 and h18 should not be included in group z4 to ensure that the binary addition of the group does not generate a carry. Also, these bits have been included in group N in Fig. 4B. (ii) In Fig. 4B, the bit c4 at position 11 in group W should be replaced with e4. (iii) Bit z44 at position 32 in group N should be replaced by z41. (iv) Bit x42 at position 33 in group M should be replaced by y42. We note that we discovered these typos while implementing their approach up to 16 digits. Therefore, other typos may exist for larger converters.
296
O. Al-Khaleel et al. / Microprocessors and Microsystems 37 (2013) 287–298
10
1 BCD digit 1.5 BCD digits 2 BCD digits
9 8
Delay (ns)
7 6 5 4 3 2 1 0 Virtex-4
Virtex-5
Virtex-6
Virtex-7
Fig. 10. Delay comparison of the implementations of a 16-digit BCD to binary convertor using the three grouping schemes and targeting different FPGA families.
formance. On the other hand, larger grouping requires smaller number of levels in the addition stage and less number of adders in each level with larger adder sizes. Conversely, smaller bit grouping may result into each bit of the contributions generation stage fit into one LUT (i.e., may eliminate the need for a LUT hierarchy) with the penalty of requiring larger number of levels and larger number of smaller-size adders per level in the addition stage. Therefore, it is not clear beforehand which bit grouping will win
Virtex-4 Virtex-5 Virtex-6 Virtex-7
9 8
Virtex-4 Virtex-5 Virtex-6 Virtex-7
500 400
Area (LUTs)
7
Delay (ns)
on a particular architecture especially that the routing delay may contribute significantly to the overall system delay. To evaluate the performance of our various schemes (i.e., various bit grouping) on a variety of FPGA architectures, we implement (in addition to the 4-bit grouping scheme) the 6-bit grouping (i.e., 1.5 BCD digits) and the 8-bit grouping (i.e., 2 BCD digits) in Verilog HDL for 16-digit BCD input. The Verilog data flow modeling is used. Each one of these three schemes (1, 1.5, and 2 BCD digit groupings) has been synthesized on an FPGA device with 4-input LUTs (Xilinx Virtex-4 xc4vlx200-11-ff1513) and on three FPGA devices with 6input LUTs (Xilinx Virtex-5 xc5vlx330t-2-ff1738, Xilinx Virtex-6 xc6vlx760-2-ff1760, and Xilinx Virtex-7 xc7v2000t-2-ffg1925). The area results are shown in Fig. 9 and the delay results are shown in Fig. 10. As shown in Fig. 9, on Virtex-4, the best bit grouping is one BCD digit followed by 1.5 and then 2 BCD digits. This indicates that the performance gains achieved by fitting each bit in the contributions into one LUT overweighs the performance losses caused by a larger binary addition hierarchy. The same observation holds true for the delay performance of the schemes as shown in Fig. 10. On 6-input LUT FPGAs (Virtex-5, Virtex-6 and Virtex-7), the delay and area (in terms of the number of LUTs) of the 6-bit grouping is more or less similar to that of the 4-bit grouping. We note though that in the case of 4-bit grouping on 6-input LUTs, the utilization of the LUTs is very low. While the 6-bit grouping beats the 4-bit grouping in terms of area on the three 6-input LUTs FPGAs we use, the delay of 6-bit grouping is better in some cases and worse
6 5 4 3 2
300 200 100
1 0
0
2
4
6
8
10
12
14
0
16
0
2
4
Number of BCD digits
6
8
10
12
14
16
Number of BCD digits
Fig. 11. Growth in delay and area of the 4-bit grouping scheme as the number of BCD digits to be converted to binary grows.
1000
Virtex-4 Virtex-5 Virtex-6 Virtex-7
9 8
800
Area (LUTs)
7
Delay (ns)
Virtex-4 Virtex-5 Virtex-6 Virtex-7
6 5 4 3 2
600
400
200
1 0
0
2
4
6
8
10
12
Number of BCD digits
14
16
0
0
2
4
6
8
10
12
14
Number of BCD digits
Fig. 12. Growth in delay and area of the 6-bit grouping scheme as the number of BCD digits to be converted to binary grows.
16
297
O. Al-Khaleel et al. / Microprocessors and Microsystems 37 (2013) 287–298
Virtex-4 Virtex-5 Virtex-6 Virtex-7
9 8
Area (LUTs)
7
Delay (ns)
Virtex-4 Virtex-5 Virtex-6 Virtex-7
2000
6 5 4 3
1500
1000
500
2 1 0
0
2
4
6
8
10
12
14
16
0
0
Number of BCD digits
2
4
6
8
10
12
14
16
Number of BCD digits
Fig. 13. Growth in delay and area of the 8-bit grouping scheme as the number of BCD digits to be converted to binary grows.
in others than that of 4-bit grouping. This indicates that some FPGA devices optimize the routing delay better than others for our architectures. For example, despite the fact that 6-bit grouping results into smaller number of levels in the binary addition stage than the 4-bit grouping, the overall delay of the 4-bit grouping is better than the 6-bit grouping on Virtex-5. The same does not hold true for Virtex-6 and Virtex-7. For the 8-bit grouping scheme, a hierarchy of LUTs is needed for the contributions generation stage in all FPGA devices we use. As a general observation, the scheme works better on 6-input LUTs FPGAs than on 4-input LUT FPGAs. Among 6-input LUTs, the scheme shows similar performance on Virtex-6 and Virtex-7 FPGAs. However, the scheme occupies significantly smaller area on Virtex-5 with slightly more delay than in the case of Virtex-6 and Virtex-7. The other aspect that we evaluate for our schemes is the growth in area and delay of the various schemes as the number of digits in the BCD operand increases. We evaluate the area and delay for 4bit, 6-bit, and 8-bit groupings on Virtex-4,Virtex-5, Virtex-6, and Virtex-7 FPGAs with 2-digit, 4-digit, 8-digit, and 16-digit BCD input. Figs. 11–13 show the results. As a general note, the area of our schemes grows exponentially as the number of digits increases whereas the delay grows in a logarithmic fashion. As shown, while the Virtex-6 and Virtex-7 FPGAs perform similarly for all schemes, Virtex-5 (which has again 6-input LUTs) performs very different. While we do not know for sure the root cause for this behavior, these results indicates significant architectural differences between Virtex-5 FPGA from one side and Virtex-6 and Virtex-7 from another side.
7. Conclusions In this paper, we present a range of efficient decimal-to-binary conversion schemes to support BCD arithmetic based on binary hardware. Our circuits employed several ideas. First, we split the BCD input into several groups of bits and compute the binary contribution of each group to the overall binary result. The contributions are then added using a tree-structured bank of adders utilizing the fast carry chain logic available in FPGAs. For the selection of the group size, we select it such that it matches the size of the lookup tables on the target FPGA. Due to this choice, each function among the outputs of the circuit that computes the contribution of a given group fits exactly in one look-up table which results into a compact design. We demonstrate in this paper that the proposed architecture outperforms existing architectures in terms of area and speed.
Furthermore, we have discussed in this paper the growth in area and delay of the proposed schemes on various FPGA families as the number of BCD digits in the input grows. The general conclusion is that the area grows in an exponential fashion whereas the delay grows in a logarithmic fashion. References [1] A. Vazquez, E. Antelo, P. Montuschi, A new family of high performance parallel decimal multipliers, in: 18th IEEE Symposium on Computer Arithmetic, 2007, ARITH ’07, pp. 195–204. [2] M.F. Cowlishaw, Decimal floating-point: algorism for computers, in: Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH16’03), ARITH ’03, IEEE Computer Society, Washington, DC, USA, 2003, p. 104. [3] G. Jaberipur, A. Kaivani, Binary-coded decimal digit multipliers, computers digital techniques, IET 1 (2007) 377–381. [4] R. James, T. Shahana, K. Jacob, S. Sasi, Decimal multiplication using compact bcd multiplier, in: International Conference on Electronic Design, 2008, ICED, 2008, pp. 1–6. [5] A. Singh, A. Gupta, S. Veeramachaneni, M.B. Srinivas, A high performance unified bcd and binary adder/subtractor, in: Proceedings of the 2009 IEEE Computer Society Annual Symposium on VLSI, IEEE Computer Society, Washington, DC, USA, 2009, pp. 211–216. [6] M. Vazquez, G. Sutter, G. Bioul, J.P. Deschamps, Decimal adders/subtractors in FPGA: efficient 6-input lut implementations, in: International Conference on Reconfigurable Computing and FPGAs, vol. 0, 2009, pp. 42–47. [7] R.D. Kenney, M.J. Schulte, High-speed multioperand decimal adders, IEEE Transactions on Computers 54 (2005) 953–963. [8] M. Véstias, H. Neto, Parallel decimal multipliers using binary multipliers, in: VI Southern Programmable Logic Conference (SPL), 2010, pp. 73–78. [9] H. Neto, M. Véstias, Decimal multiplier on FPGA using embedded binary multipliers, in: International Conference on Field Programmable Logic and Applications, 2008, FPL, 2008, pp. 197–202. [10] M. Vestias, H. Neto, Iterative decimal multiplication using binary arithmetic, in: VII Southern Conference on Programmable Logic (SPL), 2011, pp. 257–262. [11] BCD-to-Binary/Binary-to-BCD Number Converter MC-4001P, Application Note: Motorola semiconductor products, 1969. [12] R.F. Tinder, Engineering Digital Design, second ed., Elsevier, 2002. [13] L.C. Beougher, A method for high speed BCD-to-binary conversion, Computer Design (1973) 53–59. [14] L.P. Flora, D.P. Wiener, BCD-to-Binary Converter. US patent, 1982. [15] D. Wiener, BCD to Binary Converter. US patent, 1982. Osama Al-Khaleel assistant professor of Computer Engineering in the Department of Computer Engineering of Jordan University of Science and Technology (Irbid, Jordan), received his B.S in Electrical Engineering from Jordan University of Science and Technology in 1999, M.Sc. and Ph.D. in Computer Engineering from Case Western Reserve University, Cleveland, OH, USA in 2003 and 2006 respectively. Currently, his main research interests are in embedded systems design, reconfigurable computing, computer arithmetic, and logic design.
298
O. Al-Khaleel et al. / Microprocessors and Microsystems 37 (2013) 287–298 Zakaria Al-Qudah is an assistant professor of computer engineering at Yarmouk University, Jordan. He earned is Ph.D. and M.Sc. degrees from the Electrical Engineering and Computer Science (EECS) department at Case Western Reserve University (CWRU) 2010 and 2007 respectively. He received his BSc. degree from Yarmouk University, Jordan in 2004. He is interested generally in distributed systems and the Internet research. Specific subjects include the performance and security of Content Delivery Networks (CDNs), efficient utility computing platforms, and Internet Measurements.
Mohammad Al-Khaleel received the M.Sc. and Ph.D. degrees in Applied Mathematics-numerical analysis from McGill University, Montreal, QC, Canada, in 2003 and 2007, respectively. Since 2007, he has been an Assistant Professor of Mathematics with the Department of Mathematics, Yarmouk University, Irbid, Jordan.
Chris Papachristou is Professor of Electrical Engineering and Computer Science at Case Western Reserve. He received the Ph.D. degree in Electrical Engineering and Computer Science from Johns Hopkins University. His research interests include Design Automation, Testing and Reliability of VLSI Systems, Reconfigurable Computing Architecture Design, and Wireless Digital Systems. He has published numerous articles in these areas, consulted with industry and government, served as Program Chair and General Chair of several IEEE/ACM conferences and been on the program committees of many international conferences and workshops. He is a Fellow of the IEEE and a member of the ACM and Sigma XI, and is listed in Who’s Who in America.