Optimized structures of hybrid ripple carry and hierarchical carry lookahead adders

Optimized structures of hybrid ripple carry and hierarchical carry lookahead adders

Microelectronics Journal 46 (2015) 783–794 Contents lists available at ScienceDirect Microelectronics Journal journal homepage: www.elsevier.com/loc...

1MB Sizes 0 Downloads 22 Views

Microelectronics Journal 46 (2015) 783–794

Contents lists available at ScienceDirect

Microelectronics Journal journal homepage: www.elsevier.com/locate/mejo

Optimized structures of hybrid ripple carry and hierarchical carry lookahead adders Atef Ibrahim a,b,c,n, Fayez Gebali b a

Sattam Bin AbdulAziz University, Kharj, Saudi Arabia ECE Department, University of Victoria, Victoria, BC, Canada c Electronics Research Institute, Cairo, Egypt b

art ic l e i nf o

a b s t r a c t

Article history: Received 3 December 2014 Received in revised form 12 June 2015 Accepted 14 June 2015

This paper proposes improved structures for fast adders that include carry lookahead (CLA) and hierarchical carry lookahead (HCLA). Also, it proposes optimized novel structures of hybrid ripple carry/ hierarchical carry lookahead (RCA/HCLA) adders. A general methodology is presented for constructing M-bit hierarchical carry lookahead adders using n-bit modules. The only restriction on the values of M or n is n r M. Two algorithms are developed to efficiently construct hierarchical carry lookahead adders for the case when M is not an integer power or an integer multiple of n. The improved hierarchical levels of carry lookahead adders are integrated with the ripple carry adder to construct the novel hybrid RCA/ HCLA adders. Area and time complexities of the resulting designs are reported for different values of radix n and the practical values of 32 and 64 bits of M. An ASIC implementation of the proposed structures and previously published recent designs shows that one of the proposed hybrid RCA/HCAL adders achieves 28.2–77.7% reduction in area–delay product and 40.5–75.8% reduction in energy, for M¼ 64 and n ¼ 8, over the different compared adder designs. & 2015 Elsevier Ltd. All rights reserved.

Keywords: Hybrid adders Hierarchical carry lookahead adders Fast adders Optimized adder structures ASIC implementation Digital VLSI design

1. Introduction Design of adders showing high performance in speed of addition, power consumption and silicon area is important for many applications such as advanced digital signal processors, crypto-processors and embedded wireless mobile devices that require strong encryption to provide the needed security for the users. In traditional very large scale integration (VLSI) design, the system designer must take into consideration the design area and power consumption [1,2]. Managing the power in a VLSI chip does not only target power reduction, but also ensures that no hotspots are present within the die [3]. Wide adders are a piece of the most crucial power-density processor modules, making thermal hotspots and sever temperature inclinations [4–6]. The existence of various arithmetic logic units (ALUs) in current superscalar processors [7,8] and different execution cores on the same chip [8–10] further worsen the problem, affecting circuit reliability and expanding cooling costs. At the same time, wide adders are also crucial for performance, and come into view inside the ALUs and floating point units (FPUs) of microprocessor datapaths. In a perfect world, a datapath adder would realize the highest performance using the minimal amount of power and has a little n

Corresponding author at: Sattam Bin AbdulAziz University, Kharj, Saudi Arabia. E-mail addresses: [email protected], [email protected] (A. Ibrahim), [email protected] (F. Gebali). http://dx.doi.org/10.1016/j.mejo.2015.06.008 0026-2692/& 2015 Elsevier Ltd. All rights reserved.

layout footprint so as to reduce interconnect delays in the core [6]. These conflicting necessities constitute a challenging issue in choosing the best adder architecture and circuit implementation. The literature gives a variety of solutions for optimizing adders using different techniques such as carry-select adders [11–13], carry save adders [14], carry lookahead adders [15–18], hybrid between carry-select and carry-lookahead adders [19–24], carry skip [25,26], and conditional-sum adders [27,28]. The main contribution in this paper is constructing M-bit hybrid ripple-carry and hierarchical carry lookahead (RCA/HCLA) adder structures using arbitrary choice of n-bit HCLA modules. Two crucial differences exist between the proposed hierarchical structures and the structure of HCLA. The first difference is that the n-bit HCLA modules at the first level of the proposed hierarchical structures are modified so that they produce the propagate and generate signals only that there is no need to generate the carry signals at this level. The second difference is that the n-bit HCLA module at the top-most hierarchy level generate the carry signals only that there is no need to produce the propagate and generate signals at this level. At the same time, the delay in the RCA section of these adder structures is only for n bits RCA module since the carry-in signal for each n-bit RCA module is obtained directly from the n-bit HCLA module in second level of the hierarchy. Therefore, these new structures achieve a significant reduction in area and power with a minimal delay penalty since it uses n-bit RCA modules.

784

A. Ibrahim, F. Gebali / Microelectronics Journal 46 (2015) 783–794

This paper is organized as follows. Section 2 explains how to model system performance to have an idea on the impact of different parameters on the different adder structures. Section 3 presents the basic modules of the RCA, CLA (carry lookahead adder) and HCLA adders. Section 4 describes constructing efficient M-bit CLA for M is not an integer multiple of n. Section 5 describes constructing efficient M-bit HCLA for M is not an integer multiple of n. Section 6 describes the proposed M-bit hybrid ripple-carry and hierarchical carry lookahead (RCA/HCLA) adder structures using arbitrary choice of n-bit HCLA modules. Section 7 shows the complexity analysis results for the different types of adders investigated. Section 8 compares the ASIC implementation results of the different types of adders investigated and previously reported efficient adders. Finally Section 9 concludes the paper.

2. Performance modeling Exploring the optimal VLSI design requires using either performance estimation using back annotation following full place and route using a specific technology. This approach gives realistic estimates but lacks any insight on the main parameters affecting system performance. For our case the main parameters are M, n and rough estimates of the elementary gate technology parameters. Extensive implementations using different parameter settings are necessary to gain any insight. Simple analytic models for system performance give an approximate performance estimates but at least identify the important parameters that impact system performance. The assumptions employed in the next subsection are good for implementing simplified models for area and speed but they will not be realistic for modeling power. This is due to the dynamic power component of CMOS gates that depends on the switching activity factor which is a strong function in the signal statistics, inter-signal correlations, and glitching transitions. Therefore, we relayed only on the actual power simulations to study power consumption of the proposed designs. 2.1. Modeling area In order to study the order-of magnitude complexity of the proposed designs, we used the standard layout results discussed in basic VLSI technology such as in [29]. We make the following assumptions for our numerical results: 1. Standard static CMOS technology is used. 2. All logic modules will be implemented in terms of NAND gates. According to the analysis given in [30], NAND gate has better area and delay over NOR gate for static CMOS technology. 3. Large-fan-in gates with number of inputs larger than 2 are implemented using basic 2-input NAND gates in order to limit power consumption and maintain symmetric rise and fall times with reasonable transistor sizing. 4. We normalize all areas relative to the area of a 2-input NAND gate. 5. We shall ignore the inverter areas since their number is very small relative to the total number of other gates. This assumption was based on the observation that the basic module structures of

Fig. 1. Constructing 2-input XOR gate using 2-input NAND gates [29].

adders have AND-gate level followed by OR-gate level (AND–OR structure). We converted the AND–OR structure of these modules to the corresponding NAND–NAND structure. Therefore, when implementing n-input NAND gates with 2-input ones, the inverters required by the n-input NAND-gates of the first level will be offset by the inverters required by the n-input NAND gates of the second level. Also, the remaining very few inverters will not be on the critical path of the modules and thus they will not have any effect on the delay. For these reasons, inverters have been ignored from most of the adder architectures mentioned in this paper. Based on the above assumptions, the normalized area of an iinput NAND gate is give by Ai ¼ i 1 normalized relative to the area of a 2-input NAND gate. A 2-input XOR gate can be implemented using four 2-input NAND gates, as shown in Fig. 1 [29]. Consequently the normalized 2-input XOR gate area is AX ¼ 4.

2.2. Modeling delay Similar to gate areas, we normalize a gate delay relative to the delay of a 2-input NAND gate driving a similar minimum-area 2-input NAND gate. The normalized delay of an i-input NAND gate is given by T i ¼ ⌈log 2 i⌉ normalized relative to the delay of a 2input NAND gate. According to Fig. 1, a 2-input XOR gate would have a normalized delay of T X ¼ 3.

3. RCA, CLA and HCLA basic modules We provide in this section detailed analysis of the basic modules used in constructing M-bit RCA, CLA and HCLA adders.

3.1. Ripple-carry adder module (RCA) We start this section by mentioning the standard 1-bit ripplecarry adder (RCA) module construction shown in Fig. 2. When two M-bit numbers are to be added, the addend a and augend b are supplied to the M-bit RCA. The RCA is composed of two blocks: the bit-parallel P & G block and the bit-serial S & C block. The P & G block is used extensively in CLA as well as HCLA structures, as will be discussed below. On the other hand, the S & C block operates on the input data serially. Each sum output bit si is produced after the carry out bit ci  1 of the previous stage is produced. With reference to Fig. 2, the normalized area of the RCA module is estimated as 2AX þ 3, where AX is the normalized area of the 2input XOR gate. The normalized delay of the RCA module is taken as the delay of the carry-out signal which is estimated as T X þ 2, where TX is the normalized delay of the 2-input XOR gate. This estimate takes into account that an RCA delay is bound by the carry propagate signal, as opposed to the sum delay.

Fig. 2. A ripple-carry adder (RCA) is composed of a parallel P & G part and a serial S & C part.

A. Ibrahim, F. Gebali / Microelectronics Journal 46 (2015) 783–794

3.2. n-Bit carry-lookahead adder module (n-bit CLA)

¼

Fig. 3 shows an n-bit carry lookahead adder module (n-bit CLA). This module accepts n-bit signals p½n 1 : 0 and g½n  1 : 0 and the carry-in signal cin and produces the n þ 1-bit carry signals c½n : 0 according to the following equations: c1 ¼ cin p0 þ g 0

ð1Þ

c2 ¼ cin p0 p1 þ g 0 p1 þ g 1

ð2Þ

c3 ¼ cin p0 p1 p2 þg 0 p1 p2 þ g 1 p2 þ g 2

ð3Þ

n3 n2 2n þ  þ2 3 6 2

785

ð11Þ

where Ai was given in Section 2.1. We see that the area complexity increases as the cube of the number of input bits. The delay of the n-bit CLA module equals the delay of the carryout signal cn which, according to Eq. (7) and Section 2.2, is given by T nCLA ¼ 2T n þ 2 ¼ 2⌈log 2 n⌉ þ 2

ð12Þ

where Tn is the delay of an n-input NAND gate, as was explained in Section 2.2.

⋮ n2

n 2 X

i¼0

i¼0

cn  1 ¼ cin ∏ pi þ

gi

n2

∏ pj

ð4Þ

j ¼ iþ1

In general, we can write the carry at bit i as i1

i1 X

j¼0

j¼0

ci ¼ cin ∏ pj þ

gj

i1



k ¼ jþ1

pk ;

0 ri r n

ð5Þ

There are two options for generating the carry out signal cn. The following two equations illustrate our options: n1

n 1 X

i¼0

i¼0

cn ¼ cin ∏ pi þ

gi

n1

∏ pj

j ¼ iþ1

cn ¼ cn  1 pn  1 þ g n  1

3.3. n-Bit hierarchical carry-lookahead adder module (n-bit HCLA) An n-bit HCLA module (n-bit HCLA) accepts n-bit p½n  1 : 0 and g½n  1 : 0 signals and produces two one-bit signals pout and gout. Fig. 5 shows an n-bit HCLA module intended for use in a hierarchical carry lookahead adder (HCLA). The carry out signal cn is now replaced with the two group carry propagate pout and group carry generate gout. These two signals are given by n1

pout ¼ ∏ pj

ð13Þ

j¼0

ð6Þ ð7Þ

The first option (6) provides low delay but at the expense of increased area and energy consumption. The second option (7) reduces area and power but at the cost of a slight increase in delay due to propagation through two 2-input NAND gates as being equivalent to a 2-input AND gate followed by a 2-input OR gate. For the remainder of this paper we shall assume that we are using the latter option in Eq. (7) to generate cn. Fig. 4 shows the structure of a 4-bit carry lookahead adder module. The design requires several sizes of AND and OR gates to generate the carry signals c0–cn. The number of i-input NAND gates in a n-CLA is given by ( n þ 2; i¼2 ð8Þ ni ¼ n  i þ2; 2 o i r n

g out ¼

n 1 X i¼0

gi

n1

∏ pj

j ¼ iþ1

ð14Þ

The carry signals c0, c1 ; …; cn  1 are given by (5). There is no carry out signal. Fig. 6 shows the structure of a 4-bit hierarchical carry lookahead adder module. The number of i-input NAND gates for the nbit HCLA is given by ( n  iþ 3; 2 r i on ni ¼ ð15Þ 5; i¼n

As was mentioned in Section 2.1, all logic modules will be implemented in terms of NAND gates. Therefore, the AND/OR levels of Fig. 4 will be converted to two NAND levels. We can write the normalized area complexity of an n-bit CLA module as AnCLA ¼

n X

ð9Þ

ni Ai

i¼2

AnCLA ¼ ðn þ 2ÞA2 þ

n X

ðn  iþ 2ÞAi

ð10Þ

i¼3

AnCLA ¼ n þ 2 þ

n X

ðn i þ 2Þði  1Þ

i¼3

Fig. 3. n-Bit carry lookahead adder module (n-bit CLA).

Fig. 4. 4-Bit carry lookahead adder module structure.

Fig. 5. n-Bit module for use in a hierarchical carry lookahead adder (n-bit HCLA).

786

A. Ibrahim, F. Gebali / Microelectronics Journal 46 (2015) 783–794

...

Fig. 7. M-bit CLA adder that uses n-bit CLA and k-bit CLA modules when M is not an integer multiple of n.

and produces the M-bit signals p½M  1 : 0 and g½M 1 : 0 as

Fig. 6. 4-Bit hierarchical carry lookahead adder module structure.

We can write the area complexity of an n-bit HCLA module as AnHCLA ¼

n X

ð16Þ

ni Ai

i¼2

AnHCLA ¼ 5An þ

n 1 X

ðn  i þ 3ÞAi

ð17Þ

i¼2

AnHCLA ¼ 5ðn  1Þ þ

n 1 X

ðn  iþ 3Þði 1Þ

i¼2

n3 5n ¼ þn2   2 6 6

ð18Þ

T nHCLA ¼ 2T n ð19Þ

where Tn is the delay of an n-input NAND gate, as was explained in Section 2.2.

ð23Þ

Stage 2 is composed of q n-bit CLA modules that were discussed in Section 3.2. q was given in Eq. (21). An extra k-bit CLA module is added when 0 ok on. Stage 2 accepts the M-bit signals p½M  1 : 0 and g½M  1 : 0 and produces the M þ 1 -bit carry signal c½M : 0. Stage 3 is composed of an array of M XOR gates and does not comprise the S & C logic of Fig. 2 responsible for producing the sum signals si. Stage 3 accepts the two M-bit signals p½M  1 : 0 and c½M  1 : 0 and produces the sum output s½M  1 : 0: si ¼ ci  pi ;

where Ai was given in Section 2.1. We see that the area complexity increases as the cube of the number of input bits. The delay of the n-bit HCLA module equals the delay of the group generate signal gout which, according to Eq. (14) and Section 2.2, is given by ¼ 2⌈log 2 n⌉

g i ¼ ai  bi 0 r io M pi ¼ ai  bi

0rioM

ð24Þ

According to Fig. 7, the carry-out signal cM is produced by the final k-bit CLA block in Stage 2 according to Eq. (7). The area of the CLAM n is given by   k A ACLAM ¼ qAnCLA þ þ ð2AX þ 1ÞM ð25Þ n n k  CLA where q and k are obtained from Eqs. (21) and (22), respectively, and area of the n-bit CLA modules was derived in Section 3.2. The delay of the CLAM n is given by   k T þTX ð26Þ T CLAM ¼ qT nCLA þ n n k  CLA where T nCLA was derived in Section 3.2.

4. Constructing efficient M-bit CLA adders (CLAM n) A CLA adder is to be added to two M-bit numbers using only nbit CLA modules requires that M be an integer multiple of n. For the case when M 4 n but is not an integer multiple of n, the module dealing with the most significant bits will have x unused inputs where x is given by   M x¼n  M; 0 r x o n ð20Þ n A more efficient construction uses a radix k-bit CLA module mostsignificant word with 0 o k o n to eliminate any unused inputs. In this case, the number of n-bit CLA modules needed is given by   M ð21Þ q¼ n The radix k of the most significant word module is given by k ¼ M  qn

ð22Þ

Fig. 7 shows a block diagram of a CLAM n adder that uses n-bit CLA and k-bit CLA modules. The adder can be divided into three stages: Stage 1 is composed of q þ ⌈k=n⌉ P & G blocks that were shown in Fig. 2. Stage 1 accepts the M-bit addend a½M 1 : 0 and augend b½M 1 : 0

5. Constructing efficient hierarchical M-bit HCLA adders (HCLAM n) Hierarchical CLA adders (HCLA) efficiently process the two Mbit signals p½M  1 : 0 and g½M  1 : 0 to produce the one-bit output signals pout and gout and the carry signals c½M 1 : 0. For a straightforward implementation of these hierarchical adders, we require that M be an integer power of n, i.e. M ¼ nk , k 4 0. In general M is not an integer power of n and the resulting HCLA will have many unused input ports or even unused modules. Fig. 8 shows a block diagram of  an M-bit  hierarchical CLA adder that uses n-bit HCLA modules HCLAM n . For simplicity it was assumed that M¼ 16 is an integer power of n¼ 4 so that the total number of n-bit HCLA modules (m) is estimated as m¼

M 1 n1

ð27Þ

The adder can be divided into three stages which are sorted starting with the input data and subsequent operations on the intermediate results. Stage 1 (middle of figure) accepts the M-bit addend a½M  1 : 0 and augend b½M  1 : 0 and produces the M-bit signals p½M  1 : 0 and g½M  1 : 0 according to Eq. (23).

A. Ibrahim, F. Gebali / Microelectronics Journal 46 (2015) 783–794

787

3

4 4

3

4

3

4

3 4

3   Fig. 8. M-bit hierarchical CLA adder that uses n-bit HCLA modules HCLAM for the n case M ¼16 and n¼ 4.

3

0 ri o M

ð28Þ

We provide here two algorithms to construct hierarchical HCLAM n adders that address the problem of unused inputs for the practical case when M is not an integer power of n. The algorithms we propose here allow us to add two M-bit numbers using an arbitrary choice of n-bit HCLA modules, where M and n have arbitrary values. 5.1. Type 1 hierarchical HCLA structure When we construct an HCLA for M bits using only n-bit HCLA modules, we get what we call Type 1 HCLAM n structure. This type deals with the case when M and n have arbitrary values and identical n-bit HCLA modules are used to generate an unbalanced n-tree structure. As an example, Fig. 9 shows a hierarchical Type 1 CLA15 based on the 3-bit HCLA modules in Fig. 5. The bold 3 numbers inside each box indicate the radix of the HCLA module, which is 3-bit HCLA in this case. Three hierarchy levels are needed since ⌈log 3 15⌉ ¼ 3. Notice the zero high-order inputs that are needed to pad the numbers a and b to modify M-M 0 such that M 0 is an integer power of n. The boxes in red indicate n-bit HCLA adders that have zero inputs. The zero inputs for the high-order bits of the n-bit HCLA adders are shown as gray lines and indicate zero inputs/outputs. Therefore the n-bit HCLA adders indicated by red boxes and dashed lines could be removed and the resulting structure will be an unbalanced n-tree structure. Algorithm 1 explains how we can construct efficient Type1 structures using unbalanced n-tree construction when M is not an integer power of n. Algorithm 1. Pseudocode for constructing a Type 1 HCLAM n adder based on unbalanced n-tree using n-bit HCLA modules. Require Input: M, n 1: i¼0; h’0 ; 2: q0 ’⌈M=n⌉ ; 3: Q ’½q0  /n Initialization n/ 4: M’q0 ;

3

3

Stage 2 (on the right of figure) is the n-bit HCLA module array which accepts the M-bit signals p½M  1 : 0 and g½M  1 : 0 and produces the two signals pout and gout. The n-bit HCLA modules are arranged as a n-tree. The bold numbers inside each box indicate the radix of the HCLA module, which is 4-bit HCLA in this case. Two hierarchy levels are needed and the carry out signal c16 is obtained at the top level of the hierarchy using an extra 2-input AND and 2-input OR, which are equivalent to two 2-input NAND gates. Stage 3 (on the left) accepts the two M-bit signals p½M 1 : 0 and c½M  1 : 0 and produces the sum output s½M  1 : 0 : si ¼ ci  pi ;

3

3 3

3

3 Fig. 9. A hierarchical Type 1 HCLA15 3 based on 3-bit HCLA modules in Fig. 5. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

5: while M 4 1 do 6: iþ þ; h’h þ 1 ; 7: qi ’⌈qi  1 =n⌉; 8: Q ’½Q &qi  ; 9: M’qi 10: end while 11: return Q;

We provide here a short explanation of the algorithm: Line 2: Line 6: Line 7: Line 8:

The number of n-bit HCLA modules at the lowest level of the hierarchy ðh ¼ 0Þ is ⌈M=n⌉. This is stored in the vector Q. The hierarchy index i is incremented and an extra hierarchy level is added since q 4 1. The number of modules at the current hierarchy level (q) is estimated. The number of n-bit HCLA modules at the current hierarchy level is appended to the vector Q.

Algorithm 1 allows us to construct a hierarchical CLA even when M is not an integer multiple of n. However, some of the high-order bits at each level of the hierarchy will have zero inputs and the gates associated with these inputs will degrade the performance of the adder. According to Algorithm 1, the number of hierarchy levels needed for the Type 1 HCLAM n is given by h ¼ ⌈log n M⌉

ð29Þ

The following lemmas prove that the above algorithm results in the least possible hardware when M bits are to be added using nbit HCLA modules.

788

A. Ibrahim, F. Gebali / Microelectronics Journal 46 (2015) 783–794

Lemma 1. In a Type 1 HCLAM n adder, where M is not an integer power of n, a maximum of n  1 inputs could be unused at any level of the hierarchy.

2

Proof. Consider the i-the hierarchy level (0 r i o ⌈log n ðMÞ⌉). According to Algorithm 1, we can write:

3

qi ¼ ⌈qi  1 =n⌉

2

ð30Þ

where qi were defined in Algorithm 1. The number of n-bit HCLA modules at level hi is qi. The number of unused inputs x is given by x ¼ qi n  qi  1

ð31Þ

The range of x is 0 r x o n. Thus we could have at most n  1 unused inputs.□

3

The following lemma estimates the maximum number of unused inputs in the HCLA system.

3

Lemma 2. When two M-bit numbers are to be added and M is not an integer power or integer multiple of n, a maximum of ðn  1Þ⌈log n ðMÞ⌉ inputs could be unused when using n-bit HCLA's. Proof. The number of hierarchy levels is h ¼ ⌈log n M⌉, according to Eq. (31). The maximum number of zero inputs at a given hierarchy level is n  1, according to Lemma 1. Therefore, the maximum of ðn 1Þ⌈log n M⌉ inputs could be unused when using n-bit HCLA's.□ Using Eq. (18), the area complexity of the Type 1 HCLAM n design is obtained with reference to Figs. 8 and 9 and is given by the sum of the areas of Stages 1–3: A ¼ ð2AX þ 1ÞM þ2 þ AnHCLA

h 1 X

qi

ð32Þ

i¼0

where h was defined in Eq. (29) and qi is the i-th component of the vector Q, which was defined in Algorithm 1. The first term accounts for the areas of Stages 1 and 3 shown in Fig. 8. The Type 1 HCLAM n delay is related to evaluating the carry signals c0–cM  1 . Using Eq. (19) and Figs. 8 and 9, the delay is given by T ¼ 2T X þ ð2h  1ÞT n  HCLA

ð33Þ

where T nHCLA was derived in Section 3.3. The first term is due to the delay through Stages 1 and 3 shown in Fig. 8. 5.2. Type 2 hierarchical HCLA structure Type 2 HCLAM n structure deals with the case when M and n have arbitrary values. The structure uses n-bit HCLA modules as well as k-bit HCLA modules with 2 rk rn. The resulting structure is a heterogeneous tree where all inputs will be utilized and the value of n need only to satisfy the very general requirement: M 4 n. We propose the iterative Algorithm 2 that produces an unbalanced heterogeneous hierarchical adder tree. Algorithm 2. Pseudocode for constructing a M-bit adder based on radix-k HCLA modules (2 r k r n). Require Input: M, n 1: i¼0; h’0; 2: q0 ’⌊M=nc; r 0 ’M  nq0 ; 3: Q ’½q0 ; R’½r 0 ; 4: M’q0 þ ⌈r 0 =n⌉; 5: While M 4 1 do 6: iþ þ; h’h þ 1; 7: qi ’⌊qi  1 =nc; r i ’qi  1  nqi ; 8: Q ’½Q &qi ; R’½R&r i ; 9: M’qi þ⌈r i =n⌉; 10: End while 11: return Q; R;

2

3 3

Fig. 10. A hierarchical 14-bit Type 2 HCLAM n based on radix-2 and radix-3 HCLAs for the case M ¼ 14.

Fig. 10 shows constructing a hierarchical Type 2 HCLAM n for the case M ¼14 and 3-bit HCLA modules. The bold number inside each box indicates the radix of that HCLA module. Notice that there are no redundant hardware or inputs/output signals. Had we used a homogeneous system using radix-3 HCLA's, the number of unused inputs would have been 3. The iterative Algorithm 2 creates the least area hierarchical structure for a M-bit adder based on a mix of n-bit HCLA and k-bit HCLA modules with 2 r k r n. We provide here a short explanation of the code in Algorithm 2: Lines 1 and 2: The number of n-bit HCLA modules at the lowest level of the hierarchy (h¼0) is q ¼ ⌊M=nc. r is the radix of the extra r-bit HCLA, if needed. Line 3: The number of n-HCLA modules at level 0 is stored in vector Q and the radix of the extra r-bit HCLA stored in vector R. Line 4: The number of modules at level h ¼0 is estimated to decide if another hierarchy level is needed or not. Line 6: An extra hierarchy level is updated since M 4 1. Line 7: The number of n-bit HCLA modules is q at level h and one extra r-bit HCLA module is needed for the remaining HCLA module. Line 8: The number of n-bit HCLA modules at the current hierarchy level is appended to the vector Q and the radix of the extra HCLA module is appended to vector R. Line 9: Calculate the number of signals at the current hierarchy level. The number of hierarchy levels needed for the Type 2 HCLAM n is also given by Eq. (29). The following lemma proves that there will be no unused input in the Type 2 HCLAM n construction when Algorithm 2. Lemma 3. In a Type 2 HCLAM n adder, where M is not an integer power of n, there will be no unused inputs at any level of the hierarchy. Proof. loss of where power

Consider the first level of the hierarchy (i.e. h ¼0) without generality. Number of bits to be added is M using k-HCLA's 2 o k r n. The general case is when M is neither an integer nor an integer multiple of n. We can write

M ¼ qn þr

ð34Þ

where q is a positive integer and 0 o r o n. In that case, a q n-bit HCLA modules are needed at level h¼0. When q¼ 1, no HCLA modules are needed and the p and g signals are forwarded to the next level of the hierarchy. In accordance with Algorithm 2 we will

A. Ibrahim, F. Gebali / Microelectronics Journal 46 (2015) 783–794

need an extra r-bit HCLA to take care of the remaining inputs. Thus there will be no unused inputs.□ Lemma 4. At the last level of a Type 2 HCLAM n we will have exactly one n-bit HCLA module or one r-bit HCLA module. Proof. Assume we have h hierarchy layers where h was given by Eq. (29). Accordingly, we can write the following inequality: nh  1 oM r nh

ð35Þ h1

Dividing the above inequality by the factor n input P & G signal pairs is given by M 0 : 1 o M0 r n

, the number of ð36Þ

We have two cases for M 0 : Case 1 when M 0 ¼ n: Case 2 when 1 o M 0 o n:

We have q ¼1 and r ¼0. The last hierarchy level would consist of an n-bit HCLA module. We have q ¼0 and r ¼ M 0 . The last hierarchy level would consist of an r-bit HCLA module.□

Using Eq. (18), the area complexity of the Type 2 HCLAM n design is obtained with reference to Figs. 8 and 10 and is given by the sum of the areas of Stages 1–3: A ¼ ð2AX þ1ÞM þ 2 þ AnHCLA

h 1 X

qi

i¼0

þ

h 1 X

⌊r i =ncAri HCLA

ð37Þ

789

Stage 3 is composed the bit-serial S & C blocks of the RCA adder. Notice that the delay in the RCA section is only for n ¼4 bits since the carry-in signal for each 4-bit block is obtained directly from the 4-bit HCLA module in Level 1 of the hierarchy. That construction saves area and power at the M=n leaves of the tree and delay penalty is minimal since it uses n-bit RCA adders. More area and power savings can be accomplished with a small increase in delay if we extract the carry signals from Level 2. We can even go one level higher with more savings and slight increase in delay. 6.1. Type 1 RCA/HCLAM n adder Fig. 12 shows a hybrid Type 1 RCA=HCLA15 3 15-bit adder based on radix-3 HCLAs. We have several observations to note about this structure: 1. The 3-bit HCLA modules in Level 0 of the hierarchical adder do not produce a carry signal. 2. The 3-bit HCLA module in Level 2 of the hierarchical adder does not product pout or gout signals. 3. The carry-in signals fed to the S & C blocks of RCA adders in Stage 3 are coming from the carry-out signals of the 3-bit HCLA modules in Level 1 of the hierarchical adder. Using Eq. (18), and the specific structure of the hybrid RCA/ HCLA adders, the area complexity of the Type 1 RCA/HCLAM n design

i¼0

where h was defined in Algorithm 2 and qi and ri are the i-th components of the vectors Q and R, respectively, which were also defined in Algorithm 2. The first term accounts for the areas of Stages 1 and 3 shown in Fig. 8. The Type 2 HCLAM n delay is related to evaluating the carry signals c0–cM  1 . Using Eq. (19) and Figs. 8 and 10, the delay is given by T ¼ 2T x þ ð2h  1ÞT nHCLA

4 4 4

ð38Þ

4

or T ¼ 2T x þ 2ðh  1ÞT nHCLA þT rHCLA

ð39Þ

Eq. (38) is used to estimate delay when the last level of the adder has n-bit HCLA module, while Eq. (39) is used when the last level of the adder has r-bit HCLA module. T nHCLA and T r  HCLA were derived in Section 3.3. The first term in the previous two equations is due to the delay through stages 1 and 3 shown in Fig. 8.

4

Fig. 11. A hybrid RCA/HCLA 16-bit adder based on radix-4 HCLAs (RCA=HCLA16 4 ).

6. Hybrid RCA/HCLA adders In this section we propose a hybrid ripple-carry and hierarchical carry lookahead (RCA/HCLA) adder structures. To illustrate our proposed hybrid design, let us choose M to be an integer power of n for simplicity. The hybrid construction is shown in Fig. 11. The design consists of three stages. Stage 1 consists of the 4-bit parallel P & G block of the RCA adder shown in Fig. 2. Stage 2 is a hierarchical HCLA16 4 adder similar to the one shown in Fig. 8. Two crucial differences exist between the hierarchical structure in Fig. 11 and the other hierarchical structures are shown in Fig. 8, 9 or 10. The first difference is that the n-bit HCLA modules at Level 0 are now modified so that they only produce group pout and gout signals. The hardware for the carry signals is not needed. According to the design details in Section 3.3, this constitutes a significant area saving. The second difference is that the n-bit HCLA module at the top-most hierarchy level produces the carry signals only. The hardware for the pout and gout signals is not needed. Again, this constitutes a significant area saving.

3 3 3 3

3 3

3

3

Fig. 12. A hybrid Type 1 RCA=HCLA15 3 15-bit adder based on radix-3 HCLAs.

790

A. Ibrahim, F. Gebali / Microelectronics Journal 46 (2015) 783–794

is obtained with reference to Fig. 12 as A ¼ ð2AX þ 3ÞM þAlevel0 þAlevel1:h  1 þAlevelh

ð40Þ

where the first term corresponds to the areas of Stages 1 and 3 and Alevel0 is the area of the modified HCLA modules in Level 0 of the hierarchy. Alevel1:h  1 is the area of the regular HCLA modules in Level 1 to Level h  1 of the hierarchy. Alevelh is the area of the modified HCLA module in Level h of the hierarchy. We can write expressions for the three areas as Alevel0 ¼ q0

ðn  1Þðn þ 4Þ 2

Alevel1:h  1 ¼

h 1 X

ð41Þ

ð42Þ

qi AnHCLA

i¼1

Alevelh ¼ qh

n X

Alevel1:h  1 ¼

h 1 X

qi AnHCLA þ r i Ari HCLA

ð47Þ

i¼1

Alevelh ¼ qh

n X

ðn  i þ 3Þði 1Þ

i¼2

þ r h ArhHCLA

ð48Þ

where h and qi were defined in Algorithm 2 and AnHCLA , ArHCLA were defined in Eq. (18). The hybrid Type 2 RCA/HCLAM n delay is related to evaluating the carry signals c0–cM  1 . Using Eq. (19) and Fig. 13, the delay is given by T ¼ ð2T X þ 2ðn  2ÞÞ þ 2ðh  1ÞT nHCLA

ð49Þ

or T ¼ ð2T X þ 2ðn  2ÞÞ þ ð2ðh  1Þ 1ÞT nHCLA þ T r  HCLA

ðn  i þ 3Þði  1Þ

ð43Þ

i¼2

where h and qi were defined in Algorithm 1 and AnHCLA was defined in Eq. (18). The hybrid Type 1 RCA/HCLAM n delay is related to evaluating the carry signals c0–cM  1 . Using Eq. (19) and Fig. 12, the delay is given by ð44Þ

T ¼ ð2Tx þ2ðn  2ÞÞ þ 2ðh  1ÞT n  HCLA

ð50Þ

Eq. (49) is used to estimate delay when the last level of the adder has n-bit HCLA module, while Eq. (50) is used when the last level of the adder has r-bit HCLA module. T nHCLA and T rHCLA were derived in Section 3.3. The first term in the previous two equations is due to the delay through Stages 1 and 3 shown in Fig. 8.

7. Complexity analysis results

The first term is due to the delay through Stages 1 and 3. 6.2. Type 2 RCA/HCLAM n adder Fig. 13 shows a hybrid Type 2 RCA=HCLA14 3 14-bit adder based on radix-3 HCLAs. The same observations that were made for Type 1 RCA/HCLAM n apply also here with the added benefit of extra hardware savings since no unused inputs are present. Using Eq. (18), and the specific structure of hybrid RCA/HCLA adders, the area complexity of the Type 2 RCA/HCLAM n design is obtained with reference to Fig. 13 as A ¼ ð2AX þ 3ÞM þAlevel0 þAlevel1:h  1 þAlevelh

ð45Þ

where the first term corresponds to the areas of Stages 1 and 3 and Alevel0 is the area of the modified HCLA modules in Level 0 of the hierarchy. Alevel1:h  1 is the area of the regular HCLA modules in Level 1 to Level h  1 of the hierarchy. Alevelh is the area of the modified HCLA module in Level h of the hierarchy. We can write expressions for the three areas as

From the results of Sections 4–6 we can investigate the dependence of the performance parameters on the two main system parameters: M and n. In order to simplify the discussion we fixed M at the practical values of 32 and 64 bits and varied the radix n in the integer values in the range of 2 rn oM. Fig. 14 shows the dependence of the normalized area on the radix n for the case when M¼32 and the six different types of adders investigated: Ripple-carry adder (RCAM). Carry lookahead adder (CLAM n ). Type 1 hierarchical carry lookahead adder (Type 1 HCLAM n ). Type 2 hierarchical carry lookahead adder (Type 2 HCLAM n ). Hybrid Type 1 hierarchical carry lookahead adder (hybrid Type 1 HCLAM n ). 6. Hybrid Type 2 hierarchical carry lookahead adder (hybrid Type 2 HCLAM n ).

1. 2. 3. 4. 5.

As expected, the RCA adder has the least area. The CLA and the two types of HCLA adders perform the worst in terms of area.

ðn  1Þðn þ 4Þ 2 ðr 0  1Þðr 0 þ 4Þ þ r0 2

Alevel0 ¼ q0

ð46Þ

6

10

2

3 2

3 3

Total Normalized Area

5

2

3

RCA CLA Type1 HCLA Type 2 HCLA Hybrid Type 1 HCLA Hybrid Type 2 HCLA

10

4

10

3

10

3 2

10

0

5

10

15

20

25

30

Radix n Fig. 13. A hybrid Type 2 RCA=HCLA14 3 14-bit adder based on radix-3 and radix2 HCLAs.

Fig. 14. Normalized area complexity for the six types of adders considered in this paper for the case M ¼32.

A. Ibrahim, F. Gebali / Microelectronics Journal 46 (2015) 783–794

However, we notice from the figure that the two hybrid HCLA adder types have areas closest to the RCA adder when the radix n ¼6. The area of hybrid HCLA32 is almost three orders of 6 magnitude smaller than that of CLA or CLA32 6 . Furthermore, the hybrid Type 2 HCLAM n is the best lookahead adder to consider for the range of the radix n Z 6. Fig. 15 shows the dependence of the normalized delay on the radix n for the case when M ¼32 and the six different types of adders investigated. As expected, the RCA adder has the highest delay. The CLA and the two types of HCLA adders perform the worst in terms of delay. However, we notice from the figure that the two hybrid HCLA adder types have the least delays when the radix n ¼6. The delay of hybrid HCLA32 6 is almost two orders of magnitude smaller than M that of CLA or CLA32 6 . Furthermore, the hybrid Type 2 HCLAn is the best lookahead adder to consider for the range of the radix n Z 6. Fig. 16 shows the dependence of the normalized area  delay complexity on the radix n for the case when M ¼32 and the six different types of adders investigated. For most of the range of the radix n, CLA and HCLA perform worse than the RCA adder. However, for the radix n Z 6, hybrid Type 2 HCLAM n is the best performer.

Fig. 17 shows the dependence of the normalized area on the radix n for the case when M¼ 64 and the six different types of adders investigated. As expected, the RCA adder has the least area. The CLA and the two types of HCLA adders perform the worst in terms of area. However, we notice from the figure that the two hybrid HCLA adder types have areas closest to the RCA adder when the radix n ¼ 8. The area of hybrid HCLA64 8 is almost three orders of magnitude smaller than that of CLA or CLA64 8 . Furthermore, the hybrid Type 2 HCLAM n is the best lookahead adder to consider for the range of the radix n Z 8. Fig. 18 shows the dependence of the normalized delay on the radix n for the case when M¼ 64 and the six different types of adders investigated. As expected, the RCA adder has the highest delay. The CLA and the two types of HCLA adders perform the worst in terms of delay. However, we notice from the figure that the two hybrid HCLA adder types have the least delays when the radix n ¼8. The delay of hybrid HCLA64 8 is almost two orders of magnitude smaller than M that of CLA or CLA64 8 . Furthermore, the hybrid Type 2 HCLAn is the best lookahead adder to consider for the range of the radix n Z 8. Fig. 19 shows the dependence of the normalized area  delay complexity on the radix n for the case when M¼ 64 and the six different types of adders investigated.

3

7

RCA CLA Type1 HCLA Type 2 HCLA Hybrid Type 1 HCLA Hybrid Type 2 HCLA

2

10

1

10

10

RCA CLA Type1 HCLA Type 2 HCLA Hybrid Type 1 HCLA Hybrid Type 2 HCLA

6

10 Total Normalized Area

10

Total Normalized Delay

791

5

10

4

10

3

10

0

10

2

0

5

10

15

20

25

30

10

0

10

20

30

40

50

60

Radix n

Radix n Fig. 15. Normalized delay complexity for the six types of adders considered in this paper for the case M ¼ 32.

Fig. 17. Normalized area complexity for the six types of adders considered in this paper for the case M ¼64.

8

10

RCA CLA Type1 HCLA Type 2 HCLA Hybrid Type 1 HCLA Hybrid Type 2 HCLA

Area x Delay Complexity

7

10

6

10

5

10

4

10

3

10

0

5

10

15

20

25

30

Radix n Fig. 16. Normalized area  delay complexity for the six types of adders considered in this paper for the case M ¼ 32.

Fig. 18. Normalized delay complexity for the six types of adders considered in this paper for the case M ¼64.

792

A. Ibrahim, F. Gebali / Microelectronics Journal 46 (2015) 783–794

For most of the range of the radix n, CLA and HCLA perform worse than the RCA adder. However, for the radix n Z 8, hybrid Type 2 HCLAM n is the best performer.

8. ASIC implementation comparison The discussed six different types of adders and previously published efficient adders of [16–18] were implemented in VHDL at the register transfer level and synthesized for both cases M ¼32, n ¼6 and M¼64, n¼ 8 using Nangate 45 nm Open Cell Library. We used Synopsys synthesis tools package 2005.09-SP2. We use the typical corner (VDD¼ 1.1 V and Tj ¼25 1C) and unit drive strength for all the utilized primitives. The power was estimated at maximum operating frequency of each design. The switching activities are recorded during simulation in the Switching Activity Interchange Format (SAIF) file and then read by Synopsys design compiler to have the power report. Simulations used Mentor Graphics ModelSim SE 6.0a. We designed test bench to simulate 10

Area x Delay Complexity

10

RCA CLA Type1 HCLA Type 2 HCLA Hybrid Type 1 HCLA Hybrid Type 2 HCLA

8

10

6

10

4

10

2

10

0

10

20

30

40

50

60

Radix n Fig. 19. Normalized area  delay complexity for the six types of adders considered in this paper for the case M ¼ 64.

the adders developed in this paper. The test bench cycles through 200 possible combinations of the 32-bits of the inputs allowing the user to check the accuracy of the outputs. Instead of writing all the input combinations in the test bench code, a simple loop is used to generate the input combinations. Furthermore, an incorrect signal is used to automatically check the output correctness, such that the designer only needs to look at the final value of incorrect to tell if the circuit is functioning properly (i.e. at the end of the simulation, if incorrect is “0” then the adder is working correctly; otherwise, if the adder is not working correctly, then incorrect will be asserted when the output in not correct, and will remain asserted throughout the rest of the simulation). To allow the operator to see the output resulting from each set of inputs, a delay of 50 ns is given between each test using a wait statement. The value of 50 ns is selected somewhat randomly, since the simulation does not include gate propagation delay. Also, we performed post-layout simulation to include the extra pin cost and the gate propagation delay and hence we get an accurate estimation to area, speed and power consumption. Table 1 compares the ASIC implementations of the different adders. In this table, the column entitled “fmax” represents the maximum operating frequency of the adders. The column entitled “Area” represents the area of the adders. The “Adder delay” column is the total computation time required by the adder to complete a single operation. The “Power” column is the power consumed by the adder at its maximum operating frequency. The columns “ADP” and “PDP” are the area–delay product and power–delay product (energy) design metrics, respectively. These design metrics were calculated using the synthesis results in order to measure the degree of optimization achieved in each multiplier. The columns “% ADP” and “% PDP” represent the percentage reduction in ADP and PDP, respectively, that the proposed hybrid Type 2 RCA/HCAL adder achieves over the different compared adder designs. From Table 1 we notice that the proposed hybrid Type 2 RCA/HCAL adder achieves 28.2–77.7% reduction in ADP and 40.5–75.8% reduction in PDP (energy), for M¼64 and n¼ 8, over the different compared adder designs. Also, it achieves 9.3–57.5% reduction in ADP and 9.5– 58.6% reduction in PDP (energy), for M¼32 and n¼6, over them. In an attempt to compare the ASIC implementation results of area and delay to the results of the high-level abstracted estimates

Table 1 Comparison of ASIC synthesis results. Adder

M

n

fmax (GHz)

Area (μm2 )

Adder delay (ns)

RCA

32 64

6 8

2.05 1.87

796.4 1573.0

0.488 0.968

CLA

32 64

6 8

2.84 2.55

1236.4 3295.6

Type 1 HCLA

32 64

6 8

4.44 4.08

Type 2 HCLA

32 64

6 8

Hybrid Type 1 HCLA

32 64

Hybrid Type 2 HCLA

Power (mW)

ADP (μm2 ns)

PDP (PJ)

%ADP

%PDP

3.652 5.874

388.6 1522.7

1.782 5.687

57.5 75.7

58.6 75.8

0.351 0.503

4.752 10.241

433.9 1657.7

1.672 5.148

61.9 77.7

55.9 73.3

1661.0 4085.4

0.225 0.245

6.017 13.574

373.7 1001.0

1.353 3.322

55.8 63.1

45.5 58.6

4.57 4.08

1436.0 4085.4

0.219 0.245

5.489 13.574

321.2 1001.0

1.199 3.322

48.6 63.1

38.5 58.6

6 8

4.61 4.31

838.2 1590.6

0.217 0.232

3.729 5.907

182.6 369.6

0.814 1.375

9.3 0.0

9.5 0.0

32 64

6 8

4.95 4.31

818.4 1590.6

0.202 0.232

3.674 5.907

165.0 369.6

0.737 1.375

0.0 0.0

0.0 0.0

Zlatanovici [16]

32 64

6 8

4.15 3.53

1322.2 2886.4

0.241 0.283

5.225 12.331

319.0 816.2

1.254 3.487

48.3 54.7

41.2 60.6

Morrison [17]

32 64

6 8

4.33 3.80

1311.2 2470.6

0.231 0.263

5.005 11.132

303.6 649.0

1.166 2.926

45.7 43.1

36.8 53.0

Perri [18]

32 64

6 8

4.71 4.12

1062.6 2116.4

0.212 0.243

4.432 9.516

224.4 514.8

0.939 2.312

26.5 28.2

21.5 40.5

A. Ibrahim, F. Gebali / Microelectronics Journal 46 (2015) 783–794

Table 2 ASIC results versus modeling results of the discussed six different types of adder. Adder

M

n ASIC results

Modeling results

Area ðμm2 Þ Delay (ns) Area ðμm2 Þ Delay (ns) RCA

32 6 64 8

796.4 1573.0

0.488 0.968

720.2 1498.4

0.451 0.925

CLA

32 6 1236.4 64 8 3295.6

0.351 0.503

1188.2 3195.8

0.322 0.497

Type 1 HCLA

32 6 1661.0 64 8 4085.4

0.225 0.245

1592.2 3982.7

0.211 0.213

Type 2 HCLA

32 6 1436.0 64 8 4085.4

0.219 0.245

1387.0 3982.7

0.199 0.223

Hybrid Type 1 HCLA 32 6 64 8

838.2 1590.6

0.217 0.232

791.2 1501.3

0.197 0.205

Hybrid Type 2 HCLA 32 6 64 8

818.4 1590.6

0.202 0.232

778.3 1501.3

0.195 0.205

obtained earlier for the six different types of adder, we described the 2-input NAND gate in VHDL and synthesized it using the same cell library (Nangate 45 nm Open Cell Library) to estimate its area in μm2 and delay in ns. The obtained value for each of them is multiplied by the corresponding normalized value obtained in Section 7 for both cases M¼32, n¼6 and M¼64, n¼8. In this way, we can calculate the absolute values for each of the area and delay for the six different types of adder. Table 2 shows the comparison between the ASIC implementation results and the estimated results that is obtained from the proposed modeling of area and delay. We notice from this table that the estimated results for both area and delay are close to the results obtained from the ASIC implementation. There is a slight difference in the results that does not exceed, in the worst case, 10% and this is attributed to the neglect of the interconnect wires in the proposed modeling. However, the values obtained are good enough to compare the relative sizes and performance of the different designs.

9. Conclusions This paper presented improved hierarchical structures for CLA and HCLA adders. The improved hierarchal levels of the HCLA adders are integrated with the RCA adder to construct novel structures of hybrid RCA/HCLA adders. A general methodology is presented for constructing M-bit HCLA adders using n-bit HCLA modules. The only restriction on the values of M or n is n r M. Two algorithms are developed to efficiently construct HCLA adders for the case when M is not an integer power or an integer multiple of n. Area and time complexities of the resulting designs are reported for fixed values of M and different values of radix n. An ASIC implementation of the proposed structures and previously published recent designs shows that the proposed hybrid Type 2 RCA/ HCAL adder achieves 28.2–77.7% reduction in area–delay product and 40.5–75.8% reduction in energy, for M ¼64 and n ¼8, over the different compared adder designs.

Acknowledgements The authors would like to acknowledge the support of a Discovery Grant from the Natural Sciences and Engineering Research Council to the second author and the support of Sattam Bin AbdulAziz University and Electronics Research Institute for the first author.

793

References [1] H. Elmiligi, M. El Kharashi, F. Gebali, Power consumption of 3D networks-onchips: modeling and optimization, Microprocess. Microsyst. 37 (2013) 530–543. [2] B.H. Meyer, J.J. Pieper, J.M. Paul, J.E. Nelson, S.M. Pieper, A.G. Rowe, Powerperformance simulation and design strategies for single-chip heterogeneous multiprocessors, IEEE Trans. Comput. 54 (6) (2005) 684–697. [3] N. Banerjee, P. Vellanki, K.S. Chatha, A power and performance model for network-on-chip architectures, in: Proceedings of the Design, Automation and Test in Europe (DATE04), Paris, France, 2004, pp. 21250–21256. [4] S. Sahoo, K. Mahapatra, National conference on modified circuit design technique for feedthrough logic, in: Computing and Communication Systems (NCCCS), 2012, pp. 1–5. [5] T.T. Hoang, M. Sjalander, P. Larsson-Edefors, High-speed, energy-efficient 2-cycle multiply-accumulate architecture, in: IEEE International SOC Conference, 2009, pp. 119–122. [6] S. Mathew, M.A. Anders, B. Bloechel, T. Nguyen, R.K. Krishnamurthy, S. Borkar, A 4 GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90 nm CMOS, IEEE J. Solid-State Circuits 40 (1) (2005) 44–51. [7] E.S. Fetzer, M. Gibson, A. Klein, N. Calick, Z. Chengyu, E. Busta, B. Mohammad, A fully bypassed six-issue integer datapath and register file on the itanium-2 microprocessor, IEEE J. Solid-State Circuits 37 (11) (2002) 1433–1440. [8] S. Naffziger, B. Stackhouse, T. Grutkowski, D. Josephson, J. Desai, E. Alon, M. Horowitz, The implementation of a 2-core multithreaded itanium family processor, IEEE J. Solid-State Circuits 41 (1) (2006) 197–209. [9] S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang, A dual-core multi-threaded xeon processor with 16 mb l3 cache, in: IEEE ISSCC Digest of Technical Papers, 2006, pp. 102–103. [10] M. Golden, S. Arekapudi, G. Dabney, M. Haertel, S. Hale, L. Herlinger, Y. Kim, K. McGrath, V. Palisetti, M. Singh, A 2.6 GHz dualcore 64b  86 microprocessor with ddr2 memory support, in: IEEE ISSCC Digest of Technical Papers, 2006, pp. 104–105. [11] K. Rawat, T. Darwish, M. Bayoumi, A low power and reduced area carry select adder, in: The 2002 45th Midwest Symposium on Circuits and Systems MWSCAS, vol. 1, 2002, pp. I-467–470. [12] S. Sakthikumaran, S. Salivahanan, V. Bhaaskaran, V. Kavinilavu, B. Brindha, C. Vinoth, A very fast and low power carry select adder circuit, in: 3rd International Conference on Electronics Computer Technology (ICECT), vol. 1, 2011, pp. 273–276. [13] U.S. Kumar, K.K. Salih, K. Sajith, Design and implementation of carry select adder without using multiplexer, in: IEEE International Conference on Emerging Technology Trends in Electronics, Communication and Networking, vol. A247, 2012, pp. 529–551. [14] S. Jia, S. Lyu, X. Li, L. Liu, Y. He, Simplified carry save adder-based array multiplier scheme and circuits design, International Journal of Circuit Theory and Applications (published online: 30 April 2014), http://dx.doi.org/10.1002/ cta.1998. [15] R. Zlatanovici, B. Nikolic, Power-performance optimal 64-bit carry-lookahead adders, in: Proceedings of the 29th European Solid-State Circuits Conference (ESSCIRC'03), 2003, pp. 321–324. [16] R. Zlatanovici, S. Kao, B. Nikolic, Energy-delay optimization of a 64-bit carrylookahead adders with a 420 ps 90 nm CMOS design example, IEEE J. SolidState Circuits 44 (2) (2009) 569–583. [17] M. Morrison, M. Lewandowski, R. Meana, N. Ranganathan, Design of a novel reversible ALU using an enhanced carry look-ahead adder, in: 11th IEEE Conference on Nanotechnology (IEEE-NANO), 2011, pp. 1436–1440. [18] S. Perri, M. Lanuzza, P. Corsonello, Design of high-speed low-power parallelprefix adder trees in nanometer technologies, Int. J. Circuit Theory Appl. 42 (7) (2014) 731–743. [19] Y. Wang, C. Pai, X. Song, The design of hybrid carry-lookahead/carry-select adders, IEEE Trans. Circuits Syst. II: Analog Digit. Signal Process. 49 (1) (2002) 16–24. [20] J.-F. Li, J.-D. Yu, Y.-J. Huang, A design methodology for hybrid carry-lookahead/ carry-select adders with reconfigurability, in: IEEE International Symposium on Circuits and Systems (ISCAS), vol. 1, 2005, pp. 77–80. [21] Y. He, C.-H. Chang, A power-delay efficient hybrid carry-lookahead/carryselect based redundant binary to two's complement converter, IEEE Trans. Circuits Syst. I: Regul. Pap. 55 (1) (2008) 336–346. [22] P. hua Chen, J. Zhao, G. bo Xie, Y.-J. Li, An improved 32-bit carry-lookahead adder with conditional carry-selection, in: 4th International Conference on Computer Science & Education (ICCSE'09), 2009, pp. 1911–1913. [23] H. Tamar, A. Tamar, K. Hadidi, A. Khoei, P. Hoseini, High speed area reduced 64-bit static hybrid carry-lookahead/carry-select adder, in: 18th IEEE International Conference on Electronics, Circuits and Systems (ICECS), 2011, pp. 460– 463. [24] S. Parmar, K. Singh, Design of high speed hybrid carry select adder, in: IEEE 3rd International Advance Computing Conference (IACC), 2013, pp. 1656–1663. [25] A. Guyot, B. Hochet, J.-M. Muller, A way to build efficient carry-skip adders, IEEE Trans. Comput. C36 (10) (1987) 1144–1152. [26] Y. Kobayashi, A. Satoh, S. Munetoh, Carry Skip Adder. URL 〈http://www.google. com.ar/patents/US6199091〉, 2001. [27] W. Haixia, S. Zhong, Q. Xiaonan, X. Qianbin, C. Yueyang, Design of a conditional sum adder based on multiple-valued logic, in: International Conference on Electronics, Communications and Control (ICECC), 2011, pp. 810–813.

794

A. Ibrahim, F. Gebali / Microelectronics Journal 46 (2015) 783–794

[28] P. Phaneendra, S. Veeramachaneni, N. Muthukrishnan, M. Srinivas, Conditional sum block for high sparse adders, in: 2011 Asia Pacific Conference on Postgraduate Research in Microelectronics and Electronics (PrimeAsia), 2011, pp. 110–114.

[29] N. Weste, K. Eshraghian, Principles of CMOS VLSI Design, Addison Wesley, Reading, Massachusetts, 1993. [30] J. Rabaye, A. Chandrakasan, B. Nicolic, Digital Integrated Circuits, Prentice-Hall, Upper Saddle River, 2003.