Improving the performance of transmission gate and hybrid CMOS Full Adders in chain and tree structure architectures

Improving the performance of transmission gate and hybrid CMOS Full Adders in chain and tree structure architectures

Integration, the VLSI Journal xxx (xxxx) xxx Contents lists available at ScienceDirect Integration, the VLSI Journal journal homepage: www.elsevier...

5MB Sizes 0 Downloads 11 Views

Integration, the VLSI Journal xxx (xxxx) xxx

Contents lists available at ScienceDirect

Integration, the VLSI Journal journal homepage: www.elsevier.com/locate/vlsi

Improving the performance of transmission gate and hybrid CMOS Full Adders in chain and tree structure architectures Manan Mewada a, *, Mazad Zaveri b, Rajesh Thakker c a b c

School of Engineering and Applied Science, Ahmedabad, India Pandit Deendayal Petroleum University, Gandhinagar, India Vishwakarma Government Engineering College, Ahmedabad, India

A R T I C L E I N F O

A B S T R A C T

Keywords: Full adder Power delay product Ripple carry adder Multiplier Triplet design

A Full Adder (FA) is the basic building block of many VLSI sub-systems, and generally it falls into the critical path of the system. Several Transmission Gate (TG) and hybrid CMOS FA designs have been proposed in literature to achieve low Power Delay Product (PDP). But, their performance degrades when used in chain and tree structures, mainly due to poor driving capability. This paper introduces new design approach, a triplet design, to improve performance of TG and hybrid CMOS FA designs in chain and tree structures without inserting buffers. Two new hybrid CMOS FA designs, which are suitable for triplet design approach are also proposed in this paper. Six different FA designs (TG and hybrid CMOS FAs) are chosen to build 4, 8 and 16 bit Ripple Carry Adders (RCA) and multipliers, and we also studied the improvement in PDP using triplet design approach. Schematics and layouts of RCA and multiplier are designed in Cadence Virtuoso using gpdk045 library, and compared based on PDP.

1. Introduction Demand of portable electronic devices with high speed and longer battery life is increasing rapidly. These electronic devices include processing elements such as Arithmetic and Logic Unit (ALU) inside it; and FA is one of the basic building blocks of ALU. Generally, FAs fall into the critical path of the system [1–4], hence, high performance low power FA design has been an important research area since many years. FAs are the part of multi-bit adders, such as: RCA (chain structure), and also part of multipliers (tree structure) [1,2]. The TG and hybrid CMOS FA designs are suitable to achieve low PDP,1 but their performance degrades when used in chain and tree structure architectures, because of poor driving capability [4]. In this paper, we propose a new design approach, a triplet design, to improve the performance of TG and hybrid CMOS FA designs in chain and tree structures. We also propose two new hybrid CMOS FA designs which are suitable for triplet design approach, and our proposed FA design also achieve low PDP in chain and tree structures. Rest of the paper is organized as follows: Section 2 briefly summarizes different FA designs (logic styles) reported in literature. In Section 3, triplet design approach to improve performance of TG and hybrid CMOS FA designs in

chain and tree structures is explained. Two new hybrid CMOS FA designs suitable for conversion to triplet design, are introduced in Section 4. Section 5 includes simulation environment and simulation results. Finally, conclusion is carried out in Section 6. 2. Full Adder design styles FA designs can be divided in to two broad categories, one is dynamic CMOS FA designs, and other is static CMOS FA designs. Different dynamic CMOS FA designs were proposed/reported [2,5–8] in the literature. These FA circuits are constructed using high speed NMOS transistors and use few PMOS transistors (mostly controlled by clock signal), which reduces the input capacitance. Hence, dynamic CMOS FA designs can achieve high speed compared to static CMOS FA designs. However, major drawback of dynamic CMOS design is high power dissipation due to large clock load and unnecessary switching in idle mode. Also, dynamic CMOS design is more susceptible to leakage current compared to static CMOS design. So, dynamic CMOS FA designs are not suitable for battery operated devices [3,4,9], and hence, not considered in this paper. The static CMOS design consumes low power compared to dynamic CMOS design; hence FAs designed using static CMOS style are more

* Corresponding author. E-mail addresses: [email protected], [email protected] (M. Mewada), [email protected] (M. Zaveri), rathakker2008@gmail. com (R. Thakker). 1 PDP is a measure that correlates power dissipation and propagation delay and is used as figure of merit [1–8]. https://doi.org/10.1016/j.vlsi.2019.09.002 Received 23 March 2019; Received in revised form 5 August 2019; Accepted 5 September 2019 Available online xxxx 0167-9260/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: M. Mewada et al., Improving the performance of transmission gate and hybrid CMOS Full Adders in chain and tree structure architectures, Integration, the VLSI Journal, https://doi.org/10.1016/j.vlsi.2019.09.002

M. Mewada et al.

Integration, the VLSI Journal xxx (xxxx) xxx

static CMOS design styles by focusing on reducing the number of transistors, while maintaining sufficient driving capability and speed [4,9]. Many hybrid CMOS FA designs have been proposed in literature to achieve low PDP in past few decades [1,4,9,16–18]. For battery operated devices, power dissipation and speed both have equal weightage [4,9]. So, high speed, low power dissipation, good driving strength and full voltage swing at output are required for FA design [19]. From above discussed FA designs (or logic styles), TG and hybrid CMOS FA design styles are more suitable for battery operated devices, and FAs designed using these logic styles would be able to achieve low PDP value in chain and tree structures.

List of abbreviation ALU CPA CSA CSL CSK CMOS CPL FA LSB LPHS18T LPHS22T MSB MUX NMOS PMOS PDP RCA TG VLSI

Arithmetic and Logic Unit Carry Propagation Adder Carry Save Adder Carry Select adder Carry Skip adder Complementary Metal Oxide Semiconductor Complimentary Pass-transistor Logic Full Adder Least Significant Bit Low Power High Speed 18 Transistor full adder Low Power High Speed 22 Transistor full adder Most Significant Bit Multiplexer N-channel Metal Oxide Semiconductor P-channel Metal Oxide Semiconductor PT Pass Transistor Power Delay Product Ripple Carry Adder Transmission Gate Very Large Scale Integration

3. A new design approach: the triplet design FAs are generally used to build multi-bit two-operand adders (such as: RCA, Carry Select adder (CSL), Carry Skip adder (CSK),2 etc.), and to build multipliers (such as: Braun, Baugh-Wooley, Booth etc.) that are based on multi-bit multi-operand adder (referred to as Carry Save Adder (CSA)) [20]. In this paper, the multi-bit two-operand adders are referred to as “chain structure”, which requires fast generation of Cout in each FA, so as to increase the speed of carry propagation in the chain of cascaded FAs, and to reduce glitches [1]. Multipliers consists of CSA tree (for adding multiple partial products), and hence, referred to as “tree structure” in this paper. Here, the FAs are arranged in layered form, creating a tree (example, Dadda or Wallace Tree [21]), where each layer generates intermediate sum and carry vectors, and the final sum and carry vector, goes through a CPA to generate the answer of the multiplication [1]. In tree structures, performance of both the outputs Sum and Cout in each FA is crucial to achieve high speed. The basic architecture of 4-bit RCA is shown in Fig. 1. In the RCA, carry bit ripples (or propagates) through all the FAs, i.e. from the Least Significant Bit (LSB) FA to the Most Significant Bit (MSB) FA. Also, n-bit RCA (where n is small, between 4 and 8) is part of other larger adders, such as: CSL, CSK, and the last layer of CSA [20,21]. Fig. 2 shows the CSA based tree structure used in 4-bit multiplier. The rippling effect of Sum and Cout plays important role to decide the speed and power dissipation (due to unwanted glitches) for the multiplier. Hence, good driving strength and speed of FA is the basic requirement of chain and tree structures.

suitable for battery operated devices. The static CMOS FA designs can be further subcategorized in to five broad categories: conventional CMOS logic style, Pass Transistor (PT) logic style, Complimentary Passtransistor Logic (CPL) style, TG logic style, and hybrid CMOS logic style. The conventional CMOS logic style is based on two complementary networks: the PMOS pull-up network and the NMOS pull-down network [1,4,5]. The advantage of conventional CMOS logic style is its robustness against voltage scaling, which is essential for reliable operation at low voltage. Also, it has small rise/fall time, which allows it to operate at higher frequencies [1,4]. A problem with conventional CMOS logic style is higher input capacitance, which can degrade the performance of the FA, especially when connected in cascade [4]. Some of the conventional CMOS FA designs are reported in Refs. [5,7]. A FA reported in literature with lowest number of transistors is designed using PT logic style. It requires only 6 transistors to generate outputs Sum and Cout [10]. Other FA designs proposed using PT logic style, require only 8 to 10 transistors [11–13]. Hence, FA designed using PT logic style achieves low power dissipation. The major drawbacks of these FA designs are poor driving capability and threshold voltage drop, which make these FA designs unsuitable for chain and tree structures [1, 4]. CPL style is based on two mutually exclusive NMOS networks. FA design using this style can achieve high speed, full swing and good driving capability due to output inverters. One such FA design using 32 transistors is reported in Ref. [2]. But, this FA design consumes high power due to lot of internal nodes. Also, more number of transistors consumes larger silicon area and irregular arrangement of transistors increases layout complexity [4]. TG logic style is special kind of PT circuit, in which, NMOS and PMOS transistors are connected parallel to each other. The gate terminals of NMOS and PMOS transistors are controlled using complimentary signals, so combination of both transistors propagates logic value without any threshold voltage drop. The TG logic style eliminates the threshold voltage drop issue of PT logic style by adding extra NMOS or PMOS transistors in FA design. Some TG FA designs reported in literature [14, 15] can achieve high speed and full voltage swing with low power dissipation (compared to conventional CMOS and CPL styles) [2,4]. These FA designs are suitable for chain and tree structures if buffers are used periodically to improve signal strength. The hybrid CMOS logic style captures the advantages of different

3.1. Driving capability issue with transmission gate and hybrid CMOS Full Adder designs Many TG and hybrid CMOS FA designs proposed in literature lack proper driving strength [2,4,9], when they are used in chain and tree structures. The major reason behind this problem is their transmission function (in other words, TG chain) based carry propagation, in chain and tree structures [2,4,9]. It produces higher load on input C0 of LSB FA when all transmission gates are active throughout the carry propagation path. This effect is clearly visualized from Fig. 3. A 3-bit RCA is made of three TG CMOS FAs. Inputs of FA0 can directly propagate to any of the Sum and Cout of FA0 to FA2 when these FAs get particular input transitions, and every FA stage (FA0 to FA2) introduces one inverter load on input C0 of FA0. Hence, the capacitive loading is a typical problem for conventional chain and tree structures, especially when these are built using TG and hybrid CMOS FA designs. The dotted lines shown in Figs. 1 and 2 indicates all possible signal propagation paths through the FAs in 4-bit RCA and multiplier designed using TG CMOS FAs without any signal restoration; and bold dotted line indicate such worst propagation path. According to Elmore delay model, total propagation time for the TG chain is [22]:

2 In general, the RCA, CSL, CSK, are also referred to as “Carry Propagate Adder” (CPA).

2

M. Mewada et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 1. 4-bit RCA designed using regular FA.

Fig. 2. 4-bit multiplier designed using regular FA (Considering partial products generated from 4-bit inputs X and Y).

tp ¼ 0:69  CR 

nðn þ 1Þ 2

C ¼ total lumped capacitance of FA R ¼ total lumped resistance of FA n ¼ number of stages If we consider the FAs of 4-bit RCA and multiplier in Figs. 1 and 2 are

tp ¼ total propagation time

3

M. Mewada et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 3. Load on input C0 when TG CMOS [14] FA is used in RCA.

Fig. 4. Triplets of FA

Fig. 5. Load is reduce on input C0 when triplets of TG CMOS [14] FA are used in RCA.

3.2. Triplet design approach

identical, then total propagation time for 4-bit RCA is tp ¼ 0:69  CR 

4ð4 þ 1Þ ¼ 6:9  CR 2

Hence, we propose a new architecture approach, referred to as a “triplet design”, which reduces the load on input Cin by breaking its propagation in the carry chain. Also, it eliminates the requirement of buffers in chain and tree structures to restore drive strength. In triplet design, three versions of a FA (see Fig. 4) are used to construct chain and tree structures. The first version of triplet accepts inputs A, B and Cin, and generates outputs Sum and Cout (considered as regular FA). In the second version of triplet, inputs are same as first version, but output Cout is generated instead of output Cout. While the third version accepts input Cin instead of input Cin to generate outputs Sum and Cout. As shown in Fig. 5, FA0 is version 2, FA1 as a version 3, and FA2 as a version 1. Here in FA0, input C0 has to drive only output Sum0. Input C1

Considering n ¼ 4

Similarly, total propagation for 4-bit multiplier is tp ¼ 0:69  CR 

6ð6 þ 1Þ ¼ 14:49  CR 2

Considering n ¼ 6

One solution to reduce total propagation time is to insert buffers periodically in chain and tree structures, to break the TG chain and restore the drive strength [22]. But, buffers require additional area, and contribute to the power dissipation and propagation delay [5].

4

M. Mewada et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 6. 4-bit RCA designed using triplets of FA.

Fig. 7. CSA tree based 4-bit multiplier designed using triplets of FA (Considering partial products generated from 4-bit inputs X and Y).

5

M. Mewada et al.

Integration, the VLSI Journal xxx (xxxx) xxx

of FA1 is connected with output C1 of FA0, and this output C1 (of FA0) is driven by the outputs of the inverters (signal A0 or C0 of FA0), instead of the inputs A0, B0 and C0 of FA0. Also, Input C1 of FA1 has to drive only output Sum1, and output C2 is driven by input A1 or signal C1 of FA1. FA2 is acts as a regular3 FA in RCA, and outputs Sum2 and C3 are driven by input C2 of FA2. Hence, triplet design breaks carry propagation path in the TG chain and improves the driving strength of these structures without requirement of buffers, and requires only a minor change in the internal connections (related to the Boolean expressions) of the TG or hybrid CMOS FA designs. The three versions of triplet design do not increase the area of multi-bit adders and multipliers (if we compare the number of transistors in Figs. 3 and 5, both are same). A 4-bit RCA and multiplier designed using triplets of FA, are shown in Fig. 6 and Fig. 7 respectively. Only, internal signals are different as compared to 4-bit RCA and multiplier designed using regular FA, inputoutput signals are same in both the design approaches (regular and triplet). The bold dotted lines in Figs. 6 and 7 indicate the worst propagation paths through the FAs in 4-bit RCA and multiplier without any signal restoration. The number of FAs in these worst propagation paths are lesser then the number of FAs in the worst propagation path of RCA and multiplier designed using regular FAs (see Figs. 1 and 2). According to Elmore delay model, total propagation time for 4-bit RCA designed using triplets of FA is:   2ð2 þ 1Þ tp ¼ 2 0:69  CR  þ 2  tinv ¼ 4:14  CR þ 2  tinv 2 Here, tinv is inverter propagation delay (in carry propagation path) of version 2 or version 3 FA in 4-bit RCA. Similarly, total propagation time for 4-bit multiplier designed using triplets of FA is:     2ð2 þ 1Þ 1ð1 þ 1Þ þ 2 0:69  CR  þ 4  tinv tp ¼ 2 0:69  CR  2 2 ¼ 5:52  CR þ 4  tinv In both the cases, the triplet design has lower lumped RC load value compared to regular design. Also, signals are restored after some FA stages in triplet design. Hence, signal restorer buffers are not explicitly needed. There are two conditions required to convert a given TG or hybrid CMOS FA designs to triplet design. First, the FA design must have both the signals, Cin and Cin. Second, the carry out (either Cout or Cout) must be generated from TG type multiplexers. Some of the TG and hybrid CMOS FA designs proposed in literature are satisfying both these conditions [14,15,17,23]. 4. New hybrid CMOS Full Adder designs This section introduces our two new hybrid CMOS FA designs (see Fig. 8), referred as “Low Power High Speed 22 Transistor” FA (LPHS22T) and “Low Power High Speed 18 Transistor” FA (LPHS18T). Initial results for LPHS22T FA design are shown in our paper [24] (A detail explanation is given in this paper). Both the FA designs have three stages: Inverter, XOR-XNOR and Multiplexer (MUX).

Fig. 8. Proposed hybrid CMOS FA designs (a) LPHS22T [24] (b) LPHS18T.

XNOR and MUX stages. In LPHS22T FA design MP1-MN1, MP2-MN2 and MP3- MN3 transistor pairs generates A, B and Cin respectively. A and B signals are used in XOR-XNOR stage, while Cin is used in MUX stage to generate Sum. Similarly, only B and Cin signals are generated through MP1-MN1 and MP2-MN2 transistor pairs respectively in LPHS18T FA design.

4.1. Inverter stage The inverter stage generates required inverted input signals for XOR-

3

Normally FAs accept inputs A, B and Cin and generate Sum and Cout; in this paper we consider them as regular FAs. An RCA and multiplier designed using only regular FAs are considered as regular design. Three versions of FA are considered as triplets of FA. An RCA and multiplier designed using triplets of FA are considered as triplet design.

4.2. XOR-XNOR stage This stage generates XOR and XNOR signals to control selection line of 2 input TG type multiplexer (in MUX stage). In LPHS22T and LPHS18T 6

M. Mewada et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 9. Layouts of triplets of LPHS18T FA design (a) Version 1 (b) Version 2 (c) Version 3.

7

M. Mewada et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 10. (a) 8-bit RCA designed using regular FA (b) 8-bit multiplier designed using regular FA (c) 8-bit RCA designed using triplets of FA (d) 8-bit multiplier designed using triplets of FA.

8

M. Mewada et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 10. (continued).

generation time [25]. Following Boolean equations are used to generate XOR and XNOR signals in a LPHS22T FA design:

FA designs, we have used two different strategies to design XOR-XNOR stage. In LPHS22T FA design, we aimed at reducing the propagation delay, by simultaneously generating the XOR and XNOR signals (which later go to the MUX stage as selection line). Both the signals (XOR and XNOR) are generated using Double Pass-transistor Logic (DPL) style [25]. DPL style provides lower path resistance between source to destination compared to PT and TG styles. Hence, it reduces the XOR and XNOR signal

XOR ¼ AB þ AB XNOR ¼ AB þ AB

ðLPHS22T: MP4  MP5  MN4  MN5Þ ðLPHS22T: MP6  MP7  MN6  MN7Þ

In LPHS18T design, we focused on reduction in power dissipation. It

Table 1 Simulation results of 4, 8 and 16 bit RCA designed using regular FA and triplets of FA. FA

20T conventional Kamsani LPHS18T LPHS22T TFA TG CMOS

Regular /Triplet

4-bit RCA Max. Prop. Delay (ps)

Average Power Dissipation (uW)

PDP (fWs)

Reduction in PDP

Max. Prop. Delay (ps)

8-bit RCA Average Power Dissipation (uW)

PDP (fWs)

Reduction in PDP

Max. Prop. Delay (ps)

Average Power Dissipation (uW)

PDP (fWs)

Reduction in PDP

Regular Triplet Regular Triplet Regular Triplet Regular Triplet Regular Regular Triplet

299 267 279 267 279 266 279 267 278 280 265

5.72 5.86 5.61 5.78 5.61 5.69 5.79 6.09 5.15 5.49 5.9

1.71 1.56 1.57 1.54 1.56 1.51 1.61 1.62 1.61 1.54 1.56

8.57

850 597 849 594 856 593 852 597 912 855 590

6.14 6.25 5.55 5.84 5.32 5.54 6.16 6.59 4.79 5.94 6.17

5.22 3.73 4.71 3.47 4.55 3.29 5.25 3.93 4.37 5.08 3.64

28.46

2919 1154 2905 1155 2921 1153 2906 1154 2989 2912 1157

3.22 3.26 3 3.03 2.58 2.78 3.25 3.53 2.32 3.14 3.37

9.4 3.76 8.73 3.5 7.53 3.2 9.45 4.08 6.93 9.14 3.89

59.98

1.4 3.22 0.61 – 1.72

16-bit RCA

Note: Simulation results are based on limited input transitions that excite the longest propagation path. 9

26.36 27.8 25.05 – 28.31

59.87 57.87 56.84 – 57.41

M. Mewada et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Table 2 Simulation results of 4, 8 and 16 bit multiplier designed using regular FA and triplets of FA. FA

20T conventional Kamsani LPHS18T LPHS22T TFA TG CMOS

Regular /Triplet

4-bit multiplier

8-bit multiplier

Max. Prop. Delay (ps)

Average Power Dissipation (uW)

PDP (fWs)

Reduction in PDP

Max. Prop. Delay (ps)

Average Power Dissipation (uW)

PDP (fWs)

Reduction in PDP

Max. Prop. Delay (ps)

Average Power Dissipation (uW)

PDP (fWs)

Reduction in PDP

Regular Triplet Regular Triplet Regular Triplet Regular Triplet Regular Regular Triplet

519 366 469 312 346 288 471 313 425 473 313

6.31 6.59 6.75 7 6.3 6.19 7.19 7.05 5.96 6.52 6.82

3.27 2.41 3.17 2.18 2.18 1.78 3.39 2.21 2.53 3.08 2.13

26.3

1963 857 1790 773 1330 597 1814 794 1576 1850 816

8.02 8.08 8.12 8.45 7.71 7.79 8.46 8.69 7.44 7.9 8.04

15.7 6.92 14.5 6.53 10.3 4.65 15.3 6.9 11.7 14.6 6.56

56.04

7765 1937 7110 1826 5489 1154 7276 1888 5924 7415 1902

4.84 4.86 4.96 5.34 4.76 4.15 4.93 5.26 4.33 4.69 4.98

37.6 9.42 35.3 9.75 26.1 4.79 35.9 9.93 25.6 34.7 9.46

74.97

31.08 18.28 34.86 – 30.82

16-bit multiplier

55.07 54.66 55.05 – 55.14

72.37 81.7 72.33 – 72.76

Note: Simulation results are based on limited input transitions that excite the longest propagation path.

Table 3 Simulation results of 4-bit multiplier for best and worst operating. FA

20T conventional Kamsani LPHS18T LPHS22T TFA TG CMOS

Regular /Triplet

Best Operating Condition

Worst Operating Condition

Max. Propagation Delay (ps)

Avg. Power Dissipation (uW)

PDP (fWs)

Reduction in PDP

Max. Propagation Delay (ps)

Avg. Power Dissipation (uW)

PDP (fWs)

Reduction in PDP

Regular Triplet Regular Triplet Regular Triplet Regular Triplet Regular Regular Triplet

332 228 299 198 224 185 300 200 276 301 199

7.88 8.42 8.14 8.29 7.81 7.41 8.31 8.54 7.09 7.82 8.45

2.62 1.92 2.43 1.64 1.75 1.37 2.49 1.71 1.96 2.35 1.68

26.63

776 565 703 471 510 432 707 472 6.24 711 475

5.18 5.36 5.37 5.8 5.12 5.11 5.57 5.79 4.79 5.16 5.38

4.02 3.03 3.77 2.73 2.61 2.21 3.94 2.74 2.99 3.67 2.56

24.63

32.53 21.65 31.54 – 28.58

27.68 15.58 30.6 – 30.41

Note: Simulation results are based on limited input transitions that excite the longest propagation path.

Table 4 Simulation results of 4-bit RCA designed using regular FA and triplets of FA. Full Adder 20T conventional Kamsani LPHS18T LPHS22T TFA TG CMOS

Regular /Triplet

Max. Propagation Delay (ps)

Average Power Dissipation (uW)

PDP (fWs)

Reduction in PDP

Regular Triplet Regular Triplet Regular Triplet Regular Triplet Regular Regular Triplet

717 676 707 662 752 686 736 664 785 745 688

8.77 9.17 9.84 10.1 7.97 8.34 10.1 10.4 7.38 8.77 9.06

6.29 6.2 6.95 6.69 5.99 5.72 7.32 6.81 5.79 6.53 6.24

1.43% 3.74% 4.51% 6.97% – 4.44%

Note: Simulation results are based on all possible input transitions.

XOR ¼ XNOR

is possible to reduce power dissipation by reducing the number of transistors in the design. A XOR-XNOR design suggested in Ref. [9], and referred to as the “invertible inverter” and TG XNOR [5], is used in LPHS18T design, and it consists of 6 transistors, which is two transistors less as compared to XOR-XNOR designed using DPL style (8 transistors). Also, design suggested in Ref. [9], does not require A signal, hence, one less inverter is needed in the Inverter stage. Overall, 4 transistors can be cut down from the design, at the cost of asynchronous generation of XOR and XNOR signal, and this may increase propagation delay of the FA. Following Boolean equations are used to generate XOR and XNOR signals in a LPHS18T FA design:

XNOR ¼ AB þ AB

ðLPHS22T : MP5  MN5Þ ðLPHS22T : MP3  MP4  MN3  MN4Þ

4.3. MUX stage TG type multiplexers are commonly used to generate outputs Sum and Cout in many FA designs [9,14,16,23,26,27]. The simulation results in Ref. [18], shows that, the FAs designed using TG type multiplexers, generally achieves lower PDP values. Also, TG type multiplexer to 10

M. Mewada et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Table 5 Simulation results of 4-bit multiplier designed using regular FA and triplets of FA. Full Adder 20T conventional Kamsani LPHS18T LPHS22T TFA TG CMOS

Regular /Triplet

Max. Propagation Delay (ps)

Average Power Dissipation (uW)

PDP (fWs)

Reduction in PDP

Regular Triplet Regular Triplet Regular Triplet Regular Triplet Regular Regular Triplet

1665 1462 1598 1303 1588 1268 1549 1233 1505 1449 1166

20.7 20.8 22.4 23 20.4 20.9 23.4 24 18.3 21 21.7

34.5 30.4 35.8 30 32.5 26.5 36.1 29.5 27.5 30.5 25.3

11.88% 16.81% 18.46% 18.28% – 17.05%

Note: Simulation results are based on all possible input transitions.

multiplier are designed and simulated in Cadence Virtuoso using gpdk045 library. The simulation results shows that the triplet design achieves less PDP when number of bits are increasing in RCA and multiplier. A negative reduction in PDP for 4-bit RCA designed using LPHS22T and TG CMOS is observed, because, the effect of triplet is less on the small chains. Also, we haven't applied all possible transitions as well as these results are based on schematic design. So, we may get positive reduction in PDP when we

generate Sum and Cout is basic requirement to convert regular FA design to triplets. Hence, both our FA designs contain TG type multiplexer to generate Sum and Cout. The XOR and XNOR signals generated from XOR-XNOR stage are used as selection line of the multiplexers. Cin and Cin are inputs of the multiplexer that generates Sum, and A and Cin are the inputs of the multiplexer that generates Cout. Final Boolean expressions for outputs Sum and Cout are:

Sum ¼ CinðA  BÞ þ CinðA  BÞ ðLPHS22T : ðLPHS18T : MP6  MP7  MN6  MN7Þ

MP8  MP9  MN8  MN9Þ

Cout ¼ CinðA  BÞ þ AðA  BÞ ðLPHS22T : MP10  MP11  MN10  MN11Þ ðLPHS18T : MP8  MP9  MN8  MN9Þ

do thorough characterization (thorough characterization for 4-bit RCA is covered in later part of this section). The effect of temperature and voltage on performance of triplets is also studied for 4-bit multiplier. Two extreme cases: temperature ¼ 70  C, supply voltage ¼ 0.9V considered as worst case; and temperature ¼ 0  C, supply voltage ¼ 1.1V considered as best case. Simulation results for worst and best cases are shown in Table 3. As per the simulation results in Table 3, the triplet design provides low PDP compared to regular design for both, best and worst operating conditions. The rough estimation of maximum propagation delay and average power dissipation on the worst propagation path of 4, 8 and 16 bit RCA and multiplier proves the superiority of triplet design over regular design. Still, thorough characterization of n-bit RCA and multiplier is needed to prove the superiority of triplet design over regular design for any possible input transitions. The layout of RCA is simple, however the RCA is relatively slow, since each FA must wait for the carry bit to be calculated from the previous FA. Hence, RCA is not use directly to design larger multi-bit adders, but smaller RCAs (i.e, 4-bit RCA) are the part of larger multi-bit adders designed using CSL, CSK etc. [21]. Similarly, multi-bit multipliers (greater than 4-bits) are generally designed using the ‘Divide and Conquer’ strategies [28–32] based on using smaller multipliers (generally 4-bit multipliers). Hence, we have chosen to comprehensively test the 4-bit RCA and multiplier. Layouts of 4-bit RCA and multiplier are simulated in Cadence Virtuoso using gpdk045 library, considering operating condition as, temperature ¼ 50  C, supply voltage ¼ 1V, input

LPHS18T and LPHS22T FA designs satisfy the conditions mentioned in previous section to convert them into the triplets. Layouts of three versions (triplets) of LPHS18T FA design are shown in Fig. 9 for reference. 5. Simulation environment and simulation results To understand the effect of triplet design approach on PDP, schematics of 4, 8 and 16 bit RCA and multiplier are designed using triplets of FA, and compared their results with the RCA and multiplier designed using regular FA. Six FA designs (TG and hybrid CMOS FA) are chosen for simulation/analyses: 20T conventional [17], Kamsani [15], LPHS18T, LPHS22T, TFA4 [9] and TG CMOS [14]. For rough estimation of maximum propagation delay and average power dissipation of 4, 8 and 16 bit RCA and multiplier, the longest signal propagation path is exited for regular and triplet designs, using limited number of input transitions. Fig. 10 shows the exited path and FA used to design 8-bit RCA and multiplier for regular and triplet designs. In Fig. 10 (c-d), ‘*’ indicates the inverted input/output for version 2 or version 3 of the triplets. The simulation results for this arrangement are shown in Tables 1 and 2 for 4, 8 and 16 bit RCA and multiplier respectively. The operating condition considered as, temperature ¼ 50  C, supply voltage ¼ 1V, input driver strength ¼ 2  inverter (two times of minimum size inverter available in chosen technology), output load ¼ 8  inverter and frequency ¼ 200 MHz (for 4-bit), 100 MHz (for 8bit), 25 MHz (for 16-bit). The schematics of 4, 8 and 16 bit RCA and

4

TFA is not convertible to triplet. 11

M. Mewada et al.

Integration, the VLSI Journal xxx (xxxx) xxx

driver strength ¼ 2  inverter, output load ¼ 8  inverter and frequency ¼ 200 MHz; and by applying all possible transitions.5 Transistor widths of all FA designs are optimized to achieve low PDP for the given environment/condition. Table 4 and Table 5 shows the simulation results of 4-bit RCA and multiplier designed using regular FA and triplets of FA respectively. For both, the 4-bit RCA (chain structure) and the multiplier (tree structure), triplet design performed well over the regular design. A significant reduction in PDP is observed (maximum 18.46% for LPHS18T FA design) when triplet design is used in tree structure.

[9] P. Bhattacharyya, B. Kundu, S. Ghosh, V. Kumar, A. Dandapat, Performance analysis of a low-power high-speed hybrid 1-bit full adder circuit, IEEE Trans. Very Large Scale Integr. Syst. 23 (10) (2015) 2001–2008. [10] K. Chandra, R. Kumar, S. Uniyal, V. Ramola, A new design 6t full adder circuit using novel 2t xnor gates, IOSR J. VLSI Signal Process. 5 (3) (2015) 63–68. [11] H.A. Mahmoud, M.A. Bayoumi, A 10-transistor low-power high-speed full adder cell, in: ISCAS’99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No. 99CH36349), vol. 1, IEEE, 1999, pp. 43–46. [12] J.-F. Lin, Y.-T. Hwang, M.-H. Sheu, Low power 10-transistor full adder design based on degenerate pass transistor logic, in: 2012 IEEE International Symposium on Circuits and Systems, IEEE, 2012, pp. 496–499. [13] H.T. Bui, Y. Wang, Y. Jiang, Design and analysis of low-power 10- transistor full adders using novel xor-xnor gates, IEEE Trans. Circuits Syst. Part II Analog Digital Signal Process. 49 (1) (2002) 25–30. [14] A.M. Shams, T.K. Darwish, M.A. Bayoumi, Performance analysis of low-power 1-bit cmos full adder cells, IEEE Trans. Very Large Scale Integr. Syst. 10 (1) (2002) 20–29. [15] N.A. Kamsani, V. Thangasamy, S.J. Hashim, Z. Yusoff, M.F. Bukhori, M.N. Hamidon, A low power multiplexer based pass transistor logic full adder, in: Micro and Nanoelectronics (RSM), 2015 IEEE Regional Symposium on, IEEE, 2015, pp. 1–4. [16] M. Aguirre, M. Linares, An alternative logic approach to implement high- speed low-power full adder cells, in: Integrated Circuits and Systems Design, 18th Symposium on, IEEE, 2005, pp. 166–171. [17] N. Zhuang, H. Wu, A new design of the CMOS full adder, IEEE J. Solid State Circuits 27 (5) (1992) 840–844. [18] P.T. Yen, N.F.Z. Abidin, A.B. Ghazali, Performance analysis of full adder (fa) cells, in: Computers & Informatics (ISCI), 2011 IEEE Symposium on, IEEE, 2011, pp. 141–146. [19] M. Mewada, M. Zaveri, An input test pattern for characterization of a full-adder and n-bit ripple carry adder, in: Advances in Computing, Communications and Informatics (ICACCI), 2016 International Conference on, IEEE, 2016, pp. 250–255. [20] P. Behrooz, Computer Arithmetic: Algorithms and Hardware Designs, vol. 19, Oxford University Press, 2000, pp. 512583–512585. [21] K.-S. Yeo, K. Roy, Low Voltage, Low Power VLSI Subsystems, McGraw-Hill, Inc., 2004. [22] J.M. Rabaey, A.P. Chandrakasan, B. Nikolic, Digital Integrated Circuits, vol. 2, Prentice hall Englewood Cliffs, 2002. [23] N.R. Konijeti, J. Ravindra, P. Yagateela, Power aware and delay efficient hybrid cmos full-adder for ultra deep submicron technology, in: Modelling Symposium (EMS), 2013, IEEE, European, 2013, pp. 697–700. [24] M. Mewada, M. Zaveri, A low-power high-speed hybrid full adder, in: VLSI Design and Test (VDAT), 2016 20th International Symposium on, IEEE, 2016, pp. 1–2. [25] M. Suzuki, N. Ohkubo, T. Shinbo, T. Yamanaka, A. Shimizu, K. Sasaki, Y. Nakagome, A 1.5-ns 32-b cmos alu in double pass-transistor logic, IEEE J. Solid State Circuits 28 (11) (1993) 1145–1151. [26] R. Zimmermann, R. Gupta, Low-power logic styles: cmos vs cpl, in: Solid- State Circuits Conference, 1996. ESSCIRC’96. Proceedings of the 22nd European, IEEE, 1996, pp. 112–115. [27] M.J.A. Morad, S.R. Talebiyan, E. Pakniyat, Design of new low-power highperformance full adder with new xor-xnor circuit, in: Technology, Communication and Knowledge (ICTCK), 2015 International Congress on, IEEE, 2015, pp. 153–158. [28] H. Fan, J. Sun, M. Gu, K.-Y. Lam, Overlap-free karatsuba–ofman polynomial multiplication algorithms, IET Inf. Secur. 4 (1) (2010) 8–14. [29] C. Paar, A new architecture for a parallel finite field multiplier with low complexity based on composite fields, IEEE Trans. Comput. 45 (7) (1996) 856–861. [30] Y. Li, Y. Zhang, X. Guo, C. Qi, N-term karatsuba algorithm and its application to multiplier designs for special trinomials, IEEE Access 6 (2018) 43056–43069. [31] A. Zanoni, Toom-cook 8-way for long integers multiplication, in: 2009 11th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, IEEE, 2009, pp. 54–57. [32] Z. Gu, S. Li, A division-free toom-cook multiplication based montgomery modular multiplication, IEEE Transactions on Circuits and Systems II: Express Briefs.

6. Conclusion The triplet design approach proposed in this paper is able to improve the performance of TG and hybrid CMOS FA designs in chain and tree structures. This approach is applicable on the TG and hybrid CMOS FA designs, which have both the signals, Cin and Cin; and carry out (either Cout or Cout) generated from TG type multiplexers. Triplet design approach reduces the load on input Cin by breaking its propagation path in chain and tree structures, and eliminates the requirement of buffers. Our two new hybrid CMOS FA designs (LPHS22T and LPHS18T) are suitable for triplet design. Six FA designs (TG and hybrid CMOS FA), including our proposed hybrid CMOS FA designs, are used to build 4, 8 and 16 bit RCA (chain structure) and multiplier (tree structure). For all the cases, triplet design achieves low PDP compared to regular design. This suggests that our proposed FA designs, along with the proposed triplet design approach, are very suitable for incorporation in the processors for battery operated devices. References [1] C. H. Chang, J. M. Gu, M. Zhang, A Review of 0.18- Мm Full Adder Performances for Tree Structured Arithmetic Circuits. [2] M. Alioto, G. Palumbo, Analysis and comparison on full adder block in submicron technology, IEEE Trans. Very Large Scale Integr. Syst. 10 (6) (2002) 806–823. [3] M. Aguirre-Hernandez, M. Linares-Aranda, Cmos full-adders for energy- efficient arithmetic applications, IEEE Trans. Very Large Scale Integr. Syst. 19 (4) (2011) 718. [4] S. Goel, A. Kumar, M.A. Bayoumi, Design of robust, energy-efficient full adders for deep-submicrometer design using hybrid-cmos logic style, IEEE Trans. Very Large Scale Integr. Syst. 14 (12) (2006) 1309–1321. [5] N.H. Weste, D. Harris, CMOS VLSI Design: a Circuits and Systems Perspective, Pearson Education India, 2015. [6] M. Alioto, G. Palumbo, Delay uncertainty due to supply variations in static and dynamic full adders, in: 2006 IEEE International Symposium on Circuits and Systems, IEEE, 2006, p. 4. [7] S. Purohit, M. Margala, Investigating the impact of logic and circuit implementation on full adder performance, IEEE Trans. Very Large Scale Integr. Syst. 20 (7) (2012) 1327–1331. [8] A.A. Fayed, M.A. Bayoumi, Noise-tolerant design and analysis for a low- voltage dynamic full adder cell, in: –III, 2002 IEEE International Symposium on Circuits and Systems. Proceedings (Cat. No. 02CH37353), vol. 3, IEEE, 2002, p. III.

5 4-bit RCA has 9 bits as an input (4 bits for input A, 4 bits for input B and 1 bit for input Cin), so all possible input transitions are 29(29-1) ¼ 261632. Same way, 4-bit multiplier has 8 bits as an input (4 bits for input X and 4 bits for input Y), so all possible input transitions are 28(28-1) ¼ 65280 [19].

12