Low power and high speed multiplier design with row bypassing and parallel architecture

Low power and high speed multiplier design with row bypassing and parallel architecture

Microelectronics Journal 41 (2010) 639–650 Contents lists available at ScienceDirect Microelectronics Journal journal homepage: www.elsevier.com/loc...

1MB Sizes 2 Downloads 59 Views

Microelectronics Journal 41 (2010) 639–650

Contents lists available at ScienceDirect

Microelectronics Journal journal homepage: www.elsevier.com/locate/mejo

Low power and high speed multiplier design with row bypassing and parallel architecture Ko-Chi Kuo n, Chi-Wen Chou Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan

a r t i c l e in fo

abstract

Article history: Received 10 October 2009 Received in revised form 16 June 2010 Accepted 21 June 2010 Available online 8 July 2010

This paper presents a low power and high speed row bypassing multiplier. The primary power reductions are obtained by tuning off MOS components through multiplexers when the operands of multiplier are zero. Analysis of the conventional DSP applications shows that the average of zero input of operand in multiplier is 73.8 percent. Therefore, significant power consumption can be reduced by the proposed bypassing multiplier. The proposed multiplier adopts ripple-carry adder with fewer additional hardware components. In addition, the proposed bypassing architecture can enhance operating speed by the additional parallel architecture to shorten the delay time of the proposed multiplier. Both unsigned and signed operands of multiplier are developed. Post-layout simulations are performed with standard TSMC 0.18 mm CMOS technology and 1.8 V supply voltage by Cadence Spectre simulation tools. Simulation results show that the proposed design can reduce power consumption and operating speed compared to those of counterparts. For a 16  16 multiplier, the proposed design achieves 17 and 36 percent reduction in power consumption and delay, respectively, at the cost of 20 percent increase of chip area in comparison with those of conventional array multipliers. In addition, the proposed design achieves averages of 11 and 38 percent reduction in power consumption and delay with 46 percent less chip area in comparison with those counterparts for both unsigned and signed multipliers. The proposed design is suitable for low power and high speed arithmetic applications. & 2010 Elsevier Ltd. All rights reserved.

Keywords: Low power Bypassing multiplier Parallel architecture Ripple carry array

1. Introduction With the high demanding of electronic portable devices, the requirement of low power device is getting more attention in recent years. The primary concern of electronic portable device is to extend operating hours without changing the battery residing in device. Although advanced technology enhances battery life to operate for longer hours, the complicated operations in the highend portable devices are still power hungry and is critical for the low power design. Low power design can be achieved at system, logic, technology, architecture, and the circuit levels. Power saving can be significant if the low power design is planned in the earlier stage at system level. Optimizing logic level of circuit is also critical for the low power design. To reach this goal, dedicated software needs to be developed. As technology continues to shrink, power consumption also can be scaled down at the same time. Many efforts to achieve low power requirements at circuit level can be seen in many literatures. These efforts vary from voltage scaling, threshold voltage scaling, power-down strategies, and logic style. These options can be chosen at circuit

n

Corresponding author. Tel.: +886 75252000 x4322; fax: + 886 75254301. E-mail address: [email protected] (K.-C. Kuo).

0026-2692/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.mejo.2010.06.009

and topology level to implement different arithmetic functions. For example, to implement a specific function at architecture level, ripple-carry, carry-save, or carry look-ahead adder can be adopted. By choosing one of these architectures, low power consumption can be achieved by trading off with other specifications such as speed or chip area. Digital signal processing (DSP) is one of most important units in electronic devices. DSP performs fundamental operations which include video processing for displaying streamline image and baseband processing for communication operations. These applications consume significant power. For example, fast Fourier transform (FFT) is one of essential building blocks in DSP. Due to the popularity of orthogonal frequency division multiplex (OFDM) used for various portable communication devices, the demand for low power FFT is a critical requirement. As stated in [1], multiplier modules occupy 46 percent chip area in the 64-point Split-Radix FFT. Therefore, power saving can be achieved by reducing power consumption of multiplier significantly. Different parallel multipliers have been proposed in literatures and are classified into tree-based multiplier [2–4] and array-based multiplier [4,5]. Advantage of tree-based multiplier relies on the speed of multiplier increasing with the log of operand length [3]. On the other hand, array-based architectures are more popular in terms of regular layout. However, array-based multiplier

640

K.-C. Kuo, C.-W. Chou / Microelectronics Journal 41 (2010) 639–650

consumes potentially more power than tree-based architecture [6]. The reason is that additional adders are embedded in the treebased multiplier that absorb spurious switching and hence reduce power consumption [7]. However, the layout of tree-based multiplier tends to be complicated and induces more parasitic capacitances. In addition, tree-based multiplier is limited to shorter operand length (o16 bits) [8]. Multiplier with longer operand length can be implemented by modified Booth encoding Wallace multiplier. Therefore, array-based architecture is implemented in the proposed design. Many researchers have been focusing on reducing power consumption of multipliers [9–20]. A modified binary tree multiplier is presented in [9]. All partial products can be generated in one step. Power reduction is achieved by minimizing power consumption of full adder. A multiplier with array and tree architecture is proposed in [10] to enhance performance with low power consumption and smaller time-delay product. A leapfrog multiplier [11] is modified such that sum and carry signal to different rows of adders can arrive at the same time. Therefore, power consumption can be reduced. The multiplier of [12] divides all partial products into four clusters. It uses latches to disable clusters when cluster is in zero condition. The low power multiplication is achieved by operand decomposition [13]. Decomposition is performed at both multiplicand and multiplier to achieve low power consumption by reducing logic transitions. The work in [14] uses a pre-computation based method to reduce power in a sequential multiplier. In [15], low power is accomplished by reducing complexity of multiplication architecture and switching activities. In [16–18], significant power consumptions are reduced by developing new adder cell in different multiplier designs. In papers [19,20], a low-power multiplier design is proposed with bypassing method to turn off device when inputs of multiplier are zero. These power reduction techniques have been verified and implemented in many DSP or other related applications at certain additional expenses. Among these techniques, the most effective way is reducing dynamic power consumption which dominates total power consumption. Hence, average power consumption can be reduced significantly by adopting this method. Consequently, this paper intends to develop a new design to achieve lower power and high speed multiplier. Hence, a novel low power multiplier is proposed by minimizing switching activities of multiplier while maintaining the speed of multiplier by adopting the parallel architecture. Bypassing method achieves significant power saving if the number of zeros in multiplicand has more than half of the size of multiplier. However, additional hardware of adopting bypassing method reduces the operation speed of multiplier in the critical path of multiplier. Hence, parallel architecture is adopted to enhance the speed of multiplier. This paper is organized as follows. The concept of multiplier and power consumption issues are described in Section 2. The estimated probability of zero in multiplier is also presented to illustrate power saving advantage by using bypassing multiplier. In Section 3, a novel multiplier design based on bypassing scheme with parallel architecture is proposed for both unsigned and signed operands of multipliers. The simulation results of the proposed design and performance comparisons with counterpart circuits are shown in Section 4. Finally, conclusion is given in Section 5.

through a series of shift and addition operations. Since it can reuse the same hardware to perform multiplication, it occupies less area than other multipliers. However, it needs more clock cycles to accomplish multiplication and cannot be realized in pipeline structure. On the other hand, array multiplier is common in multiplier design due to its regular and compact structure. The structure of array multiplier is organized by several stages of adders and AND-gates. It generates all the partial products after only one AND-gate delay. Then, it sums up all partial products sequentially. The advantage of this structure is that the arrangement of its adders is very regular and is favorable for layout due to this advantage. It also can be realized with parallel structure. However it occupies more area and hardware than that of iterative multiplier. Conventional array multiplier is primarily used for computing multiplication of two input data. For example, two unsigned n-bits binary numbers A¼an  1an  2an  3ya0 and B ¼bn  1bn  2 bn  3yb0 can generate a (2n  1)-bit product P, which can be defined as the following: 1 !0 n1 n1 n1 X n1 X X X i @ jA P ¼ AB ¼ ai 2 bj 2 ¼ ðai bj Þ2i þ j ð1Þ i¼0

j¼0

j¼0i¼0

where i and j are the number of bits in the multiplier and multiplicand, respectively. An example of 4-bit multiplication is shown in Fig. 1. Two conventional array multipliers can be used to generate partial products. According to the way of carry propagation, it can be classified into two structures: ripple-carry array (RCA) (Fig. 2) and carry-save array (CSA). In RCA multiplier all adder cells are composed of RCA adders. For example, it needs 3N adders to accomplish multiplication in an N  N multiplier. However, delay a

a

a

a

b

b

b

b

a 0 b0

a 3b0 a 2 b0 a1b0

a 3b1 a 2 b1

a1b1 a 0 b1

a 3b2 a 2 b2 a1b2 a 0b2

a 3b3 a 2 b3 a1b3 a 0 b3 P7

P6

P5

P4

P3

P2

P0

P1

Fig. 1. A 4  4 basic multiplication.

a2b0

a3b0

a2b1 a3b1

a3b2

a 3 b3

+

+

+

P6

P5

+

a 2 b3

a2b2

+

+

a1b3

+

+

a1b1 +

a1b2

+

a1b0

a0b0

a0b1 + 0

a0b2 0

a0b3 0

2. Multiplier concepts

+ 0

2.1. Conventional array multiplier Conventional multipliers can be classified into iterative and array multipliers. Iterative multiplier can accomplish multiplication

P7

P4

P3

P2

Fig. 2. Ripple carry array multiplier.

P1

P0

K.-C. Kuo, C.-W. Chou / Microelectronics Journal 41 (2010) 639–650

time needed in the worst case is (2N+ 1) full adder delay. In CSA, the main adder cells consist of CSA adders. RCA adders are used in the final row of adder. In this array, it also needs 3N adders to accomplish multiplication. However, delay time needed in the worst case is (N + 2) full adder delay. In order to achieve low power and high speed performance at the same time, the proposed multiplier is based on RCA adder.

2.2. Power consumption

where a is switching probability, f is the average number of transitions, CL is the output capacitance, VDD is the supply voltage, ISC is the short circuit current, and Ileakage is the leakage current. In the submicron technology, leakage current also consumes significant portion of power [23,24]. Some leakage reduction methods can be found in [23,24]. This paper mainly focuses on reducing dynamic power consumption of multiplier by minimizing switching activity. 2.3. Bypassing multiplier based on CSA

Power consumption is a critical parameter in designing electronic circuits, especially in portable electronic and communication devices. CMOS technology has been widely used for VLSI circuit design due to its effect of less power consumption. Power consumption of CMOS circuits can be divided into static and dynamic power consumption [21]. Eq. (2) shows power consumption of digital CMOS circuits [22]. 2 Ps ¼ a f CL VDD þISC VDD þ Ileakage VDD

ð2Þ

Si , j −1 ai b j bj

bj

+

Ci , j −1

Ci −1 , j −1

bj

bj

The operation of bypassing multiplier [20] is to disable adders based on multiplier bit bj (0 rjrn  1); hence, power consumption can be reduced. In order to disable the multiplier, the conventional full adder needs to be modified as shown in Fig. 3. The bypassing multiplier based on the modified full adder is shown in Fig. 4. There are three tri-state buffers and two multiplexers in the modified full adder to perform bypassing technique. The tri-state buffer decides whether to disable adder or not based on the value of multiplier bits bj. Two multiplexers are designed to select the correct outputs. For instance, if bit bj is 0, the adders in the third row of the multiplier can be disabled. Then, outputs of adders in the second row can be passed to adders in the fourth row directly. Note that it cannot execute addition to generate the correct output because the rightmost full adder in third row of multiplier is disabled. Therefore, it has to add an extra hardware to perform correct addition. 2.4. Analysis of switching probability

Ci , j Si , j

The result of power saving by adopting bypassing method primarily relies on the number of zeros in input data of multiplier.

Fig. 3. Modified carry save full adder.

a3b1

a 2b1 a3b0

a1b1 a 2b0

a0b1 a1b0

+

+

+

b1 a3b2

+

b2 a 3 b3

a 2 b3

a1 b3

+

b1

b2

+

b2

b2

a 0 b3

+

b3

b1

a0b0

a0 b2

a1b2

a 2 b2

+

P7

641

+

b3

+

b3

b3

+

+

+

+

P6

P5

P4

P3

P2

Fig. 4. A 4  4 bypassing multiplier based on carry save array.

P1

P0

K.-C. Kuo, C.-W. Chou / Microelectronics Journal 41 (2010) 639–650

The ‘‘zero’’ bit in the data of multiplier is 50 percent in uniformly normal distribution. However, in actual multiplier implementation such as applications of Adaptive Differential Pulse Code Modulation (ADPCM), G723.1 speech code, and wavelet-based image coders, input data of these applications can be analyzed to illustrate power reduction efficiency of bypassing multiplier. The input data of these applications are extracted from analyzing effective dynamic range presented in [25]. The data of ADPCM is recorded at 0.125 s audio signal that is further used for multiplication of low and high pass band splitting. For the data of G.723.1 speech coder, a 0.05 s of speech data is sampled with 8 kHz frequency for multiplication in autocorrelation of linear prediction coding. In the last application, one fortieth of multiplication for a 512  512 pixel image is performed for low and high pass filtering. The original data is fed into a 16  16 multiplier. The histograms of effective dynamic range of input data show the probability of each input vector distribution in terms of effective bit number. The bit numbers of zeros can be used in bypassing multiplier to disable devices. The effective bit number and noneffective bit number of these three applications are shown in Table 1 for multiplicand X and multiplier Y. The probability of each dynamic range is calculated by actual input vector. To estimate the number of zeros in effective data of zeros, it is assumed that 50 percent of these effective data range is zeros. Since these data range is starting from 1 to 16 effective data bit, the rest of non-effective dynamic data bit is from 15 to 0. The rest of the non-effective dynamic data bits are all zeros since they do not represent any value. Based on these assumptions, the equation of estimating probability of zeros can be described in

Table 1 Probability of effective bit number of three applications: (A) ADPCM audio coder; (B) G.723.1 speech code; and (C) wavelet-based image coder. Effective dynamic range (X  Y)

 16  15  14  13  12  11  10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Case (A)

Case (B)

Case (C)

Prob. of X (%)

Prob. of Y (%)

Prob. of X (%)

Prob. of Y (%)

Prob. of X (%)

Prob. of Y (%)

0 2 10 9 6 5 2 2 3 3 3 2 0 0 0 0 2 0 0 0 2 4 2 2 2 2 5 6 11 13 2 0

0 0 0 5 0 10 0 0 0 10 0 0 0 0 0 0 37 0 0 0 5 0 5 5 5 0 0 5 5 0 5 0

0 0 1 2 18 12 6 4 2 1 1 0 0 0 0 0 0 0 0 0 1 2 2 2 4 6 11 10 9 3 3 0

0 0 1 3 5 4 4 3 2 1 0 0 0 0 0 0 38 0 0 0 0 0 0 1 3 4 6 5 2 5 13 0

0 0 2 5 9 5 4 3 3 2 1 1 1 0 1 1 12 1 1 2 1 1 2 3 4 8 9 11 6 1 0 0

0 0 0 0 0 0 0 0 0 6 0 0 11 11 0 11 11 0 23 0 0 22 5 0 0 0 0 0 0 0 0 0

1.2

prob

1

Probability

642

0.8 0.6 0.4 0.2 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Effective dynamic data range

Fig. 5. Probability of effective dynamic data from 1 to 16 bit.

Table 2 Probability of zeros in three different applications: (A) ADPCM audio coder; (B) G.723.1 speech code; and (C) wavelet-based image coder. Multiplicand

Probability of zeros (%)

Case Case Case Case Case Case

65.28 77.41 66.91 75.72 72.13 88.22

(A) Multiplicand X (A) Multiplier Y (B) Multiplicand X (B) Multiplier Y (C) Multiplicand X (C) Multiplier Y

following equation:   n n X X i ni  50% þ þ probðDi Þ  probðDi Þ n n i¼1 i ¼ 1     i ni      50% þ  n n 

ð3Þ

where n is the number of bit in multiplicand X and multiplier Y, Di is the effective data, and prob is the probability of specified effective data from Table 1. The probability of zeros with effective data range is shown in Fig. 5. From this figure, we can conclude that the lower the effective data bit, the higher the probability of zeros, which is reflected in most cases of input vector in different applications. Most data are ranged in lower and middle effective data. Based on Eq. (3), the estimated probability of zeros in these three applications can be calculated and are summarized in Table 2; the probability of zeros for both multiplicand X and Y is at least over 65 percent and greater than 50 percent in normal distribution. In the application of wavelet-based image coder, the ‘‘zero’’ probability of multiplier Y can even reach 88 percent. Therefore, power consumption can be reduced significantly by adopting bypassing method. In addition, it is observed that multiplier Y has larger probability of zeros compared to that of multiplicand X. Therefore, multiplicand with larger probability of zeros can be used in bypassing multiplier.

3. The proposed low power and high speed multiplier with row bypassing and parallel architecture 3.1. Unsigned bypassing multiplier design The array multiplier is composed of rows of adders as shown in Fig. 2. The sum and carry signals are generated from previous rows and fed into 2 of 3 inputs of current row. The power consumption can be lower if the transitions of these input signals can be less frequent. As shown in Table 2, the average zero probability of input signals on different DSP applications is over

K.-C. Kuo, C.-W. Chou / Microelectronics Journal 41 (2010) 639–650

to achieve parallel architecture. The two tri-state buffers are placed at two inputs of full adder to disable the operation of full adder when bj is 0. The tri-state buffer is designed by transmission gate (TG). The multiplexer is placed at the sum output of full adder. The value of sum can be selected from the bypassing value or sum output of full adder according to the value of bj. The proposed design does not need to add multiplexer for carry output and tri-state buffer for carry input of full adder. The reason is that two inputs of full adder in jth row need to be disabled while the value of bj is 0. Thus carry outputs of the full adders in the same row cannot be changed since two out of three-input full adder is disabled. Thereby, full adder only needs two tri-state buffers and one multiplexer. Moreover an AND gate is inserted into the last carry output in each row of full adder for correcting output when the value of bj is 0. Therefore, significant portion of extra hardware can be saved without degrading speed performance. In addition, power consumption also can be reduced as a result of reduced hardware activities. Fig. 6 is the proposed full adder. Fig. 7 shows the proposed 4  4 multiplier based on the modified RCA full adder. The proposed RCA full adder only needs two tri-state buffers and one multiplexer. On the other hand, the full adder design in [20] needs three tri-state buffers and two multiplexers. It is evident that the proposed design can reduce hardware area. A multiplication test vector of 1111  1001 is set up for the proposed design shown in Fig. 8. The values on the side of arrows indicate the value of sum bit or carry bit. From this example, the partial products which shall be summed in first and second row of adders are all zero because of b1 ¼b2 ¼0. Then, the sum of output equals to the results from previous row of adders. It is noteworthy that output carry bit of each full adder is zero in the same row and

73.8 percent. Therefore, the most effective way to reduce the power of array based multiplier is to disable the transition of adder. The operational principle of bypassing multiplier is discussed in Section 2.3. The CSA based bypassing multiplier can save certain power consumption. However, the circuit implementation of CSA based multiplier shown in Figs. 3 and 4 are complicated. The additional circuits by adopting bypassing method can degrade the operation speed of multiplier. As mentioned in Section 2.1, CSA based multiplier can achieve faster operation speed compared to RCA based multiplier. However, hardware cost is 50 percent more compared to conventional array multiplier [20]. The proposed multiplier adopts the ripple-carry adder with fewer hardware components and parallel architecture. The new bypassing architecture is proposed to enhance operating speed and reduce power consumption of ripple-carry adder at same time. A RCA adder is adopted with bypassing ability in each row of adders. The reason of adopting RCA adder instead of CSA adder is

Si, j

1

ai b j bj

bj

Ci , j

+

Ci

1, j

bj Si, j Fig. 6. Proposed RCA with row bypassing technique.

a 3b1

a 2 b1 a 3b0

a1b1 a 2 b0

a 0 b1 a1b0

+

+

+

b1

1 0

a 3 b2

a 2 b2

b2

a 3 b3

b3

1 0

1 0

a 0 b2

b2 a b 0 3

1 0

+

+

+

P6

P5

P4

b2

0

1

b3

b1

0

1 0

b2

a 0 b0

0

1 0

b1

+

+

1 0

b3

b1

+

+

+

1 0

P7

a1 b 2

+

a1b3

a 2 b3

643

0

b3

0

P3

P2

Fig. 7. A 4  4 row bypassing multiplier based on RCA.

P1

P0

644

K.-C. Kuo, C.-W. Chou / Microelectronics Journal 41 (2010) 639–650

1 0

0

1 0

0

0

0

0

+

0

1

1

1

+

0

+ 1

0

0

0

1

+

0

+ 1

0

+

0

+

1

1

+

0

1 0

0

1 1

0

+

1

+

0 1

0

+

1

1 0

0

0

+

1

0 0 1

0

0

1

1

Fig. 8. An example for 4  4 multiplier with RCA.

a7b1

a6b1 a7b0

a4b1 a5b0

a5b1 a6b0

a2b1 a3b0

a3b1 a4b0

a1b1 a2b0

a0b0

a0b1 a1b0

b1 b1

b1

a7 b2

a6b2

a5b2

b1

a3b2

a4b2

b1

b1

a2b2

b1

b1

a0b2

a1b2

b2 b2

a7b3

a6b3

b2

a4b3

a5b3

b2

a3b3

b2

b2

a1b3

a2b3

b2

b2

a0b3

b3 b3

P10

P9 Co3

b3

P8 Co2

P7

P6

b3

b3

b3

b3

P5

P4

b3

P3

P2

P1

P0

Co1 Fig. 9. An 8  4 row bypassing multiplier based on RCA.

carry signal propagates with the same direction. Thus, we can discern that all carry signals propagate from zero to the next full adder in the jth row when the value of bj is 0. Besides the above-mentioned method, the proposed multiplier also adopts parallel architecture to shorten delay time. For an example of 8  8 multiplication, two 8  4 bypassing multiplier based on RCA can be shown in Fig. 9. The partial sums and carry output from these two 8  4 multipliers can be computed simultaneously. Note that the final stage adders consist of RCA

adders in both sides and CSA adders in the middle. In this configuration, the parallelism of the proposed multiplier can be established. Furthermore, delay time of RCA multiplier can be shortened through this method. The final proposed multiplier is shown in Fig. 10. Less extra hardware is used compared to that of [20]. The proposed multiplier needs ((5/2)N  3) full adder delay in the worst case for N  N multiplier design. The proposed parallel architecture is not suitable for CSA based Braun multiplier and [20]. CSA based multiplier cannot be decomposed into two

K.-C. Kuo, C.-W. Chou / Microelectronics Journal 41 (2010) 639–650

a7 a6 a5 a4 a 3 a2 a1 a0

b7 b6 b5 b4

C 3 C 2 C1

P16 P15

P14

P13

P12

P11

P10

P8

P7

a 7 a 6 a 5 a4 a 3 a2 a1 a 0

b3 b2 b1 b0

P10P9P8 P7P6P5 P4 P3 P2 P1 P0

P9

645

C3 C2 C1

P6

P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0

P4 P3 P2 P1

P5

Fig. 10. An 8  8 row bypassing multiplier based on RCA.

a2b0

a3b0

0

0 a 3 b1

a3b2

P7

a2b1

+

a1b1

+

a0b2

a2b2

+

a1b2

+

a1b3

+

a0 b3

1

a3b3

+

+

+

+

P6

P5

P4

a 2 b3

+

+

P3

P2

a0b0

a1b0

0 +

P1

a0b1

P0

Fig. 11. A 4  4 signed Braun multiplier.

parallel 8  4 multipliers because the inputs of the current row CSA adder come from the upper row; the 16  16 signed multiplier can be designed by similar procedure.

3.2. Signed bypassing multiplier design The multiplier introduced in the previous section is used to compute unsigned numbers. However, it is essential to design signed multipliers because computer system usually manipulates signed numbers. With regard to signed multiplier design, some signed multiplication algorithms are proposed in [26]. In

conventional array multipliers such as Braun multiplier, signed multipliers can be realized through Baugh–Wooley multiplication algorithm [26], often used to deal with signed multiplication. The algorithm uses 2’s complement to represent the signed numbers and also uses the same framework of array multiplier. The advantage of this algorithm is accomplishing signed multiplication without expanding sign bits. Consequently, additional hardware cost is not increased; thus, not dissipating extra power. Only the AND gate to NAND gate for corresponding operands is changed and an inverter is inserted at the final carry output. Fig. 11 shows the architecture of a 4  4 signed Braun multiplier [26].

646

K.-C. Kuo, C.-W. Chou / Microelectronics Journal 41 (2010) 639–650

adders in the signed bypassing multiplier for carrying propagation is placed in the last row. The whole circuit architecture for a 4  4 signed bypassing multiplier is shown in Fig. 12. The proposed multiplier also adopts the Baugh–Wooley algorithm [26] for signed number multiplication. Considering an 8  8 signed multiplication [26], all operands are separated into two parts. Full adders are used to compute the last row of operands according to the analysis in the previous paragraph. Therefore, two different 8  4 bit signed ripple-carry array multipliers need to be designed. Blocks 1 and 2 are the two 8  4 bit signed ripple-carry array multipliers, respectively. Block 1 shown in Fig. 13 is used to deal with the upper part of operands and it is similar to the multiplier shown in Fig. 9 except for some changes on the gates of the circuits. Similarly, Block 2 shown in Fig. 14 is used to deal with the lower part of operands. Block 2 is different than Block 1 in hardware design as Block 1 uses the proposed full adder shown in Fig. 6 to compute all operands and Block 2 only differs in the computation of the last row of operands as ripple-carry adders are used to compute this row of operands. Since the proposed multiplier does not need additional full adder to correct the operation of multiplication, the addition in the final step can be computed without adding other full adders. Thus, hardware requirement for the proposed signed multiplier is less than the signed bypassing multiplier. Finally, these two blocks are combined and an inverter is placed at the carry output. The proposed signed multiplier is shown in Fig. 15. The 16  16 signed multiplier can be designed by the similar procedure.

For example, two signed 4-bits binary numbers A¼a3a2a1a0 and B ¼b3b2b1b0 can generate a product P, which can be defined as follows: P ¼ 1  27 þ a3 b3 26 þ ða3 b2 þ b3 a2 Þ25 þ ða3 b1 þ b3 a1 þ1Þ24 þða3 b0 þb3 a0 Þ23 þ ðb2 22 þ b1 21 þ b0 Þða2 22 þ a1 21 þ a0 Þ

ð4Þ

Next, the same algorithm is utilized to design the proposed bypassing multiplier with signed operands. For bypassing multiplier [20], it could also utilize Baugh–Wooley multiplication algorithm [26] to realize signed bypassing multiplier. According to Baugh–Wooley multiplication algorithm [26], some AND gates of original design must be changed to NAND gates for the corresponding operands in [20]. However, general CSA would be used instead of the modified full adders shown in Fig. 3 for the computation of last row of operands in multiplication. The reason is described as follows. First, we know that disabling adders is performed only when operand is zero. The probability of a 2-input NAND gate with operand (AB)0 being zero is only 25 percent. If the adder shown in Fig. 3 is used for this row, additional logic must be added. Power consumption for these additional logics may be large. In others words, adders in this row may dissipate more power in most of time. Consequently, general CAS will be used for this row of adders because they do not dissipate power on the additional logic. Since it has to add one in the final step in Baugh– Wooley multiplication algorithm [26], additional one row of

a2b1 a3b0

a3b1

+

+

0

0

b1

a3b2

0

+

0

0

0

a0b0

a0b1 a1b0

a1b1 a2b0

b1

b1

a2b2

a1b2

a0b2

+

+

+ b2

0 b2

a3b3

1

+

+

+

P7

P6

0

b2

a 2 b3

a1b3

a0b3

+

+

+

+

0

+

+

1

P5

P4

+

0

b2

0

+

+

P3

Fig. 12. A 4  4 signed bypassing multiplier.

0

P2

P1

P0

K.-C. Kuo, C.-W. Chou / Microelectronics Journal 41 (2010) 639–650

a b a b

a b ab

a b ab

a b

ab ab

ab a b

647

ab a b

ab

a b ab

b b a b

ab

ab

b

b

b

ab

a b

a b

ab

a b

b

b

b

b

b

ab

a b

b

a b

a b

b ab

b

b

ab

a b

b

b

a b

b b

b

P

P

Co

P

Co

b

b

P

P

b

P

b

P

b

P

P

P

P

Co Fig. 13. An 8  4 signed RCA multiplier with row bypassing block 1.

Fig. 14. An 8  4 signed RCA multiplier with row bypassing block 2.

4. Simulation results and performance comparisons In this section, the performance evaluation of the proposed multiplier along with the comparison to the conventional Braun multiplier is presented. Performances include power consumption, delay, power-delay product, and layout area. These circuits

are designed in transistors level without using any standard cell from the technology library. Post-layout simulations are performed with standard TSMC 0.18 mm CMOS technology and 1.8 V supply voltage by Cadence Spectre simulation tools. The design and simulation flow is shown in Fig. 16. In the design process, multiplier design was constructed at circuit level in the Cadence

648

K.-C. Kuo, C.-W. Chou / Microelectronics Journal 41 (2010) 639–650

Fig. 15. The proposed 8  8 signed RCA multiplier.

Fig. 16. Simulation flow.

design environments. The power consumption and speed of the proposed design are obtained by simulation. After verifying the circuit level, the proposed unsigned and signed multipliers are converted to layouts with the Cadence Virtuoso Layout Editor. The layouts are verified through Cadence DRC and LVS tool and finally the layouts are extracted with the Cadence LPE tool. An example layout of the proposed 16  16 signed multiplier is shown in Fig. 17. To evaluate the proposed method, two different sizes, 8  8 and 16  16, of multiplier are simulated and 20 test patterns are generated randomly for both 8  8 and 16  16 multipliers to evaluate the performance. The test patterns are randomly generated with uniformly distributed probability, and the post layout simulations are performed for both unsigned and signed multipliers in order to verify the feasibility of the proposed design. The performance comparisons of the proposed design and other counterparts for both unsigned and signed multipliers are listed in Tables 3–5 in terms of power consumption, delay, and

power delay product. Table 3 shows the power consumption of the proposed and above-mentioned multipliers. The simulation results of delay and power-delay product for Braun, [20], and for the proposed multiplier are shown in Tables 4 and 5, respectively. Table 6 shows the layout area of proposed multiplier, Braun multiplier, and [20]. For a 16  16 multiplier, the proposed design achieves 17 and 36 percent reduction in power consumption and delay, respectively, at the cost of 20 percent increase in chip area in comparison with those of conventional array multiplier. In addition, the proposed design achieves averages of 11 and 38 percent reduction in power consumption and delay, respectively, with 46 percent less chip area in comparison with that of counterpart [20] for both unsigned and signed multipliers. From these simulation results, it is evident that the proposed design outperforms the other counterparts in terms of power, delay, and power delay product at the cost of an average of 24 percent area overhead. In the proposed multiplier, it can achieve more power savings if the probability of zero is greater than the probability of one in the operand of multiplier and can be confirmed in Section 2.4. Table 7 shows the performance comparison of the proposed multiplier design with results from other recent published papers. The designs of these multipliers are ROM based and low power bypassing. The ROM based multiplier achieves low power by using single transistor ROM cell that eliminates identical rows and columns [27]. The other bypassing method use additional logic implemented in the adder to skip the redundant signal transitions [28]. Both of these designs adopt the principle of reducing switching activity to lower the power consumption. The comparison is based on the performance of the multiplier provided in [27,28]. Power consumption, delay, and power-delay product of the proposed design are the best among these designs.

5. Conclusions A low power and high speed CMOS array multiplier is presented. The proposed multiplier reduces power consumption by disabling adders resided in the multiplier when inputs are at

K.-C. Kuo, C.-W. Chou / Microelectronics Journal 41 (2010) 639–650

649

Fig. 17. The layout of the proposed 16  16 signed multiplier.

Table 3 Power consumption (in mW) and power saving. Design

Braun (unsigned) [10] (unsigned) Proposed (unsigned) Braun (signed) [10] (signed) Proposed(signed)

Table 6 Total area (in mm2) and area overhead.

Multiplier size and normalized ratio

Design

88

Ratio

16  16

Ratio

2.413 2.238 2.144 2.867 2.991 2.445

1.00 0.93 0.89 1.00 1.04 0.85

11.050 9.561 9.111 11.532 10.671 9.619

1.00 0.87 0.82 1.00 0.93 0.83

Braun (unsigned) [10] (unsigned) Proposed (unsigned) Braun (signed) [10] (signed) Proposed (signed)

Design

Braun (unsigned) [10] (unsigned) Proposed (unsigned) Braun (signed) [10] (signed) Proposed (signed)

Braun (unsigned) [10] (unsigned) Proposed (unsigned) Braun (signed) [10] (signed) Proposed (signed)

Ratio

16  16

Ratio

73524 132342 92449 73524 139604 93592

1.00 1.80 1.26 1.00 1.90 1.27

307585 538879 367908 307585 553681 372177

1.00 1.75 1.20 1.00 1.80 1.21

Performance and normalized ratio

Multiplier size and normalized ratio 88

Ratio

16  16

Ratio

3.504 3.188 2.243 3.713 3.472 2.543

1.00 0.91 0.64 1.00 0.93 0.69

7.584 6.537 4.713 8.104 7.238 5.334

1.00 0.86 0.62 1.00 0.89 0.66

Table 5 Power-delay product (10  12 J) and improvement. Design

88

Table 7 Performance comparison of recent published papers.

Table 4 Delay (in ns) and improvement. Design

Multiplier size and normalized ratio

Multiplier size and normalized ratio 88

Ratio

16  16

Ratio

8.455 7.135 4.808 10.645 10.384 6.217

1.00 0.84 0.57 1.00 0.97 0.58

83.803 62.844 42.940 93.455 77.236 51.307

1.00 0.75 0.51 1.00 0.82 0.55

zeros. Delay time of multiplier is also shortened by adopting parallel architecture. In order to validate the effectiveness of the proposed design, power consumption and delay are evaluated by Cadence Spectre post-layout simulation with standard TSMC 0.18 mm CMOS technology. Simulation results show that the

Power (mW) [27] ROM based 16  16 13.50 multiplier 16.30 [28] low power bypassing 8  8 Multiplier Proposed bypassing 9.11 16  16 multiplier

Ratio Delay (ns)

Ratio Power delay product (pJ)

Ratio

1.48

5.55

1.18

74.9

1.75

1.79

14.28

3.03

232.7

5.42

1.00

4.71

1.00

42.9

1.00

proposed design can achieve greater power efficiency with less extra hardware and power-delay product among different counterparts. For a 16  16 multiplier, the proposed design achieves 17 and 36 percent reduction in power consumption and delay, respectively, at the cost of 20 percent increase in chip area in comparison with those of conventional array multiplier. In addition, the proposed design achieves 11 and 38 percent reduction in power consumption and delay, respectively, with 46 percent less chip area in comparison with those of in [20]. The test patterns are randomly generated with uniformly distributed probability. As mentioned in Section 2.4, the average zero input of operand in multiplier for the typical DSP applications is 73.8 percent. Therefore, the proposed multiplier can achieve even greater power saving if the probability of zero in the inputs of multiplier is larger than 0.5. Compared to other recent published papers [27,28], the proposed bypassing multiplier achieves the lowest value of power, delay, and power-delay-product. Hence,

650

K.-C. Kuo, C.-W. Chou / Microelectronics Journal 41 (2010) 639–650

the proposed design achieves the goal of low power and high speed performance at the same time.

Acknowledgements The authors would like to acknowledge the financial support of the National Science Council, Taiwan, Republic of China, under grant number NSC96-2220-E-110-008. Authors would like to express their greatest thanks to CIC (Chip Implementation Center) of NAPL (National Applied Research Laboratories), Taiwan, for their thoughtful chip fabrication service. References [1] W.C. Yeh, C.W. Jen, High-speed and low-power split-radix FFT, IEEE Transactions on Signal Processing 51 (2003) 864–874. [2] C.S. Wallace, A suggestion for a fast multiplier, IEEE Transactions on Computer 13 (1964) 14–17. [3] V.G. Oklobdzija, D. Villeger, S.S. Liu, A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach, IEEE Transaction on Computer 45 (1996) 294–306. [4] B. Parhami, in: Computer Arithmetic, Algorithms, and Hardware Design, Oxford University Press, New York, 2000. [5] K.Z. Pekmestzi, Multiplexer-based array multiplier, IEEE Transactions on Computers 48 (1999) 15–23. [6] P.C.H. Meier, R.A. Rutenbar, L.R. Carley, Exploring multiplier architecture and layout for low power, in: Proceedings of the IEEE Custom Integrated Circuits Conference, 1996, pp. 513–516. [7] K.S. Chong, B.H. Gwee, J.S. Chang, A micropower low-voltage multiplier with reduced spurious switching, IEEE Transactions on Very Large Scale Integrated Systems 13 (2005) 255–265. [8] C.H. Han, H.J. Park, L.S. Kim, A low-power array multiplier using seperated multiplication technique, IEEE Transactions on Circuits and Systems-II, Analog and Digital Signal Processing 48 (2001) 866–871. [9] E. Abu-Shama, M.B. Maaz, and M.A. Bayoumi, A fast and low power multiplier architecture, in: Proceedings of the IEEE Midwest Symposium on Circuits and Systems, 1996, pp. 53–56. [10] R. Mudassir, H. El-Razouk, Z. Abid, New designs of signed multiplier, in: Proceedings of the IEEE Northeast Workshop on Circuits and Systems, 2005, 259–262. [11] S. Mahant-Shetti, P. Balsara, C. Lemonds, High performance low power array multiplier using temporal tiling, IEEE Transactions on Very Large Scale Integrated Systems 7 (1999) 121–124. [12] A.A. Fayed, M.A. Bayoumi, A novel architecture for low-power design of parallel multipliers, in: Proceedings of the IEEE computer Society Workshop on VLSI, 2001, pp. 149–154.

[13] M. Ito, D. Chinnery, K. Keutzer, Low power multiplication algorithm for switching activity reduction through operand decomposition, in: Proceedings of the 21st International Conference on Computer Design, 2003, pp.21–26. [14] N. Honarmand, M.R. Javaheri, N. Sedaghati-Mokhtari, A. Afzali-Kusha, Power efficient sequential multiplication using pre-computation, in: Proceedings of the IEEE International Symposium on Circuits and Systems, 2006, pp. 2709–2712. [15] L.H. Chen, O.T.-C. Chen, T.Y. Wang, Y.C. Ma, Multiplication-accumulation computation unit with optimized compressors and minimized switching activities, in: Proceedings of the IEEE International Symposium on Circuits and Systems, 2005, pp. 6118–6121. [16] C. Senthilpari, A.K. Singh, K. Diwakar, Design of a low-power, high performance, 8  8 bit multiplier using a Shannon-based adder cell, Microelectronics Journal 39 (2008) 812–821. [17] Z. Abid, H. El-Razouk, D.A. El-Dib, Low power multipliers based on new hybrid full adders, Microelectronics Journal 39 (2008) 1509–1515. [18] K. Navi, V. Foroutan, M. Rahimi Azghadi, M. Maeen, M. Ebrahimpour M. Kaveh, O. Kavehei, A novel low-power full-adder cell with new technique in designing logical gates based on static CMOS inverter, Microelectronics Journal 40 (2009) 1441–1448. [19] S. Hong, S. Kim, M.C. Papaefthymiou, W.E. Stark, Low power parallel multiplier design for dsp applications through coefficient optimization, in: Proceedings of the IEEE International ASIC/SOC Conference, 1999, pp. 286–290. [20] J. Ohban, V.G. Moshnyaga, K. Inoue, Multiplier energy reduction through bypassing of partial products, in: Proceedings of the IEEE Asia-Pacific Conference on Circuits and Systems, 2002, pp. 13–17. [21] A.P. Chandraksan, S. Sheng, R. Bordersen, Low-power CMOS digital design, IEEE Journal of Solid-State Circuits 27 (1992) 473–484. [22] M. Psilogeorgopoulos, M. Munteanu, T.-S. Chuang, P.A. Ivey, L. Seed, Contemporary techniques for lower power circuit design, PREST Deliverable D2.1, The Department of Electronic and Electrical Engineering, The University of Sheffield, Mappin Street, Sheffield S1 3JD, UK, 1998, pp. 1–91 /http:// www.engr.newpaltz.edu/  damu/spring_2008/resource/cont_tech.pdfS. [23] N.S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flaunter, J.S. Hu, M.J. Irwin M. Kandemir, V. Narayanan, Leakage current: Moore’s law meets static power, IEEE Computer 36 (2003) 68–75. [24] J. Kao, S. Narendra, A. Chandrakasan, Subthreshold leakage modeling and reduction techniques, Proceedings of the IEEE/ACM International Conference Computer Aided Design, 2002, pp. 141–148. [25] O.T.C. Chen, S. Wang, Y.W. Wu, Minimization of switching activities of partial product for designing low-power multipliers, IEEE Transaction on Very Large Scale Integrated Systems 11 (2003) 418–433. [26] R. Mudassir, H. El-Razouk, Z. Abid, New designs of signed multiplier, in: Proceedings of the IEEE International NEWCAS Conference, 2005, pp. 259–262. [27] B.C. Paul, S.F. Fujita, M. Okajima, ROM-based logic (RBL) design: a lowpower 16 bit multiplier, IEEE Journal of Solid-State Circuits 44 (2009) 2935–2942. [28] C.C. Wnag, G.N. Sung, Low-power multiplier design using a bypassing technique, Journal of Signal Processing Systems 57 (2009) 331–338.