High-speed low-power multiplexer-based selector for priority policy

High-speed low-power multiplexer-based selector for priority policy

Computers and Electrical Engineering 39 (2013) 202–213 Contents lists available at SciVerse ScienceDirect Computers and Electrical Engineering journ...

2MB Sizes 2 Downloads 30 Views

Computers and Electrical Engineering 39 (2013) 202–213

Contents lists available at SciVerse ScienceDirect

Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

High-speed low-power multiplexer-based selector for priority policy q Jih-ching Chiu 1, Kai-ming Yang ⇑ Department of Electrical Engineering, National Sun Yat-Sen University, 70 Lien-hai Rd., Kaohsiung 804, Taiwan ROC

a r t i c l e

i n f o

Article history: Received 8 December 2011 Received in revised form 3 December 2012 Accepted 3 December 2012 Available online 9 January 2013

a b s t r a c t The nature of priority policy causes inefficient throughput in synchronous clock systems because of an unbalanced propagation path. To improve speed, the proposed priority scheme improves extremely unbalanced delays between the highest and lowest weight by integrating the multiplexer-based date selector with the priority encoder. Balanced propagation paths are analyzed based on the gate-level evaluation and demonstrated by post-layout simulation. In terms of scalability, this design is suitable for extending width and has a latency of only O(log m) for m requests. The proposed design also improves the critical path by using delayed-precharge technology for dynamic logic and transmission gate at transistor level. The simulation results show that, for 8–128 requests cases, this approach achieves balanced propagation paths from fastest to lowest path. The proposed design achieves a 4.5 speedup and a 57.2% decrease in power dissipation. Crown Copyright Ó 2012 Published by Elsevier Ltd. All rights reserved.

1. Introduction A priority selector in the central management controller can function as a concurrent signal arbiter, as a priority gating system, and as other special applications [1–4]. In a multi-bit priority encoder, each bit is weighted according to its position. When several masters request a single resource that includes a bus, an I/O, memory, and an interconnection network router, the priority encoder grants and selects one of requests with the highest priority. The priority selector is widely used to manage hardware resources in the central arbiter of a shared resource and special applications such as instruction fetcher units for variable length instructions and multi-instruction issue processors. However, the priority encoder of a priority selector has extremely unbalanced propagation paths. Higher weight requests can be determined easier than lower weight requests. Propagation paths for each request increase as weight decreases. Therefore, the lowest weighting request has the most complex design. It means that the lowest weighting request is the worst propagation path. Fig. 1a shows all propagation paths with the priority policy. For the high-speed design, the priority policy naturally causes an unbalanced delay, which is inefficient in synchronous clock systems. To improve the efficiency of priority encoders, current approaches adopt look-ahead signals to determine high weight requests in advance [4–7]. A look-ahead signal from a selected request with a higher priority disables lower priority requests. Although these approaches can improve the operating speed as compared with the conventional design, they are still unable to balance all propagation paths as shown Fig. 1b. Therefore, as the number of input requests increases, the unbalance delay becomes insufficient for synchronous clock systems. As a result, the most efficient solution for improving performance is to balance all propagation paths for the priority policy. Considering the multiplexer-based (Mux-based) data selector, the

q

Reviews processed and approved for publication by Editor-in-Chief Dr. Manu Malek.

⇑ Corresponding author. Tel.: +886 7 5252000x4183; fax: +886 7 5254199. 1

E-mail addresses: [email protected] (J.-c. Chiu), [email protected] (K.-m. Yang). Tel.: +886 7 5252000x4142; fax: +886 7 5254199.

0045-7906/$ - see front matter Crown Copyright Ó 2012 Published by Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compeleceng.2012.12.002

J.-c. Chiu, K.-m. Yang / Computers and Electrical Engineering 39 (2013) 202–213

203

...

….

……



……



(a)

(b)

(c)

Fig. 1. Propagation paths of: (a) a native feature for priority policy, (b) a look-ahead scheme, and (c) balanced scheme. The native feature of priority policy has an extremely unbalanced encoder signal delay as shown (a). Current approaches use forwarding look-ahead signals to alleviate unbalance delay as shown in (b). This study improves efficiency by integrating the multiplexer-based data selector and priority policy to balance all propagation paths as shown in (c).

binary tree structure in which propagation paths have a similar length from leaf to root has been used for other designs [8,9]. In this work, technique by balancing all propagation paths is proposed to accompany the priority encoder and the mux-based data selector techniques to improve the performance of an inefficient priority selector as shown Fig. 1c. Furthermore, by applying the same techniques, a significant performance improvement has also been accomplished to verify the proposed techniques. This paper is organized as follows. Section 2 defines and explains a priority selector with Boolean equation in the background subsection and the main problem of the priority encoder in the Section 2.2. The design of the current priority encoder with a look-ahead structure is then briefly reviewed. The proposed approach is described in Section 3. Section 4 summarizes the simulation results and the analyses of speed, area and power. Finally, conclusions are given in Section 5. 2. Background and previous works 2.1. Background The data selector for the priority policy chooses the data of an asserted input request with the highest weight given preference over other asserted input requests. Others corresponding to the lower weight positions are discarded. The priority selector can be expressed by the following Boolean equation:

Dout ¼ PE0  D0 þ PE1  D1 þ    þ PEm1  Dm1

ðaÞ

where Di i = {0–m  1} is the selected input data. Priority Encoder equations are set as EP. The operator ‘‘’’ indicates the logic AND. The Di is the data corresponding with EPi. The EP with the corresponding winner is classified as logic ‘‘high’’. Others are logic ‘‘low’’. Eq. (a) shows that the selected input with the highest priority input takes precedence, if two or more input requests are given simultaneously. Therefore, when two or more input signals send requests, Dout is assigned to the input data with the chosen highest priority. The EPi for bit position i are expressed as

PE0 ¼ M0 PE1 ¼ M1  M 0 PE2 ¼ M2  M 1  M 0 

ðbÞ

PEm1 ¼ Mm1  M m2  . . . M 0 x1 Y PEx ¼ Mx  My y¼0

where Mi is the input request of the priority encoder. An m-bit priority encoder case is the least bit and is assumed to have a high priority. The P represents a logic AND for Mi. If some requests (M) are asserted, the only one of PEs with the highest weight request is set to logic ‘‘1’’. For example, when M3, M5 and M11 requests are given simultaneously, these input requests are assigned to logic ‘‘high’’. According to Eq. (b), only the EP3 with the highest priority is enabled. As shown Eq. (a), Dout will be the data output with the highest weighting request.2 2 In fact, since the data selector is usually implemented by tri-state buffer without OR gates, priority selector is realized by the priority encoder and tri-state buffer. These PE results control the tri-state buffer to high-impedance or bypassed data. The PE with the highest priority out of all asserted requests opens the tri-state buffer to send data whereas others are set to high-impedance. However, the consideration in this paper is about PEs. The difference between tri-state buffer and Eq. (a) design will not influence this discussion.

204

J.-c. Chiu, K.-m. Yang / Computers and Electrical Engineering 39 (2013) 202–213

MAC3

MAC2

MAC3

MAC2

MAC0

MAC1

MAC0

MAC1

(a) Look-ahead signal

(b)

(c)

Critical Path

Fig. 2. Critical path of: (a) series connection, (b) multi-level design and (c) binary-tree structure.

2.2. The problem of the priority encoder Eq. (b) shows that the complexity of PEs increases as weight decreases. All PE propagation paths feature extremely unbalanced delay, which causes inefficient performance in synchronous clock systems (Fig. 1a). As a result, the critical path is on the order of O(n) because the gate delay of PEm1 with the lowest priority will grow as the number of requests increases. In [8], Wang et al. also reports that the operating speed of a long priority encoder will be not fast enough when the number of requests is more than 32 bits. 2.3. Previous work Current approaches typically forward look-ahead signals to reduce the critical path delay. For example, [6] grouped each 4-bit priority input into a macro block (MAC). Each MAC cascades a forwarding signal to a lower priority MAC, which is then enabled or disabled by a look-ahead signal. Doing so enables faster propagation of the priority status to the cells. The resulting propagation delay constructed by cascading MAC has the series delay of the constituent four-bit blocks (Fig. 2a). A similar priority encoder with series-connected structure was also proposed in [7]. In contrast, the proposed approach groups 8-bit priority encoders into a MAC. The look-ahead signals are generated by the dynamic OR gate. However, the series-connected structure is still inefficient for improving performance in synchronous clock systems due to the huge difference between critical path and shortest propagation path. For fine scalability, Wang and Huang [7] presents that multi-level folding architecture forwarding more look-ahead signals than [6] to other MACs [8]. Similarly, Maurya and Clark proposed a static circuit for the multilevel structure [5]. These approaches replace cascading look-ahead signals with hierarchical forwarding signals as shown in Fig. 2b. For N number of inputs, the increase in critical paths is O(log N). Conversely, the shortest path is still O(1). These methods are still unable to overcome the native feature of priority policy, i.e., unbalanced propagation paths. As the difference between shortest and longest path increases, these approaches causes performance bottlenecks in synchronous clock systems. To balance all propagation paths, this study applies a binary-tree structure with the same length from leaf to root (Fig. 2c). The details of the proposed priority selector design are discussed further in the next section. 3. Balanced propagation path for priority policy selector This section describes the proposed priority selector integrating the mux-based date selector with the priority encoder. The mux-based date selector is the binary search tree structure. The main principle balancing propagation paths is the continuous partition of a set into the high and low priority request subsets (subtree) by the proposed expressions. In other words, the proposed expressions are used to determine all requests (Mi) into the high and low subset. Therefore, the result of expressions indicates that the asserted request with the highest priority is in the high priority or low subset. Similarly, each subset can be divided again by the proposed expressions into two subsets. By these dividing operations, the asserted request with the highest priority will be searched. Based on this principle, these expressions determining subsets can be implemented by multiplexers. In the binary tree structure of the mux-based data selector, expressions of the proposed scheme can be replaced to each control signal for searching the qualified request. If the value of proposed expressions is logic ‘‘low’’, select the high priority subtree. Similarly, if the value of proposed expressions is logic ‘‘high’’, select the lower priority subtree. By these proposed expressions, the proposed priority encoder scheme controls each multiplexer to select the set of requests with the highest weighting. The proposed design effectively balances all propagation paths since each path has the same length in the binarytree structure (Fig. 2c). Therefore, the balanced priority selector efficiently minimizes the critical path. Proposed expressions to control each multiplexer will be discussed further below. 3.1. Proposed priority encoder scheme The proposed scheme uses extracted equations (L) to distinguish two sets (subtree). Each separated subtree is repeatedly distinguished using the proposed equations until the chosen highest priority data is selected. Fig. 3 shows one example of an 8-bit priority policy, where the truth table shows that L4 can classify a set with a higher priority (set (1)). Assume input requests (M7–M0) are 01100100. L4 is low to disable PE4–PE7 (set (2)). According to Eq. (a), D4–D7 is masked. Based on this

J.-c. Chiu, K.-m. Yang / Computers and Electrical Engineering 39 (2013) 202–213

L4=0, Set (1)

205

L4=1, Set (2)

M0 M1 M2 M3 M4 1 x x x x 0 1 x x x 0 0 1 x x 0 0 0 1 x 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

M5 M6 x x x x x x x x x x 1 x 0 1 0 0

M7 x x x x x x x 1

L4 = M0 M1 M2 M3 (a) True table for priority encoder

L4 = 0 M0~M3

Decoder L4 = 1

Choose one of M0~M3 Set (1) Choose one of M4~M7 Set (2)

(b) L4 select a set of a higher priority

Fig. 3. An example of 8-bit priority selector for the first level.

concept, L4 can select a higher priority data set by controlling multiplexers as shown in Fig. 3b. This technique can be extended to the next level. For this principle, the L expression can be extended as follows:

Priority encoder equations as shown Eq. (b) are partitioned into the high and low priority subsets. The equation Li (M i    M 0 ; i ¼ m=2Þ distinguishes the two subsets. When one of requests in the high priority set (M0–Mi) is asserted, Li is used to select the data set in the high priority set (set (1)). In contrast, When all of M0–Mi is unasserted, Li is set to logic ‘‘high’’ and select the data set in the low priority set (set (2)). As shown equations in set 2, part of these equations (M i    M 0 ) in subset (2) can be replaced by Li. Beside, equations in subset (1) can attach an additional Li without changing results of equations. According to equations of set (1) and set (2), Li can be used to distinguish set (1) and set (2). In other words, Li indicates whether asserted requests (Mi) are the lower priority set (Mm1Mi+1) or the higher priority set (Mi–M0). If Li is logic state 0, it means that one of Mi–M0 will own the priority. Otherwise, one of the Mm1–Mi+1 requests is given priority when Li is logic state 1. The subset of corresponding data is selected as the output. Thus, the multiplexer has a selective function, which is used to implement the above theory (i.e., by using Li to differentiate the two subsets). Similarly, subsets (1) and (2) can be also divided into (1.1) and (1.2), and (2.1) and (2.2) as follows:

Subset (1.1) and subset (1.2) are twofold to two subsets from set (1), which have the same number of PEs equations. The weights of all requests in subset (1.1) are higher than those of all requests in all other subsets (1.2). The Lj classifies the two

206

J.-c. Chiu, K.-m. Yang / Computers and Electrical Engineering 39 (2013) 202–213

sets in set (1) and determine whether an input request comes from the lower weighting subset (1.2) or from the higher weighting subset (1.1). Similarly, equations in subset (1) can be restructured by Li and Lj as shown the above equations.

The scheme for subsets (2.1) and (2.2) is similar to that for set (1). As in the above equations, the only difference is i at the initial position of set (2). The subsets are distinguished by defining Lk as (M k    M iþ1 ). As in the above scheme, Lj distinguishes between subsets (1.1) and (1.2). Similarly, Lk distinguishes between (2.1) and (2.2). Fig. 4 shows examples of 8-bit priority policies for Lj and Lk. In the example, L2 and L6 represent Lj and Lk respectively. L2 can distinguish M0–M3 set and L4 can distinguish M4–M7. As in the expression L4, these expressions can be used for further analysis of priority. The L2 and L6 can request the multiplexer to select a subset from set (1) and set (2). Repeating above operations can define all L equations as the control signal of multiplexers. For one example of an 8-bit priority policy, all PE can be re-written to L equation as Eq. (c). Meanwhile, placing Li in the multiplexers according to the in-order traversal rule completes the eight-request priority selector as shown Fig. 5.

PE0 ¼ L4  L2  L1

PE4 ¼ L4  L6  L5

PE1 ¼ L4  L2  L1

PE5 ¼ L4  L6  L5

PE2 ¼ L4  L2  L3

PE6 ¼ L4  L6  L7

PE3 ¼ L4  L2  L3

PE7 ¼ L4  L6  L7

L4 ¼ M3  M 2  M 1  M0 ; L1 ¼ M0 ;

L3 ¼ M 2 ;

ðcÞ

L2 ¼ M 0  M 1 ;

L5 ¼ M 4 ;

L6 ¼ M 4  M 5

L7 ¼ M6

In the proposed design, the multiplexers imply that, in the last level, each MUX selects one of two data, one of which has higher weight. The dotted box in Fig. 6 shows that, if M0 is asserted (L1 = 0), D0 is outputted. If M0 is not asserted, D1 is selected by the multiplexer. In the next level, the output of each multiplexer implies that this output has the highest weight of the four requests. Likewise, in the next level, the output of each multiplexer implies that this output has the highest weight of all 8. Data output in the first level indicates that it has the highest weight of all asserted requests. Fig. 5 shows the priority selector for eight requests, which is completed by a multiplexer with L expressions. Assume requests (M7–M0) are 01100100 as shown in Fig. 4 of the two requests in each MUX in level 3, the request with the higher weighting is given priority, and corresponding data D is selected and sent to the next level. After data from level 3 is passed on, the multiplexer in level 2 selects the data with the higher weighting of the two requests in level 1. Finally, according to L4, the data with the highest weighting request is passed to the output. The next section shows why L expressions in the gate level are not on the critical path of the mux-based selector design.

L2=0

L2=1

L6=0

M0 M1 M2 M3 M4 1 x x x x 0 1 x x x 0 0 1 x x 0 0 0 1 x 1 0 L2 = M0 M1 0 0

L6=1

M5 M6 x x x x x x x x x x 1 x 0 1 0 0

L6 = M4

M7 x x x x x x x 1 M5

L2=0

M0~M1

M4~M5

Decoder

Choose one of Set (1.1) D0~D1

L2=1

Choose one of D2~D3

L6=0

Choose one of Set (2.1) D4~D5

Decoder L6=1

Choose one of D6~D7

Fig. 4. An example of 8-bit priority selector for the second level.

Set (1.2)

Set (2.2)

J.-c. Chiu, K.-m. Yang / Computers and Electrical Engineering 39 (2013) 202–213

D7

D6

1

0

D5

L7=M6

D4

1

D3

L5=M4

0

D2

1

D1

L3=M2

0

207

D0 L1=M0

1

0 3 Level

1

0

L6

M5 M4

x

1

0

1

L4

0

L2

x

M1 M0

2 Level

M3 M2 M1 M0

x

1 Level

out M7~M0:0110_0100 Fig. 5. Implementation of eight-request case for priority selector.

Dm-1 Dm-2 1

D2

D3 Lm-1

0

.. ... ... ...

1

D1

D0

1

0

L3

0

L1

Last Level Dm-1~Dk+1 Dk~Di+1

MUXk

1

0

MUXi

Di~Dj+1

Lk

MUXj 1

0

1

Dj~D0 0

Lj Second Level

Li

First Level

Fig. 6. Priority selector with L expressions for m requests.

Fig. 6 shows how the completed m requests a priority selector with L expressions. Consider the multiplexer MUXj (Multiplexer), which is controlled by Lj. Assertion of any one of the M0–Mj requests belonging to a corresponding position in subset (1.1) confirms that data belonging to this subset is selected by Lj. If none of the requests from M0–Mj is asserted, the output data come from the highest weighting request in subset (1.2). Similarly, consider the multiplexer MUXk, which is

Li Boolean Equation Algorithm Input: m: Maximum available number of priority input requests. start: Initial position of current range of requests. d: Number of requests in the current range. i: Identification of current Li_Eq Function. Output: Ln: The nth level Boolean equation to distribute the priority order. n=1~m-1, Initial:start=0; d=i=m/2; 1: Procedure:Li_Eq (i, start, d): 2: Begin 3: If (d<1) 4: End 5: else start + d - 1

6:

Li =

Mx x = start

7: 8: 9:

Li_Eq (i-d/2,start, d/2); //Recurring High priority subset Li_Eq (i+d/2,start+d, d/2); //Recurring Low priority subset End Fig. 7. Li Boolean Equation Algorithm.

208

J.-c. Chiu, K.-m. Yang / Computers and Electrical Engineering 39 (2013) 202–213

controlled by Lk. Assertion of any one of the Mi+1–Mk requests belonging to a corresponding position in subset (2.1) confirms that the data is from Di+1–Dk and is selected by Lk. If none of the requests from subset (2.1) is asserted, the data are from Mk+1–Mm1 with a corresponding position in subset (2.2). In the first level, Li determines the data output from either the Lj or the Lk source. All L expressions can be defined by recursion (Fig. 7). The recursive algorithm is described in section III C below. These expressions determine the control signals of all multiplexers according to inorder traversal trace in the binary tree. 3.2. Analytical latency Consider gate delay (D) with a 2-input logic gate for fair comparison. A delay on the multiplexer is called Multiplexer Delay (MD). Assume that the number of requests is m and that a propagation path from the last level to output has log2 m level defined as p. The maximum L expressions contain 2p1 terms, which are NANDed and realized to a tree structure with only bi-input NAND gates. The delay on maximum terms of L expression is D(p  1) in the first level. However, the delays from a primary input to the output of the root multiplexer are MDp. As an example of Fig. 3, if D0 is passed onto first level, the selector delay is 4D. However, the L4 Boolean equation only expends 2D. In the x level, delay from data processing, MDx, is always longer than its L expressions, D(x  1), since (MD > D) and (x > x  1). That is, the L expressions are not on the critical path of the Mux-based selector design. Therefore, the proposed priority selector does not increase delay in the priority function even if the priority width is extended. According to above description, the critical path of the proposed priority selector scheme has only MD(log2 n) delays on the multiplexer path that has n requests. In fact, the multiplexer can be improved using many approaches. If the MD is less than L expressions, the tree structure with NAND gates becomes the worstcase latency. 3.3. A novel expression generation algorithm This section describes how all L expressions are generated by a recursive process. The proposed expression algorithm resembles a binary search tree. If L is high, the higher priority subset is searched. Similarly, if L is low, the lower priority subset is searched. This process is repeated until the highest priority data are found. Fig. 7 shows how the algorithm in recursive structure generates all L expressions. The function name is Li_eq. The variable start in Fig. 7 denotes the initial position of the current range. The d and i indicate the number of NANDed terms in a sub range and the position of the L Boolean equation, respectively. The current Li_eq function generates a Boolean equation Li to distinguish the higher and lower weighting subsets as statements in section A. After completing the current L expression, these parameters (start, d and i) are re-assigned to the higher or lower weighting subsets (lines 7–8 in Fig. 7). In the higher weighting subset, the Li_eq function is repeated as Li_Eq(i  d/2, start, d/2) until d = 1. The lower weighting subset recurs with the Li_eq function as Li_Eq(i + d/2, start + d, d/2) until d = 1. All Li can be expressed using this algorithm. According to i order, Li expressions were replaced for inorder traversal place to complete this proposed Mux-base priority selector. Finally, the proposed Mux-base priority selector is completed by replacing the Li expressions according to i order for the in-order traversal place. 4. Implemented VLSI design This study describes the optimized design with a dynamic circuit. Considering performance and extension, the dynamic NOR gates are the parallel structure needed to complete L expressions, which are easily scaled up to avoid excessive latency for input extension. On the multiplexer hands, the similar wave-pipelining [11,12] design is used to solve internal race problem between dynamic MUX. Consider asynchronous pipelines design for signal rail style with matched delay [13,14]. In this style design, the evaluation phase of dynamic MUX is postponed by a delayed precharge phase until all inputs stabilize. In the priority selector, the precharge phase is extended by increasing the duration so that the evaluation phase of each level begins with a delay after the previous level. Connecting the delayed-precharge of all MUX ensures that the data are passed on correctly. The remainder of this section is organized into two subsections. The first subsection describes the transistor-level design realized for the proposed priority selector. The second subsection discusses how a faster circuit design is achieved by using a DP-mux(Delayed-Precharge multiplexer) instead of a conventional Domino CMOS logic. 4.1. Design methodology for priority selectors using delayed precharge The proposed design includes two components: (1) NOR gates for L expressions and (2) MUX for select function. This section describes how these components are implemented at the transistor level. The L expressions, all of which are ANDed inverted requests, are realized by NOR gate to evaluate extension as shown in Fig. 8a. Using a parallel structure (dynamic NOR gate), the number of inputs is easily scaled up. At the transistor level, using NOR type to realize L expressions still avoids the critical path.

209

J.-c. Chiu, K.-m. Yang / Computers and Electrical Engineering 39 (2013) 202–213

……

………

(a) Tpchr

PreCi

Sel Sel D1 D1

D0 D0

PreCi

PreCi

0

PreCo

PreCi

Tnchr

Delayed-Precharge

1

PreCo

(b) Fig. 8. (a) For scalability, this study realizes L expressions with dynamic NOR gates. Using a parallel structure scales up the number of requests. (b) Multiplexer with dynamic logic style. Before selecting data, it must be pre-charged by the PreCi. The gray block also indicates a delay circuit used to extend the precharge phase until a multiplexer of all signals is stable.

Another component, the DP-MUX (Delayed Precharge Multiplexer), includes a dynamic multiplexer and a delay circuit. This delay circuit, called a matched delay, is set equal to the worst case delay of the corresponding DP-MUX in Fig. 8b [13]. In practice, a matched delay can be implemented in several ways. One technique cascades inverters or a chain of transmission gates to tune the similar delay. Another technique duplicates the worst-case path of the function block as the total delay. This technique provides a more accurate matched delay. During the evaluation phase, PreCi conducts transistors Tnchr and shuts off transistors Tpchr. If Sel is logic ‘‘0’’, the output of DP-MUX is D0. Similarly, when Sel is logic ‘‘1’’, the output of DP-MUX is D1. In the precharge phase, the MUX-out is charged to logic 1, and PreCo is passed to logic 0 through matched delay. As the precharge phase transfers to the evaluate phase, the delayed-precharge circuit detains the evaluate phase signal long enough to approximate the critical path of DP-MUX. A

1 0

1 1 0

(a)

1 0

0 …………

1 0

1 0

…………

(b)

Fig. 9. (a) A traditional domino logic cascades a static inverter in every output. The static inverter not only increases the latency on the critical path, but also raises the load capacitance in every stage. (b) Using a delayed precharge completes the select function. Every stage drives fewer MOSs and has no extra overhead on the longest propagation path.

210

J.-c. Chiu, K.-m. Yang / Computers and Electrical Engineering 39 (2013) 202–213

delayed precharge signal must ensure that all input has been processed and that all output is ready for the evaluation phase. The traditional domino logic design cascades a static inverter to solve the race condition. Matched delay circuitry solves the race problem by suspending the precharge signal to PreCi until all outputs are stable. Because of the matched delay, the worst latency of DP-MUX ensures that MUX-output is stable before a delayed precharge signal (PreCo) propagates to the next level. 4.2. Propagation delay Using delayed precharge to avoid the race problem has two main advantages. First, each stage drives fewer numbers of MOSs compared to domino logic. Fig. 8b shows that the DP-MUX output is connected to another input in the next stage. One MUX-output drives only one nMOS. However, the domino logic output, which cascades static logic, must drive at least two MOSs. The DP-MUX has a lower input capacitance compared to domino logic. Fig. 9 shows the second advantage, where the dotted line denotes the propagation path on the critical path where tinv and tEVLt indicate the delay time and evaluation phase, respectively, of a static inverter. Domino logic to pass date spends tinv and tEVLt in every stage. In n requests (log2 n level) priority selector, this conventional design style has a (tinv + tEVLt)  log2 n delay time on the critical path as shown Fig. 9a. The DP-MUX has a delay time of only (tEVLd)  log2 n where tEVLd indicates the delayed-precharge time as shown Fig. 9b. Advantage (1) ensures that tEVLd is lower than tEVLt. Because of the above features, DP-MUX is faster than domino logic for optimizing propagation time in a priority selector. 5. Experimental results This section divides the verification procedure into two parts (gate level by cell-based and transistor level by full-custom design): (1) the performance of the proposed scheme for balanced propagation path in the gate level and (2) use of a delayed precharge multiplexer to optimize intrinsic delay at the transistor level. First, two priority selectors based on TSMC 0.18 lm standard-cell library are designed and compared by Synopsys design compiler. (1) The proposed design uses the L expression to balance propagation path in the ‘‘proposed’’ row of Table 1. (2) The behavior of the priority selector was described by RTL in the DBPS row (Described Behavior of Priority Selector with RTL). For an accurate comparison, the same drive, load, and other condition constraints are used for the DBPS and proposed design. The speed of the synthesized circuit is tuned for maximum speed regardless of area. To compare the proposed design to the DBPS in terms of speed and extension, 8–256 requests of the priority selector are completed as shown in Table 1. The speed performance of the balanced design is better than that of the DBPS circuits under the same constraints. The balanced design is also superior in terms of total area. Specifically, as the number of requests increases, this design is superior in terms of speed improvement and area reduction. The experiments confirm that the balanced propagation path technique is suitable for extending numerous requests. Second, optimization results simulated at transistor level are shown in terms of speed. The following designs were laid out for full custom design flow in a 0.18 lm TSMC process and simulated by HSPICE, with extract wire and layout parasitical capacitance in a typical corner. HSPICE simulator is operated with Ubuntu operating system. Using the same random input patterns for different designs completes the simulation. The same W/L rate, wire width and load capacitance, were used for fair comparison. The Fig. 10 post-layout simulation shows that the critical path is on the multiplexer but at the NOR gate for L expressions. The DPX indicates the delayed precharge time of the evaluation phase in the x level. This control signal for the multiplexer selection, which is realized by the NOR gate, is labeled NORY. For example, the period from the beginning of the precharge phase to the evaluation phase in level 2 is marked DP2. In the same level, the control signal is called NOR2. In the first level, these control signals are connected directly by input request because multiplexers are controlled by Mi without Mi. Therefore, the DP2 is clearly shorter DP1, so only the other levels are considered in this discussion. Fig. 8 shows the delayed precharge from DP2DP7 in the 128-request case. The simulation results show that, when DP7 is raised to 0.9 Vcc (the start of evaluation phase), NOR64 is in logic state ‘0’ (defined as 0.1 Vcc) in the same level. Even in a transistor design, control signals do not affect overhead on the longest propagation path.

Table 1 Pre-layout comparison with synthesized priority selector. Requests Critical path delay (ns) Proposed DBPS Speedup Area (lm2) Proposed DBPS Area saving (%)

8

16 0.68 0.83 1.221

1089.6 3545.1 325.3

32 0.87 1.04 1.195

2179 6917 317.4

64 0.99 1.22 1.232

3775 16223 429.7

128 1.14 1.43 1.254

8480 34273 404.1

256 1.25 1.66 1.328

16883 66804 395.6

1.41 1.98 1.40 29346 110676 377.1

211

J.-c. Chiu, K.-m. Yang / Computers and Electrical Engineering 39 (2013) 202–213

DP2

0.9Vcc

DP3

DP4

DP5

DP6

DP7

0.1Vcc NOR16

NOR64

NOR32

NOR2. NOR4.NOR8

Fig. 10. Propagation delay of L expressions and select function for a 128-request priority selector.

Propagation Paths(ps)

Critical Path

Series

Folding

Static

D-balabce

T-balance

Fig. 11. Difference in intrinsic delay between the shortest and longest propagation path.

Fig. 11 shows the range of propagation path between fastest and lowest delay. The critical path of series-type design increases by approximately O(n) as the number of requests increases. Although the folding type retards growth in the longest path, it still causes a 9-fold difference between fastest and lowest delay. Multiplexers were implemented in two of the proposed balance designs. First, using a dynamic logical design called D-balance was applied as shown in Fig. 9b. Second, the transmission gate that cascades the inverter to drive the next stage replaces the dynamic design for the multiplexer, called T-balance. The simulation results show that the D-balance and T-balance designs obtain similar maximum and minimum delays and similar improvement in speed. Transistor count is highest in the folding design. Fig. 12 compares the power consumption observed in the post-layout simulations. Compared to other designs, the D-balance design obtained better throughput but consumed more power. Considering the trade-off between power and speed, the T-balance design without a charged load capacitance in pre-charge phase is more energy efficient compared to D-balance

J.-c. Chiu, K.-m. Yang / Computers and Electrical Engineering 39 (2013) 202–213

Power(n Watts)

212

The number of input requests

Transistor count

Fig. 12. Power consumption versus the input request number.

The number of input requests Fig. 13. Transistor count versus input request number.

design. Because the static implementation limited the activity factor to precharge circuits, its power savings are superior. In terms of transistor count, the folding design has a high area cost (Fig. 13) because it must forward more look-ahead signals to more high-priority components. The proposed T-balance design is the most efficient in term of space. Based on the series design, Table I compares the speed and power savings obtained by the 128-bit priority selector. The folding design improves speed by about 2.07 times and reduces power dissipation by 49.53%. The D-balance design achieves a 5.77 speed increase but raises power consumption by 8.5%. Although a static implementation showed an 80% reduction in power consumption, this design achieves only a 1.868 times improvement in the power consumption of the 128-bit priority selector. Considering power and performance, the T-balance design provides not only a 4.5 times speed improvement, but also a 57.2% improvement in power consumption. Additionally, its transistor count is lower. 6. Conclusion In this paper, we have proposed the priority selector integrated a priority encoder with the mux-based data selector and Boolean’s equations to control each multiplexer. By proposed Boolean’s equations, all propagation paths from the highest to lowest weighting have similar length in a mux-based selector. As a result, the inefficient nature of the traditional priority policy for synchronous clock systems, the considerable difference in delay between the fastest and slowest path, can be improved efficiently. This study has demonstrated proposed designs via the cell-based and full-custom flows. At the gate level, the balanced design is superior in terms of total area. As the number of requests increases, the proposed designs are outstanding in terms of speed improvement and area reduction. Beside, the balanced propagation path technique is suitable for extending numerous requests. For the new high-speed and low-power design, this structure is realized efficiently at the transistor level by the domino CMOS logic, called T-balance. In term of scalability considerations, the presented design does not increase extract latency on the multiplexer path for selection function. For the 128-bit priority selector, the new technique enhancing the performance in a synchronous clock system is about 4.5 times faster and 57.2% improvement in power consumption.

J.-c. Chiu, K.-m. Yang / Computers and Electrical Engineering 39 (2013) 202–213

213

Acknowledgments The authors would like to thank the reviewers for their many constructive comments and suggestions in improving this paper. We also thank the contributions of National Sun Yat-Sen University and aim for the top university project under Grant 01C030710. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [11] [12] [13] [14]

Adamides ED, Lliades P, Argyrakis I, Tsalides P, Thanailakis A. Cellular logic bus arbitration. IEE Proc Part E, Comput Dig Tech 1993;140(6):289–96. Kadota H, Miyake J, Nishimichi Y, Kudoh H, Kagawa K. An 8-kb content-addressable and reentrant memory. IEEE J Solid-State Circ 1985;SC-20:951–7. Chiu JC, Yang KM. Novel instruction stream buffer for VLIW architectures. Comput Electr Eng 2010;36(1):190–8. Kumar VC, Phaneendra PS, Ahmed SE, Sreehari V, Muthukrishnan NM, Srinivas MB. A reconfigurable INC/DEC/2’s complement/priority encoder circuit with improved decision block. In: Proc Int Symp Electr Syst Des; December 2011. p. 100–05. Maurya SK, Clark LT. Fast and scalable priority encoding using static CMOS. In: Proc IEEE Int Symp Circ Syst; May 2010. p. 433–36. Delgado-Frias JG, Nyathi J. A high-performance encoder with priority lookahead. IEEE Trans Circ Syst I 2000;47:1390–3. Wang J-S, Huang C-H. High-speed and low-power CMOS priority encoders. IEEE J Solid-State Circ 2000;35:1511–4. Huang C-H, Wang J-S, Huang Y-C. Design of highperformance CMOS priority encoders and incrementer/decrementers using multilevel lookahead and multilevel folding techniques. IEEE J Solid-State Circ 2002;37:63–76. Maurya SK. A dynamic longest prefix matching content addressable memory for IP routing. IEEE J Very Large Scale Integration (VLSI) 2011;19:963–72. Wong D, De Micheli G, Flynn M. Designing high-performance digital circuits using wave-pipelining. IEEE Trans Comput-Aided Des Integr Circu Syst 1993;12(1):24–46. Liu W, Gray CT, Fan D, Farlow WJ, Hughes TA, Cavin RK. A 250-MHz wave pipelined adder in 2-m CMOS. IEEE J Solid-State Circ Sep. 1994;29(9):1117–28. Singh M, Nowick SM. The design of high-performance dynamic asynchronous pipelines: lookahead style. IEEE Trans Very Large Scale Integration (VLSI) 2007;15:1256–69. Yee G, Sechen C. Clock-delayed domino for adder and combinational logic design. In: Proc IEEE Int Conf Comput Des; October 1996. p. 332–37.

Jih-Ching Chiu was born in Pingtung, Taiwan. He received the B.S. and M.S. degrees in electrical engineering from National Sun Yat-Sen University and National Cheng-Kung University, Taiwan. He received the Ph.D. degree in Computer Science and Information Engineering from National Chiao Tung University, Taiwan. In 1989 he joined the National Sun Yat-sen University, Taiwan. His research interests are in the areas of ILP CPU design, computer system integration, embedded system design, new generation processor design and reconfigurable computing. Kai-Ming Yang was born on June 19, 1982 in Taiwan. He received the M.S. degree in electrical engineering from National Sun Yat-Sen University in 2004 and 2006. Currently, he is pursuing the Ph.D. degree in electrical engineering at National Sun Yat-Sen University. His current research interests include computer architecture, parallel processing, and circuit design.