A closed-loop ASIC design approach based on logical effort theory and artificial neural networks

A closed-loop ASIC design approach based on logical effort theory and artificial neural networks

Integration, the VLSI Journal 69 (2019) 10–22 Contents lists available at ScienceDirect Integration, the VLSI Journal journal homepage: www.elsevier...

2MB Sizes 0 Downloads 59 Views

Integration, the VLSI Journal 69 (2019) 10–22

Contents lists available at ScienceDirect

Integration, the VLSI Journal journal homepage: www.elsevier.com/locate/vlsi

A closed-loop ASIC design approach based on logical effort theory and artificial neural networks Kunwar Singh ∗ , Satish Chandra Tiwari, Maneesha Gupta VLSI Design Laboratory, Room No. 18, Department of ECE, Netaji Subhas University of Technology, Sector-3, Dwarka, New Delhi, 110078, India

A R T I C L E

I N F O

Keywords: Digital VLSI Logical effort theory Artificial neural networks Logic synthesis Configurable standard cell ASIC design

A B S T R A C T

Standard cell library is the backbone of modern day application specific integrated circuit (ASIC) design flow provided by electronic design automation (EDA) vendors worldwide. In these libraries, standard cells are generally available in terms of discrete drive strengths with higher drive strength indicating a faster version of the cell belonging to some predefined logic functionality. However, this leads to increased values of area and power consumption in comparison to a lower drive strength standard cell which has slower response time indicating the underlying tradeoff between speed, area and power. A standard cell with discrete drive strength is not always required during the process of logic synthesis and non-availability of standard cells with fractional drive strengths in the aforementioned libraries hugely impacts the overall performance of resulting digital integrated circuits (ICs) and systems [ 1]. In this paper, a novel technique has been introduced for on-demand generation and inclusion of standard cells in the logic synthesis process leading to availability of a continuous spectrum of standard cells in terms of drive strengths which ultimately provides a platform for closed-loop ASIC design flow. Logical effort (LE) theory has been utilized alongside artificial neural networks (ANNs) in order to implement the proposed methodology. Extensive circuit simulations have been performed using HSPICE in 130 nm/1.2 V CMOS process technology. Preliminary results are encouraging with up to 41.7% and 62.8% reduction in powerdelay product (PDP) and power-delay-area product (PDAP) for a 5-stage gate level test circuit. An 8-bit counter realized using the proposed methodology shows up to 29.77% reduction in power dissipation and 37.8% savings in area at varying capacitive loads when compared to conventional logic synthesis technique which employs a library comprising of standard cells with discrete drive strengths. It is noteworthy that the proposed approach is a general technique which can be easily mapped to high complexity circuits and advanced technology nodes. To support this fact, simulation results have also been provided at 90 nm/1 V CMOS process node for the 5-stage test circuit.

1. Introduction It is a well established fact that the number of standard cells that are present on current generation CMOS chips exceeds 10 million. Moreover, with the onset of next generation chips, as CMOS design rules continue to shrink, the number of on-chip standard cells will increase manifolds. Even though in CMOS technology with device size and power supply scaling the average power dissipation per standard cell is reducing, the sharp increase in the total number of on chip standard cells is causing total power consumption of digital systems to increase drastically. This large power increase has become the limiting factor in determining the maximum amount of logic functionality that can be

realized on a single CMOS chip. Thus, in order to implement more complex CMOS chips, the total power dissipation due to all standard cells must be reduced to an absolute minimum. Until now the ASIC design is developed by writing high level code in a hardware description language (HDL) such as Verilog or VHDL (Very High Speed Integrated Circuits Hardware Description Language). In general, when an HDL code is synthesized using any standard synthesis tool, a final RTL netlist is generated. The standard cells netlist produced by all the present day synthesis tools must adhere to three strict criterion [2]. 1. The netlist must implement the desired logic functionality. 2. It must generally meet the system timing constraints. 3. It must occupy minimal standard cell area.

∗ Corresponding author. E-mail address: [email protected] (K. Singh). https://doi.org/10.1016/j.vlsi.2019.07.006 Received 10 July 2016; Received in revised form 9 December 2017; Accepted 27 July 2019 Available online 31 July 2019 0167-9260/© 2019 Elsevier B.V. All rights reserved.

K. Singh et al.

Integration, the VLSI Journal 69 (2019) 10–22

If the obtained RTL netlist violates any of the above mentioned constraints, the designer can further improve the HDL code and if the violations still persist, full custom designing for some specific part or the entire obtained RTL netlist is needed. The process might involve several iterations but designing at full custom level provides freedom to a designer to implement standard cells of specific drive strengths which might be fractional as well (For e.g., a standard cell of drive strength, say, 1.56×). This methodology substantially increases time to market eliminating the very essence of ASIC designing. To limit this problem to some extent, standard cell library is being continuously loaded with more number of standard cells having different drive strengths. However, designing standard cells for every drive strength is not possible and adding a large number of standard cells also stresses the library design process increasing both computational effort and time to market. Modern day logic synthesis tools utilize standard cell libraries for ASIC design. However, the current generation of standard cell libraries is composed of cells which are static in terms of drive strengths. As a result, the synthesis tool has limited options to satisfy the timing constraints which leads to an overhead in terms of power dissipation and silicon area [1]. In addition, speed of ASICs lags by a factor of six to eight when compared to the fastest custom circuits in the same CMOS process [3] thus seriously degrading the performance of ASIC designs relative to the custom designs. ANN modeling techniques have been widely used for solving problems in electronics engineering. ANN has been used for determining the channel length and width of a MOS transistor for specific drain current in [4]. A multi-layer perceptron ANN has been utilized to model signal and noise behaviour of microwave transistors by Gunes et al. [5]. ANNs have been utilized for modeling an on-chip spiral inductor [6]. ANNs have also been used in Ref. [7] for design of CMOS circuits in the nanometer regime. They have also been adopted for automated synthesis of operational amplifiers by Wolfe and Vemuri [8]. Kahraman and Yildirim provided an insight into designing of technology independent CMOS circuits for both digital and analog domains [9,10]. Jafari et al. employed ANN for optimum transistor sizing of a three-stage CMOS operational amplifier under fixed set of constraints in [11]. However, none of the previously used methodologies for circuit optimization in digital VLSI domain using ANNs have revealed any technique to curtail the design space and hence generate finite number of datasets, which increases their computational effort and leads to sub-optimal results. A solution to this problem has been reported in this paper which ultimately leads to the formulation of closed-loop ASIC design flow. In this work, the whole idea is to make drive strength of the standard cell itself to be a variable, so that the synthesis tool is able to pick up standard cells which can adjust their drive strengths dynamically (on the run) to satisfy the timing, area and power constraints. Hence, a novel methodology is presented to realize standard cells with dynamic drive strengths (variable in nature) based on LE theory and ANNs. LE theory has been effectively utilized to limit the design space for optimization and hence finite datasets are generated using SPICE simulations in terms of input capacitance Cin (which is related to the transistor sizes using LE theory), delay, power and area etc. at target load capacitance CL for a logic cell/block. ANN on the other hand, utilizes this data for training and subsequently predicting the appropriate transistor sizes for that logic cell/block based on the target specifications. The proposed technique offers a wide range of advantages (i) The need for loading the library with huge number of standard cells of different drive strengths is eliminated (ii) Standard cells are generated on the fly during synthesis and can even have fractional drive strengths (iii) The technology tends to fill the wide gap between ASIC and full custom design approaches which is a critical issue with the current generation CAD tools (iv) Utilizing ANNs for predicting set of transistor widths for a targeted speed, power and/or area based on data extracted from actual layouts is always more accurate than mathematically modeling the above mentioned parameters based on LE theory alone. These parameters show significant deviations up to 15–20% even with the

best modeling approaches available in the literature especially below 100 nm where leakage power and interconnect delays have a major impact on operating characteristics and optimization of the circuits (v) The overall notion is to create an ASIC design flow which is at par with full custom design methodology, fully automated, offers minimum computational requirements and least time to market. The rest of the paper is structured as follows. Section 2 elaborates the simulation parameters and posynomial functions which are utilized to determine the optimum gate delays using convex optimization. Section 3 describes the conventional open loop ASIC design flow and the associated limitations. It also highlights the methodology used for creating standard cells with dynamic drive strengths. Section 4 demonstrates the proposed closed-loop ASIC design flow. Section 5 presents the simulation results obtained after comparison of traditional and proposed synthesis methodologies utilizing a 5-stage arbitrary gate level circuit and an eight bit asynchronous counter. Conclusions are drawn in Section 6. 2. Simulation parameters HSPICE has been used as the simulation platform at 130 nm/1.2 V and 90 nm/1 V using Predictive Technology Model (PTM) technology files [12]. The signal slope for data signals is limited to 100 ps. The area is estimated by multiplying the sum of widths of all the MOSFETs in the circuit with the channel length. PDP and PDAP have been used as the Figures of Merit (FOMs). Tool command language (Tcl) scripts have been utilized to extract parameters such as power, delay etc. from the output files. The absolute gate capacitance (CGATE_fF ) in fF (femtofarads) and absolute transistor width (W) in nm (nanometers) at 130 nm process node are related by the following expression. CGATE_fF = (1.56 × 10−3 )W

(1)

The above relation is obtained by fitting simulation data as explained in [13,14]. 8X, 16X and 24X capacitive loads have been used for simulation in this work which correspond to 9.72 fF, 19.44 fF and 29.16 fF respectively. 2.1. Representation of gate delays as posynomial functions The latest arrival time at the output of a circuit can be minimized if the ‘nth’ gate is characterized by its LE (gn ), parasitic delay component (pn ), and drive strength (xn ). These arrival time equations are represented using a special class of functions called polynomials which are convex and provide a single optimum [15–17]. The delay equations for each gate in a circuit are obtained using the following relationships [13]. d = Cout ∕x + p

(2)

x = Cin ∕g

(3)

where, Cout and Cin represent the output and input capacitance of the logic gate. These equations can be modeled in the form of an optimization problem in order to minimize the arrival time at the output of the circuit under consideration. minimize zo (s) subject

to zi (s) ≤ 1, wi (s) = 1,

i = 1 … .m , i = 1 … .m ,

(4) (5)

where, zi are posynomial functions, wi are monomials, and s = (s1 , …., sn ) is a vector consisting of optimization variables with component si and an implicit constraint that the variables are positive i.e., si > 0 [18]. In this work, the quantity to be minimized is the arrival time of the signal at the output load of the test circuit which is a function of the drive strengths of the logic gates in the input to output path. 11

K. Singh et al.

Integration, the VLSI Journal 69 (2019) 10–22

3. Methodology to create standard cells with dynamic drive strengths It is well known that the conventional ASIC design methodology is dependent on a standard cell library with cells having fixed drive strengths in order to satisfy design constraints during logic synthesis. Hence, two major shortcomings of the traditional ASIC flow are: 1. The methodology utilizes a static library of standard cells unable to create design constraint specific cells which can be optimally used to converge towards the design goal. 2. It is considered to be an open−loop approach. On the other hand, the proposed methodology can automatically generate the cells on-demand and is considered to be a closed-loop design approach. The process of creating cells which are configurable in terms of drive strengths using ANN is divided into two phases:

Fig. 1. A logic gate/block.

characterization is stopped further which implies that increasing Cin and hence the transistor widths has no considerable effect on cell delay reduction and it can be concluded that the delay has saturated. Cin is augmented in increments of Cmin , where Cmin represents the capacitance offered by the input terminal of logic gate/block, when the size of transistors is set to minimum. Once the speed is saturated for a given CL and Cin , the absolute width of transistors are obtained using LE theory for each combination of Cin and CL . Using these widths, layout can be generated automatically/semi-automatically using a layout synthesis tool such as CELLERITY [21] which also includes the effects of interconnects. Automation scripts in scripting languages like Tcl can be used further in order to extract precise values of various parameters like delay, power, risetime, falltime, leakage power etc. by performing post layout simulations of the these layouts using SPICE netlists including parasitics. However, in this paper datasets are prepared from pre-layout schematics using HSPICE as the simulation platform and extracting the values of relevant parameters utilizing Tcl scripts since tools for automated generation of layouts are not available commercially. The aforementioned methodology for creating ANN based standard cells configurable in terms of drive strengths is conceptualized in the form of a pseudocode as illustrated below.

3.1. Cell characterization phase Consider a two input logic gate as shown in Fig. 1. Cin is the input capacitance of the gate with respect to input B, and is directly proportional to width of transistors connected to the input terminal. If width of transistors increases, the delay reduces but power dissipation and area of the cell increases in general. Therefore, the width of transistors cannot be increased indefinitely to satisfy timing constraints. This is well supported by conventional LE theory used for speed optimization of CMOS logic circuits [19]. Further, the upper bound on the width of transistors for a cell is determined by delay sensitivity factor. Transistor width is related to the input capacitance and as width increases, Cin also increases but delay reduces. As a result, the delay C sensitivity factor ‘SDin ’ is defined as differential of delay with Cin [20]. It is expressed as C

SDin =

𝜕 D Cin 1 Δ_t =− 𝜕 Cin D N Δ_t + 1

(6)

C

where, SDin is the delay sensitivity factor, Cin is the input capacitance, D is optimized delay, N is number of stages in the critical path

and ‘Δ_t’ is the relative delay increment with respect to parasitic delay. C When the value of ‘SDin ’ is such that its modulus is reduced below a target value set by the library developer (which is 5% in our case), the

The characterization of two input NOR gate at 16X capacitive load is exemplified in Table 1. It shows the variation of delay and power with increasing Cin values. It is to be noted that the delay saturates at 47.25

12

K. Singh et al.

Integration, the VLSI Journal 69 (2019) 10–22

Table 1 Characterization of two-input NOR gate at 16X capacitive load. Cin (fF)

Delay (ps)

Power (μW)

Area (μm2 )

1.04 2.08 3.12 4.16 5.2 6.24 7.28 8.32 9.36 10.4

184.06 117.76 88.73 74.14 65.32 59.36 55.23 51.98 49.37 47.25

12.68 17.73 20.94 23.80 26.52 29.34 31.90 34.58 37.31 40.02

0.16 0.32 0.48 0.64 0.80 0.96 1.12 1.28 1.44 1.6

ps for Cin = 10.4 fF. This leads to the determination of upper bound on transistor widths early in the design cycle and hence defines the limits of power(energy)-delay design space. This process is repeated for capacitive load values up to MX where ‘M’ is a positive integer whose value is set to be 32 for this work. Thus, there will be 32 sets of data values similar to the ones as in Table 1 for a range of capacitive loads 1X - 32X.

Fig. 2. Input and output parameters of the ANN.

Fig. 3. Regression plots for the trained ANN. 13

K. Singh et al.

Integration, the VLSI Journal 69 (2019) 10–22

Fig. 5. Performance plot for the trained ANN.

Fig. 4. Histogram plot for the trained ANN.

3.2. Artificial neural network training and validation phase In order to construct the ANN model of each logic cell, a standard 3-layer feedforward multi layer perceptron (MLP) architecture has been considered which is demonstrated in Fig. 2. The network generally includes an input layer, some hidden layers, and an output layer. MLPs have been successfully employed to solve difficult problems by training in supervised environment known as the error back propagation algorithm. To minimize the training error, the weight parameters and the bias values are adjusted during the training procedure. Levenberg-Marquardt (LM) back propagation method has been utilized for the purpose of training. 3.2.1. Prediction of transistor gate sizes using ANN The prediction of transistor gate sizes corresponding to a target speed or power is achieved through three steps. Firstly, a MLP ANN is created with input, hidden and output layers, with 2, 20 and 1 neurons, respectively. The inputs of the ANN are load capacitance, delay and/or power etc. and the target output is Cin of the logic gate/block. In the next step, training of the ANN is performed using 320 data sets with 10 readings (Cin ) varying in fixed increments from Cmin to 10Cmin ) corresponding to each capacitive load using back propagation (BP) method. In the BP training procedure, out of the total data sets 70% are used for training, 15% for testing, and 15% for validation of ANN respectively. Figs. 3–6 show the regression plot, histogram plot, performance plot and training graph respectively for a two input NOR gate. Fig. 3 highlights the fact that desired ANN has been fitted well into a model to predict the value of Cin with all the sets of data viz. training set, validation set and test set and an overall accuracy of 99.9% is achieved. Once the ANN is trained and validated, the ANN can be used to predict the response to any input (in our case CL and delay are inputs and Cin is the output). Accordingly, Fig. 4 shows the histogram plot which shows the error values as the difference between target and predicted values. Fig. 5 represents the validation performance graph between mean squared error (MSE) and number of epochs. The best validation performance of the trained ANN in terms of MSE is 0.013542 at

Fig. 6. Training graph for the trained ANN.

epoch 157. Fig. 6 summarizes the training graph for the ANN under consideration. It can be clearly observed that the training procedures have been executed with good accuracies. The predicted Cin value corresponding to fractional drive strength is well utilized to estimate the transistor sizes based on LE theory on an automated basis throughout this work. The ANN is trained using ‘M’ sets of data as mentioned previously in Section 3.1 and the resulting trained ANN represents a standard cell with dynamic drive strength. The power and area requirements for a standard cell of particular drive strength at a specific load are determined well in advance using the above mentioned ANN and it can be appropriately picked or dropped by the synthesis tool in accordance with the design constraints creating tremendous scope for highly optimized power and area aware synthesis. It is also worth noting that although more emphasis is given on delay, power, and area, a number of other circuit characterization parameters such as leakage power, 14

K. Singh et al.

Integration, the VLSI Journal 69 (2019) 10–22

Fig. 7. Proposed closed-loop ASIC design flow based on ANNs.

noise etc. can be used as either inputs or outputs for training the ANN.

ate target-constraints specific cells either through manual layout design or semi-automatic/automatic layout generator such as CELLERITY [21] as depicted in Fig. 7. Although actual standard cells in the form of layouts are required after chip floorplanning and placement, important information such as layout area corresponding to design-specific cells is provided during the floorplanning and placement stage for efficient utilization of chip area. Additionally, information regarding dynamic and leakage power dissipation etc. is also available and can be well utilized for satisfying power constraints optimally.

4. Proposed closed-loop ASIC design flow The proposed approach describes a closed-loop process of ASIC design and optimization via automatic or manual creation of designspecific cells with desired characteristics (such as, speed, area, power, noise, etc.), which are implemented in the form of ANN cells as described in the previous section. The cells obtained on the fly using ANN in the loop, are optimized with respect to the design constraints thus embedding the powerful characteristics of design-specific customization to standard cell-based design methodology. The drive strength corresponding to each logic gate/block obtained as the solution of convex optimization problem using posynomial equations [15] to satisfy the design constraints is provided as input to a library consisting of dynamic drive strength cells in the form of trained ANNs, where each trained ANN in the library corresponds to a single logic cell/block. Using these values of drive strengths, the minimum delay for each stage in the critical path of design under consideration is evaluated. Once the optimum delay for each logic cell/block is known, the input capacitance of the last stage logic gate driving the output load capacitance is predicted using the ANN based cell which accepts CL and delay as inputs and predicts the Cin value. This Cin value serves as the load capacitance for the penultimate stage logic gates and iteratively applying the aforementioned process for all remaining logic gates in the critical path proceeding backwards from last stage to the first stage leads to prediction of their respective Cin values using the ANN simulator (MATLAB 7.10 in our case). Since CL and Cin values are available, transistor sizes can be very easily evaluated based on LE theory using automation scripts in the loop. These transistor sizes are utilized to cre-

5. Simulation results and discussion 5.1. Implementation of a 5-stage arbitrary test circuit A 5-stage circuit as shown in Fig. 8 has been employed to compare the proposed methodology with the conventional synthesis approach. Corresponding to each gate in the circuit a test library is designed using HSPICE at 130 nm/1.2 V CMOS technology choosing an inverter with Wn = 1.04 μm and Wp = 2.50 μm as the reference inverter (𝛽 = 2.42 ensures equal rising and falling delays, where 𝛽 represents the ratio of size of PMOS and NMOS transistors). The details of the standard cells implemented in the test library are shown in Table 2. In this work, the circuit delay has been modeled in the form of posynomial functions as shown in Eqs. (7)−(12), and CVX, a package for specifying and solving convex programs [22] has been used to obtain the minimum arrival time at the output. It is to be noted, however, that the drive strength of inverter in the first stage is kept fixed at 1×, equivalent to that of the reference inverter. d1 = 1.4 + 1.33x2 + 1.66x3

15

(7)

K. Singh et al.

Integration, the VLSI Journal 69 (2019) 10–22

Fig. 8. Test circuit.

fractional drive strengths obtained are rounded off to the nearest integer [13]. A different approach is used in the proposed methodology, where, optimized delay of every gate has been determined using posynomial equations which is based on the arrival time of signals at the output of each gate. Now, moving backwards from the last stage which drives the external load (8X, 16X and 24X in our case), the optimum input capacitance is predicted using CL and delay as inputs and Cin as the output, utilizing the ANN simulator. For this purpose, ANN library is utilized which is composed of the cells shown with the corresponding regression values in Table 3. The regression values indicate the accuracies with which the ANN has been trained and modeled using the datasets obtained from SPICE simulations as mentioned earlier in the previous section. It is to be noted here that the input capacitance of the gate in lets say jth stage represents a component (except when a gate in the (j − 1)th stage drives only one gate in the jth stage such as G5 and G6) of the load capacitance of gate in the (j − 1)th stage. Transistor widths are evaluated from the corresponding Cin values using LE theory which can be further used to generate schematic/netlist driven layouts. These layouts potentially represent the design constraint specific cells needed to achieve the target specifications. Table 4 lists the arrival times normalized with respect to the process dependent parameter ‘𝜏 ’ based on LE theory whose value is determined as 7.3 ps from the delay vs fan-out graph at 130 nm PTM CMOS process. It also shows the optimized delays dn corresponding to the nth gate (G1−G6) of the circuit under analysis. The drive strengths of the cells

Table 2 Composition of the standard cell based test library. Logic gate

Standard cells

Inverter NAND3 AND2 NOR2 OR2

inv_x1, inv_x2, inv_x4, inv_x8 nand3_x1, nand3_x2, nand3_x4 and2_x1, and2_x2, and2_x4 nor2_x1, nor2_x2, nor2_x4 or2_x1, or2_x2, or2_x4

d2 = 4.2 + 1.66x3 x2−1

(8)

d3 = 4.2 + 1.66x4 x3−1

(9)

d4 = 4.2 + 1.66x5 x4−1

(10)

d5 = 2.8 + x6 x5−1

(11)

d6 = 1.4 + CL x6−1

(12)

where, di represents the delay of the ith gate while xi expresses the drive strength of the gates G1−G6 (refer Fig. 8). Based on the signal arrival times, the optimum drive strength for each gate is determined using the posynomial equations and appropriate cells are picked from the cell library to satisfy the performance constraints in case of traditional methodology. The

Table 3 Composition of the ANN based test library. Cell Regression value

INVERTER_ANN 0.96

NAND3_ANN 0.99

AND2_ANN 0.99

NOR2_ANN 0.99

OR2_ANN 0.99

Table 4 Cells utilized for the test circuit for 16X load at 130 nm CMOS process. Gates

G1 G2 G3 G4 G5 G6

an

3.63 9.3 9.3 14.97 20.64 24.29

dn (ps)

26.4 41.3 41.3 41.3 33.9 26.6

Conventional approach

Proposed approach

Standard cells picked

Cin (fF)

ANN based cells picked

Cin (fF)

inv_x1 and2_x1 or2_x1 nand3_x1 nor2_x2 inv_x8

5.69 7.48 7.6 4.39 16.2 45.52

inv_x1 and2_x1.1 or2_x1.36 nand3_x1.72 nor2_x1.22 inv_x1.86

5.69 8.3 10.4 7.9 10.4 10.1

16

K. Singh et al.

Integration, the VLSI Journal 69 (2019) 10–22

Table 5 Comparison of output parameters for 16X load at 130 nm CMOS process. Approach

Delay (ps)

Power dissipation (𝜇 W)

Area (μm2 )

PDP (fJ)

PDAP (fJ.μm2 )

Conventional Proposed

194.2 160

337.9 231

13.65 9.47

64.9 35.5

885.8 336.1

It is clearly observed from Table 4 that utilization of targetconstraint specific cells with fractional drive strengths leads to an overall reduction in the capacitance to be charged or discharged per signal transition which has a direct impact on the power consumption of the circuit. Additionally, lower values of Cin result in utilization of gates with smaller transistor widths. As a consequence, there is a reduction of 17.6%, 31.6%, 30.6%, 45.3% and 62% respectively in optimum delay, power dissipation, area, PDP and PDAP of the design under consideration as demonstrated in Table 5. Fig. 9 shows the performance results for 8X and 24X loading conditions. The trend of considerable improvements continues with the corresponding percentage reductions indicated against each parameter. Furthermore, the test circuit exhibited robust performance at various process corners viz. TT, FF, FS, SF and SS. The timing and power characteristics for the test circuit at 130 nm CMOS process node with applied process/voltage/temperature variations are shown in Fig. 10. In order to demonstrate effectiveness of the proposed methodology at advanced process nodes, simulations have been carried out using 90 nm PTM CMOS technology models for the same test circuit. The steps mentioned earlier for conducting simulations using 130 nm PTM CMOS technology models are repeated (except that the

Fig. 9. Performance of test circuit at 8X and 24X (@ 130 nm).

picked to satisfy the performance constraints are also highlighted along with their respective input capacitances.

Fig. 10. PVT analysis for test circuit at 16X load.

Table 6 Cells utilized for the test circuit for 16X load at 90 nm CMOS process. Gates

G1 G2 G3 G4 G5 G6

an

4.3 12.2 12.2 20.0 26.1 30.5

dn (ps)

17.4 31.3 31.3 31.3 24.4 17.6

Conventional approach

Proposed approach

Standard cells picked

Cin (fF)

ANN based cells picked

Cin (fF)

inv_x8 and2_x2 or2_x2 nand3_x2 nor2_x4 inv_x8

7.0 7.2 2.7 4.4 6.6 7.0

inv_x4.7 and2_x1.7 or2_x2.2 nand3_x1.4 nor2_x2.8 inv_x5.5

4.82 5.72 2.96 2.47 5.72 5.75

17

K. Singh et al.

Integration, the VLSI Journal 69 (2019) 10–22

Table 7 Comparison of output parameters at 16X load at 90 nm CMOS process. Approach

Delay (ps)

Power dissipation (μW)

Area (μm2 )

PDP (fJ)

PDAP (fJ.μm2 )

Conventional Proposed

91.22 93.78

148.4 96.76

7.29 5.36

13.53 9.07

98.6 48.6

5.2. Implementation of 8-bit asynchronous counter An eight bit asynchronous counter has been implemented using the proposed methodology at 130 nm/1.2 V CMOS technology as another proof of concept. For this purpose, a Toggle flip-flop (T_FF) is designed using the benchmark transmission gate flip-flop (TGFF) [23]. The circuit diagram of the T_FF is as shown in Fig. 12. The transistor sizing of T_FF has been done in accordance with LE theory [20]. There are five stages in the critical path of T_FF and w1-w5 represent the normalized transistor widths in these stages. These widths are normalized with respect to the minimum transistor size (260 nm) at the CMOS technology under consideration. The feedback transistors are kept at minimum technology widths. If Cin and CL are known, the capacitance values at internal nodes of the T_FF are evaluated using capacitance transformation formula as indicated in Eq. (13) where Cout-i is the output capacitance, Cin-i is the input capacitance and gi is the logical effort of the ith stage in the netf is the optimum stage effort (refer [19] for further details). work while ̂ Fig. 11. Performance of test circuit at 8X and 20X (@ 90 nm).

f Cin−i = (Cout −i × gi )∕̂

(13)

The transistor sizes of the entire network are then calculated by utilizing the relationship between gate capacitance and width of the transistor as mentioned in Eq. (1). A standard cell library has been implemented and details of standard timing parameters [24] such as setup time (Tsetup ), clock-to-output delay (TCQ ) and data-to-output delay (TDQ ) etc. are provided in Table 8. In order to realize a configurable cell corresponding to T_FF based on the proposed methodology, cell characterization is performed and datasets are obtained using SPICE simulations for capacitive loads rang-

drive strength of inverter in the first stage is no longer fixed) and the results obtained for 16X capacitive loading are highlighted in Tables 6 and 7. Again, it is clearly observed from Table 7 that power dissipation, area, PDP and PDAP are reduced by 34.7%, 26.4%, 32.9% and 50.7% respectively with a negligible delay penalty. Additionally, the improvements in various parameters are reported in Fig. 11 for 8X and 20X loading at 90 nm PTM CMOS process.

Fig. 12. LE theory based sizing of Toggle Flip-flop.

Table 8 Details of standard cells designed for Toggle flip-flop. Cell

Cin (fF)

Tsetup (ps)

TCQ (ps)

TDQ (ps)

Capacitive load (fF)

T_FF_1x T_FF_2x T_FF_4x T_FF_8x

2.4 4.8 9.6 19.2

50 39 39 30

92.28 91.98 95.91 97.43

142.28 130.98 134.91 127.43

4X 8X 16X 32X

18

K. Singh et al.

Integration, the VLSI Journal 69 (2019) 10–22

Table 9 Simulation results for 20X capacitive load. Cell

Cin (fF)

Tsetup (ps)

TCQ (ps)

TDQ (ps)

Power dissipation (μW)

Area (μm2 )

T_FF_4x T_FF_8x T_FF_6.07x

9.60 19.20 10.63 (predicted)

38 31 32

97.92 93.9 93.97

135.92 124.90 125.97

203.20 314 232.60

2.88 5.64 3.70

Table 10 Simulation results for 28X capacitive load. Cell

Cin (fF)

Tsetup (ps)

TCQ (ps)

TDQ (ps)

Power dissipation (μW)

Area (μm2 )

T_FF_4x T_FF_8x T_FF_5.76x

9.60 19.20 8.42 (predicted)

39 30 32

102.20 96.40 97.64

141.2 126.40 129.64

205.60 318.80 214.70

2.88 5.64 3.10

ing from 1X-32X which is achieved by varying Cin in fixed increments of 1.21 fF (1.21 fF - 12.1 fF corresponding to Cmin - 10Cmin ) for each load value and recording the values of delay, power and area as mentioned previously in subsection 3.1. This represents cell characterization phase. Hence, 320 datasets (10 sets of values for each load ranging from 1X-32X) are obtained and then an ANN is trained with CL and delay as inputs and Cin as output till a regression of 0.97 is achieved. This trained ANN can now be used to predict the value of Cin for target CL and delay values and these steps form part of training and validation phase. Let us consider the following two design goals as examples: 1. CL = 20X and TDQ = 125ps 2. CL = 28X and TDQ = 130ps With respect to these design goals, it is observed from Tables 9 and 10 that in both cases only T_FF_8x cell is appropriately picked from the standard cell library because T_FF_4x cell fails to meet the target delay constraint for the specified loads. In case of proposed methodology, ANN has been utilized for prediction of Cin by specifying the design goals in terms of capacitive loads and respective delays. The predicted Cin values thus obtained for both cases are found to be 10.63 fF and 8.42 fF. These Cin values along with target capacitive loads are utilized to obtain transistor sizes for the design goal specific cells using LE theory. Using these transistor widths, the drive strengths of cells obtained using proposed methodology are determined using SPICE with respect to the cells in standard cell library and are estimated as 6.07x and 5.76x for the first and second design goals respectively. Simulation results indicate that for these drive strengths, delay constraints are met and the resulting cells lead to 25.9% and 32.6% reduction in power dissipation while area is reduced by 34.4% and 44.8% respectively when compared to T_FF_8x standard cell. The net improvement in terms of PDAP is highlighted in Fig. 13. It is noteworthy that in addition to delay, power and area can also be included as design constraints in target specifications. We have assumed that the circuit is being designed for a speed critical application and hence power and area constraints have not been applied in the design goals. In addition, it is readily observed that the power dissipation and area of T_FFs obtained by estimating the transistor widths based on Cin value as predicted by the ANN are significantly lesser for a target CL and delay and both of them are equivalent to a fully customized cell designed for a target load and delay value thus representing cells with fractional drive strengths viz. 6.04x and 5.76x (drive strength values obtained through calibrated simulations).

Fig. 13. PDAP improvement in ANN based Toggle Flip-flops.

Node B: CB = Cg (XN2) + Cg (XP3)

(15)

Node C: ∑ CC = Cgd (XNi) + Cgd (XPi) + Cdb (XNi) + Cdb (XPi) i=2,3

+



Cg (XNi) + Cg (XPi)

(16)

i=4

Node D: ∑ CD = Cgd (XNi) + Cgd (XPi) + Cdb (XNi) + Cdb (XPi)

(17)

i=4,5

Node E: ∑ CE = Cgd (XNi) + Cgd (XPi) + Cdb (XNi) + Cdb (XPi) i=5,7

+



Cg (XNi) + Cg (XPi)

(18)

i=8

Node F: ∑ CF = Cgd (XNi) + Cgd (XPi) + Cdb (XNi) + Cdb (XPi)

5.2.1. Calculation of capacitances at internal nodes of Toggle_FF (refer Fig. 12) Node A: CA =

∑ i=1,3

Cgd (XNi) + Cgd (XPi) + Cdb (XNi) + Cdb (XPi)

i=8,9

+

(14)

∑ i=6

19

Cg (XNi) + Cg (XPi)

(19)

K. Singh et al.

Integration, the VLSI Journal 69 (2019) 10–22

Table 11 Technology parameters used for capacitance calculations. Parameter

Cgdo (F/m)

Cgso (F/m)

Cjsw (F/m)

Cj (F/m2 )

NMOS PMOS

3.68E-10 3.63E-10

3.68E-10 3.63E-10

2.06E-10 9.8E-11

0.00137 0.00136

capacitance, Cgd is gate-to-drain capacitance and Cdb is the drainbody capacitance. We have assumed that Cgs (gate-to-source capacitance) and Csb (source-to-body capacitance) are equal to Cgd and Cdb respectively. Capacitance estimation is done by using technology parameters as listed in Table 11 and the procedure explained in [25]. As a result, a comparison of the total gate capacitance (Cgtotal ), parasitic capacitance at internal nodes (Cinternal ) and clock load (Cclock-load ) for T_FF_8x and cells obtained from ANN is demonstrated in Table 12. It is noteworthy that T_FF_6.07x offers approximately 34% whereas T_FF_5.76x leads to 44% reduction in all the three types of capacitances considered viz. Cgtotal , Cinternal and Cclock-load respectively. This is the reason behind significant improvement in power dissipation levels of the cells obtained by using proposed methodology. Since the transistor sizing ratios of individual transistors lying in the critical path of T_FF as shown in Fig. 12 have been obtained using LE theory, therefore, the normalized scale factors (w1 - w5 ) for these transistors are listed in Table 13. It is further stated that the scale factor in our case represents the absolute transistor width normalized with respect to 260 nm which is fixed as the minimum transistor size at 130 nm CMOS process. Both the cells obtained using ANN show reliable circuit behaviour against process/voltage/temperature variations as demonstrated in Fig. 14. These cells have been further utilized to implement an 8−bit counter as illustrated in Fig. 15. In order to maintain uniform loading conditions for every flip-flop in the counter, a buffer (consisting of two inverters) has been inserted between the output terminal of each flip-flop and the clock terminal of the flip-flop in the next stage as indicated in Fig. 15. Both the inverters comprising the buffer are identical and have been sized such that the input capacitance of the first inverter is equal to 20X (24.2 fF) and 28X (33.8 fF) respectively for the target specifications under consideration. The average power consumption of the counters in each case is estimated over 256 clock cycles at 125 MHz, 250 MHz and 500 MHz clock frequencies. The reduced power consumption levels are indicated in Figs. 16 and 17. The simultaneous improvements in area

Table 12 Comparison of capacitances for T_FF_8x and cells obtained from ANN. Cell

(Cgtotal ) fF

(Cinternal ) fF

(Cclock-load ) fF

T_FF_8x T_FF_6.07x T_FF_5.76x

255.77 170.40 144.54

175.41 114.56 97.32

22.85 14.92 12.80

Table 13 Scale factors for Toggle_FF at 130 nm CMOS technology. Parameter

w1

w2

w3

w4

w5

T_FF_8x T_FF_6.07x T_FF_5.76x

15.96 8.69 6.76

13.76 8.23 6.76

11.96 7.88 6.76

10.38 7.53 6.76

18.07 14.38 13.53

Node G: ∑ Cgd (XNi) + Cgd (XPi) + Cdb (XNi) + Cdb (XPi) CG = i=9,11

+



Cg (XNi) + Cg (XPi)

(20)

i=12

Node H: ∑ Cgd (XNi) + Cgd (XPi) + Cdb (XNi) + Cdb (XPi) CH = i=12

+



i=10,13

Cg (XNi) + Cg (XPi)

Clock load: ∑

Cclock−load =

i=5,7,9,11

Cg (XNi) + Cg (XPi)

(21)

(22)

Eqs. (14)–(22) represent the expressions for evaluation of absolute capacitance values at various nodes of the T_FF where Cg is gate

Fig. 14. PVT analysis for T_FF_6.07x and T_FF_5.76x cells. 20

K. Singh et al.

Integration, the VLSI Journal 69 (2019) 10–22

Fig. 15. Circuit diagram of counter.

T_FF_5.76x implemented by using predicted set of widths by using ANNs, Monte Carlo simulations were also executed. The number of runs were restricted to 100 wherein the transistor sizes and threshold voltage (obtained from model parameters) both were varied by using a gaussian pdf with a standard deviation of 5%. All the circuits under consideration exhibited normal circuit behaviour in terms of logical functionality for the aforementioned analysis. 6. Conclusion The paper introduces a novel closed-loop ASIC design flow. The proposed design flow is based on a mechanism for realization of dynamic drive strength standard cells utilizing LE theory and ANNs. The closedloop methodology is capable of generating cells of any drive strength on the fly for the targeted circuit design specifications subject to the availability of automated layout generator in the design flow. Hence, it represents a critical step towards the elimination of gap between ASIC and full custom design techniques. Simulation results indicate that by using the proposed approach for synthesis, substantial savings are obtained on a consistent basis in power dissipation, area, PDP and PDAP for a wide range of capacitive loading conditions. Therefore, the proposed methodology leads to power and area efficient logic synthesis which can be used for rapid design of digital ICs, optimized in a three dimensional design space viz. timing, power and area on chip. The improvements are expected to increase further with advancements in accurate automated layout generation tools and ANN modeling techniques.

Fig. 16. Histogram showing power savings of counter @ 20X load.

occupied for the test counter at 20X and 28X loading are reported as 28.8% and 37.8% respectively. Although the circuit has been tested for only two target design specifications, there can be numerous points in the design space where the proposed approach produces significant improvement over the traditional open-loop ASIC design methodology. In order to validate correct behaviour of the optimized 5-stage test circuit and fractional drive strength cells viz. T_FF_6.07x and

Acknowledgement The authors would like to thank the authorities at NSIT for the financial support provided to file the patent application based on this work in India. This work is protected by Indian Patent Application 3282/DEL/2012. References [1] D.G. Chinnery, K. Keutzer, Closing the gap between ASIC and custom: an ASIC perspective, in: Proceedings of Design Automation Conference, Los Angeles, USA, 2000, pp. 637–642. [2] R. Pasqualini, Method and Metric for Low Power Standard Cell Logic Synthesis. U. S. Patent 7260808 (2007). [3] D. Chinnery, K. Keutzer, Closing the Gap between ASIC and Custom: Tools and Techniques for High-Performance ASIC Design, Kluwer Academic Publishers, 2004. [4] M. Avci, T. Yildirim, Neural network based MOS transistor geometry decision for TSMC 0.18 m process technology, in: Proceedings of ICCS, Part IV, LNCS 3994, 2006, pp. 615–622. [5] F. Gunes, F. Gurgen, G. Torpi, Signal-noise neural network model for active microwave devices, IEE Proc. Circuits Dev. Syst. 143 (1) (1996) 1–8. [6] S.K. Mandal, S. Sural, A. Patra, ANN and PSO-based synthesis of on-chip spiral inductors for RF ICs, IEEE Trans. Comput. Aided Des. Integr Circuits Syst. 27 (1) (2008) 188–192. [7] F. Djeffala, M. Chahdib, A. Benhayaa, M.L. Hafianea, An approach based on neural computation to simulate the nanoscale CMOS circuit, Solid State Electron. 51 (1) (2007) 48–56 Elsevier. [8] G. Wolfe, R. Vemuri, Extraction and use of neural network models in automated synthesis of operational amplifiers, IEEE Trans. Comput. Aided Des. Integr Circuits Syst. 22 (2) (2003) 198–212. [9] N. Kahraman, T. Yildirim, Technology independent circuit sizing for standard cell based design using neural networks, Digit. Signal Process. 19 (4) (2009) 708–714 Elsevier.

Fig. 17. Histogram showing power savings of counter @ 28X load. 21

K. Singh et al.

Integration, the VLSI Journal 69 (2019) 10–22 [18] S. Boyd, S.J. Kim, L. Vandenberghe, et al., A tutorial on geometric programming, Oper. Res. 67 (8) (2007) 67–127. [19] I.E. Sutherland, B. Sproull, D. Harris, Logical Effort: Designing Fast CMOS Circuits, Morgan Kaufmann, 1999. [20] M. Alioto, E. Consoli, G. Palumbo, General strategies to design nanometer flip-flops in the energy-delay space, IEEE Trans. Circ. Syst.-Part I 57 (7) (2010) 1583–1596. [21] M. Guruswamy, R.L. Maziasz, D. Dulitz, S. Raman, V. Chiluvuri, A. Fernandez, L.G. Jones, CELLERITY: a fully automatic layout synthesis system for standard cell libraries, in: Proceedings of IEEE Design and Automation Conference, 1997, pp. 327–332. [22] CVX Research, CVX: Matlab Software for Disciplined Convex Programming, Version 2.0, 2014. accessed January 2014 http://cvxr.com/cvx. [23] G. Gerosa, S. Gary, C. Dietz, D. Pham, K. Hoover, J. Alvarez, H. Sanchez, P. Ippolito, T. Ngo, S. Litch, J. Eno, J. Golab, N. Vanderschaaf, J. Kahle, A 2.2 W, 80 MHz superscalar RISC microprocessor, IEEE J. Solid State Circ. 29 (12) (1994) 1440–1452. [24] V. Oklobdzija, V. Stojanovic, D. Markovic, V. Nedovic, Digital System Clocking: High Performance and Low Power Aspects, Wiley, 2003. [25] M. Alioto, E. Consoli, G. Palumbo, Flip-flop Design in Nanometer CMOS: from High Speed to Low Energy, Springer, 2015.

[10] N. Kahraman, T. Yildirim, Technology independent circuit sizing for fundamental analog circuits using artificial neural networks, in: Proceedings of Ph.D. Research in Microelectronics and Electronics, 2008, pp. 1–4. [11] A. Jafari, S. Sadri, M. Zekri, Design optimization of analog integrated circuits by using artificial neural networks, in: International Conference on Soft Computing and Pattern Recognition, 2010, pp. 385–388. [12] Y. Cao, T. Sato, D. Sylvester, M. Orshansky, C. Hu, New paradigm of predictive MOSFET and interconnect modeling for early circuit design, in: IEEE Custom Integrated Circuits Conference, 2000, pp. 201–204. [13] N.H.E. Weste, D.M. Harris, CMOS VLSI Design: A Circuits and Systems Perspective, fourth ed., AddisonWesley, 2011. [14] G. Palumbo, M. Pennisi, Design guidelines for high-speed transmission-gate latches: analysis and comparison, in: IEEE International Conference on Electronics, Circuits and Systems, 2008, pp. 145–148. [15] S. Boyd, S.J. Kim, D. Patil, M. Horowitz, Digital circuit optimization via geometric programming, Oper. Res. 53 (6) (2005) 899–932. [16] S. Boyd, L. Vandenberghe, Convex Optimization, first ed., Cambridge University Press, 2004. [17] S. Sapatnekar, V. Rao, P. Vaidya, S.M. Kang, An exact solution to the transistor sizing problem for CMOS circuits using convex optimization, IEEE Trans. Comput. Aided Des. Integr Circuits Syst. 12 (11) (1993) 1621–1634.

22