Integration, the VLSI Journal 70 (2020) 21–31
Contents lists available at ScienceDirect
Integration, the VLSI Journal journal homepage: www.elsevier.com/locate/vlsi
Towards an automated design flow for memristor based VLSI circuits Lei Xie ∗ , Hao Cai, Chao Wang, Jun Yang National ASIC Engineering Center Southeast University, Nanjing, China
A B S T R A C T
As today’s CMOS technology is gradually scaling down to its physical limits, emerging technologies are under research as alternatives in the future, such as carbon nanotube, magnetic tunneling junction, memristor. Among them, memristor is a promising candidate to implement the futuristic VLSI circuits. It provides a great scalability, near-zero standby power consumption, etc. In order to design memristor based VLSI circuits and explore their potential, it is crucial to develop an automated design flow. However, such a design flow is still missing so far. This paper proposes an automated design flow, Mosys by reusing parts of existing CMOS VLSI circuit design tools. Mosys provides a circuit design flow from a Verilog programming interface to performance estimation models. In addition, it employs a probabilistic power estimation model instead of one based on exhaustive-searching method. In our experiments, it significantly reduces the running time up to over 3000 times with a marginal error (<1%), as compared to the state-of-the-art. To verify the whole Mosys flow, several integer arithmetic functional units (e.g., add, multiply) are described in Verilog and implemented. In addition, Mosys is compared with the state-of-the-art using the EPFL benchmark suite. The results show that Mosys significantly improves the area (6.29x) and delay (4.68x) on average.
1. Introduction Today’s down-scaling of CMOS technology is gradually approaching its physical device limits. As a consequence, major challenges have been being aggregated such as reduced reliability, saturated performance gain, increased leakage power consumption, etc [1–4]. In order to address such challenges, emerging device technologies (e.g., carbon nanotube, tunneling field-effect transistors, graphene transistor, magnetic tunneling junction, memristor [1–4], etc.) are under investigation as the alternatives to implement future VLSI circuits. Among them, memristor is a promising candidate [2–5] since massive memristor devices are able to be easily implemented using crossbar architecture, where memristors are located at the intersections of row and column nanowires. Memristor crossbar can provide a great scalability, higher integration density, etc [3,5,6]. Memristor crossbar has shown a great potential to realize a wide range of applications, such as neuromorphic systems [7–9], computation-in-memory processors [10–15], non-volatile memories [16–19] and logic circuits [20–27]. Memristor based logic circuits can be divided into three major categories based on their primitive logic gates [5]; they are threshold/majority logic [23,26], material implication logic [21], and Boolean logic [22,24,25,28]. These reported logic styles are only focusing on fundamental logic gates (e.g., AND) or simple manually-designed logic circuits (e.g., 1-bit full adder). Therefore, a large gap is laid between simple manually-designed circuits and complex VLSI circuits.
To bridge such a large gap, the development of an automated design flow is a crucial step. Recently, only a limited number of design flows have been reported for threshold/majority logic [27], implication logic [33], and Boolean logic [29–31]. As Boolean logic has been widely used in today’s CMOS VLSI circuits for decades, it is very convenient to reuse existing IP cores and EDA tools. Therefore, this paper concentrates on developing an automated design flow for Boolean logic. Table 1 summarizes the characteristics of state-of-the-art [29, 30] in several aspects. It clearly reveals the following issues. First, the reported works have not provided any design flow consisting of the programming interface. Today’s digital designs or IP cores are typically described by a hardware description language such as Verilog and VHDL. Hence, it is inconvenient (or even impossible) to develop new VLSI circuits or reuse existing soft IP cores without a programming interface. Second, the reported works employed power models which need to calculate power consumption for each possible input combination and then average them. It is referred to as exhaustively-searching method [34, 35]. However, the number of possible input combinations increases exponentially when the number of input bits increases linearly. Hence, it is almost impossible to estimate the power consumption when the number of input bits is great. To address above issues of state-of-the-art, this paper proposes Mosys. It is a design flow for memristor-based VLSI circuits supporting from programming interface to performance estimation. This work
∗ Corresponding author. E-mail addresses:
[email protected] (L. Xie),
[email protected] (H. Cai),
[email protected] (C. Wang),
[email protected] (J. Yang). https://doi.org/10.1016/j.vlsi.2019.09.009 Received 14 June 2019; Received in revised form 3 September 2019; Accepted 14 September 2019 Available online 19 September 2019 0167-9260/© 2019 Elsevier B.V. All rights reserved.
L. Xie et al.
Integration, the VLSI Journal 70 (2020) 21–31
Table 1 Summary of boolean logic design flows. Design FlowRef
Du [29]
XbarGen [30]
Staircase [31]
Mosaic [32]
Mosys This work
Prog. Interface Logic Synthesis Place-and-Route Perf. Estimation
No Yes No Xbar
No Yes No Xbar
No Yes Yes Xbar
Power model
Search
Search
No
No Yes Yes Xbar & CMOS Search
Verilog Yes Yes Xbar & CMOS Prob.
is built on top of our preliminary work published in Ref. [32], which proposed Mosaic. As compared to the preliminary work, the new contributions of this paper are:
∙ An automated design flow. It supports a Verilog programming interface by reusing an open-source Verilog front-end. Its inputs are Verilog source codes; its output includes the SPICE netlists for the crossbar part as well as (synthesized) Verilog files of the CMOS controller. ∙ A power estimation model based on a probabilistic method. It significantly reduces the simulation time (up to 3000x) with a marginal error (<1%). ∙ A comprehensive performance evaluation of Mosys design flow. A complex benchmark suite is used and the results are compared with the state-of-the-art. The results show that Mosys significantly reduces the area (6.29x) and delay (4.68x) on average.
Fig. 1. Electronic characteristics of a memristor device.
The remainder of this paper is structured as follows. Section II describes the fundamentals of Boolean logic based on memristor crossbar. Section III proposes Mosys design flow. Section IV presents the guidelines using Verilog programming interface. Section V describes the proposed probabilistic power model. Section VI evaluates Mosys and compares it with the state-of-the-art. Section VII discusses the advantages and limitations of Mosys. Finally, it concludes the paper.
Fig. 2. Primitive logic gates.
2. Memristor crossbar based boolean logic output memristors (see Fig. 2(a)). In addition, a resistor (Rs ) is required as a reference. CBL uses high (RH ) and low (RL ) resistive state to represent logic 1 and 0, respectively. A copy gate in Fig. 2(a) is given as an example to illustrate the working principle of primitive logic gates. The other logic gates in Fig. 2(b) can be understood similarly (details can be found in Ref. [25]). A copy gate consists of an input and output memristor. Besides, a resistor Rs (RL ≪ Rs ≪ RH ) is attached to the column nanowire as a reference [25]. Before copying the logic state stored in the input memristor to the output one, the output memristor is initialized to RH (i.e., logic 1). This initialization step is not shown in Fig. 2(a) for brevity. In order to perform a copy operation, a control voltages Vw > Vth is applied to the input memristor while the output memristor is grounded; the column nanowire is left floating. In case the input is 1, the input memristor is in high resistive state (RH ). The output memristor is initialized to high resistive state. Hence, the voltage Vx of the floating column nanowire is around 0; the voltage Vom across the output memristor is around 0. As a result, the output memristor keeps in high resistive state. The input-0 case can be understood in a similar way. Note that the INV and NAND gate need another control voltage Vh except Vw ; its value typically satisfies the following relation with Vw and Vth : Vh = V2w < Vth [25]. In addition, Vh is used to alleviate the impact of sneak path currents within the crossbar [25].
This section presents the fundamentals of the Boolean logic circuits based on memristor crossbar [24,25,28]. For brevity, this logic style is referred to as Crossbar Boolean Logic (CBL). It first presents the electronic characteristics of a single memristor device. Subsequently, it presents the working principle of primitive logic gates including Copy, AND, NAND and Inverter (INV). Finally, it gives a simple example about how to map the primitive logic gates onto a crossbar array. 2.1. Electronic characteristics Fig. 1 shows the electronic characteristic of a memristor device [36]. A memristor has a high (RH ) and low (RL ) resistive state. In order to switch a memristor from one resistive state to another, a voltage should be applied across the device and its absolute value should be greater than a threshold voltage Vth . Otherwise, the memristor keeps its current resistive state. The process that a memristor switches from high to low resistive state is referred to as SET; while the opposite switching process is RESET. Note that the polarities of the applied voltage for SET and RESET are opposite (see the right part of Fig. 1. CBL requires the voltage Vw and Vh to control primitive logic gates as shown in the next subsection. 2.2. Primitive logic gates CBL employs four primitive logic gates to realize any Boolean functions [25]; they are copy, inverter (INV), NAND and AND as shown in Fig. 2. Copy is used to transfer data within/between the crossbar arrays while other logic gates are used to process data. A logic gate typically consists of one or multiple input memristors, as well as one or multiple
2.3. A Design Example: carry of a one-bit full adder CBL is able to implement any Boolean functions expressed in the format of sum-of-product (SoP) using one or multiple computing elements (CE). Next, we use the carry c of a one-bit full adder as an example to 22
L. Xie et al.
Integration, the VLSI Journal 70 (2020) 21–31
Table 2 Summary of control voltages. Signal
Row 1
State INA RIN CFM EVM GER INR SOU
Control Voltage Vw Vw G Vh Vw G Vh F Vh Vw Vh Vh Vh Vh
6
x-c
s
s
Vw Vh Vh Vh G F Vw
G F F Vh Vh Vh Vh
G Vh Vh Vh Vh Vw F
G Vh Vh Vw F Vh F
Vh are applied to the remaining rows and columns that are not involved in any logic gates in order to reduce sneak path currents [25]. In order to describe computing elements, mapping matrix is introduced. It contains elements 1 and 0 is used to represent the LB. The element 1 presents that a memristor is located in the intersection of the relative row and column while 0 a disabled junction. Fig. 3(c) shows the mapping matrix of carry s; it denotes the LB part of the crossbar array in Fig. 3(a).
Fig. 3. Design example: Carry of a one-bit full adder.
show this method. Carry c is expressed in Eq. (1). c = xy + yci + ci x = xy · yci · ci x
Column 2–5
(1)
Based on Eq. (1), three NAND gates are needed to implement xy, yci and ci x, respectively. Then an AND gate is needed to realize xy · yci · ci x and an INV gate is needed to calculate the final result of the carry c. These gates can be mapped onto a crossbar as shown in Fig. 3(a). Three NAND gates are mapped to row 2, 3 and 4, respectively. For instance, xy is mapped onto row 2. Two input memristors are located in the intersection of row 2 with column x and y. An output memristor is located in the intersection of row 2 and column c. The output memristors of the three NAND gates are then used as the input memristors of the followed AND gate. The output memristor of the AND gate is located in the intersection of row 5 and column c. Finally, the INV gate uses the output memristor of the AND gate as its input. Its output memristor is placed in the intersection of row 5 and column c. Row 2, 3, and 4 together are referred to as logic body (LB). Row 1 is used to store the primary inputs of the entire circuit; it is referred to as an input latch (IL). Row 5 stores the circuit outputs; it is referred to as an output latch (OL). Resistors Rs that are required by primitive logic gates are attached to each row and column. The remaining junctions of the crossbar array are disabled without applying the forming process [6,25]; these disabled junctions are always in a very high resistance RD (≫ RH ) [6,25]. A CMOS circuit is utilized to control the crossbar [25]. It consists of a controller and voltage drivers (triangles in Fig. 3(a)). A voltage driver is attached to each row or column. It is used to apply control voltages to nanowires or leave them floating. The behavior of the controller is described by a finite state machine (FSM). Fig. 3(b) shows such an FSM that consists of seven states.
3. Mosys Design Flow This section first briefly presents the proposed Mosys design flow. Next, it describes the how Mosys is extended from our preliminary work Mosaic [32]. 3.1. Design flow Fig. 4 shows Mosys design flow for CBL. The white boxes are the reused components which are used in today’s CMOS EDA flow; the gray boxes are new components added for CBL; the white boxes with red letters are the new/modified parts for Mosys. Its inputs are Verilog source codes; its outputs include the SPICE netlists for the crossbar part as well as (synthesized) Verilog files of the CMOS controller. The entire flow consists of the following major steps. Verilog Parsing: The input of Mosys is a circuit description in Verilog. The Verilog code(s) is(are) translated into a netlist described in Berkely Logic Interchange Format (BLIF). This translation is processed by a Verilog front end. Currently, several open-source front ends are available such as ODIN II [37] and Yosys [38]. Yosys is integrated into Mosys design flow. Note that Mosys is a combination of ‘Mosaic’ and ‘Yosys’. Logic Synthesis: The BLIF netlist is optimized and rewritten into a netlist consisting of look-up tables. This step is completed by the logic synthesis tool ABC [39]. Crossbar Mapping: Mosys first extracts the required information from the LUT netlist. Such information includes mapping matrix of each LUT and signal names at each stage. Subsequently, Mosys maps all LUTs onto a crossbar array. The mapped crossbar array is described using a global mapping matrix. The detailed mapping scheme is referred to our preliminary work [32]. Simulation Files Generation: Mosys automatically generates simulation files used to verify the CBL circuits. They contain both the crossbar and CMOS part. Two Verilog-A files are generated; one is used to describe the CMOS controller; another describes a voltage driver and it is used a subcircuit. A SPICE netlist is generated to describe the crossbar part. It connects the crossbar and CMOS part together. In addition, the CMOS controller has another Verilog version as the input of Synopsys Design Compiler, which is used to estimate the performance. Performance Estimation: Mosys estimates the performance of the generated hybrid memristor/CMOS circuit in terms of area, delay and power consumption. The detailed performance estimation model and visualization are referred to our preliminary work [32].
1. INA: All the memristors are initialized to high resistive state RH . 2. RIN: The IL in row 1 receives inputs from other CEs or is programmed by the controller. 3. CFM: The inputs stored in IL are copied to the input memristors of all three NAND gates located in row 2 to 4 at the same time. 4. EVM: The results of all three NAND gates are evaluated simultaneously. 5. GER: The results of the AND gate in column c is calculated. 6. INR: The result of the INV gate located in row 5 is evaluated and the final result carry c is achieved. 7. SOU: The outputs c and c are sent to other CEs. To perform required primitive logic gates at each state, the related control voltages should be applied to each row and column. Table 2 summaries all the voltages. Rows and columns driven by the same control voltages at each state are grouped to simplify Table 2. For instance, during state CFM, primary inputs stored in IL are copied to all NAND gates located in row 2 to 4. Hence, Vw is applied to row 1 while row 2–4 are grounded (G); column x to ci are left floating (F). In addition, 23
L. Xie et al.
Integration, the VLSI Journal 70 (2020) 21–31
Fig. 5. The mapped crossbar of a 16-bit integer multiplier.
input at next cycle. Therefore, a sequential logic cannot be implemented by Mosys yet. Supporting sequential logic is a part of the future work. ∙ No registers are needed: As memristor devices can store the data in a resistive state at each step, the digital circuits do not need any registers, latches or flip-flops. As a result, the non-blocking assignment (<=) are not necessary in Verilog codes.
Fig. 4. Mosys design flow.
3.2. Extend mosys from mosaic Mosys has extended our previous work Mosaic [32] in two aspects; see the white boxes with red letters in Fig. 4. First, Mosys has added a Verilog front end using Yosys RTL framework [38]. As a result, Mosys significantly facilitates the VLSI circuit design and it can easily reuse existing soft IP cores that are developed for today’s VLSI circuits. Second, Mosys has improved the power estimation model. Mosaic uses a power model that needs to calculate the power consumption of all the possible input combinations and then average them as the metric of power consumption. As the input combinations grow exponentially with the number of input bits, it is extremely time-consuming and is infeasible to be applied for large circuits with many input bits. Mosys adopts a probabilistic method that has been applied in CMOS power estimation [34]. Such a probabilistic model saves the running time significantly with a marginal error. The next two sections will describe these two extension parts in detail.
4.2. Case study A 16-bit integer multiplier is used as an example. The detailed code is not listed as it is the same as the normal Verilog code. The multiplier is described using an ‘always’ block with two inputs. This integer multiplier is mapped onto a crossbar using Mosys as shown in Fig. 5. Each black dot in the figure is a memristor device while white parts are the disabled junctions. The mapped crossbar has 10080 rows and 782 columns. 5. Probabilistic power estimation model This section presents a probabilistic method to estimate the power consumed by the crossbar part of CBL. It first presents the motivation. Next, it explains the basic idea of the probabilistic method. Note that the probabilistic method is generic and therefore it can be applied in other types of memristor-based logic.
4. Verilog programming interface Mosys integrates a Verilog programming interface for memristorbased VLSI circuits. This section first presents the guidelines when a user is programming Verilog codes to describe a memristor-based VLSI circuit. Thereafter, it uses a 16-bit integer multiplier as a study case.
5.1. Motivation The power consumption of a CBL implementation consists of both the crossbar and CMOS part. As the crossbar part is used process input data, the power consumption is related to the pattern of the inputs. Our previous work Mosaic [32] uses a power model based on an exhaustive searching scheme. It first calculates the power consumption of all the possible input combinations, and then averages them as the performance metric of the power consumption. Hence, the input combinations grows exponentially with the number of input bits. Consequently, it is extremely time-consuming and even impossible to be applied for large circuits with many input bits. In addition, the CMOS part is only used to control the circuits and hence its power consumption is independent of the input pattern. In order to solve such a problem, Mosys employs a
4.1. Programming guidelines Mosys integrates Yosys [38] as a Verilog RTL front end. It translates a hierarchical Verilog code into a BLIF file as shown in Fig. 4. Yosys has an extensive support on Verilog 2005. It is able to process almost any synthesizable Verilog code. Due to the characteristics of the CBL supported by Mosys, the Verilog codes used to describe a CBL design need to follow some special guidelines.
∙ Combinational logic only: Currently Mosys only can implement combinational logic circuits. Because the mapping scheme utilized in Mosys cannot use the output of a logic block at current cycle as the 24
L. Xie et al.
Integration, the VLSI Journal 70 (2020) 21–31
Vx . After applying Kirchhoff’s circuit law, Vx is calculated by Eq. (2). Vx =
1 R1
+
1 R1 1 R2
+ +
1 R2 1 R3
+
1 Rs
Vw
(2)
Vx varies with the value of R1 , R2 and R3 . For instance, the probabilistic average value of Vx before and after the output memristor switches are expressed by Eq. (3) and 4, respectively. Vx,bsw =
Fig. 6. An example of the probabilistic power model: And gate.
H
( Rp1 + H
Vx,asw =
Table 3 Parameters used in power estimation model. Symbol Probability pi psw pnsw Resistance Ri Rs RL RH Voltage Vx(,asw,bsw) Vw
1−p1 ) + ( Rp2 + 1−R p2 ) RL H L 1−p1 ) + ( Rp2 + 1−R p2 ) + R1 RL H L L
+
1 Rs
+
1 Rs
( Rp1 + H
( Rp1 + H
Definition
1−p1 ) + ( Rp2 + 1−R p2 ) RL H L 1−p1 ) + ( Rp2 + 1−R p2 ) + R1 RL H L H
( Rp1 +
Vw
(3)
Vw
(4)
where Vx,bsw and Vx,asw present the value of Vx before and after the output memristor switching; Ri=1,2 of Eq. (3) are replaced R1 with
The probability when an input memristor is the high resistance The probability when an output memristor is switching to low resistance The probability when the memristor stays at high resistance
pi ·
1 RH
+ (1 − p i ) ·
1 ; RL
i
R3 of Eq. (4) are substituted by RH and RL in
the two phases before and after the output memristor switching, respectively. Step 3: Calculate the power consumption of each case. The power consumption of each case is the sum of the power consumed by all the memristors and resistor Rs at the case. As a memristor can be regarded as a resistor, its power is estimated by Eq. (5) [40].
Resistance value of a memristor Resistance value of a resistor Low resistance of a memristor High resistance of a memristor
PR =
V2 R
(5)
Let us calculate the power consumption when the output memristor switches as an example. The power consumed by a memristor or resistor Rs is expressed in Eq. (5). For instance, the power consumed by the first input memristor R1 and output memristorR3 are estimated by Eqs. (6) and (7) [40].
The voltage of the floating nanowire (before and after the output memristor switches) A control voltage used to program memristor
Power consumption Pi,sw Power consumption of a memristor when it is switching Pi,bsw/asw Power consumption of a memristor before or after it switches Pgate Power consumption of a primitive gate
P1,sw = 0.5 · (Pbsw + Pasw )
= 0.5 · [
(Vw − Vx,bsw )2 p1 RH
+
1−p1 RL
+
(Vw − Vx,asw )2 p1 RH
+
1−p1 RL
]
(6)
P3,sw = 0.5 · (Pbsw + Pasw )
probabilistic method to estimate the power consumption of the crossbar array.
= 0.5 · [ 5.2. Basic idea
(0 − Vx,bsw )2 RH
+
(0 − Vx,asw )2 RL
]
(7)
The power consumption Psw of the switching case is the sum of power consumed by all the memristors and resistor Rs at the switching (sw) case. The power consumption Pnsw of the non-switching case is calculated in the same way. Step 4: Calculate the overall power consumption of a Gate. The overall power consumption PAND of the AND gate is expressed by Eq. (8).
To show the basic idea of the probabilistic power model, a 2-input AND gate is used as an example. Fig. 6 shows such an example. The power consumed by an AND gate is the sum of power consumed by three memristors and a resistor Rs ; see Fig. 6(a). To apply the probabilistic model, the following four steps should be taken. All the parameters used in this example are summarized in Table 3.
PAND = pnsw · Pnsw + psw · Psw
Step 1: Calculate the probabilities. The AND has two different cases depending on the resistance of input memristors. In the first case, all the inputs are in high resistance as shown in Fig. 6(b). Therefore, the output memristor does not switch. This is referred to as a nonswitching probability, detoted by pnsw = p1 p2 . In the second case, at least one of the input is in low resistance, the output memristor switches from RH and RL . This is referred to as a switching case, detoted by psw = 1 − p1 p2 . Fig. 6 (c) and (d) show the two phases before and after the output memristor switches, respectively. Step 2: Calculate the voltage of the floating nanowire. To perform an AND gate, Vw is applied to the rows of input memristors while the row of the output memristor is grounded; the column is left floating (see Fig. 6 (a)). Therefore, the voltage across the three memristors and resistor Rs is determined by the voltage of the floating column
(8)
Power consumption of other primitive gates (i.e., NAND, INV, Copy) can be estimated using the above four-step procedure. As a crossbar array of CBL consists of many primitive logic gates, the power consumption of the entire crossbar is estimated by summing up the power consumed by each primitive gate at each execution step. The power consumption of all the execution steps are accumulated together as the performance metric of power consumption.
6. Evaluation This section comprehensively evaluates the proposed Mosys design flow by conducting the following experiments.
25
L. Xie et al.
Integration, the VLSI Journal 70 (2020) 21–31
∙ Front End Verification: Five basic 8-bit integer operations are
Table 5 EPFL combinational benchmark suite [41].
described in Verilog and then implemented by Mosy. These operations are add, subtract, multiply, divide and compare. SPICE simulations are run and their results are compared with the expected ones. In addition, their performance are estimated. ∙ Power Model Evaluation: Four primitive logic gates and eight ripple carry adders (i.e., their inputs are 1–8 bits) are used to evaluate the proposed probabilistic power model. Their power consumption and runtime are measured. The probabilistic model is compared with the exhaustively-searching based method. ∙ Design Space Exploration: The EPFL combinational benchmark suite [41] is used to further evaluate Mosys design flow. It consists of both arithmetic and random/control circuits. The performance of benchmark circuits are explored when the number of inputs of lookup tables (LUTs) varies from 3 to 8 as well as 10. ∙ Comparison with State-of-the-Art: Mosys is compared with the stateof-the-art design flow, XbarGen [30]. XbarGen is reproduced using Matlab scripts. As XbarGen has no place-and-route scheme, the scheme proposed in Ref. [29] is integrated into XbarGen. In addition, we add the performance estimation model for CMOS part as it is missing in the original paper [29,30]. EPFL benchmark suite is used while different numbers of LUT inputs are considered (i.e., 3 to 8 plus 10).
Fifteen benchmarks are selected out of twenty ones of the EPFL benchmark suite. Their basic information Table 5 is summarized, including the number of inputs, outputs, and 6-input look-up tables. The numbers of LUT-6 are spread in a wide range from 29 to 12096. Hence, it properly illustrates the potential of Mosys design flow. The area, delay and power consumption are used as performance metrics in the experiments. The detailed performance estimation model is referred to our preliminary work [32]. The delay is further divided into the number of stages (Nstage ) and the delay of each step (Dstep ). The total delay is (4Nstage + 3) ×Dstep [32]. In addition, the power consumed by a circuit at each execution step is accumulated as the metric.
6.1. Simulation setup Mosys is developed as an in-house tool chain. It reuses both the existing open-source and commercial EDA tools as shown in the white boxes in Fig. 4. Namely, Yosys [38] is used as the Verilog front end; ABC [39] is used to do logic synthesis; Cadence HSPICE is used as the SPICE simulator; the performance of CMOS controller required by CBL is estimated by Synopsys Design Compiler. The crossbar mapping, simulation files generation and performance estimation model of Mosys are implemented using seven Matlab scripts in total; see gray boxes in Fig. 4. Table 4 summarizes the simulation parameters used in our experiments. They contain the technology parameters of memristor devices, CMOS devices, and nanowires (cooper) as well as the parameters of CBL circuits. TSMC 28 nm low-power library is used.
6.2. Verilog Front End Verification Five 8-bit integer operations are first described in Verilog. Then, they are synthesized into 6-input LUTs. Thereafter, these LUTs are mapped to crossbar. Subsequently, both the crossbar and CMOS part are simulated using HSPICE. Finally, their performance are estimated using Mosys. The simulation results of all the possible input combinations (i.e., 65536 = 28 ∗ 28 for each 8-bit integer operation) are the same with the expected results calculated by Matlab scripts. Therefore, it verifies the correctness of integrating the Verilog front end provided by Yosys. Fig. 7 shows the area and power consumption. In terms of area, the CBL implementations of all five circuits need larger area to implement the CMOS controller than the crossbar part. In terms of power consumption, the power consumption of all circuits is dominated by the crossbar part. Fig. 8 shows the delay of the five circuits. The CMOS part contributes more amount to the delay of each execution step than the crossbar part. The delay of the whole circuit is positively proportional to the number of stages.
Table 4 Simulation parameters. Parameter
Description
Memristor (TaOx ) [6,42] F (nm) Feature size Vth (V) Threshold voltage Low resistance RL (kOhm) RH – R L RD RH
Am (nm2 ) Tsw (ns)
Value 28 1.5 200 7k
–
50
Area of a memristor Switching time (max of SET and RESET)
3136 0.2
CMOS Technology: TSMC 28 nm Low Power Library
6.3. Power Model Evaluation
Nanowire (Copper) [40] Cnw (fF/𝜇 m) Rnw (Ohm/𝜇 m)
Capcitance in unit length Resistance in unit length
0.26 9.88
No. of rows No. of columns No. of all memristors Program voltage Half-select voltage –
– – – 2.1 1.05 10
The proposed probabilistic power model is evaluated in terms of accuracy, scalability and running time. The exhaustively-searching power model that is used in our previous work [32] is used as a baseline, as it uses the voltage information extracted from the SPICE simulations. Accuracy: Four primitive logic gates are used to analyze the accuracy of the probabilistic model. The fan-in of the copy and inverter are 1 while that of AND and NAND gates are 2. The fan-out of them are 1. Fig. 9 (a) shows the power consumption estimated by the baseline (Exh.) and probabilistic (Prob.) Probabilistic model estimates the power
Circuit [25,43] H W na Vw (V) Vh (V) Rs RL
∗
–: changing with different designs
26
L. Xie et al.
Integration, the VLSI Journal 70 (2020) 21–31
Fig. 7. Front end verification: Area and power consumption of integer operations.
Fig. 8. Front end verification: Delay of integer operations.
consumed copy, inverter and NAND gates with a marginal error (<1%) as compared to the baseline. In terms of AND gate, the error is −9%. Another experiment is conducted where power consumption of AND and NAND gates with different fan-ins are estimated. Fig. 9 (b) shows the error of the probabilistic model. Probabilistic model can accurately estimate the power consumed by NAND gates with different fan-ins. The probabilistic model typically can estimate the power consumed by AND gates with a reasonable absolute value of errors (<3%), except
the 2-input case (−9%). In addition, the error of probabilistic model for both AND and NAND gates is approaching to zero when the number of fan-ins increases. In the real circuits, AND gates with different fan-ins are involved. Therefore, the relatively large error of 2-input AND will be alleviated as shown next. Scalability: To examine the accuracy of the probabilistic model when the circuit scales up. Ripple carry adders with n-bit (i.e.,1 ≤ n ≤ 8) inputs are used. Fig. 10 (a) shows the power estimated by both the
Fig. 9. Accuracy analysis: (a) Power consumption of primitive gates, (b) Error of probabilistic model.
Fig. 10. (a) Scalability analysis, (b) Runtime and speedup. 27
L. Xie et al.
Integration, the VLSI Journal 70 (2020) 21–31
runtime of probabilistic model increases very slowly and it is over 3000 times faster than the baseline for the 8-bit adder. Overall, probabilistic model provides a feasibility to estimate the power consumption for large-scale circuits with many input bits. 6.4. Design Space Exploration To further illustrate the potential of Mosys design flow, we explore the design space of the EPFL combinational benchmark suite [41] (see Table 5). As shown in our preliminary work [32], the number of LUT inputs impacts on the performance of the circuits. To achieve an optimal CBL implementation in terms of area, delay and power consumption, each benchmark is synthesized by Mosys several times using different input numbers of LUTs (from 3 to 8 as well as 10). As the performance of some circuits show a similar trend, they are divided into two groups as shown in Table 5. The circuits with gray background are referred to as large group while the ones with white background are referred to as small group. Circuit sin and ctrl are used as examples of the large and small group, respectively. Area Fig. 11 (a) and (b) show the area of sin and ctrl, respectively. The area of sin is dominated by the crossbar part. As the number of LUT inputs increases, the area is first reduced and then increases. The area-minimal implementation is achieved when the number of LUT inputs is 6. The area of the crossbar part depends on the number of rows and columns; the number of rows is the same as the number of minterms/NAND gates; the number of columns is the same as the sum of numbers of inputs, intermediate signals and outputs. Neither of the number of minterms and intermediate signals is varying in a fixed manner when the number of LUT inputs increases. Therefore, an areaminimal implementation needs to be obtained by scanning different numbers of LUT inputs. In contrast, the area of ctrl is dominated by the CMOS controller. As the number of LUT inputs increases, the area is first reduced and then
Fig. 11. Trend of Area with Different Numbers of LUT Inputs: (a) Circuit sin, (b) Circuit ctrl.
baseline and probabilistic model (the left y-axis) as well as the error of the probabilistic model (the right y-axis). The results clearly draw a conclusion that the error of probabilistic model is less than −0.7% and its absolute value is approaching to zero as the number of input bits is increasing. Running time: The running time of both power models are measured as well as the speedup of probabilistic model with respect to the baseline. The results are shown in Fig. 9 (b). The running time of baseline model increases exponentially as the possible input combinations increase exponentially with the number of input bits. In contrast, the
Fig. 12. Trend of Delay with Different Numbers of LUT Inputs: (a) Circuit sin, (b) Circuit ctrl. 28
L. Xie et al.
Integration, the VLSI Journal 70 (2020) 21–31
crossbar part is small and its nanowire delay is not significant [44]. When the number of LUT inputs increases, both the delay caused by the CMOS and crossbar part increase very slightly. As a result, the delay of a single step increases very marginally with the number of LUT inputs. The total delay is continuously reduced which is mainly caused by the reduced number of stages. The delay-minimal implementation is achieved when the number of LUT inputs greater than 7. Because the number of stages has been reduced to a minimal value; for instance its stage number is 1 in this case. Power Fig. 13 (a) and (b) show the power consumption of sin and ctrl, respectively. The power consumption of sin is dominated by the crossbar part. The power consumed by the crossbar part is first reduced and then increased when the number of LUT inputs increases. The number of memristors is positively proportional to the area of the crossbar array. Therefore, the crossbar consumed power shares the same trend with its area (see also Fig. 11 (a)). The power consumption of ctrl is dominated by the crossbar part, which is similar to sin. Different with sin, the power consumed by the crossbar part is continuously reduced when the number of LUT inputs increases. Because the crossbar area is continuously reduced (see also Fig. 11 (b)). Optimal Design Both sin and ctrl can have their optimal design which perform best in all three metrics. The optimal design of sin is obtained when the number of LUT inputs is six; the optimal ctrl is obtained when the number of LUT inputs is ten. Sometimes a circuit cannot obtain an optimal design. Therefore, engineers need to decide which performance metric is more crucial and select the best design for such a metric. Overall, all fifteen benchmarks can be optimized in terms of the area (1.53x to 4.75x), delay (1.89x to 5.92x) and power consumption (1.36x to 3.16x) by properly selecting the number of LUT inputs. In addition, it is possible to find an optimal design which perform best in terms of all three metrics by exploring different numbers of LUT inputs.
Fig. 13. Trend of Power Consumption with Different Numbers of LUT Inputs: (a) Circuit sin, (b) Circuit ctrl.
becomes flat. The area-minimal implementation is achieved when the number of LUT inputs 7, 8 or 10. As ctrl is a small circuit, it is possible to be implemented by several LUTs with more inputs. The number of required LUTs is fixed once the number of LUT inputs reaches a certain threshold value. Delay Fig. 12 (a) and (b) show the delay of sin and ctrl, respectively. The number of stages of sin reduces when the number of LUT inputs increases. For sin, the delay of a single execution step is dominated by the crossbar part since its long nanowire induces a long delay [44]. When the number of LUT inputs increases, more minterms are needed and hence more rows. As a result, the delay of a single step increases due to longer wires. The total delay, which is (4Nstage + 3) ×Dstep , is first reduced due to the reduced number of stages. Then, it is increased due to the increased delay of a single step. The delay-minimal implementation is achieved when the number of LUT inputs is 6. For ctrl, the number of stages reduces when the number of LUT inputs increases. The CMOS controller contributes a bit more to the delay of a single execution step than the crossbar part. Because the
6.5. Compare with state-of-the-art Finally, Mosys is compared with the XbarGen in terms of area and delay. As XbarGen only uses the count of switching times of all the memristors as the power consumption metric and it only estimates the crossbar part, a complete power consumption of both the crossbar and CMOS part is not available [30]. Area reduced ratio and delay reduced ratio are used as performance metric. Area reduced ratio is defined as
Fig. 14. Comparison with state-of-the-art: (a) Area reduced ratio, (b) Delay reduced ratio. 29
L. Xie et al. AMosys . AXbarGen
Integration, the VLSI Journal 70 (2020) 21–31
AMosys is the total area of the whole CBL implementation gen-
∙ Support Other Boolean Logic Types: Mosys only supports the CBL
erated by Mosys while AXbarGen that of the XbarGen. The maximum, minimum and average value of the area reduced ratio are extracted from 7 cases with different numbers of LUT inputs (i.e., 3 to 8 plus D 10). Similarly, delay reduced ratio is defined as D Mosys . DMosys is the
proposed in Refs. [24,25,28]. It is possible to be modified to support other logic styles reported in Refs. [21,22]. To support more logic styles, Mosys needs to update the components related to mapping, simulation files generation and performance estimation. Improve the Mapping Scheme: As shown in Fig. 5 (b), only a small part of the crossbar are placed with memristor devices. It is crucial to improve the utilization ratio of the whole crossbar. We plan to develop a novel mapping scheme in the future and integrate it into Mosys. Impact of Unreliable Memristor Devices: As memristor technology is still under investigation in both academia and industry, the devices suffer from the limited device endurance and device variation [3,6]. In the future, Mosys can support such device characteristics by updating the used RRAM model. Sneak Path Issues: Logic circuits based on memristor crossbar may fail due to sneak path currents, which are the unexpected currents within the crossbar [5,6]. Currently, three methods have been proposed; they are adding selector devices (e.g., CMOS transistor) [6], using complementary resistive switches [45], applying half-select voltages [6]. Mosys now integrates the method that applies halfselect voltages to alleviate the impact of sneak path currents. In the future, Mosys will support other methods. Verification with Physical Circuits: Currently, Mosys are verified with SPICE simulations. Because until now, major foundries do not provide any manufacture service for customized memristor/RRAM logic designs. In the future, we plan to further verify Mosys using physical designs. In addition, the layout of both CMOS and crossbar part should be generated by Mosys. Towards A Complete Design Flow: Our current design flow is not complete yet as compared to the conventional CMOS design flow. Although Mosys has covered the major part of a standard design flow, more components should be added in our future work including specification, verification, physical mapping, test.
∙
XbarGen
total delay of the whole CBL implementation generated by Mosys while DXbarGen that of the XbarGen. Area Fig. 14 (a) shows the area reduced ratio. The results show that Mosys can reduce the area of the circuits in a range from 1.53x to 11.54x, and 6.29x on average. It is because that Mosys uses a different mapping scheme from the one used in Ref. [29]. This scheme removes the redundant columns used by repeated intermediate signals as well as removes the redundant rows used as the reserved interconnects for data transfer. Delay Fig. 14 (b) shows the delay reduced ratio. The results show that Mosys can reduce the delay of the circuits in a range from 1.5x to 15.31x, and 4.68x on average. Similar to the reason why Mosys improves the area, the significant delay improvement of Mosys is due to its advanced mapping scheme [32]. Overall, Mosys outperforms magnificently XbarGen in terms of both area and delay. As shown in previous experiments, the power consumption is positively proportional to the area. Therefore, it is reasonable to infer that Mosys is very likely to outperform XbarGen in terms of power consumption.
∙
∙
∙
7. Discussion This section first discusses the advantages of the proposed Mosys design flow. Subsequently, it presents the challenges that Mosys will face in the future work.
∙
7.1. Advantages The proposed Mosys design flow provides the following advantages as compared to the state-of-the-art.
8. Conclusion
∙ A Tool Chain from Program Interface to SPICE Simulation: The proposed Mosys provides a tool chain for memristor-based VLSI circuits design. Mosys not only efficiently reuses the existing tools used in today’s CMOS logic, but also develops the key components specialized for memristor-based logic. It supports all the key components including, a Verilog programming interface, logic synthesis, crossbar mapping, simulation files generation, and performance estimation and visualization. ∙ General Programming Interface: Mosys provides a Verilog programming interface to memristor-based VLSI circuit design. It offers a convenient approach for digital engineers and researchers to design a memristor-based VLSI circuit and estimate its performance. In addition, Mosys makes it feasible to reuse today’s soft IP cores based on CMOS technology. Hence, it avoids the repeated development of the same IP cores for memristor-based circuits. ∙ Quick Power Estimation Model: Mosys employs a probabilistic power model. It accelerates the power estimation process significantly. Hence, it is time-efficient to estimate the power consumption of memristor-based VLSI circuits.
This paper proposes Mosys, which is an automated design flow for memristor based VLSI circuits. Mosys provides a tool chain that consists of Verilog programming interface, logic synthesis, crossbar mapping, simulation files generation and performance estimation. In addition, it significantly accelerates the power estimation using a probabilistic model. All in all, Mosys is stepping towards the design automation of VLSI circuits based on hybrid CMOS/memristor technology. Acknowledgment The authors would like to sincerely acknowledge the helpful comments from the anonymous reviewers. This work is supported by the Research Funding 1106007132, 3206008801, 3206009601 from Southeast University. References [1] K. Bernstein, R.K. Cavin, W. Porod, A. Seabaugh, J. Welser, Device and architecture outlook for beyond cmos switches, Proc. IEEE 98 (12) (2010) 2169–2184. [2] R. Waser, Nanoelectronics and Information Technology: Advanced Electronic Materials and Novel Devices, John Wiley & Sons, 2012. [3] ITRS, Beyond Cmos White Paper, 2014. [4] A. Chen, J. Hutchby, V. Zhirnov, G. Bourianoff, Emerging Nanoelectronic Devices, John Wiley & Sons, 2014. [5] S. Hamdioui, et al., Memristor for computing: myth or reality? in: DATE, 2017, pp. 722–731. [6] J.J. Yang, et al., Memristive Devices for Computing, 2013. [7] A. Padovani, J. Woo, H. Hwang, L. Larcher, Understanding and optimization of pulsed set operation in hfo x-based rram devices for neuromorphic computing applications, IEEE Electron. Device Lett. 39 (5) (2018) 672–675.
7.2. Challenges In order to further improve Mosys, more efforts should be paid to address the following challenges in our future work.
∙ Support Sequential Logic: Mosys only supports combinational logic currently. Because mapping scheme utilized in Mosys cannot use the output of current cycle as the input of the next cycle. In the future, we can update the mapping scheme and supports the sequential logic. 30
L. Xie et al.
Integration, the VLSI Journal 70 (2020) 21–31 [23] L. Gao, F. Alibart, D.B. Strukov, Programmable cmos/memristor threshold logic, IEEE Trans. Nanotechnol. 12 (2) (2013) 115–119. [24] S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E.G. Friedman, A. Kolodny, U.C. Weiser, Magicmemristor-aided logic, IEEE Transactions on Circuits and Systems II: Express Briefs 61 (11) (2014) 895–899. [25] L. Xie, et al., Fast boolean logic mapped on memristor crossbar, in: IEEE ICCD, 2015. [26] G.S. Rose, et al., Leveraging Memristive Systems in the Construction of Digital Logic Circuits, 2012. [27] D. Fan, M. Sharad, K. Roy, Design and synthesis of ultralow energy spin-memristor threshold logic, IEEE Trans. Nanotechnol. 13 (3) (2014) 574–583. [28] G. Snider, Computing with hysteretic resistor crossbars, Appl. Phys. A 80 (6) (2005) 1165–1172. [29] H. Du Nguyen, et al., Synthesizing hdl to memristor technology: a generic framework, in: NANOARCH, IEEE, 2016. [30] M. Traiola, et al., Xbargen: a memristor based boolean logic synthesis tool, in: VLSI-SoC, IEEE, 2016. [31] A. Zulehner, K. Datta, I. Sengupta, R. Wille, A staircase structure for scalable and efficient synthesis of memristor-aided logic, in: Proceedings of the 24th Asia and South Pacific Design Automation Conference, ACM, 2019, pp. 237–242. [32] L. Xie, Mosaic: an automated synthesis flow for boolean logic based on memristor crossbar, in: Proceedings of the 24th Asia and South Pacific Design Automation Conference, ACM, 2019, pp. 432–437. [33] F.S. Marranghello, et al., Sop based logic synthesis for memristive imply stateful logic, in: IEEE ICCD, 2015. [34] F.N. Najm, A survey of power estimation techniques in vlsi circuits, IEEE Trans. Very Large Scale Integr. Syst. 2 (4) (1994) 446–455. [35] E. Macii, M. Pedram, F. Somenzi, High-level power modeling, estimation, and optimization, IEEE Trans. Comput. Aided Des. Integr Circuits Syst. 17 (11) (1998) 1061–1079. [36] X. Guan, et al., A Spice Compact Model of Metal Oxide Resistive Switching Memory with Variations, 2012. [37] P. Jamieson, K.B. Kent, F. Gharibian, L. Shannon, Odin ii-an open-source verilog hdl synthesis tool for cad research, in. 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, IEEE, 2010, pp. 149–156. [38] C. Wolf, Yosys open synthesis suite. [39] B.L. Synthesis, V. Group, Abc: A System for Sequential Synthesis and Verification, 2017. [40] D.B. Strukov, et al., Cmol Fpga: a Reconfigurable Architecture for Hybrid Digital Circuits with Two- Terminal Nanodevices, 2005. [41] EPFL Integrated Systems Lab, The Epfl Combinational Benchmark Suite, . [42] F. Miao, et al., Anatomy of a nanoscale conduction channel reveals the mechanism of a high- performance memristor, in: APS Meeting, 2012. [43] L. Xie, Towards robust implementation of memristor crossbar logic circuits, in: PRIME, 2016, pp. 1–4. [44] W. Elmore, The Transient Response of Damped Linear Networks with Particular Regard to Wideband Amplifiers, AIP Publishing, 1948. [45] A. Siemon, S. Menzel, R. Waser, E. Linn, A complementary resistive switch-based crossbar array adder, IEEE journal on emerging and selected topics in circuits and systems 5 (1) (2015) 64–74.
[8] W. Wu, H. Wu, B. Gao, P. Yao, X. Zhang, X. Peng, S. Yu, H. Qian, A methodology to improve linearity of analog rram for neuromorphic computing, in: 2018 IEEE Symposium on VLSI Technology, IEEE, 2018, pp. 103–104. [9] Y. Jiang, P. Huang, D. Zhu, Z. Zhou, R. Han, L. Liu, X. Liu, J. Kang, Design and hardware implementation of neuromorphic systems with rram synapses and threshold-controlled neurons for pattern recognition, IEEE Transactions on Circuits and Systems I: Regular Papers 65 (9) (2018) 2726–2738. [10] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J.P. Strachan, M. Hu, R.S. Williams, V. Srikumar, Isaac: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars, Comput. Architect. News 44 (3) (2016) 14–26. [11] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, Prime: a novel processing-in-memory architecture for neural network computation in reram-based main memory, in: ACM SIGARCH Computer Architecture News, vol. 44, IEEE Press, 2016, pp. 27–39. [12] L. Song, X. Qian, H. Li, Y. Chen, Pipelayer, A pipelined reram-based accelerator for deep learning, in: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), IEEE, 2017, pp. 541–552. [13] B. Feinberg, U.K.R. Vengalam, N. Whitehair, S. Wang, E. Ipek, Enabling scientific computing on memristive accelerators, in: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), IEEE, 2018, pp. 367–382. [14] J. Yu, H.A. Du Nguyen, M.A. Lebdeh, M. Taouil, S. Hamdioui, Time-division multiplexing automata processor, in: 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, 2019, pp. 794–799. [15] H. Cai, Y. Wang, L.A.D.B. Naviner, W. Zhao, Robust ultra-low power non-volatile logic-in-memory circuits in fd-soi technology, IEEE Transactions on Circuits and Systems I: Regular Papers 64 (4) (2016) 847–857. [16] R. Fackenthal, M. Kitagawa, W. Otsuka, K. Prall, D. Mills, K. Tsutsui, J. Javanifard, K. Tedrow, T. Tsushima, Y. Shibahara, et al., 19.7 a 16gb reram with 200mb/s write and 1gb/s read in 27nm technology, in: 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), IEEE, 2014, pp. 338–339. [17] M.-F. Chang, J.-J. Wu, T.-F. Chien, Y.-C. Liu, T.-C. Yang, W.-C. Shen, Y.-C. King, C.-J. Lin, K.-F. Lin, Y.-D. Chih, et al., 19.4 embedded 1mb reram in 28nm cmos with 0.27-to-1v read using swing-sample-and-couple sense amplifier and self-boost-write-termination scheme, in: 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), IEEE, 2014, pp. 332–333. [18] S.-S. Sheu, M.-F. Chang, K.-F. Lin, C.-W. Wu, Y.-S. Chen, P.-F. Chiu, C.-C. Kuo, Y.-S. Yang, P.-C. Chiang, W.-P. Lin, et al., A 4mb embedded slc resistive-ram macro with 7.2 ns read-write random-access time and 160ns mlc-access capability, in: 2011 IEEE International Solid-State Circuits Conference, IEEE, 2011, pp. 200–202. [19] C.-C. Chou, Z.-J. Lin, P.-L. Tseng, C.-F. Li, C.-Y. Chang, W.-C. Chen, Y.-D. Chih, T.-Y.J. Chang, An n40 256k 44 embedded rram macro with sl-precharge sa and low-voltage current limiter to improve read and write performance,, in: 2018 IEEE International Solid-State Circuits Conference-(ISSCC), IEEE, 2018, pp. 478–480. [20] J. Borghetti, Z. Li, J. Straznicky, X. Li, D.A. Ohlberg, W. Wu, D.R. Stewart, R.S. Williams, A hybrid nanomemristor/transistor logic circuit capable of self-programming, Proc. Natl. Acad. Sci. 106 (6) (2009) 1699–1703. [21] J. Borghetti, G.S. Snider, P.J. Kuekes, J.J. Yang, D.R. Stewart, R.S. Williams, memristiveswitches enable statefullogic operations via material implication, Nature 464 (7290) (2010) 873. [22] S. Kvatinsky, N. Wald, G. Satat, A. Kolodny, U.C. Weiser, E.G. Friedman, Mrlmemristor ratioed logic, in: In 2012 13th International Workshop on Cellular Nanoscale Networks and Their Applications, IEEE, 2012, pp. 1–6.
31