A GALS design based on multi-frequency clocking for digital switching noise reduction

A GALS design based on multi-frequency clocking for digital switching noise reduction

Integration, the VLSI Journal xxx (xxxx) xxx Contents lists available at ScienceDirect Integration, the VLSI Journal journal homepage: www.elsevier...

4MB Sizes 0 Downloads 67 Views

Integration, the VLSI Journal xxx (xxxx) xxx

Contents lists available at ScienceDirect

Integration, the VLSI Journal journal homepage: www.elsevier.com/locate/vlsi

A GALS design based on multi-frequency clocking for digital switching noise reduction Nguyen Van Toan, Dam Minh Tung, Jeong-Gun Lee ∗ ,1 1 Hallymdaehak-gil, Chuncheon, Gangwon, 24252, South Korea

A R T I C L E

I N F O

Keywords: Supply noise Digital switching activity Globally asynchronous locally synchronous (GALS) Noise reduction Multi-frequency clocking (MFC)

A B S T R A C T

In this paper, we investigate two design techniques to reduce the digital switching noise on the power supply lines: a multi-frequency clocking (MFC) and a globally asynchronous locally synchronous (GALS) design technique. By deploying an MFC design, we can spread the switching current peaks over a frequency band, which results in the supply noise reduction. With the use of a GALS technique, we partition a large synchronous design into some smaller locally synchronous modules (LSMs) which are clocked by their own local clocks. Each local clock signal can be different from the other in terms of frequency or phase. Thus, we can also distribute the digital switching activities inside circuits over time, which leads to the supply noise reductions. The experimental results on a commercial Field Programmable Gate Array (FPGA) Spartan-3 (XC3S400-TQ144) show that the noise reduction rates can be achieved up to 17.4 dB for a GALS system with 8 LSMs, and 19.2 dB for the joint use of the MFC and GALS techniques with 4 LSMs. Moreover, as a real design example, a DES crypto processor with the deployment of GALS design and multi-frequency clocking strategy is also implemented and evaluated. The noise reduction rate is achieved about 19.5 dB at the fundamental frequency.

1. Introduction With semiconductor process scaling, the power supply of an integrated circuit (IC) has been scaled down. An instantaneous supply voltage drop in ICs, called a supply noise, is a major concern in a system on chip (SoC) design today. IR drops are caused by parasitic resistance of the power distribution network (PDN) of a chip while the inductive noise is produced by switching currents flowing through parasitic inductance [1]. These supply voltage drops can cause circuit malfunctions [2]. Besides, the supply voltage noises are main contributors to the electromagnetic interference (EMI) that adversely affects the reliable operations of other sensitive mixed-signal circuits sharing the chip die. Therefore, the supply noises become more critical than ever before. The well-known technique to suppress the supply noise is insertion of decoupling capacitors (decaps) (including on-chip and off-chip decaps) between the supply rails [3–5]. Placing decoupling capacitors on the power lines helps reduce the PDN parasitic impedance. Furthermore, decaps can act as charge tanks to provide charges for load circuits while the main supplies are temporarily deteriorated by

heavy loads. This can suppress voltage fluctuations on the power rails. There are many kinds of on-chip decaps such as MOS decaps, gated decaps, and metal decaps which can be implemented on a chip [6]. However, die area penalties and leakage currents are inevitable [6,7]. Gated decaps are on-chip capacitor structures that can be switchedoff when load circuits are deactivated, which minimizes the leakage and working current consumption of the capacitor. However, the resistance of the switch transistor increases the impedance of the capacitor. Another category of switching noise optimization is so-called the clock-related optimization. The first one is the clock skew optimization. By adding some different skews into clock lines of sequential element clusters, we can distribute the digital switching activities over time to reduce the switching noise [8–10]. However, only a limited amount of skew can be added into the clock network of a large synchronous design. The highly added skews to the clock may cause timing violations (i.e., setup and hold times) or performance degradation since a large clock tree already contains skews. Moreover, it is difficult to add some skews to the clock network of an FPGA-based design, and it is not recommended in an FPGA design flow. The second method is clock tree

∗ Corresponding author. 1

E-mail address: [email protected] (J.-G. Lee). ESoC/Smart Computing, Department of Computer Engineering, Hallym University.

https://doi.org/10.1016/j.vlsi.2019.04.002 Received 9 November 2018; Received in revised form 4 February 2019; Accepted 12 April 2019 Available online XXX 0167-9260/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: N. Van Toan, et al., A GALS design based on multi-frequency clocking for digital switching noise reduction, Integration, the VLSI Journal, https://doi.org/10.1016/j.vlsi.2019.04.002

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

optimization which can be implemented by sizing clock buffers [11] and mixing clock drivers with different voltage thresholds [12]. However, this method cannot be implemented in FPGAs whose clock buffers are pre-fabricated. A clock frequency modulation, typically known as a spread spectrum clocking generation (SSCG) [13–18] has been proven to be an efficient solution for EMI reduction. There are two categories of SSCGs: the first sort is voltage-controlled oscillator (VCO)-based SSCGs [13–15] and the second one is digital-controlled oscillator (DCO)-based SSCGs [16–18]. VCO-based SSCGs are harder to implement than fully digital designs. In general, it takes much time and efforts in their implementations and verifications. All-digital DCO-based SSCGs have been proposed to reduce the design complexities [16–18], which can be viable on FPGAs. However, to achieve a good EMI reduction rate, the delay lines that constitute the DCO and the modulation part should be designed carefully. In asynchronous circuits, handshake protocols are utilized to synchronize their components instead of a global clock pulse as in synchronous circuits [19–21]. Naturally, the propagation delays of computational stages are different from each other. As a result, digital switching activities in asynchronous circuits are more distributed in time than in synchronous counterpart. J. Le et al. [19] and K. L. Chang et al. [20] reported that asynchronous circuits can produce less noises, which are about 10 dB lower than synchronous ones. Interestingly, asynchronous circuits can reduce dynamic power dissipation since they only consume the power when tokens need to be processed. Nevertheless, one of the most critical issues in designing asynchronous circuits is lack of commercial computer-aided design (CAD) tools, especially, in verification and test phases. As an alternative approach, a GALS design technique, which combines synchronous and asynchronous styles, has attracted researchers recently [22–25]. All their LSMs are designed by using a synchronous circuit style, which is well-supported by commercial EDA tools. The synchronizations and communications for these LSMs are conducted by asynchronous interfaces. Accordingly, a large synchronous design can be partitioned into several smaller modules with a moderate number of digital switchings in each module. These LSMs can operate on their own clock sources with different frequencies or phases. Thus, GALS designs can achieve a good EMI reduction [26,27]. The noise reduction rate strongly depends on the number of LSMs in a GALS design. However, when the number of LSMs is increased, area overheads caused by asynchronous interfacing circuits and local clock generators will be increased. In this work, we investigate two design techniques to reduce EMI: all-digital multi-frequency clocking (MFC) and GALS techniques. For our proposed MFC scheme, its main structure consists of a fixed delay line, controllable fine-grained delay line, and a delay selection circuit (modulation part), which form a multi-frequency ring oscillator.

With the proposed MFC, we can finely vary the output clock frequency within an expected range. Compared to the conventional SSCG, our MFC has some advantages such as simplicity in implementation (all its components are standard cells) and verification. Hence, it can be viable on existing commercial FPGAs. Different from analog circuit designs, hardware designers do not need to perform layout verification procedures [28] in the FPGA-based design flow. CAD tools implement (synthesize, place and route) the circuit under user’s constraints automatically. Then, a bit-stream can be generated to do in-circuit verification [29]. Therefore, this helps shorten time-to-market. For the GALS system, we partition a large synchronous design into smaller modules. These modules are clocked by different clock sources (by MFCs). With such a GALS system, we can suppress the noise on the power/ground lines. In addition, in order to focus on the noise aware GALS based clocking strategies, we perform in-depth power/ground noise analysis, and implement various clocking schemes for a low EM noise GALS design. On the other hand, in order to make a scalable synthetic GALS system for validating the analyses, we use power hungry dummy circuits to mimic LSMs of the GALS. Finally, as a real design example, a DES (Data Encryption Standard) crypto processor is implemented and validated to demonstrate the effectiveness of our method. The paper is organized as follows. Section 2 presents the overview of MFC and GALS design techniques. The analysis models of power/ground noises and EMI reduction in a GALS are presented in Section 3. The MFC and noise generator are implemented in Section 4. The experimental results are presented in Section 5. A real design example of DES crypto processor and some discussion are covered in Section 6 and 7, respectively. Finally, Section 8 concludes our paper. 2. Overview of an MFC and a GALS system 2.1. A multi-frequency clocking In this paper, we propose an all-digital multi-frequency clocking (MFC) scheme which is easy to implement in an FPGA. The output clock frequency can be varied within a pre-defined frequency range. The clock frequency variation rate is determined by the modulation circuit. The conceptual MFC scheme is shown in Fig. 1 (a). It consists of a fixed delay line, a controllable delay line with fine-grained delay steps, and a modulation circuit. Fine-grained delay elements are implemented by fast carry multiplexers (MUXCYs) [30]. We will discuss it in details in Section 4. Similar to conventional SSCGs, by spreading the clock energy over multiple frequencies, the spectrum magnitude of the modulated clock will be reduced. That means the power/ground noises caused by the simultaneous switchings of the circuits clocked by this MFC will be reduced as well. Fig. 1 (b) illustrates the spectrum of a clock before and after the modulation.

Fig. 1. (a) A conceptual multi-frequency clocking scheme; (b) the power spectrum comparison of a clock pulse before and after applying frequency modulation. 2

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 3. An equivalent circuit for estimating the supply noise at the package/board level using a lumped model [34,35].

Fig. 2. The block diagram of an asynchronous interface (wrapper) for a locally synchronous module.

yet. It is worth noting that the following analysis can be applied to any GALS system.

2.2. A GALS system

3.1. The PDN transfer function

A GALS system is composed of multiple LSMs and asynchronous interfaces in between them. Each LSM can operate on its own clock source. Thank to this design style, we can partition a large synchronous design having a large amount of simultaneous switching into smaller ones, which results in suppressing power and ground noises. Communications between LSMs are performed by globally asynchronous interfaces. There are different communication mechanisms between LSMs such as two flip-flop synchronizer, asynchronous first-inputs firstoutputs (FIFO) [22], and handshake protocols [31]. In this work, we do not discuss asynchronous interfaces in details, and focus on the power and ground noise reductions relying on design partitions since the global handshake interfaces cause negligible power and ground noises when compared to those caused by LSMs. Fig. 2 depicts the block diagram of a very basic asynchronous interface (wrapper) using a handshake protocol. There are input and output ports at the boundary of the interfaces that are in charge of handling communication signals between the LSMs. There are two kinds of port controllers: demand (D) port and polling (P) port [31]. In order to obtain circuits of these port controllers, we can derive their state transition graphs (STG) [31], and then we can use the Petrify tool [32] to synthesize them. A local pausible clock generator provides a clock pulse for LSM operations. This clock generator can stop producing the clock pulse (i.e. can delay the next clock rising edge) during a data transaction. Whenever one LSM needs a new data for processing or transfers a processed data to another LSM, it will inform the input/output port controllers by asserting the den signal. Upon receiving the notification from the LSM, input/output port controllers handshake the far-end ports (by using two-phase or four-phase handshake protocol) and request the clock generator to stop generating the clock. After finishing a data transaction, input/output ports de-assert the request signal to resume the clock.

The simple equivalent PDN model at the package/board level is shown in Fig. 3 [34,35]. Let’s denote Rp , Lp , and Cp be resistive, inductive, and capacitive parasitic components of the power rail while Rg , Lg , and Cg are for the ground rail. Decoupling capacitors and parasitic capacitances of internal logic are represented by Cd . IA is the switching current of the logic core. For better understanding of the transfer function given in Ref. [34], we present its derivation. It is worthy to pay special attention to the dependency of the PDN impedance on the operating frequency. We assume that Vdd is an ideal voltage source whose voltage remains constant. In transient analysis, this voltage source can be replaced by a short circuit. The frequency-domain impedance of the ground network Zgnd is given by Eq. (1). Zgnd = (j𝜔Lg + Rg )‖

j𝜔Lg + Rg 1 = j𝜔Cg 1 + Rg Cg (j𝜔) + Cg Lg (j𝜔)2

(1)

Suppose that the impedance of a power network (Zpow ) is identical to that of a ground network (Zgnd ). Since the voltage source Vdd is shortcircuited, the power/ground noise is given as follows: Vgnd =

=

(Zpow + Zgnd )‖ZCd 2

× IA

(j𝜔Lg + Rg ) × IA 1 + j𝜔Rg (Cg + 2Cd ) − Lg (Cg + 2Cd )𝜔2

(2)

The transfer function of the ground network is given by Eq. (3). H (𝜔) =

Vgnd IA

=

(j𝜔Lg + Rg ) 1 + j𝜔Rg (Cg + 2Cd ) − Lg (Cg + 2Cd )𝜔2

(3)

The Bode plot of the transfer function is shown in Fig. 4. Depending on the values of parasitic components of the PDN, the shapes of the transfer function can be changed. In case of Rg = 1.0 Ω, Lg = 1.0 nH, Cg = 5.0 pF, Cd = 1.0 nF (yellow curve), there is no anti-resonance on the Bode plot while two others have anti-resonances. The PDN impedance is a function of the operating frequency. From Eq. (3), we can model the power/ground noise as presented in Fig. 5. Obviously, the power/ground noise is a function of the switching current (IA (𝜔)) and the PDN impedance (H(𝜔)). As mentioned above, the PDN impedance varies with the frequency. Thus, the power/ground noise is also a function of an operating frequency. When partitioning a design into modules working on multiple clock domains, we can approximate the transfer function as a constant if these

3. Noise reduction estimation in a GALS system In this section, we derive the transfer function of the PDN. Then, we will analyze the noise reduction for GALS systems. We categorize GALS systems into two cases based on the patterns of data communications between their LSMs. The first case is the GALS systems whose communications between LSMs are highly decoupled. That means LSMs sometimes transfer data, or they transfer data after a certain number of clock cycles, or there are deep decoupling FIFOs between them to buffer data. This case was analyzed in Refs. [27,33]. As an extension, the second case is a GALS system whose LSMs have data transfer at every clock cycle such as a pipeline, which has not been taken into consideration 3

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 4. Bode plots of the transfer function with different parasitic values: (1) blue solid curve: Rg = 1.0Ω, Lg = 1.0 nH, Cg = 5.0 pF, Cd = 1.0 nF, (2) orange dotted curve: Rg = 1.0Ω, Lg = 1.0 nH, Cg = 5.0 pF, Cd = 0.5 nF, (3) yellow dashed curve: Rg = 1.0Ω, Lg = 1.0 nH, Cg = 5.0 pF, Cd = 0.2 nF. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

Fig. 6. Switching currents are represented in a frequency domain: I0 , I1 , …, In are the amplitudes of the switching current Is at the fundamental and its harmonics; I′0i , I′1i , …, I′ni are the amplitude of the switching current Ii of the i-th partition Mi (1 ≤ i ≤ N).

In the case of fair and even partitioning, each partition Mi consumes a current that can be approximated by Ii = Is ∕N. Note that the operating frequencies of partitions, Mi , are not exactly identical, but their differences are not significant. If the frequency variations are large, we have to consider the frequency response of the PDN (H(𝜔)) at different operating frequencies of LSMs since its impedance is a function of the frequency as shown in Fig. 4 of Section 3.1. Finally, the noise reduction rate can be rewritten as Eq. (5) if 𝛼ni = |Is (𝜔n )|∕|Ii (𝜔n )| = N:

Fig. 5. The equivalent model of the ground noise. In a frequency domain, the ground noise is Vgnd (f) = IA (f) × H(f).

Sn = 20 × log(N )

It is worth noting that the clock stretching effects of a pausible clock based GALS has not been taken into consideration in Eq. (5). In that case, the period of clock signals can be adjusted when clock generators are not released from a MUTEX before its next rising edge. If this phenomenon happens, we can achieve more EMI reduction rates since the noise spectrum is spread over more frequencies, which has not been taken into consideration in Refs. [27,33].

modules work in a plesiochronous2 system. When the modules in a digital system operate at largely different frequencies, however, the transfer function can be no longer a constant. So, the analysis and estimation of the power/ground noise becomes more complex. 3.2. Estimations of noise reduction in a GALS system 3.2.1. Case 1 - highly decoupled communication Assume that we split a large design into N partitions, and each partition Mi (1 ≤ i ≤ N) operates at each own clock frequency fi . The initially synchronous design consumes a switching current Is . Depending on the number of switchings occurred inside each partition, those partitions consume different currents Ii (1 ≤ i ≤ N). Waveforms of the switching currents can be decomposed into cosine waveforms. Fig. 6 represents the switching currents of the initial design and the partitioned design in a frequency domain. Actually, X. Fan et al. [27] and M. Babic et al. [33] analyzed this case. Let Sn be the reduction rate in the power spectrum of the supply noise at the n-th harmonic. The expression of Sn is given by Eq. (4) where 𝛼ni is the ratio of Is to Ii . ( ) |Is (𝜔n ) × H (𝜔n )| Sn [dB] = 20 × log max{|Ii (𝜔n ) × H (𝜔n )|}

= 20 × log(min{𝛼ni })

(5)

3.2.2. Case 2 - loosely decoupled communication Some GALS systems have highly intensive data transfers between LSMs such as streaming pipeline architectures where data or instruction transfers happen between pipeline stages at every cycle. In this type of a GALS system, the handshake protocol will be executed at every local clock. Therefore, the local clock frequencies of LSMs automatically adapt to each other since the total input data throughput is exactly equal to the total output data throughput. Eventually, the local clock signals have the same frequency, but different phases. Fig. 7 (a) depicts an example for this case. Thus, the EMI reduction can be achieved since the digital switching activities are spread over time. Fig. 7 (b) shows the vector representation of amplitudes and phases of switching currents consumed by partitions at the fundamental frequency. Similar representatives can be done for higher harmonics. Let Fn {Is } and Fn {Ii } be the Fourier coefficients at n-th harmonic of the switching currents of the originally synchronous design and each locally synchronous module, Mi , in a GALS system, respectively. The total switching current of the GALS system at n-th harmonic is given by Eq. (6). N is the number of LSMs in the GALS system; 𝛼 i is the relatively

(4)

2 In a plesiochronous system, the clock frequencies of modules are different slightly to the others, and phases vary slowly [36].

4

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 7. (a) Clock pulses of three locally synchronous modules in a GALS system: the same frequency, but different phases; (b) Their representation in amplitudes and phases.

phase shifts between the clock frequencies of LSMs. N ∑ i=1

Fn {Ii } =

N |Fn {Is }| ∑ × (1 + e−jn𝜔n 𝛼i ) N i=2

(6)

The reduction rate in the power spectrum of the supply noise is given by Eq. (7). Note that the transfer functions Hs (𝜔) and Hi (𝜔) are a constant and equal to each other as described in Section 3.2.1. The denominator of Eq. (7) is the summation of all unit phasors, which is always less than N. Therefore, the supply noise is always suppressed. ( ) |Fn {Is }| Sn = 20 × log ∑N | i=1 Fn {Ii }| ) ( N (7) = 20 × log ∑N | i=2 (1 + e−jn𝜔n 𝛼i )| Fig. 8. The block diagram of the proposed MFC.

If the current consumption of partitions are different from each other, Eq. (7) will be modified. Let 𝛽 i be the ratio of the current consumption of the partition Mi to that of the originally synchronous design. The reduction rate in the power spectrum of the supply noise is given by: ( ) |Fn {Is }| Sn = 20 × log ∑N | i=1 Fn {Ii }| |∑ |−1 |N | − jn𝜔n 𝛼i | | = 20 × log| (𝛽1 + 𝛽i × e )| | i=2 | | |

through pins “0” of multiplexers have shorter delays. To make an effectively controllable delay line, we need to minimize skew of the interconnection path (purple line) that connects all pins “0” of fast multiplexers. In the design phase, we constrain the maximum allowable skew of this interconnection by using the attribute “MAX_SKEW” that is supported by Xilinx ISE software. The modulation circuit is in charge of varying delay of the ring oscillator by controlling the value of selection bits (Sel). The modulation rate of output clock frequencies is determined by the configurable switching threshold. The results of comparison between the switching threshold and the counter are encoded by a thermometric encoder [37]. Finally, the output of this encoder is used as the selection bits (Sel) for configuring the fine-grained delay line. Our proposed circuit structure is an all-digital design. It is designed by using hardware description languages (e.g. Verilog-HDL, Very high speed HDL). The proposed circuit can be portable and synthesizable. So, it can migrate to other FPGA families with new technology nodes easily without any manual intervention. The proposed circuit structure can also be implemented in ASIC by using standard library cells (e.g. NAND, BUF, INVERTER gates). Therefore, it helps reduce the design time, design cost, and make prototypes very quick and easier. Nowadays, mixed-signal chips consist of digital components and analog components (i.e. RF transceivers, RF low noise amplifiers, Analog-to-digital converter, etc.) that are designed and placed on the same silicon substrate. The noise that is generated by the switching activities of digital components can penetrate into the substrate and power/ground rails of analog circuits, which adversely affects their performance [19]. Therefore, solutions of the power supply noise mitiga-

(8)

4. Implementation of MFC and noise generator 4.1. Design of the proposed MFC The circuit structure of the proposed MFC consists of a fixed delay line, a controllable delay line with fine-grained delay steps, and a modulation circuit to generate selection signals Sel as presented in Fig. 8. The fixed delay line is constituted by lookup-table (LUT) elements in an FPGA. The controllable fine-grained delay line is made of fast carry multiplexers (MUXCY) which are available in modern FPGAs. These MUXCYs are usually used in high-speed arithmetic circuits. Typically, a fast carry multiplexer has an approximate delay of 90 ps (simulation result) that is very smaller than the delay of an LUT. The advantage of using fine-grained delay cells is to achieve a high EMI reduction rate with a small cost of jitter variation in the clock pulse. The path from the output of the fixed delay line through pins “1” of cascaded multiplexers is the longest path (red line) while the paths 5

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Table 1 Clock comparison with previous works. Works

[13]

[15]

[17]

[18]

This work

Modulation Method Center Frequency Fc (MHz) Frequency Deviation (MHz) Jitter (peak-to-peak) Jitter Configurability Technology

VCO 162/270 1.35 42.8ps No CMOS 180 nm 1.8 19 (270 MHz)

VCO 5000 25 14.4ps No CMOS 65 nm 1.2 7 (5 GHz)

DCO 27/54 1–10% × Fc 370 ps to 3712 ps Yes CMOS 180 nm 1.2 1.2 (54 MHz)

DCO 270 5.4 72.6ps No CMOS 90 nm 1.0 0.44 (270 MHz)

DCO 23.2/64.5 0.8/5.2 372 ps to 1486 ps Yes FPGA 90 nm 1.2 0.54 (23.2 MHz) 1.02 (64.5 MHz)

Supply voltage (V) Power consumption (mW)

tion are very essential in VLSI designs. Moreover, digital designs based on FPGAs are very popular, and their supply rails suffer from the digital switching noise. Therefore, an implementation of an all-digital spread spectrum clocking generation, such as an MFC, can help reduce the power supply noise caused by digital switching activities inside the chip. Table 1 shows the comparison between our on-chip clock generator and previous works in terms of frequency, jitter, and frequency deviation, etc. It is noted that conventional SSCGs are only applied to the designs with dual-clock first-input-first-output (FIFO) buffers which are used as a synchronization mechanism between clock domains. Unlike conventional SSCGs, our MFC designs can be applied to both of the dual-clock FIFO buffers based systems and pausible clocking based GALS systems. Additionally, in this work, the switching rate of output frequency and frequency spreading ratio (jitter) can be configured by setting the switching threshold and the counter as illustrated in Fig. 8. The number of output frequencies can be configured from 4 to 16. In FPGAs, their clock networks are normally divided into some clock regions where regional clock buffers are responsible to deliver a clock signal to sequential elements (e.g. Flip-flops, registers, memories, etc.). In the case of local on-chip clock generation, the output clock is connected to the regional clock buffer first before the clock signal is delivered to the clock network by that clock buffer. Therefore, a good quality clock signal can be delivered to sequential logic elements. Of course, there is no problem with the drive strength (drivability) of the local on-chip clock generator.

FFs in the noise generator are toggled at every clock continuously. The noise generator consumes high current at the rising clock edge so that it is expected to generate enough supply noise for investigating the impact of GALS clocking strategies on EM noise. 5. EMI reduction evaluation In this section, we presents test scenarios for evaluating the noise reduction in the GALS system with MFC in the first part. Then, the measurement setup and experimental results will be covered in the second part. 5.1. Measurement scenarios This section presents scenarios for evaluating the supply noise reduction. The first scenario is the impact of a GALS design on the supply noise reduction. The second scenario is the impact of a GALS system with MFC strategy on the supply noise reduction. Finally, we investigate the impact of the GALS system implementing a pipeline with two and three LSMs on the EMI reduction. 5.1.1. Impact of a GALS design on the supply noise reduction In order to evaluate the impact of a GALS design on the supply noise reduction, we intentionally partition an originally synchronous design into N smaller LSMs with N ranging from 2 to 8. Each of the LSMs will be clocked by its own local clock generator. The clock frequencies of these LSMs are slightly different from each other. Then, the supply noise will be evaluated.

4.2. Design of a noise generator In order to assess the effectiveness of a GALS design technique on the EMI reduction rate, a noise generator is designed to consume high currents at clock edges, therefore, intentional noises are introduced to power rails due to variations of the current consumption. In experiments, the noise generator works as a locally synchronous module of a GALS system. Its schematic is shown in Fig. 9. The noise generator consists of a chain of cascaded Flip-Flops (FFs). The first stage of a chain consists of one FF and an inverted feedback path in order to generate a continuous toggling bit pattern. On the rising clock edge, this data is shifted out to output pins (A0 to An), through intermediate FFs. All

5.1.2. Impact of a GALS design with an MFC-equiped clock generator on the supply noise reduction The effect of the GALS design with the MFC-equiped clock generator on the supply noise reduction is assessed by exploring two subscenarios. First, to evaluate the EMI reduction for the MFC, we create two designs, namely Design-1 and Design-2. The block diagram of the Design-1 and Design-2 is depicted in Fig. 10.

Fig. 10. EMI reduction evaluation for an MFC. This circuit structure is used for evaluating Design-1 (the MFC generates a single clock frequency) and Design-2 (the MFC generates variable clock frequencies) of the first scenario.

Fig. 9. The block diagram of a noise generator. 6

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 13. The block digram of a GALS system with three LSMs. Fig. 11. EMI reduction evaluation for a GALS system. This circuit structure is used for evaluating Design-3 of the second scenario.

Fig. 14. A setup model for the supply noise measurement. Fig. 12. The block diagram of a clock scheduler.

handshake protocols of a GALS systems. Readers can refer to the work in Ref. [31] for more details.

• Design-1: An MFC circuit is configured to generate a fixed single clock frequency (center frequency = 23.2 MHz) for four LSMs (four

5.2. Measurement results

noise generators). • Design-2: An MFC circuit is used to provide the clock source with variable frequency for four LSMs (four noise generators).

A 150-Ω method, which is standardized by IEC 61697-4 [38], is used as a means to measure the conducted noise on the power line of the Spartan-3 FPGA. The measurement setup is depicted in Fig. 14. In our experiments, the spectrum analyzer, namely Gwinstek GSP-830, is used to measure the conducted noises.

Second, to evaluate the joint impact of the GALS technique with the MF clock generator on the EMI reduction, four independent MFC circuits are utilized to provide distributed variable frequency clocks for four different noise generators which intentionally mimic LSMs. The design of this case is illustrated in Fig. 11, and it is called Design-3. Specifically, if four individual MFCs are not synchronized with any reference, their output clock frequencies can be identical at a certain time depending on their starting times and operation behavior, which makes the noise spectrum associated with their operations become higher. So, to avoid this problem, we design a scheduler to interleave their clock frequencies and to make all the local clock frequencies different from each other at anytime. An additional clock generator will be used to provide a reference clock for the scheduler circuit. This clock generator is configured to generate a single clock frequency. Additionally, the scheduler has four 3-bit registers to store the values which are used to select the clock frequencies for local clock generators. The block diagram of the scheduler is shown in Fig. 12. In this scheduler, the switching threshold is a configurable parameter that is used to determine the switching rate of the operating frequency.

5.2.1. Impact of a GALS design on the supply noise reduction The supply noises of GALS systems are shown in Fig. 15. As can be seen, the supply noise is rapidly decreased with respect to the number of partitions (N = 1 to 8). The originally synchronous design with one clock source causes the highest noise with 69.4 dB μV while the GALS system with 8 LSMs introduces the lowest noise with about 52.0 dB μV.

5.1.3. The impact of a GALS design with loosely decoupled communications on EMI reduction In this scenario, we implement two GALS systems: (1) the GALS system with two LSMs, (2) the GALS system with three LSMs. LSMs in GALS systems communicate with each other by using two-phase asynchronous handshake protocol. The data transfers between LSMs in a GALS system occur at every clock cycle. Fig. 13 illustrates the block diagram of such a GALS system with three LSMs. These LSMs transfer data at every clock cycle. In the figure, (Rp-1, Ap-1), and (Rp-2, Ap-2) form handshake signal pairs between control ports. ctrl-i (i = 1 to 4) are the communication signals between control ports and their local clock generators. In this paper, we do not discuss the operation principles and

Fig. 15. The supply noise spectrum of a GALS systems with respect to the different number of partitions. 7

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

0.8 MHz, i.e., the difference in frequency between the lower and upper bounds is 0.8 MHz. The observed power noise at the spectrum analyzer is about 59.3 dB μV as shown in Fig. 17 (b). Therefore, the noise reduction rate is 10.1 dB (=69.4–59.3). The modulation frequency fm used in the MFC is about 30 kHz, and the resolution bandwidth RBW of SA is set to 30 kHz. Our MFC exploits 16 discrete frequencies by using a variable delay line with 16 different fine-grained delay elements (MUXCYs). The supply noise mainly distributes over these 16 frequencies. Second, four MFCs are used to provide clock signals for four different noise generators. The observed power supply noise is about 50.2 dB μV. Thus, the EMI reduction rate is about 19.2 dB (=69.4–50.2). The supply noise spectrum is shown in Fig. 17 (c). In this case, each MFC provides a variable frequency clock for each noise generator. At a specific point of time, four noise generators always operate with four different clock frequencies. Therefore, the supply noise caused by both MFC design and GALS design is more attenuated. It is noted that the measurement result shown in Fig. 17 (c) is obtained with the resolution bandwidth (RBW) of 3 kHz. Setting the RBW of the spectrum analyzer to 3 kHz to resolve spectrum peaks. Assume that the transfer function of the PDN is constant within the frequency deviation of these MFCs, i.e., 23.2 ± 0.4 MHz. The noise reduction rate is caused by partitioning (i.e., the GALS system with N = 4) is given by:

Fig. 16. The supply noise reduction of a GALS system with respect to the different number of partitions.

Table 2 Hardware resources for GALS systems (the first scenario). LSMs

Registers

Slices

LUT-4s

MUXCYs

1 2 3 4 5 6 7 8

2132 2132 2132 2132 2132 2132 2132 2132

1184 1224 1265 1306 1345 1386 1426 1466

2232 2280 2328 2376 2424 2472 2520 2568

16 32 48 64 80 96 112 128

S1 = 20 × log (4) ≈ 12.04 dB Finally, from the measurement result, we find that the MFC contributes about 7.2 dB (=19.2 − 12.04) for the noise reduction rate in this GALS system. 5.2.3. GALS systems with loosely decoupled communications between LSMs In this case, since LSMs in GALS systems communicate with each other at every clock cycle (highly intensive data communication), the local clock generators of LSMs automatically adapt and finally run at a same frequency, but different phases. The relative phase shifts between clock signals can be adjusted by adding some delays on the handshake signals, i.e., request or acknowledge signals. Based on the simulation results, the phase shift between clock signals for the GALS system with two LSMs is about 100◦ . For the GALS system with three LSMs, the clock phase of the second LSM is about 100◦ later than that of the first LSM. The clock phase of the second LSM is about 90◦ earlier than that of the third LSM. The experimental results of the supply noise spectrum for these GALS systems are summarized in Fig. 18. For the comparison, we also measure the noise spectrum for the fully synchronous single clock design for a reference. We can use simulated phase shifts between LSMs to estimate the noise reductions based on Eq. (7). As shown in Fig. 18 (a) and (b), at the fundamental frequency, the supply noise spectrum of the GALS design with two LSMs is 3.3 dB lower than that of the synchronous design. Its supply noise spectrum is 12.8 dB lower than that of the synchronous design at the second harmonic. The higher harmonics seems to be eliminated. For the GALS system with three LSMs, at the fundamental frequency, the supply noise spectrum is about 8.7 dB lower than that of the synchronous design while its second harmonic spectrum is attenuated about 7.7 dB. So, when the switching activities are more distributed in time, the noise spectrum at the fundamental frequency is more suppressed. Additionally, if we want to suppress the supply noises more, a spread spectrum clocking can be also applied to this circuit.

The estimated and actual noise reduction rates are shown in Fig. 16. The noise reduction rates are estimated based on Eq. (5). The actual noise reduction rate nearly matches the estimated one. The highest error is about 1.06 dB for the GALS system with 8 LSMs. Table 2 shows the hardware resources used for GALS systems with the different number of LSMs. GALS designs have the different number of LUT-4s and MUXCYs because the overheads of local clock generators. Each local clock generator costs 48 LUT-4s and 16 MUXCYs. 5.2.2. Impact of a GALS design with an MFC-equiped clock generator on the supply noise reduction Prior to showing the joint impact of the MFC and the GALS techniques on the supply noise, we summarize the hardware resources and clocking strategies of the three designs in Table 3. For the first case, we measure the power supply noise when one MFC is used to generate a single fixed frequency clock for the LSMs working as noise generators. The measured power noise for this case is 69.4 dB μV at the operating frequency of 23.2 MHz as illustrated in Fig. 17 (a). Next, the MFC is configured to provide a variable clock frequency for 4 noise generators, the center frequency of this clock source is about 23.2 MHz, and the frequency deviation is about

Table 3 Design features and hardware resources (the second scenario). Parameters

No. of Clock Generators No. of Registers No. of Occupied slices Clock feature Clock Frequency (MHz) Frequency deviation (MHz)

Sub-scenario 1

Sub-scenario 2

Design 1

Design 2

Design 3

1 2132 1224 fixed 23.2 0

1 2132 1224 variable 22.8–23.6 0.8

4 2132 1361 variable 22.8–23.6 0.8

6. A design example of a GALS DES crypto-processor In this section, a demonstration of the proposed technique is performed on a DES crypto processor [39]. First, we briefly describe how a 8

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 17. Supply noise spectra: (a) when using a single clock frequency, (b) when applying spread spectrum clocking, (c) when applying variable frequencies and GALS design (4 LSMs) with a clock scheduler.

DES algorithm works, and how to partition a DES crypto processor. Second, we present measurements of conductive noise reduction rates on the power line when applying the GALS design and the MFC technique for the local clock generator.

For the demonstration, we divide this crypto processor into three partitions as depicted in Fig. 19. The first partition consists of the round operations from 1 to 4, the key expansions from K1 to K4, and ROM for storing secret keys and input data. The second partition is constituted by the round operations from 5 to 10 and the expanded key generators from K5 to K10. The third partition includes the round operation from 11 to 16 and the expanded keys from K11 to K16. To obtain a good noise reduction, we need to balance the power dissipation between GALS blocks. Thus, the power estimation of the design is necessary in the partitioning phase. Xilinx design toolkits support the power estimation with XPower Analyzer tool [40]. One of the most important inputs of the power simulation is switching activities of targeted circuits. We can obtain the switching activities of circuits by executing the simulation with random input patterns (plaintexts and secret keys) on the gate level netlist with back-annotated delays (generated from placeand-route design). The size of each partition is adjusted based on the power dissipation report. The block digram of the pausible clocking based GALS DES crypto processor is illustrated in Fig. 20. The logic cores of the GALS DES crypto processor are three partitions as illustrated in Fig. 19. Input data and secret keys are stored in on-chip ROMs. The input and output port

6.1. DES algorithm and DES crypto-processor partitioning The DES algorithm is a symmetric cipher that is used to protect users’ data. The DES algorithm encrypts input data with a secret key at a sender site and decrypts the cipher-data with the same secret key at a receiver site. The DES operates on 64-bit plaintext blocks and returns ciphertext blocks of the same size. Basically, its secret key size is 56 bits. An expansion key generator produces sixteen 48-bit round keys (K1 to K16) for the DES operation. The DES algorithm are described in details in Ref. [39]. The DES processor has two main parts: (1) round operations and (2) expansion key generator. In general, the DES crypto processor is normally implemented with 16 pipeline stages to improve its throughput. Thus, it can produce a 64-bit encrypted/decrypted data at every clock cycle with an initial latency of 16 clock cycles.

9

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 18. Supply noise spectra: (a) when using a single clock frequency, (b) when applying a GALS system with two LSMs, (c) when applying a GALS system with 3 LSMs.

controllers (DO and DI) are demand port type. As the communication mechanism between port controllers, two-phase handshake protocol is employed. Since this pausible clocking based GALS system operates as a pipeline, its local clock generators always have the same frequency. If the clock generator of the GALS-2 is equipped with an MFC to vary its output clock frequency, the clock frequency of the GALS-1 and GALS-3 are automatically adapted to the clock frequency of the GALS-2. So, we don’t apply the MFC to clock generators of the GALS-1 and GALS-3. 6.2. Supply noise evaluation In the experiments, we measure the power noises for three cases: (a) all synchronous DES design, (b) a GALS DES without multi-frequency clocking strategy, and (c) a GALS DES with a MFC generator. Table 4 shows the hardware resources of these designs. The noise spectrum of three cases are summarized in Fig. 21. The measurement setup to record the conducted noise on the power line of the FPGA is presented in Fig. 14. For the synchronous DES processor, the observed noise spectrum at first four harmonics are 71.7, 64.2, 58.9 and 52.1 dB μV, respectively. The operating frequency of synchronous DES is about 23.2 MHz. As can be seen, the supply noise at the lower frequency harmonic is dominant over those at the higher order harmonics.

Fig. 19. The DES crypto processor partitioning.

10

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 20. The pausible clocking based GALS DES crypto processor with three GALS blocks: DO and DI ports are input and output demand port controllers.

Table 4 Summary of DES crypto-processor designs. Features

Synchronous Design

GALS Design

GALS with MFC

No. of FFs No. of 4 input LUTs No. of Occupied Slices Clock generators

1973 3897 2737 1 DCM

2229 3668 2801 3 ring oscillators

2229 3658 2799 3 ring oscillators

Fig. 21. Supply noise of DES running at 23.2 MHz: (a) a synchronous DES processor, (b) a pausible clocking based GALS DES processor with three partitions, (c) a pausible clocking based GALS DES processor with three partitions and multi-frequency clocking (clock modulation).

11

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Table 5 Comparison with previous works. Work

[8]

[20]

[26]

Ours

Design Style Test Circuit Technology Center frequency (MHz) Method Noise Reduction Rate (dB)

Synchronous 40 K gate chip CMOS@180 nm 50 Measure 6

Asynchronous 8051 core CMOS@130 nm 50 Measure 12

GALS FFT IHP@130 nm 80 Measure 13

GALS DES FPGA@90 nm 23.2/64.5 Measure 19.5/23.5

With the GALS DES processor, since the digital switching activities are spread over time, the noise spectrum is lower than those of the synchronous design. Particularly, at the fundamental frequency, the noise reduction rate is about 10.6 dB compared to the synchronous one while the reduction rate is about 10.1 dB at the second harmonic. When deploying the MFC generator to LSMs of the GALS DES crypto processor, we can achieve a significant noise reduction at the fundamental frequency. The noise magnitude at the fundamental frequency is reduced to 52.2 dB μV, which is about 19.5 dB lower than the one of the synchronous design while the noise at the second harmonic is reduced about 25.1 dB. The noise at the third harmonic is almost eliminated. Compared to the GALS design, the MFC technique contributes about 9 dB to the noise reduction rate for the GALS DES processor. Table 5 shows the comparison between ours and the previous works. It is noted that our work focuses on the supply noise optimization in a commercial FPGA. We just demonstrated the effectiveness of our designs to the power supply noise reduction rather than focusing on the performance such as operating frequency. That is the reason why we used the operating frequency of 23.2 MHz. The frequency of the on-chip oscillator (i.e. proposed MFC) depends on the number of delay elements (including coarse-grain and fine-grain delay elements) forming the ring oscillator, and the delays of delay elements. Practically, the delays of delay elements totally depend on the technology nodes. Therefore, we can increase the operating frequency of our designs to higher frequency by reducing the number of delay elements in the ring oscillator. To demonstrate our designs to operate at higher frequency, we also perform additional experiments and evaluation results with the clock frequency of 64.5 MHz. Our design was targeted to the low cost commercial FPGA families, for example, Xilinx Spartan-3. Experimental results in Fig. 22

show that the EMI reduction rate at the fundamental frequency is about 23.5 dB. Our pausible clocking based GALS DES crypto-processor uses individual local MFCs. When applying this design approach to its pipeline architecture, the frequency of these local MFCs automatically adapt to each other by delaying (stretching) the next rising clock edge (the total throughput at the input and output are equal). Eventually, they have the same frequency but different phases. It is noted that multi-frequency clocking in our GALS DES crypto-processor means that all locally synchronous blocks of the pipeline systems use a single clock frequency at a certain time, and they periodically adjust their operating frequency after a time interval (it is determined by configurable clock switching rate). Additionally, our experiments showed that the supply noise generated by the GALS DES crypto processor (the whole pipeline system operates with the same clock frequency, but locally synchronous pipeline stages operates with different phases) is less than that produced by fully synchronous counterpart about 10.6 dB. When further applying an MFC technique to the GALS DES-crypto processor, it suppresses the supply noise further (about 9 dB).

6.3. Performance overhead evaluation Assume that the system configuration and the timing diagram of an asynchronous interface in the pausible clocking based GALS DES crypto-processor are depicted in Fig. 23(a) and (b), respectively. Let us define timing parameters as shown in Table 6. From the timing diagram, the maximum throughput of the data link depends on the frequencies of TX and RX blocks. Moreover, it is also limited by the round trip communication delay of a data link between TX and RX blocks. The

Fig. 22. Supply noise of DES running at 64.5 MHz: (a) a synchronous DES processor, (b) a pausible clocking based GALS DES processor with three partitions and multi-frequency clocking. 12

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 23. (a) The block diagram and (b) the timing diagram of an asynchronous interface in the pausible clocking based GALS DES crypto-processor (Te and Re are trigger events for data transfer).

Table 6 Timing parameters of a GALS system. Parameters

Meanings

tCT ttx (trx ) tlf (tlb ) td T

clock tree insertion delay of each LSM required time to generate tx_ack (rx_ack) when Te (Re) is triggered propagation delay of Req (Ack) from TX (RX) to RX (TX) blocks relative delay of rx_clk signal with respect to tx_clk signal a clock period of TX (RX) clock generators

delay of TX and RX are given by (9) and (10), respectively. ttxd = tCT + ttx + tlf

(9)

trxd = tCT + trx + tlb

(10)

In the experiment based on the simulation (at the worst case 1.14 V, 85 ◦ C) demonstrated in Fig. 24, the total delay of the round trip communication is about 5.76 ns which is less than the clock period of 15.5 ns (≡64.5 MHz). So, it does not affect the total throughput of the pipeline. Actually, our approach just targets to pipelining applications whose computational delays can be comparable to delays of boundary interfaces, and supply noise is a critical factor (e.g. for crypto-processors).

If the delay ttxd is less than or equal to T∕2, the total maximum round trip communication delay is given by (11). If the delay trxd is less than or equal to T∕2, then the maximum round trip communication delay is less than or equal to T. The throughput of the data link is now limited by the clock period of TX and RX blocks. tlink = T ∕2 + trxd

7. Discussion A ring oscillator based MFC suffers from the process, operating voltage, and temperature (PVT) variations. Since the frequency of a ring oscillator based MFC is determined by the delay of a chain of delay elements. When data-path delays are changed due to the variation of operating conditions, the delays of delay elements forming the ring oscillator can also be changed accordingly. Therefore, the operating frequency of the ring oscillator can be changed. In the proposed MFC-based GALS system, the variations of an MFC clock period and

(11)

In the worst case, if the delays ttxd and trxd are greater than T∕2, then the maximum round trip communication delay is greater than T. The throughput of the data link is limited by the round trip communication delay. The performance overhead is given by (12). toverhead = ttxd + trxd − T

(12)

Fig. 24. Timing simulation (the worst case 1.14 V and 85 ◦ C) of an asynchronous interface in the pausible clocking based GALS DES crypto-processor. 13

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

delays of critical paths in each LSM are highly correlated. That is, the ring oscillator can rapidly adapt its clock period to delays of critical paths, which can save timing margins. Required timing margins for an MFC are, therefore, guard-bands to tolerate the locally differential variability with respect to critical paths [1]. In some cases, we can deploy local voltage supply sensors to detect dynamic supply noises locally, and then a controller will use this information to adjust the frequencies of the local clock generators (by dynamically configuring the number of delay elements constituting the ring oscillator) to keep the frequency deviation in a given range. For adaptive voltage frequency scaling (AVFS) applications, the sensor and controller are in charge of detecting and adapting the local clock frequency to ensure the correct operation, power/energy budget, and performance of a target system.

[4] H. Fujii, et al., Evaluation of power supply noise reduction by implementing on-chip capacitance, in: Proc. 8th Workshop on Electromagnetic Compatibility of Integrated Circuits, 2011, pp. 219–223. [5] Y. Ogasahara, et al., Supply noise suppression by triple-well structure, IEEE Trans. Very Large Scale Integr. Syst. 21 (4) (2013) 781–785. [6] T. Charania, et al., Analysis and design of on-chip decoupling capacitors, IEEE Trans. Very Large Scale Integr. Syst. 21 (4) (2013) 648–658. [7] Y. Ogasahara, et al., All-digital ring-oscillator-based macro for sensing dynamic supply noise waveform, IEEE J. Solid State Circ. 44 (6) (2009) 1745–1755. [8] M. Badaroglu, et al., Clock-skew-optimization methodology for substrate-noise reduction with supply-current folding, IEEE Trans. Comput. Aided Des. Integr Circuits Syst. 25 (6) (2006) 1146–1154. [9] H. Seo, et al., Clock skew optimization for maximizing time margin by utilizing flexible flip-flop timing, in: Proc. 16th Int. Symp. Quality Electronic Design, 2015, pp. 35–39. [10] A. Vijayakumar, V.C. Patil, S. Kundu, An efficient method for clock skew scheduling to reduce peak current, in: Proc. 29th Int. Conf. on VLSI Design and Embedded Systems, 2016, pp. 505–510. [11] H. Jang, D. Joo, T. Kim, Buffer sizing and polarity assignment in clock tree synthesis for power/ground noise minimization, IEEE Trans. Comput. Aided Des. Integr Circ. Syst. 30 (1) (2011) 96–109. [12] Y. Kaplan, Sh Wimer, Mixing drivers in clock-tree for power supply noise reduction, IEEE Trans. Circ. Syst. I, Reg. Pap. 62 (5) (2015) 1382–1391. [13] W.-Y. Lee, L.-S. Kim, A spread spectrum clock generator for DisplayPort main link, IEEE Trans. Circ. Syst. II, Exp. Briefs 58 (6) (2011) 361–365. [14] T. Kawamoto, M. Suzuki, T. Noto, 1.9-ps jitter, 10.0-dBm-EMI reduction spread-spectrum clock generator with autocalibration VCO technique for serial-ATA application, IEEE Trans. Very Large Scale Integr. Syst. 22 (5) (2014) 1118–1126. [15] S.-G. Bae, et al., A 5-GHz sub-sampling PLL based spread spectrum clock generator by calibrating the frequency deviation, IEEE Trans. Circ. Syst. II, Exp. Briefs 64 (10) (2017) 1132–1136. [16] S. Damphousse, et al., All digital spread spectrum clock generator for EMI reduction, IEEE J. Solid State Circ. 42 (1) (2007) 145–150. [17] D. Sheng, et al., A low-power and portable spread spectrum clock generator for SoC applications, IEEE Trans. Very Large Scale Integr. Syst. 19 (6) (2011) 1113–1117. [18] C.-C. Chung, et al., A low-cost low-power all-digital spread-spectrum clock generator, IEEE Trans. Very Large Scale Integr. Syst. 23 (5) (2015) 983–987. [19] J. Le, Ch Hanken, M. Held, M.S. Hagedorn, K. Mayaram, T.S. Fiez, Experimental characterization and analysis of an asynchronous approach for reduction of substrate noise in digital circuitry, IEEE Trans. Very Large Scale Integr. Syst. 20 (2) (2012) 344–356. [20] K.-L. Chang, J.S. Chang, B.-H. Gwee, K.-S. Chong, Synchronous-logic and asynchronous-logic 8051 microcontroller cores for realizing the internet of things: a comparative study on dynamic voltage scaling and variation effects, IEEE Trans. Emerg. Sel. Topics Circuits Syst. 3 (1) (2013) 23–34. [21] J.-G. Lee, A low EMI circuit design with asynchronous multi-frequency clocking, IEICE Trans. Electron. E97-C (12) (2014) 1158–1161. [22] M.N. Horak, S.M. Nowick, M. Carlberg, U. Vishkin, A low-overhead asynchronous interconnection network for GALS chip multiprocessors, IEEE Trans. Comput. Aided Des. Integr Circuits Syst. 30 (4) (2011) 494–507. [23] I. Miro-Panades, et al., A fine-grain variation-aware dynamic Vdd-hopping AVFS architecture on a 32 nm GALS MPSoC, IEEE J. Solid State Circ. 49 (7) (2014) 1475–1486. [24] M. Cannizzaro, S. Beer, J. Cortadella, R. Ginosar, L. Lavagno, SafeRazor: metastability-robust adaptive clocking in resilient circuits, IEEE Trans. Circ. Syst. I, Reg. Pap. 62 (9) (2015) 2238–2247. [25] I. Levi, Alex Fish, O. Keren, Low-cost pseudoasynchronous circuit design style with reduced exploitable side information, IEEE Trans. Very Large Scale Integr. Syst. 26 (1) (2018) 82–95. [26] X. Fan, et al., A GALS FFT processor with clock modulation for low-EMI applications, in: Proc. 21st IEEE Int. Conf. Application-specific Systems, Architectures and Processors, 2010, pp. 273–278. [27] X. Fan, et al., GALS design for spectral peak attenuation of switching current, in: Proc. 19th IEEE Int. Symp. Asynchronous Circuits and Systems, 2013, pp. 83–90. [28] Dan Clein, Layout design, in: CMOS IC Layout: Concepts, Methodologies, and Tools, Newnes, MA, 2000. [29] Xilinx, FPGA Design Flow Overview [Online]. Available: https://www.xilinx.com/ support/documentation/sw_manuals/xilinx10/isehelp/ise_c_fpga_design_flow_ overview.htm. [30] Xilinx, Spartan-3 Libraries Guide for HDL Designs, 2009. [Online]. Available: https://www.xilinx.com/support/documentation/sw_manuals/xilinx11/spartan3_ hdl.pdf. [31] X. Fan, M. Krstic, E. Grass, Performance analysis of GALS datalink based on pausible clocking, in: Proc. 18th International Symposium on Asynchronous Circuits and Systems, 2012, pp. 126–133. [32] J. Cortadella, et al., Petrify: a tool for manipulating concurrent specifications and synthesis of asynchronous controllers, IEICE Trans. Info Syst. E80-D (3) (1997) 315–325. [33] M. Babic, S. Zeidler, M. Krstic, GALS partitioning methodology for substrate noise reduction in mixed-signal integrated circuits, in: Proc. 22nd IEEE International Symposium on Asynchronous Circuits and Systems, 2016, pp. 67–74. [34] M. Babic, et al., Frequency-domain modeling of ground bounce and substrate noise for synchronous and GALS systems, in: Proc. 25th Int. Workshop Power and Timing Modeling, Optimization and Simulation (PATMOS), 2015, pp. 126–132.

8. Conclusion In this work, we developed analytical models of the power noise reductions for the case of utilizing the MFC and the GALS design techniques. Then, we have evaluated the noise reductions rigorously with the MFC and GALS techniques through experimental scenarios. For the MFC technique, we have designed a clock generator supporting frequency variation by using a controllably fine-grained delay line. By deploying the clock generator with 16 different frequencies for our design, we can achieve EMI reduction rate of 7.2 dB. For a GALS system, we evaluated the noise reduction via two scenarios: (1) a GALS system consisting four local synchronous modules (noise generators) which are clocked by four different MFCs (each MFC generates a single clock frequency), (2) a GALS system consists of four synchronous modules which are locally synchronized by four individual MFCs that produce variable clock frequencies (i.e., modulation). The experiments showed that we can achieve 11.7 dB of noise reduction rate for the first case, and 19.2 dB for the second case. Moreover, the noise reduction rate for the GALS system having highly intensive data transfers was analyzed and evaluated. In this case, the local clock signals of a GALS system have a same frequency, but different phases. Digital switching activities inside the LSMs of a GALS system are more distributed in time. As a result, the spectral noises are suppressed. Finally, a real design example of the DES crypto processor was implemented by applying the GALS technique and the proposed multi-clocking strategy. Experimentally, a high noise reduction rate can be achieved, which is at least 19.5 dB at the fundamental frequency. Acknowledgment This work has been supported by the National Research Foundation of Korea through the Basic Science Research Program under Grant 2018R1D1A1B07043399.

Appendix A. Supplementary data Supplementary data to this article can be found online at https:// doi.org/10.1016/j.vlsi.2019.04.002. References [1] J. Cortadella, et al., Ring oscillator clocks and margins, in: Proc. 22nd IEEE Int. Symp. Asynchronous Circuits and Systems, 2016, pp. 19–26. [2] R. Samanta, et al., Clock buffer polarity assignment for power noise reduction, IEEE Trans. Very Large Scale Integr. Syst. 17 (6) (2009) 770–780. [3] H. Fujita, et al., Evaluation of PDN impedance and power supply noise for different on-chip decoupling structures, in: Proc. 9th Int. Workshop on Electromagnetic Compatibility of Integrated Circuits, 2013, pp. 142–146.

14

N. Van Toan et al.

Integration, the VLSI Journal xxx (xxxx) xxx

[35] H. Wang, E. Salman, Closed-form expressions for I/O simultaneous switching noise revisited, IEEE Trans. Very Large Scale Integr. Syst. 25 (2) (2017) 769–773. [36] T. Yoshimuara, A. Iwata, An analysis of interference in synchronous systems, IEICE Electron. Express 1 (15) (2014) 465–471. [37] S. Nakaigawa, Thermometric-Binary Code Conversion Method, Conversion Circuit Therefor and Encoder Element Circuits Used Therefor, U.S. Patent 6346906B1, 2002. [38] IEC 61967, Integrated Circuits—Measurements of Electro-Magnetic Emissions, 150 kHz to 1 GHz, 2001.

[39] William Stallings, Block ciphers and the data encryption standard, in: Cryptography and Network Security - Principles and Practices, Pearson, New York, 2011. [40] Xilinx, Xilinx Power Tools Tutorial: Spartan-6 and Virtex-6 FPGAs, 2010. [Online]. Available: https://www.xilinx.com/support/documentation/sw_manuals/xilinx11/ ug733.pdf.

15