Microelectronics Journal 44 (2013) 258–269
Contents lists available at SciVerse ScienceDirect
Microelectronics Journal journal homepage: www.elsevier.com/locate/mejo
Mitigating power- and timing-based side-channel attacks using dual-spacer dual-rail delay-insensitive asynchronous logic Washington Cilio a, Michael Linder a, Chris Porter a, Jia Di a,n, Dale R. Thompson a, Scott C. Smith b a b
Department of Computer Science & Computer Engineering, University of Arkansas, ENGR 311, CSCE Dept., Fayetteville, AR 72701, United States Department of Electrical Engineering, University of Arkansas, JBHT-CSCE 504, Fayetteville, AR 72701, United States
a r t i c l e i n f o
a b s t r a c t
Article history: Received 10 February 2012 Received in revised form 1 December 2012 Accepted 10 December 2012 Available online 9 January 2013
Side-channel attacks have become a prevalent research topic for electronic circuits in security-related applications, due to the strong correlation between data pattern and circuit external characteristics which can be easily measured. By monitoring the power/timing information of a synchronous circuit, an attacker can easily obtain the secret data stored on the device. Although dual-rail asynchronous circuits have more stable power traces, they are still vulnerable to power-based attacks because of the imbalanced loads between the two rails of each signal. Moreover, asynchronous circuits are among the most prone to timing attacks since their delays are strongly data dependent. Dual-spacer dual-rail delay-insensitive Logic (D3L), presented in this paper, is able to mitigate both power- and timing-based side-channel attacks. In a D3L circuit, power consumption is decoupled from data pattern by using a dual-spacer protocol which guarantees balanced switching activities between the two rails of each signal, while timing-data correlation is broken by inserting random delays. Three Advanced Encryption Standard cores have been designed using synchronous logic, traditional dual-rail asynchronous logic, and D3L. Correlation Power Analysis and Timing Analysis attacks were applied and the results show that the D3L design is able to render both attacks unsuccessful, while the other two circuits have vulnerabilities. & 2012 Elsevier Ltd. All rights reserved.
Keywords: Side-channel attack Delay-insensitive asynchronous logic Dual-rail NULL Convention Logic Dual-spacer Delay element
1. Introduction As technology advances, more and more electronic devices store secret information such as bank accounts, identification numbers, passwords, and other private data that need to be secured from unauthorized access. Although originally considered safe and secure, hardware, just as software, is prone to attacks that force the targeted system to reveal sensitive data. Cryptographic algorithms are commonly used to protect such data. However, despite the mathematical robustness of these algorithms, their physical implementations are known to be susceptible to attacks. Non-invasive attacks on such devices take advantage of side-channel information leaked from the system, instead of trying to reverse engineer it. Such side-channel information can be power, timing, electromagnetism, and any other information that might be measured from the device during computation. Most electronic devices running cryptographic algorithms are implemented in CMOS technology, where transistors act as n
Corresponding author. Tel.: þ1 479 575 5728; fax: þ 1 479 575 5339. E-mail addresses:
[email protected] (W. Cilio),
[email protected] (M. Linder),
[email protected] (C. Porter),
[email protected] (J. Di),
[email protected] (D.R. Thompson),
[email protected] (S.C. Smith). 0026-2692/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.mejo.2012.12.001
voltage-controlled switches. While a circuit node is switching, electrons flow across the corresponding transistors to charge/ discharge its load capacitance, thereby consuming power. Due to the fact that different transistors are turned on/off while processing different data, causing different power consumption, powerbased side-channel attacks are implemented using the IC’s transient power data. The theory of power-based attacks, e.g., Differential Power Analysis (DPA), was introduced in [1]. In addition, Correlation Power Analysis (CPA) [2] uses the Pearson productmoment correlation coefficient to guess a key. In general, these attacks require the transient power data while the target IC performs encryption/decryption on different texts, and then use statistical algorithms to derive the key. Power-based attacks are the most powerful and prevalently implemented side-channel attacks [3], which have been successfully implemented to crack many of the most relevant cryptographic algorithms on different platforms, including DES [4], Elliptic Curve Cryptosystems [5], RSA [6,7], AES [8,9], and all AES candidates [10], implemented on FPGAs [11] and as ASICs [12]. A number of methods have been proposed for mitigating power-based attacks by decoupling transient power consumption from the data being processed. Techniques based on balancing power fluctuation include new CMOS logic gates [13–18], which go through a full charge/ discharge cycle for each data processed. Other power balancing
W. Cilio et al. / Microelectronics Journal 44 (2013) 258–269
methods include modifying the algorithm execution [19], compensating current at the power supply node [20], and using subthreshold operation [21]. Additionally, many techniques for randomizing power data have been proposed [22–27]. The principle of timing-based attacks [28] is very similar to power-based ones except these attacks rely on timing fluctuations of the target circuit while processing different data patterns. Depending on the load capacitance and driving strength, the charge/discharge process during the switching activities at an internal circuit node takes different amounts of time to finish, which in turn causes different timing delays. Existing countermeasures include inserting dummy operations [29], using redundant representation [30], and unifying the multiplication operands [31]. Asynchronous circuits, especially dual-rail asynchronous circuits, possess unique characteristics that could help mitigate such attacks [32–37]. Dual-rail asynchronous circuits, such as NULL Convention Logic (NCL) [38], use two wires to represent one signal. The DATA-spacer alternation protocol ensures the number of times each dual-rail signal switches is independent from the input; instead, it is only determined by the number of data processed [39], making power variation significantly smaller than synchronous designs [40]. Nonetheless, switching activity remains unbalanced between the two rails of each signal, which most likely drive different capacitive loads, as illustrated in Section 2.1; thus, DPA, High-Order DPA, or CPA can still succeed. Moreover, such dual-rail logic circuits are even more vulnerable to timing-based attacks due to their strong data-timing dependency. Other dual-rail approaches such as wave dynamic differential logic (WDDL) and masked dual-rail pre-charge logic (MDPL) have also been proposed in [41] and [42], respectively. Both WDDL and MDPL use a clocked data path to control the flow of data through the circuit. WDDL converts the input into a dual-rail signal and propagates through the logic during the evaluation part while the clock is low, and it propagates a logic 0 on both rails during precharge state while the clock is high. Although WDDL, like NCL, is more robust against power attacks, it is still unbalanced and has to have a complex place and route process as described in [41], making it more difficult to implement. On the other hand, the MDPL approach uses masking techniques on a dual-rail logic style to prevent side-channel information leakage. However, it has been shown in [43] and [44] that an unprotected MDPL is susceptible to leaking information during early propagation through the circuit by creating a correlation between the unmasked signal and the power leakage information at the time of evaluation. Additionally, a similar scheme to the dual-spacer dual-rail delay-insensitive logic (D3L) presented in this paper was introduced in [32] and [45], where a clocked data path was incorporated to control the data flow within the circuit of dual-spacer and dual-rail style. One drawback to clocked data path is that
Table 1 NCL dual-rail encoding truth table. State
Rail0
Rail1
NULL (spacer) DATA0 DATA1 Invalid
0 1 0 1
0 0 1 1
Spacer (NULL)
Data X
Spacer (NULL)
Data Y
259
attackers can isolate more precisely the start and end of operation guided by the timing reference generated by the clock, thus segmenting the power and energy side-channel information. The asynchronous nature of D3L, on the other hand, removes the dependency of a clock signal and implements a delayinsensitive hand-shake protocol to perform operations asynchronously within the circuit, which allows the designer to further mask the start and end of operations in the different sub-blocks of the circuit or instruction processing by providing flatter power traces and more constant energy consumption. Additionally, D3L circuits possess all benefits of delay-insensitive asynchronous circuits such as no clock tree, high energy efficiency, robust circuit operation under process/voltage/temperature variations, and low noise/emission. As predicted by the International Roadmap for Semiconductors (ITRS) [46], asynchronous paradigms will become more widely used in industry to increase circuit robustness, decrease power, and alleviate many clock related issues, and will occupy 30% of the world’s total IC area by 2016. This paper presents the detailed design, implementation, and analysis of dual-spacer dual-rail delay-insensitive logic (D3L) that is capable of mitigating both power- and timing-based sidechannel attacks. The organization of this paper is as follows: Section 2 explains the circuit architecture that makes D3L suitable for mitigating side-channel attacks. It also introduces the Advanced Encryption Standard (AES) architecture, which is implemented using D3L, NCL, and synchronous logic as test vehicles. Section 3 describes the details of the D3L AES design. Section 4 shows how the analysis is performed. It ends by presenting the comparison results among D3L, NCL, and synchronous designs. Section 4 summarizes the work done in this paper.
2. Dual-spacer dual-rail delay-insensitive logic (D3L) 2.1. Motivation Traditional dual-rail delay-insensitive asynchronous logic such as NCL represents a signal with three states: DATA0, DATA1, and NULL (or spacer) state, as shown in Table 1. These states are coded using two rails (wires). Each rail is mutually exclusive from the other, meaning both rails cannot be asserted at the same time; this is an illegal state. While asserting a TRUE value on Rail0 represents DATA0, asserting the other represents DATA1. Due to the return-to-spacer protocol, NCL circuits, after a data cycle, must always return to spacer before accepting new data, as shown in Fig. 1. While such protocol decouples data from switching activity of the two-rail bundle, the unbalanced switching between rails still exists. Take for example the sequence of three consecutive DATA1’s shown in Fig. 2 (the fan-shaped symbols are NCL threshold gates [38]), while Rail1 alternates and charges/ discharges its load capacitance, Rail0 maintains logic 0. The opposite occurs in the case of three consecutive DATA0’s. Therefore, although the total switching activities of the two-rail bundle remain the same in both cases, since Rail1 and Rail0 most likely drive different loads, as illustrated in Fig. 1, the difference in switching activities between these two rails causes difference in power consumption, which makes it possible to decode which rail is switching.
Spacer (NULL)
Fig. 1. NCL single-spacer protocol sequence.
Data Z
Spacer (NULL)
260
W. Cilio et al. / Microelectronics Journal 44 (2013) 258–269
Rail 1 Rail1
3
Rail 1
Rail 0
Rail 0
Rail 0 Spacer
Data 1
Spacer
Data 1
Spacer
Data 1
1
Spacer
Fig. 2. Unbalanced rail switching activities with different load capacitances for NCL.
All-zero Spacer
All-one Spacer
Data X
Data Y
All-zero Spacer
All-one Spacer
Data Z
Fig. 3. D3L dual-spacer protocol sequence. Threshold
IN1 IN2
Table 2 D3L truth table. n-inputs
State
Rail0
Rail1
All-zero spacer DATA0 DATA1 All-one spacer
0 1 0 1
0 0 1 1
OUT
m
INn
Fig. 5. D3Lmn gate.
A0 Rail 1
Z0
2 TH22
Rail 0 Time 0
All-zero Spacer
All-one Spacer
Data 1 1
2
All-zero Spacer
Data 1 3
4
All-one Spacer
Data 1 5
6
A1
3
TH22
2.2. Dual-spacer dual-rail delay-insensitive logic (D3L) On the other hand, D3L solves this unbalanced switching problem by adding a new spacer, which is the major difference between D3L and NCL. While for NCL asserting both rails simultaneously results in an invalid state, D3L takes advantage of this combination and uses it as a new spacer, named all-one spacer, while keeping the previous spacer, as shown in Table 2 [52]. The previous spacer is denoted as all-zero spacer. By alternating from an all-zero spacer to an all-one spacer after every data set, as shown in Fig. 3, D3L allows both rails to have identical switching activity regardless of the data being processed. For example, in the same case of passing consecutive DATA1’s, as shown in Fig. 4, even if the rails have unbalanced capacitance loads, the switching activities between the two rails are identical and power-data correlation is minimized.
2.2.1. D3L Gates As an expansion of NCL, D3L takes advantage of NCL’s threshold gate concept. D3L logic family consists of 27 basic gates, which are fundamental to embark all logic functions implemented with four inputs or less. As shown on Fig. 5, each D3L gate has n inputs and a threshold m, denoted as D3Lmn. For example, the D3L34 gate has A, B, C, and D as its inputs, and will only assert its output when 3 or more of its inputs have been asserted. Its behavior is equivalent to ABCþABDþACDþ BCD. NCL threshold gates have hysteresis, i.e., once the output of a gate is asserted, it remains asserted until all inputs are de-asserted. D3L gates, on the other hand, do not hold the output once the
Z1 2
Fig. 4. Balanced switching activities on both rails for D L.
KO KI RST Fig. 6. Modified NCL register.
number of asserted inputs is below the threshold. Thus, input completeness, a critical requirement of delay-insensitive asynchronous circuits [38], is compromised for D3L. A method to compensate for input completeness, named NCL_X, will be presented later in this paper. As a result, the construction of D3L gates is much simpler than their NCL counterparts. 2.2.2. D3L Registers 2.2.2.1. Basic D3L register. D3L registers are customized NCL registers. They are modified in such a way that all-one spacers can be recognized and stored. A D3L register has two basic components: a modified NCL register, and a KI generator. The modified NCL register is shown in Fig. 6. Two NCL TH22 gates (i.e., the gate has two inputs and a threshold of two, which asserts its output only when both inputs are asserted; and de-asserts its output only when both inputs are de-asserted), or two-input
W. Cilio et al. / Microelectronics Journal 44 (2013) 258–269
Muller C-elements, are used to latch data. Each gate has two inputs, one of which is either A0 or A1 rails, and the other is the KI signal (acknowledge in) from the next registration stage. A reset (RST) signal is added to reset the output to a known state. A Boolean XNOR gate is used to generate the new KO signal instead of the NOR gate used in a regular NCL register, so that allone spacers can be correctly detected without affecting other states. The KI generator is used to make sure all-one spacers can be requested as spacers instead of corrupted data. In order to accomplish this, the KI signal from the next registration stage needs to be processed along with some other signals to correctly set the state of the enable signal named KI_gen, which is connected to the KI input of the modified NCL register. Four signals are needed for the KI generator to work: the KI input, a ps (previous spacer) signal that keeps track of the previous spacer, and the dual-rail (B0 and B1) output from the modified NCL register. The KI_gen signal follows the Boolean logic in Eq. 1: KI_gen ¼ KI ps ðB0 þ B1Þ þ KIpsðB0 þ B1Þ þ B0B1 KI þ B0 B1KI
ð1Þ
Eq. 1 takes advantage of the saved previous state through the ps signal and the data output state to manipulate the desired spacer or data to the next stage in the circuitry. In turn, the output successfully enables the modified NCL register to accept all-one spacers that otherwise would be an illegal state and would drive the system into a deadlock. Table 3 shows the truth table used to derive Eq. 1. Table 3 KI generator truth table. Row
B0
B1
KI
ps
KI_gen
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 0 1 1 1 0 0 1 1 0 0 1 1 1 0 0
261
A complete basic D3L register is shown in Fig. 7. The ps signal is generated from the modified NCL register dual-rail Z output using a TH22n gate, a reset-to-zero C-element. The ps signal values for an all-zero spacer and an all-one spacer are logic0 and logic1, respectively. During initialization, the ps signal must be set to the correct spacer that the register is reset to, unless the register is reset to data, in which case the ps signal should be set to logic0. 2.2.2.2. D3L filter register. There are certain situations in which a basic D3L register cannot handle the dual-spacer protocol, e.g., the ring registers used for storing data. Deadlock is created due to the lack of dual-spacers inside the loop, because basic D3L registers do not have the capabilities to generate alternating spacers on their own. Therefore, a different register must be included that can filter and alternate between the two spacers when needed. Such register is named filter register, whose internal connections are shown in Fig. 8. The spacer filter component analyzes the dual-rail input D, the ps signal, and the KO signal to alternate the spacer while allowing data to flow through unchanged. Its outputs D0_filter and D1_filter, described in Eqs. (2) and (3), respectively, are needed not only to ensure that spacers are filtered, but also to make sure the dualspacer protocol is enforced. D0_f ilter ¼ D0 D1 þ KO ps D0 þ KO ps D0 þ KO ps D1 þ KO ps D1 ð2Þ D1_f ilter ¼ D0 D1þ KO ps D1þ KO ps D1 þ KO ps D0 þ KO ps D0 ð3Þ In the similar fashion as the basic D3L registers and its KI generator, the D3L filter register masks its signals to incorporate the all-one spacer. As an example, the three-ring D3L register on the left of Fig. 9 has been modified by replacing register 3 with a D3L filter register, which has the necessary functions to alternate spacers when needed by the ring to avoid deadlocks in the sequence. The waveform on Fig. 9 also shows how the design allows the correct filtering of spacers done by register 3. We can see that when register 2 hands a spacer to register 3, then register 3 converts the spacer to the opposite one to continue the circulation of data, thus avoiding deadlock. 2.2.2.3. D3L spacer generator register. In some cases, spacer alternation is not available or does not have enough cycles, and a local spacer generator is required to fill the gap. For example, a component X may need many cycles to output data, but input data is only provided for one cycle by the previous component.
2n RST
D0
D0
D1
D1
Z0 Z1
B0 Modified_NCL_Reg B1 KI
KI_gen KI
RST
KI ps
KI GENERATOR
KO rst
Fig. 7. A complete basic D3L register.
262
W. Cilio et al. / Microelectronics Journal 44 (2013) 258–269
2n
RST
D0_filter
D0
D1_filter
D1
Z0 Z1
D0
D0
Basic D3L Register
D0_filter D1
D1
KO
D1_filter
KI
KO ps
KI KO
RST
rst
Spacer Filter
RST
Fig. 8. D3L filter register inner connections.
RST Z Basic 0 D3L A Regis Z ter 1 1 A 0
Z Basic 0 D3L A Regis Z ter 1 1 A 0
RST
K I
K O
RST
Reg 1 Reg2 F.Reg 3
3
2
1 K O
Z Basic 0 D3L A Regis Z ter 1 1 A 0
K I
K O
RST
All-zero spacer All-zero spacer Data
All-one All-zero All-one Data Data Data spacer spacer spacer All-one All-zero All-one Data Data Data spacer spacer spacer All-zero All-one All-one Data Data Data spacer spacer spacer
Data
KO_1
K I
KO_2 KO_3
2n
Fig. 9. D3L three ring register with filter register.
RST D0 D1 Z0 Z1 KI KO ps
Time 0
1
2
3
4
5
6
7
8
9
10
11
3
Fig. 10. D L spacer generator register and waveforms.
Then, component X needs to compute the data, while not requesting for a spacer to the previous component until the operation is done. However, component X still needs spacers to cycle data out, but they are not available as input to the component. It is in such cases that a spacer generator is added to the circuitry to generate the correct spacer needed to complete the operation. Fig. 10 depicts a D3L spacer generator register. To generate the correct signals, a D3L filter register is outfitted with a ps signal delay component to provide the functionality needed to generate spacers. The ps signal delay component, shown in detail in Fig. 11, allows the ps to change only when KO is logic 1, which indicates the register is ready for data. Otherwise, the ps value is saved
Fig. 11. Previous spacer (ps) signal delay component transistor-level design.
W. Cilio et al. / Microelectronics Journal 44 (2013) 258–269
263
Vdd
X0 Y0 X1 Y1
Z0 D3LAND0 Out
In Z1
2 D3L22 Fig. 12. D3L input incomplete AND function.
X0 Y0 X1 Y1
Z0
Vn
D3LAND0 Z1
2
gnd
D3L22
Fig. 14. N-voltage-controlled delay element schematic. Vdd
Vlow
3
comp Ctrl
Ctrl
TH33 Vn
Fig. 13. Input complete D3L AND function with completion logic. In
using feedback until the register changes state again. On the other hand, the spacer generator, described by Eqs. (4) and (5), controls the input to the register and switches to the correct spacer, even when the input spacer is wrong in sequence. D0_gen ¼ KO ps ðD0 þ D1Þ þ KO ps D0 þD1 þ KO D0 D1 ð4Þ
D1_gen ¼ KO ps ðD0 þ D1Þ þ KO ps D0 þ D1 þ KO D0 D1
ð5Þ
The waveform in Fig. 10 shows the behavior of a spacer generator register. After reset, KO is logic 1, while data is latched at time 1. When KI signal goes low, close to time 2, the output switches to an all-one spacer, ps is updated and KO goes high. However, at time 4, the D input becomes an all-one spacer, but the output of the register gives an all-zero spacer. Although the difference in spacer exists, the component receives the correct spacer in the dual-spacer protocol sequence. Therefore, there is no deadlock. 2.3. NCL_X Input completeness requires the circuit to hold its output unchanged until all inputs have made the transition from DATA (spacer) to spacer (DATA). As discussed above, unlike NCL gates, D3L gates are not input complete, thereby compromising delayinsensitivity. Therefore, certain external circuits are needed to ensure input completeness. Such need can be seen in Fig. 12, which depicts a D3L AND function. Although the function is input complete when transitioning from an all-zero spacer to data, the transition from data to either spacer is not. Initially developed to support NCL design, NCL_X technique [47] was chosen to provide D3L with input completeness. Although NCL_X can only detect all-zero spacers, a few simple
In
Buffer Chain Out
Out
Fig. 15. Buffer chain with delay control.
modifications enable it to be compatible with the dual-spacer protocol. Fig. 13 shows the input complete D3L AND function with NCL_X. The major modification is to the local completion logic which replaces OR gates and C-elements with XNOR gates. 2.4. Delay element Delay elements have been adopted by a wide variety of circuits and applications. For D3L, they are used to mask the timing delay of data processing. Same as NCL, it always takes a D3L circuit the same amount of time to process the same data, which largely facilitates timing-based side-channel attacks. Inserting random delays in every computation masks such timing information, thus breaking the timing-data correlation. An N-voltage-controlled delay element [48] has been chosen for D3L circuits. As shown in Fig. 14, the N-voltage-control delay element has a data input In, a data output Out, and a N-voltage (Vn) input. The Vn input controls the voltage to the n-type transistors, the variation of which regulates the time it takes for the output to follow the input. For actual implementation, Vn is selected between two different voltages, Vdd and Vlow, as shown in Fig. 15. To alternate between delays, a Ctrl signal and its inverse Ctrl are used to control two p-type transistors to switch between voltage sources. This Ctrl signal can be any internal signal, which alternates from processing one data to another, thereby generating randomized delays for each operation. Such delays are difficult for attackers to predict since they do not have knowledge on which internal signals are being used for such purpose. The selection of Vlow is not important as long as it is sufficiently
264
W. Cilio et al. / Microelectronics Journal 44 (2013) 258–269
different from Vdd. Note that since D3L circuits are delay-insensitive, such inserted random delays will not jeopardize the circuit’s operation. Moreover, by properly sizing the transistors, these delay elements can be used as buffers for large fanout nets, thereby reducing overhead.
round in which the MixColumns transformation is omitted. A different sub-key is created every round using the key expansion routine. The sub-key is then added to the State with an AddRound transformation. 3.2. AES implementation using D3L
3. Design and implementation of D3L, NCL, and synchronous AES cores
The D3L AES core performs the transformations and key scheduling in parallel. The design can easily be integrated to any asynchronous system, because externally it acts as a regular register with KI and KO signals. The core is divided into five major components: the InitialRound block, the KeyExpansion block, the AESLoop block, the AESControl block, and the LastRound block, as depicted in Fig. 16. The InitialRound block is in charge of receiving new data as well as controlling spacer generation. The KeyExpansion block receives the cipher key from the InitialRound block and generates the key schedule in the form of sub-keys during every round. The sub-keys are passed to the AESLoop, which also receives input data from the InitialRound. The AESLoop then performs the transformations. The LastRound block executes the last round of transformations by combining the output of the AESLoop with the last sub-key. Finally, the AESControl block coordinates the communication between blocks, and generates the correct signals to direct the data traffic as well as the Rcon values that are used by the KeyExpansion block.
3.1. AES algorithm overview The AES algorithm [49] is widely used for the protection of sensitive data. AES processes input blocks of 128 bits, known as plaintext, using cipher key lengths of 128, 192, or 256 bits. Additionally, the length of the cipher key determines the number of iterations or rounds a plaintext undergoes in the algorithm. In every round, the State, which is an intermediate output seen as a 4 4 two-dimensional array, is subject to SubByte, ShiftRows, MixColumns, and AddRound transformations, except for the last AESControl
crtl a Rcon crtl b
crtl c
KeyExpansion
Last sub-key
sub-key
Last Round Key
Key
Last AESLoop State
FirstRound State
Plaintext
3.2.1. InitialRound block The InitialRound block is the data gateway to the D3L AES core. Fig. 17 depicts the connections inside the block. Two 128-bit basic
Ciphertext
Fig. 16. D3L AES core top-level diagram.
Key
128 A
A
Z Basic D3L Register 128-bit
Z D3L Generator Register 128-bit
KI
KO
RST Master KI
Main rst Plaintext 128
A
Z Basic D3L Register 128-bit
Z 128
A
Z D3L Generator Register 128-bit
KI RST
KI
KO RST
Main rst
2d
State to AESLoop
Main rst
128
KO
AddRound 128-bit
KI
KO
RST
A
Cipher Key to KeyExpansion
128
128
KO/Reset Count
TH22d Fig. 17. InitialRound block detailed schematic.
B c_add
KI_1
W. Cilio et al. / Microelectronics Journal 44 (2013) 258–269
265
128 From KeyExpansion 128
128 A
128 select A From Initial Round 2to1 Z Mux
Z
Reg2
AESTransform
Reg 3
Basic D3L Register 128-bit
comp_check
D3L Filter Register 128-bit
KI KI_1
KI_2
KI
KO
RST 2d TH22d
A
Z
Basic D3L Register 128-bit KO
Reset Count
128
128
sub_key
A
Z
Reg 1
B c_mux rst
128 A
Z
Reset Count
KI_3
RST 2
Reset Count
KO_2
KI
KO
RST
KO_1
TH22 Reset Count
KO_3
To LastRound Block
Fig. 18. AESLoop block schematic.
From AESControl
32
To AESLoop Block Rcon
128 select A From Initial 2to1 Round Z Mux B
128
A
Z
128 A
KeyExpansion Routine 128 comp_check
Reg 2
Reg 3
Basic D3L Register 128-bit
D3L Filter Register 128-bit
KI
KO
3d
TH33d
Z
Basic D3L Register 128-bit KI_1
Reset Count
KI
KO
RST Reset Count
128 A
Z
Reg 1
c_mux
rst
128 A
Z
KI_2
KO_2
Reset Count
KI
KO
KI_3
RST
RST
KO_3
Reset Count
KO_1
To LastRound Block
Fig. 19. KeyExpansion block schematic.
registers save the plaintext and cipher key inputs. The KO output prepares the core for a new input by resetting components and also acts as the primary KO output signal, which reports the status of the core. 3.2.2. AESLoop block The AESLoop block, depicted in Fig. 18, computes nine rounds of the main algorithm loop after the InitialRound for 128-bit cipher key. The major unit of this block, named AESTransform, consists of a ring register, which stores data from previous rounds, and logic blocks to compute the SubByte, ShiftRows, MixColumns, and AddRound transformations. 3.2.3. KeyExpansion block The KeyExpansion Routine generates the key schedule in the form of 128-bit sub-keys. A sub-key is generated at every round either from the original key or from a previously generated subkey, depending on the state of the system. The key is broken down for computation into 4 words with an Rcon constant fed from the control logic. Once all words have been computed, the output is sent to the next stage in the KeyExpansion block as shown in Fig. 19, which communicates with other blocks and
distributes the right sub-key to the AESLoop and LastRound blocks.
3.2.4. AESControl block The AESControl block, which is a state machine, generates the control signals for the multiplexers in AESLoop and KeyExpansion blocks. Additionally, the block generates the Rcon, a 32-bit dualrail number needed to compute sub-keys in the KeyExpansion block. A stop signal is also generated to signal the LastRound that the cipher computation is done. The LastRound block will then release the data to the core output. Fig. 20 depicts the diagram of the AESControl block.
3.2.5. LastRound block The LastRound block, as shown in Fig. 21, transforms the last State using the sub-key and sends the result to the register. The comp_check signal communicates with the rest of the blocks to broadcast the completion status of the transformations and allows signaling the request for new data. Although this request does raise a flag when the last round is done, time misalignment created by the inherent delay-insensitive asynchronous design
266
W. Cilio et al. / Microelectronics Journal 44 (2013) 258–269
32 To Mux select 5 mux_sel Rcon A Z Increment Logic
To KeyExpansion 5 Z
A
A
TH22d
5
Z
A
Basic D3L Register 5-bit
stop_sig
KI KI_1 RST 2
Reset Count To LastRound
KO_1
Z
KO 2
Z Reset Logic
D3L Filter Register 5-bit
KI_2
Reset Count
A
Reg 3
comp_check
KI RST
KO TH22
5
A
Z Increment Logic
Reg 2 Cipher_Done
Basic D3L Register 5-bit KO
2d
A
Reg 1
comp_check
Reset Count
5
5 A
TH22
KI KI_3 RST
Reset Count
RST Reset Count
KO_3
KO_2
Fig. 20. AESControl block schematic.
Sub-key from KeyExpansion
128
128 128
A
128 A
Z
A
From AESLoop SubByte 128-bit
ShifRows 128-bit
128
A
Z 128
Z Ciphertext Output
D3L Filter Register 128-bit
AddRound Z 128-bit B
KI
KO c_sub
c_shift
c_add
RST
Stop from AESControl
Main rst comp_check 3 TH33 Fig. 21. LastRound block schematic.
and the randomized delays caused by the use of delay elements makes the power traces difficult to analyze.
a2d_/ccheck D1
00
C9
FF
36
00
D0
00
36
FF
C9
00
3.3. AES implementation using NCL and synchronous logic The NCL AES core is very similar to the D3L design. One difference between these two is the use of registers. While the D3L design uses three types of registers, the NCL core only needs one type of register to implement its single-spacer protocol. Another difference is the InitialRound block. While in the D3L design a generator register is used to generate the correct spacer, the NCL counterpart only needs a reset block. The synchronous AES core is also very similar in structure to the two asynchronous counterparts. An on-the-fly SubByte computation scheme is implemented to avoid the overhead and complexity of look-up tables. An extra clock is added to facilitate the analysis of a sidechannel attack by isolating the SubByte block from others, the rationale of which is presented in the next section. Note that the synchronous core is a direct implementation of the AES algorithm without any additional protection against side-channel attacks.
4. Results and analysis The objective of this work is to evaluate the side-channel attack mitigation effectiveness of the proposed D3L technique, and compare it with the NCL and synchronous counterparts. In order to have a fair comparison, all three AES cores are designed without any special adjustments that would affect their side-channel attack resistance. The simulations are performed at transistor-level, therefore paracitics of interconnects and transistor layouts are excluded.
Z1
00
DD
FF
05
00
Z0
00
22
FF
FA
00
Fig. 22. Simulation waveform for D3L S-box.
This is due to the fact that AES cores are relatively simple circuits with small number of transistors and short total interconnect length, whose impacts on the attacks are minimal. The IBM 5AM 0.5 mm process has been used to design all three circuits. The simulation is performed by providing a 3.3 V VDD at 25 1C. While the full design has been simulated at transistor-level for functionality verification, these simulations take a very long time to finish. Therefore, for attack implementations, the simulations are performed for sub-circuits instead. Considering an AES calculation, the original 128-bit secret key and plaintext firstly undergo an AddRound transformation. A SubByte transformation, which contains 16S-Boxes that take an input of 8 bits each, is then applied to each output byte of the AddRound. It is only necessary to attack one S-Box at a time, which has 28 or 256 possible key combinations, allowing for fast brute-force attacks. A set of sample Cadence Ultrasim simulation waveforms of a D3L S-Box is shown in Fig. 22, in which Ccheck is the handshaking signal, D0 and D1 are the dualrail inputs in HEX, and Z0 and Z1 are the dual-rail outputs in HEX.
W. Cilio et al. / Microelectronics Journal 44 (2013) 258–269
The alternation between various input/output DATA and two spacers (i.e., 00 for all-zero-spacer and FF for all-one-spacer) are clearly shown in the simulation. Correlation Power Analysis (CPA) is applied to attack all three designs using power/energy as well as timing information. Its strong statistical model makes the attack stronger than the original DPA, as described in [50]. It is assumed that after one key byte is successfully discovered, the other 15 bytes can be hacked in a similar fashion. Since both asynchronous designs lack a clock signal, the switching activities may happen at different times, creating misalignments in side-channel signals. These misalignments can be eliminated by using the circuit energy information instead of power. As shown in [12], the attacker can break the electrical current traces into smaller pieces and compute the energy separately. This approach is adopted to attack both asynchronous designs. A timing attack is only performed against the D3L and NCL designs to compare their performance. Timing attack to the synchronous design is not performed because memory-less sequential synchronous circuits are resilient to timing attacks due to their fixed-time operation in terms of number of clock cycles, in which timing is independent of data patterns. The total time of computation is used as the timing reference to analyze the timing-data correlation. 256 vector files, one for each possible plaintext input, are generated, each using the same key and the same input timing. These simulations begin with the circuit in its reset state. Next, the plaintext and key are given. After the ciphertext is available, the appropriate spacer is given, followed by plaintext and the next spacer. This is for the accommodation of the dual-spacer cycle of the D3L design. After each simulation, the instantaneous current being drawn by the target node, in this case the power supply of the target S-Box, is recorded in 1 ps time steps. The end result is a set of 256 power traces for each simulated plaintext. To solve the issue of trace alignment in the asynchronous designs, the values of each trace are integrated to generate energy data. These sets of energy data are used along with the predicted transistor models for each design as the inputs to the CPA program. The timing information is gathered by calculating the number of time steps in each power trace that current is being drawn. The assumption is that the current drawn will be 0 or close to 0 when the circuit is inactive and nonzero when it is active. The sum of active time steps represents the calculation time of each simulation. Much like the energy-based attacks, the sets of 256 time values are used as inputs to the CPA program for the time based attacks. The energy data consist of the entire data-spacer-data-spacer cycle of the operation. The time data was split over each part of the cycle. The experimental process begins by designing the three AES cores presented in Section 3 in VHDL. Once all three designs are
267
completed, the code is imported into Cadence to generate a transistor-level schematic/netlist of each design using the IBM 5AM 0.5 mm process. A 1-ohm resistor is connected in series between the voltage source and the voltage input to the SubByte block for the NCL and D3L cores, respectively. For the synchronous design, however, the second register in the AESLoop block is connected to the resistor in series with the voltage source. With the resistor in place, simulations can be performed and the node between the resistor and the SubByte block or register, whichever case may be, can be monitor for power fluctuation. However, after a preliminary simulation of each design using the Cadence simulation tools (UltraSim and Spectre), it is discovered that the simulation takes a very long time to finish. In order to shorten the simulation time, Synopsys Nanosim is chosen as the transistor-level simulator instead. For the synchronous design, the simulation is straightforward. With the reset signal asserted at time 0, the registers are set to logic 0, while he cipher key and the plaintext are loaded into the AddRound, which computes the result as soon as it receives the data since it is just a logic block. Then, at the first clock edge, the result of the AddRound is latched, and the SubByte calculates the result. Since the second register is still a zero value, the collection of switching activity is very simple. Finally, on the second rising edge of the clock, the second register latches the data, and all the switching activity is recorded as current fluctuation for the second register. On the other hand, during an NCL design simulation, a full NULL-DATA-NULL cycle is performed. Even though the simulation is run as a full cycle, the information obtained is broken into two pieces to calculate the energy. The first one is the switching activity from NULL to DATA, while the second one is the change from DATA to NULL. A similar approach is used for the D3L design. Samples were simulated using the AZS-DATA-AOS-DATA-AZS full cycle, where AZS and AOS stand for all-zero spacer and all-one spacer, respectively. The cycle is broken into four switching activity sections, which are AZS-DATA, DATA-AOS, AOS-DATA, and DATA-AZS. In order to compare the effectiveness of the timing attack, an attack on the original D3L circuit, which does not have any delay elements, is compared to an attack on the modified D3L circuit, which has added delay elements to every output bit line of the first S-Box. The delays are controlled by one signal, which is generated by applying a bitwise XOR operation on all rail1 signals from the previous AddRound transformation until all are merged into a single control signal. The top-level setup is shown in Fig. 23. The timing attack performed on the modified circuit is done exactly in the same way as the original circuit as mentioned previously.
Fig. 23. Modified D3L circuit with added delay elements and delay control for S-box 1.
268
W. Cilio et al. / Microelectronics Journal 44 (2013) 258–269
Table 4 Power attack results for D3L, NCL and synchronous designs.
Table 5 Timing attack results for the asynchronous designs with and without randomized delay.
Design
Key guess
Max correlation coefficient
Data type collected
D3 L NCL Synchronous
Failed Success Success
0.354 0.428 0.668
Energy Energy Power
As mentioned before, the Differential Power Analysis (DPA) is a power-based side-channel attack announced in [1]. DPA takes advantage of leaking information from an electronic circuit, and uses statistical methods to guess the cipher key value. However, since the emerging of DPA, many improvements to the DPA and different approaches have been proposed to attack a circuit more efficiently. One approach is Correlation Power Analysis (CPA) [2], which is based on a Hamming distance model. While many analysis are based on the Hamming weight, which only looks at the number of bits set in a byte or word, CPA uses the Hamming distance, which is the total number of bits that switched from 0 to 1 or vice versa, to compute an estimate of the consumed power by certain State change. Once the power estimate is made using the Hamming distance, CPA uses the Pearson product-moment correlation coefficient to compare the guess with the actual reading and rates the accuracy of the guess. More details about CPA can be found in [2], and [51]. In this paper, CPA has been chosen to attack all three designs because it provides a stronger statistical model for the attacks. Although the Hamming distance is integrated without problems for the attack on the synchronous design, the asynchronous designs cannot be attacked using the same model. Due to the switching behavior of dual-rail value signals, the Hamming distance cannot be computed because a switch always occurs regardless of the data pattern, making the Hamming distance always the same. Instead, the attacker can estimate the number of transistors that switch, which is proportional to the energy spent, for every possible input to the SubByte function. Using the new table of transistor counts, the energy data collected from the simulation can be compared to the estimated energy using the Pearson product-moment correlation coefficient to obtain the accuracy of the guess. A Java program for the NCL and D3L designs is developed using the CPA model to automatically correlate the energy estimate and the simulation results. The program takes as input the 256 energy readings from the simulations and compares them to a table, which contains the number of switching transistors for any given input to the SubByte block. Once the correlation is complete, the program returns the key guess along with a coefficient number for the guess. Additionally, another program is developed to generate the table used in the CPA program. It simulates a SubByte block for the NCL and D3L designs to count the number of transistors that switch for any of the 256 possible inputs and saves them to a table. The results presented in Table 4 [52] show that the attacks on the synchronous and NCL designs are successful. The success of the attack on the synchronous design is expected, because synchronous designs have strong power-data pattern correlation. Furthermore, the correlation coefficient, which indicates how strongly the measurements and actual data are correlated, i.e., a coefficient of 1 means the two are highly correlated while 0 means they are not correlated, ratifies with a high correlation rating that the synchronous design is very vulnerable to power attacks and can be easily cracked. Similarly, the attack on the NCL design yields results that are also expected. Although the NCL design is more resilient to such attack, the unbalanced load capacitances between the two rails of each signal still leak power information
Design
Key guess
Max correlation coefficient
NCL
Success
0.400
D3L
Success
0.373
Fail
0.106
Total computation time Total computation time Total computation time
3
D L with randomized delay
Data type collected
Table 6 Overheads of D3L AES core. Design
Area (mm2)
Energy (mJ)
Delay (ns)
D3L NCL Synchronous
6.27 3.28 1.5
6.012 2.208 1.356
325 462 153
that facilitates CPA. However, it is important to point out that the correlation coefficient is much lower than the synchronous design, meaning the attacker needs more samples to succeed. On the other hand, the CPA attack against the D3L design presents a couple of reassuring results. First, the attack fails to guess the correct cipher key. Second, the attack fails with the lowest correlation coefficient among all three attacks, i.e., the attack is not able to find a good correlation between energy and data. Table 5 shows the results for the timing-based CPA attack on the asynchronous designs. The successful timing attacks on the NCL and the original D3L cores are not a surprise because of the strong timing-data correlation in asynchronous circuits. Even though the correlation coefficient appears to be low, the vulnerability at that level is still enough for guessing the right key. On the other hand, the D3L design with randomized delay performs much better than its original counterpart. The CPA attack fails to guess the correct key, and the correlation coefficient is very low (0.106), which indicates there is very little correlation between data and timing information. Therefore, the inserted delay elements successfully break the data-timing correlation in the design. Such significantly enhanced security feature of D3L comes with circuit overhead. As shown in Table 6, although the D3L AES is faster than the NCL design, its area and energy are the highest among the three designs. This overhead come from two sources. The first is the NCL_X completion checking. The second is the increased registration overhead caused by having to handle two spacers.
5. Conclusion In this paper, a dual-spacer dual-rail delay-insensitive logic (D3L) circuit design methodology was proposed to mitigate power- and timing-based side-channel attacks. With its dualspacer protocol, D3L not only decouples data from switching activity at the signal-level, but also balances the switching activity between the rails of each dual-rail signal, making it much more difficult for an attacker to correlate data with power consumption. Additionally, D3L mitigates timing attacks by inserting delay elements to break the timing-data correlation that exists in delay-insensitive asynchronous designs. Three AES cores have been designed using D3L, NCL, and synchronous logic, and CPA attacks are applied to these circuits. Results show that the D3L design successfully mitigates the power-based CPA attack
W. Cilio et al. / Microelectronics Journal 44 (2013) 258–269
with the lowest correlation coefficient, while the attacks succeed with the other two designs. Moreover, the D3L design with randomized delay defeats the timing-based CPA attack with very low correlation coefficient. Therefore, D3L is a promising logic candidate for designing tamper-resistant circuits and systems against power- and timing-based side-channel attacks. References [1] J. Jaffe, P. Kocher, and B. Jun, Differential power analysis, in Proc. 19th International Advances in Cryptology Conference CRYPTO ’99, Santa Barbara, CA, 2009, pp. 388–397, 1999. [2] E. Brier, C. Clavier, and F. Olivier, Correlation power analysis with a leakage model, in Proc. Cryptographic Hardware and Embedded Systems CHES 2004, Cambridge, MA, Aug. 2004, pp. 16–29. [3] I. Blake, G. Seroussi, N. Smart, J.W.S. Cassels, Advances in Elliptic Curve Cryptography, Cambridge University Press, New York, NY, USA, 2005. [4] T. Messerges, E. Dabbish, R. Sloan, Investigations of power analysis attacks on smartcards, USENIX Workshop on Smartcard Technology (1999) 17 17. [5] J. Coron, Resistance against differential power analysis for Elliptic Curve Cryptosystems, 1st International Workshop on Cryptographic Hardware and Embedded Systems, 1999, pp. 292–302. [6] B. Boer, K. Lemke, and G. Wicke, A DPA Attack against the Modular Reduction within a CRT Implementation of RSA, 4th International Workshop on Cryptographic Hardware and Embedded Systems, 2002, pp. 228–243. [7] S. Serner and W. Colin, More detail for a combined timing and power attack against implementations of RSA, IMA International Conference, No. 9, 2003, pp. 245–263. [8] S. Ors, F. Gurkaynak, E. Oswald, B. Preneel, Power-analysis attack on an ASIC AES implementation, ITCC (2004) 546–552. [9] G. Boracchi, A Study on the Efficiency of Differential Power Analysis on AES SBox, Technical Report 2007-17, DEI Politecnico di Milano. [10] S. Chari, C. Jutla, J. rao, and P. rohatgi, A Cautionary Note Regarding Evaluation of AES Candidates on Smart Cards, 2nd Advanced Encryption Standard Candidate Conference, pp. 133–147, 1999. [11] O. Berna, O. Elisabeth, P. Bart, Power-analysis attacks on an FPGA—first experimental results, CHES (2003) 35–50. [12] S. Ors, F. Gurkaynak, E. Oswald, B. Preneel, Power-analysis attack on an ASIC AES implementation, ITCC (2004) 546–552. [13] F. Mace, f. Standaert, J. Quisquater, J. Legat, A design methodology for secured ICs using dynamic current mode logic, PATMOS (2005) 550–560. [14] I. Verbauwhede, K. Tiri, D. Hwang, P. Schaumont, Circuits and design techniques for secure ICs resistant to sidechannel attacks, ICICDT (2006). [15] M. Aigner, S. Mangard, R. Menicocci, M. Olivieri, G. Scotti, A. Trifiletti, A novel CMOS logic style with data independent power consumption, ISCAS (2005) 1066–1069. [16] K. Lin, S. Fan, S. Yang, C. Lo, Overcoming glitches and dissipation timing skews in design of DPA resistant cryptographic hardware, DATE (2007) 1265–1270. [17] V. Sundaresan, S. Rammohan, and R. Vemuri, Power invariant secure IC design methodology using reduced complementary dynamic and differential logic, IFIP International Conference on VLSI-SoC, 2007, pp. 1–6. [18] K. Kulikowski, V. Venkataraman, Z. Wang, A. Taubin, Power balanced gates insensitive to routing capacitance mismatch, DATE (2008) 1280–1285. [19] Y. Wang, J. Leiwo, T. Srikanthan, L. Jianwen, An efficient algorithm for DPAresistent RSA, APCCAS (2006) 1659–1662. [20] D. Mesquita, J. Techer, L. Torres, G. Sassatelli, G. Cambon, M. Robert, F. Moraes, Current mask generation—a transistor level security against DPA attacks, 18th Symposium on Integrated Circuits and Systems Design (2005) 115–120. [21] S. Haider, L. Nazhandali, Utilizing Sub-threshold technology for the creation of secure circuits, ISCAS (2008) 3182–3185. [22] M. Hasan, Power Analysis Attacks and Algorithmic Approaches to their Countermeasures for Koblitz Curve Cryptosystems, IEEE Transactions on Computers, 50, 10, pp. 1071–1083. [23] P. Corsonello, S. Perri, M. Margala, An integrated countermeasure against differential power analysis for secure smart-cards, ISCAS (2006) 5611–5614. [24] S. Yang, W. Wolf, N. Vijaykrishnan, D. Serpanos, Y. Xie, Power attack resistant cryptosystem design—a dynamic voltage and frequency switching approach, DATE (2005) 64–69. [25] K. Baddam and M. Zwolinski, Evaluation of dynamic voltage and frequency scaling as a differential power analysis countermeasure, 20th International Conference on VLSI Design Held Jointly with Sixth International Conference on Embedded Systems, 2007, pp. 854–862.
269
[26] J. Ambrose, R. Ragel, S. Parameswaran, RIJID—random code injection to mask power analysis based side channel attacks, DAC (2007) 489–492. [27] M. Rivain, E. Dottax, E. Prouff, Block ciphers implementations provably secure against second order side channel analysis, FSE (2008). [28] P.C. Kocher, Timing attacks on implementations of Diffie–Hellman, RSA, DSS, and other systems, in Proc. sixteenth International Advances in Cryptology Conference CRYPTO ’96, Santa Barbara, CA, 2006, pp. 388–397. [29] B. Chevallier-Mames, M. Ciet, M. Joye, Low-cost solutions for preventing simple side-channel analysis—side-channel atomicity, IEEE Trans. Comput. 53 (6) (2004) 760–768. [30] D. Page and N. Smart, Parallel Cryptographic Arithmetic Using a Redundant Montgomery Representation, IEEE Transactions on Computers, 53, 11, pp. 1474–1482. [31] A. Hodjat, D. Hwang, I. Verbauwhede, A scalable and high performance Elliptic Curve processor with resistance to timing attacks, ITCC (2005) 538–543. [32] D. Sokolov, J. Murphy, A. Bystrov, A. Yakovlev, Design and analysis of dual-rail circuits of security applications, IEEE Trans. Comput. 54 (4) (2005) 449–460. [33] G. Bouesse, M. Renaudin, S. Dumont, F. Germain, DPA on quasi delay insensitive asynchronous circuits—formalization and improvement, DATE (2005) 424–429. [34] I. Verbauwhede, K. Tiri, D. Hwang, P. Schaumont, Circuits and design techniques for secure ICs resistant to sidechannel attacks, ICICDT (2006). [35] D. Shang, F. Burns, A. Bystrov, A. Koelmans, D. Sokolov, A. Yakovlev, Highsecurity asynchronous circuit implementation of AES, IEE Proc.-Comput. Digit. Tech. 153 (2) (2006) 71–77. [36] K. Kulikowski, V. Venkataraman, Z. Wang, A. Taubin, M. Karpovsky, Asynchronous balanced gates tolerant to interconnect variability, ISCAS (2008) 3190–3193. [37] K. Baddam, M. Zwolinski, Path switching—a technique to tolerate dual rail routing imbalances, Des. Autom. Embed. Syst. (2008). [38] K. Fant and S. Brandt, NULL convention logicTM: a complete and consistent logic for asynchronous digital circuit Synthesis, in Proc. Application Specific Systems, Architectures and Processors ASAP 96, 1996, Chicago, IL, pp 261–273. [39] J. Di, F. Yang, D3L—a framework on fighting against non-invasive attacks to integrated circuits for security applications, in Proc. Third IASTED International Conference Circuits, Signals, and Systems, Marina del Rey, CA, 2005, pp. 73–78. [40] J. Wu, Y. Kim, M. Choi, Low-power side-channel attack-resistant asynchronous S-box design for AES cryptosystems, Proc. GLSVLSI (2010) 459–464. [41] K. Tiri and I. Verbauwhede, Place and route for secure standard cell design, Proc. Smart Card Research and Advanced Application IFIP Conf. (CARDIS ’04), 2004. [42] T. Popp and S. Mangard, Masked dual-rail pre-charge logic: DPA resistance without routing constraints, Proc. Workshop on Cryptographic Hardware and Embedded Systems (CHES ’05), Lecture Notes in Computer Science, vol. 3659, Springer-Verlag, pp. 172–186, 2005. [43] T. Popp, M. Kirschbaum, T. Zefferer and S. Mangard, Evaluation of the masked logic style MDPL on a prototype chip, Proc. Workshop on Cryptographic Hardware and Embedded Systems (CHES ’07), Lecture Notes in Computer Science, vol. 4727, Springer-Verlag, pp. 81–94, 2007. [44] K.J. Kulikowski, M.G. Karpovsky,A. Taubin, Power attacks on secure hardware based on early propagation of data, 2006. IOLTS 2006. 12th IEEE International On-Line Testing Symposium,. [45] D. Shang, A. Yakovlev, A. Koelmans, D. Sokolov, A. Bystrov, Dual-Rail with Alternating-Spacer Security Latch Design NCL-EECE-MSD-TR-2005-107, Microelectronic System Design Group, School of EECE, University of Newcastle upon Tyne, September 2005. [46] International Technology Roadmap for Semiconductors report 2009 edition, available at URL: /http://www.itrs.net/Links/2009ITRS/Home2009.htmS. [47] A. Kondratyev, K. Lwin, Design of asynchronous circuits using synchronous CAD tools, J. IEEE Des. Test 19 (4) (2002) 107–117. [48] N.R. Mahapatra, and G. Tareen, Comparison and analysis of delay elements, in The IEEE fortyfifth Midwest Symposium on Circuits and Systems MWSCAS2002, vol. 2, pp. II-473-II-476, Aug. 2002. [49] National Inst. of Standards and Technology, Federal Information Processing Standard 197, The Advanced Encryption Standard (AES), URL: /http://csrc. nist.gov/publications/fips/fips197/fips197.pdfS, 2001. [50] J. Jaffe, P. Kocher, and B. Jun, Introduction to Differential Power Analysis and Related Attacks. [51] S. Aumonier, Generalized Correlation Power Analysis, in the Workshop ECRYPT 2007, Krakow, Poland, Sept. 2007 URL: /http://www.impan.pl/BC/ Program/conferences/07Crypt-abs/Aumo nier%20-%20SubmissionWorkshopSA.pdfS. [52] W. Cilio, M. Linder, C. Porter, J. Di, S.C. Smith, and D.R. Thompson, SideChannel Attack Mitigation Using Dual-Spacer Dual-Rail Delay-Insensitive Logic (D3L), 2010 IEEE SoutheastCon, March 2010.