Microprocessors and Microsystems 66 (2019) 1–9
Contents lists available at ScienceDirect
Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro
Cache timing attacks on NoC-based MPSoCs Cezar Reinbrecht a,∗, Bruno Forlin a, Johanna Sepúlveda b a b
Federal University of Rio Grande do Sul, Brazil Institute for Security in Information Technology, Technical University of Munich, Germany
a r t i c l e
i n f o
Article history: Received 7 June 2018 Revised 3 October 2018 Accepted 25 January 2019 Available online 1 February 2019 Keywords: MPSoC NoC Cache timing attack NoC timing attack Collision attack
a b s t r a c t Rising demands for increased performance, lower energy consumption, connectivity and programming flexibility are nowadays driving the platforms, so-called Multi-Processor Systems-on-Chips (MPSoCs). These platforms are composed of several processing elements, custom IP cores, and memories, all of which are in general interconnected by a Network-on-Chip (NoC). As attractive as these characteristics are, they introduce several security concerns, specially when applications with different trust and protection levels share resources. When a malicious software acquires access to an IP, it opens a door to external surveillance of the cache-memory, processing units and the NoC communication structure. The cache memory was already exploited by several authors in ASICs and Systems-on-Chips through SideChannel Attacks (SCAs). In this work, we expand this concept, exploring the timing attacks on caches in the MPSoC scenario. We implement two well established attacks in the literature on a real hardware platform, the MPSoC Glass. Furthermore, we present the NoC as a novel vulnerability to increase attack efficiency, resulting in the Earthquake Attack. Results show that the attacks from literature can succeed inside the MPSoC, and obtain better results. Additionally, Earthquake improves the base attack by using the NoC timing attack, reducing the remaining attack complexity from 236.9 to 232 with 216.6 encryptions instead of 227.97 . © 2019 Elsevier B.V. All rights reserved.
1. Introduction In today’s market, where both consumer and industry domains search for optimal systems for connectivity and versatility, balancing performance, energy consumption and security is an everyday concern. Multi-Processor System-on-Chips (MPSoCs) have been established as a player to meet these challenges, composed of several processing elements, custom IP cores, and memories, all of which are interconnected by a Network-on-Chip (NoC). The versatility and connectivity introduced by these systems, is sometimes shadowed by the security concerns brought with these functionalities, specially taking in consideration the wide range of possible applications, with different origins and thrust levels. One of these concerns can be materialized as side-channel attacks (SCA). This category of attacks is one of the most potent threats to MPSoCs, where secret information (e.g., personal information and passwords can be inferred from observations of the system that processes them). These observations can target the physical leakage of devices, e.g., their thermal [1], power [2–5], or electromagnetic [6] properties. This work, in contrast, focuses on ∗
Corresponding author. E-mail address:
[email protected] (C. Reinbrecht). URL: http://lattes.cnpq.br/2486661464139271 (C. Reinbrecht)
https://doi.org/10.1016/j.micpro.2019.01.007 0141-9331/© 2019 Elsevier B.V. All rights reserved.
logical leakage such as timing behavior, i.e., the fact that different (secret) inputs take different amounts of time to be processed. In early literature, Kocher [7] described how the execution time of a cryptographic operation can be used to infer the secret key involved in the computations. Further works have leveraged this concept to one of the most dangerous SCA today, the cache timing attack [8–10]. In the multi-processor domain, the wide-spread use of cache memories has made timing attacks a very imminent threat. In general, cache attacks can be categorized in trace-driven, accessdriven, and time-driven attacks [11]. For trace-driven approaches, the adversary must observe the sequence and results of memory requests to the cache hierarchy. The trace of cache hits and misses gives an insight into the data that has been processed [12]. For access-driven attacks, the adversary actively interferes with the cache activity of the target and thereby learns which data or code has been used during an operation from the positions accessed [13]. Time-driven attacks, on the contrary, simply rely on the fact that cache hits and misses cause measurable differences in the overall processing time. Multiple observations of the processing time for different inputs then allow to draw a statistical conclusion about the data and code used during an operation [8–10]. The timing-driven attack applied to the multi-processor environment is the object of study of the present work.
2
C. Reinbrecht, B. Forlin and J. Sepúlveda / Microprocessors and Microsystems 66 (2019) 1–9
Unfortunately, design choices of cryptographic implementations have actually facilitated these kind of attacks. One prominent and thoroughly studied cryptography is the Advanced Encryption Standard (AES) and its software implementations that frequently rely on large look-up tables. Although they significantly improve speed, the secret-dependent accesses to these tables have repeatedly been exploited in cache attacks. This vulnerability of AES was exploited successfully by Bernstein in 2005 [8]. This attack is based on the existence of key-dependent cache conflicts in the first encryption round. Neve et al. [10] provided implementation details regarding the attack of Bernstein and proposed an extension to the second encryption round. This proposal enabled full key recovery, something the original attack could not guarantee. Thereafter, in 2010, Bogdanov et al. [11] proposed another timedriven attack that exploits so-called cache collisions. They occur when intermediate values during an operation are identical for different inputs and subsequently cause accesses to the same memory location. These accesses are then said to collide. Naturally, the first access will bring the requested memory into the cache hierarchy. This fact speeds up the second access and consequently the overall execution. For AES software implementations, Bogdanov et al. showed that different inputs can cause up to five cache collisions during encryption or decryption. This so-called wide-collision is sufficient to infer the secret key from a set of timing observations. In this work, we show how these cache timing attacks (Bernstein and Bogdanov et al.) can be applied to an MPSoC setting and we analyze its practicality in this novel scenario. Subsequently, we experimented the Earthquake attack presented in [14], a combination of the collision attack with established methods from a previous NoC timing attack by Reinbrecht et al. [15], and show that this attack is significantly more powerful. Our results emphasize once more that timing attacks, in particular those exploiting the cache hierarchy, are an imminent threat on NoC-based MPSoCs and must be addressed during design time to ensure the security of the overall system. In summary, our contributions are: • the implementation and evaluation of a cache timing attack on a real MPSoC hosted on an FPGA, • the implementation and evaluation of a differential cachecollision attack on a real MPSoC hosted on an FPGA, • the analysis of the practicability of the attack in a multi-core environment, • the evaluation of a cache-based attack, the Earthquake attack, which successfully combines cache-collisions with techniques from NoC timing attacks. The rest of the paper is organized as follows. Section 2 provides background information about our reference MPSoC, in order to better understand the scenario where the attacks are applied. Section 3 presents previous works related to cache timing attacks. Section 4 explains the attacks from Bernstein and from Bogdanov et al. in detail, and the adaptations required to apply these attacks in our MPSoC architecture. Section 5 gives details about the Earthquake attack, a joint attack of cache and NoC. Section 6 shows the experimental results of these attacks running on an FPGA, before we finally conclude in Section 7. 2. Reference MPSoC Multi-Processor System-on-Chips are complete systems containing multiple processing elements on the same integrated circuit [16]. Besides the processors, the system can comprise co-processors, hardware accelerators, memories and a Networkon-Chip (NoC). They are composed of routers and links being the typical interconnection solution to integrate several components. NoCs can vary in router architecture or topology. Applications and
fields in which MPSoCs are already used include Machine Learning, Internet-of-Things, high-bandwidth communication, augmented reality, and video encoding/decoding. Any MPSoC can be classified according to the following characteristics: • the architecture model: which processors and IPs comprise the system. • the memory model: shared, distributed or distributedshared strategy. • the communication model: bus-based or NoC-based topology. • the software architecture: hardware abstraction layer (HAL), operating system (OS), application programming interfaces (APIs). To fully understand the attacks described in chapter IV and V, first we must explore our chosen architecture, MPSoC Glass, specially because moderate modifications are required in order for the reference attacks to succeed in an MPSoC environment. 2.1. Reference architecture - MPSoC Glass The reference architecture is the MPSoC Glass [17], a heterogeneous MPSoC interconnected by a mesh NoC and based on a shared memory organization. The structure and components can be observed in Fig. 1. MPSoC Glass uses the NIOS II processor from Intel FPGAs. Similar to ARM’s big.LITTLE technology [18], our architecture uses different NIOS II implementations, the economy core (NIOS II/e from [19]), and the fast core (NIOS II/f from [19]). The NIOS II fast core aims at high-performance applications, system management and primary tasks of the system. Minor tasks and parallel applications are intended to be executed by a group of NIOS II economy cores. The NIOS II economy core has a low area and power, and is suitable to be replicated many times in an MPSoC. Consequently, there are more economy cores in the MPSoC than fast cores, i.e., four fast NIOS II and ten economy NIOS II. Other components integrated into the MPSoC Glass are a shared cache, and an UART interface resulting in a total of sixteen parts. The memory hierarchy has two cache levels. The level 1 (L1) is a local instruction and data direct-mapped cache. The level 2 (L2) is a shared unified cache and 16-way set-associative. The shared L2 cache has a direct link to the external main memory. The sizes of the caches and cache lines are parametrized for each design. Depending on the target usage, a different configuration can be set. The NoC is composed of a 5-port router, with input buffers of eight flits, each flit having 32 bits. The topology is a 4 × 4 mesh, which uses the XY routing algorithm. Also, MPSoC Glass has a built-in security zone, defined at design-time. Elements inside secure zones mutually trust each other, because their communication is exclusive [20,21]. The processors inside the secure zone are not allowed to download or run untrusted software. Processors outside the secure zone can execute any application, even external ones downloaded by the system. The software architecture is simplified for the first version of MPSoC Glass. Each processor has a hardware abstraction layer (HAL) to initialize the necessary components, and a bare-metal code that runs the tasks directly in the HAL. A developed library set provides all communication features through the NoC and cryptographic functions as services, being similar to an API solution. A custom development environment was used to automate parts of the development flow, such as compilation and upload of the binaries in the several processors of the system. 2.1.1. Applicability The typical scenario used to evaluate all attacks in this work is showed in Fig. 1. IP0 is the shared cache L2, available for all nodes and it has direct access to the main memory. IP13 is the secure
C. Reinbrecht, B. Forlin and J. Sepúlveda / Microprocessors and Microsystems 66 (2019) 1–9
3
Fig. 1. Reference MPSoC Glass, and the scenario established for experiments.
co-processor responsible for the security services, such as AES encryption and decryption. IP13 is inside a secure zone, a function provided by the NoC that creates a control environment where only trusted applications are allowed. All processing elements have its own local L1 cache. In this scenario, the IP1 of the system was infected, becoming the attackers IP. IP1 can observe the traffic from IP0 to IP13 (shared cache response) by using communication collisions. The attacker observes the traffic behavior, because the content of the packets are not observable. Besides, the malicious IP1 can ask for the security services provided by IP13 , such as encryption or decryption with the system secret key. 3. Related work Timing-based cache attacks were first mentioned by Kocher [7] and Kelsey et al. [22]. According to Kocher, cryptographic algorithms running on a platform always present timing leakages that can be used for performing logical side channel attacks, also named in this work as Architectural Channel Attacks (ACAs). Tsunoo et al. [23] exhibited for the first time a functional timing attack on caches, breaking the DES algorithm. This work opened a new front on ACAs. Bernstein [8] adapted Tsunoo attack to break AES cryptography. The attack exploits timing variations of an AES-128 performance-oriented implementation. It is based on the existence of key-dependent cache conflicts in the first encryption round. Neve et al. [10] provided implementation details regarding the attack of [8] and proposed an extension to the second encryption round. It enabled full key recovery, something that the original attack could not guarantee. The timing-based attack introduced by Bernstein [8] has been applied to different scenarios. Weiss et al. [24,25] implemented the attack in virtualized embedded environment. Spreitzer and Plos [26] and Spreitzer and Grard [27] performed the attack on mobile phone devices. Apecechea et al. [28] and Irazoqui et al. [29] demonstrated the attack in virtualized environments used in commercial cloud computing systems. In spite of the vast exploration of this technique
in different scenarios, no cache timing attack have been reported yet to a NoC-based MPSoC. Cache-collision attacks on table-based implementations of AES have been proposed by Acıiçmez et al. [30], Bonneau and Mironov [9], and Bogdanov et al. [11]. The attack proposed by Acıiçmez et al. [30] exploits internal collisions in the first two rounds of AES. It is a chosen-plaintext attack and allows the authors to recover a full AES key with 226.66 queries when timing the communication via the operating system’s network stack. Bonneau and Mironov [9] describe multiple cache-collision attacks targeting the first and last round of AES. While the first round attack only allows to retrieve a limited number of key bits, the last round attacks allow full-key recovery. The authors also explicitly leverage cache collisions caused by similar, but not identical, intermediate values. With this improvement, the authors are able to recover a full key with less than 219 encryptions. Spreitzer and Plos [26] investigate the applicability of Bogdanov’s attack on Android-based mobile devices. The authors show that the reliable detection of wide-collisions is challenging in practice and that large, i.e. 64-byte, cache lines further decrease the success rate. The authors conclude that no practical attack is possible in the chosen attack scenario. Lauradoux [31] demonstrates that cache collisions can not only be detected through timing measurements, but also through measurements of the power consumption. The author argues that cache misses have a different power profile than cache hits and that it is therefore possible to also detect cache collisions. In this work, we investigate only timing-based attacks, as the power consumption can typically not be determined without additional equipment and physical access to the target device. 4. Reference timing attacks Timing attacks is a type of side channel attack that exploits the timing behavior of the shared resources of a system. The following subsections present the AES cryptography in order to provide a
4
C. Reinbrecht, B. Forlin and J. Sepúlveda / Microprocessors and Microsystems 66 (2019) 1–9
background to understand the studied attack strategies. Then, the technique of Bernstein; Bogdanov et al. Finally, the NoC Timing Attack are detailed. 4.1. Advanced Encryption Standard (AES) AES is a widespread symmetric cipher that operates on inputs of 128 bits and uses keys of 128, 192 or 256 bits. Encryption and decryption are implemented in rounds that each consists of four elementary operations. The longer the key, the more rounds are executed, e.g. 10 rounds for a 128-bit key and 14 rounds for a 256-bit key. The input to AES is organized in blocks of 16 bytes. For encryption, the plaintext is written as pi , where 0 ≤ i ≤ 15. Each input block is arranged as a 4 × 4 state matrix. The same holds for a 128-bit key, which is written as ki . As each round uses a distinct round key, a so-called key expansion is performed. This yields kri , where 0 ≤ r ≤ 9. The previous ki is the initial key in round 0, i.e., k0i . For encryption, the four elementary operations are as follows: • SubBytes: substitute all input bytes with lookups from the so-called substitution box, or S-box. • ShiftRows: shift each row of the state matrix cyclically to the left, the row index represents the shift value. • MixColumns: polynomial multiplication over GF(28 ), each column represents a set of coefficients and is replaced by the multiplication result. • AddRoundKey: XOR operation between the state matrix and the current round key. In the last round of AES, the MixColumns operation is omitted. For decryption, the inverse round operations are used and performed in reverse order. 4.1.1. Performance-oriented implementation To improve the performance of AES implementations, the round transformations SubBytes, ShiftRows, and MixColumns can be performed using tables of pre-computed values, so-called transformation or T-tables [32]. This introduces four 1 kB tables (T0 , T1 , T2 , and T3 ) as well as one additional table T4 for the last round, which omits the MixColumns transformation. The AddRoundKey operation remains unchanged. Implementations using T-tables reduce each round to two operations. First, the state matrix, represented as xri , where 0 ≤ i ≤ 15 and 0 ≤ r ≤ 9, is used as input to the T-tables. The result of the lookups is XOR’ed to form an updated state matrix. Second, the AddRoundKey operation is performed to update the state with the current round key. Eq. (1) illustrates this reduced round of AES.
(xr+1 , xr+1 , xr+1 , xr+1 ) ← T0 [xr0 ] T1 [xr5 ] T2 [xr10 ] T3 [xr15 ] kr+1 0 1 2 3 0 r+1 r+1 r+1 r+1 (x4 , x5 , x6 , x7 ) ← T0 [xr4 ] T1 [xr9 ] T2 [xr14 ] T3 [xr3 ] kr+1 1 (1) (xr+1 , xr+1 , xr+1 , xr+1 ) ← T0 [xr8 ] T1 [xr13 ] T2 [xr2 ] T3 [xr7 ] kr+1 8 9 10 11 2 (xr+1 , xr+1 , xr+1 , xr+1 ) ← T0 [xr12 ] T1 [xr1 ] T2 [xr6 ] T3 [xr11 ] kr+1 12 13 14 15 3 The last round is computed by substituting T0 , . . . , T3 with T4 . The resulting x10 is the final ciphertext. i 4.2. Cache timing attack: Bernstein’s attack The timing-based cache attack for the AES implementations relies on the transformation tables. These so-called T-tables, incorporate the round changes and reduce one round of AES to 16 T-table lookups and 16 XOR operations, as discussed in the previous subsection. When a T-table implementation is executed on a processor with cache, there is in general no guarantee that all entries of the T-tables are permanently cached and quickly available to the processor. As other processes may share the cache, certain entries of
the T-tables will be in conflict with other data that is used concurrently. Hence, cache misses for the T-table entries may occur. Since the round keys are part of the XOR operation in each round, the lookups to the T-tables ultimately depend on the secret AES key. The Bernstein’s attack is a timing-based attack composed of two steps. The first is the learning step, whose purpose is to obtain the timing patterns for a key known by the attacker. The second is the attack step, whose purpose is to extract a timing pattern of an unknown key, which will be later correlated with the data obtained in the learning step. The correct hypothesis in the correlation will peak. In the learning step, the adversary queries the victim with a set of random plaintexts, which the victim encrypts using the known key, typically filled with zeros. The attacker measures the processing time of each encryption and stores it in matrix L (Eq. (2)). It has 16 rows (one for each byte of the plaintext, represented as pi , where 0 ≤ i ≤ 15) and 256 columns (one for each value of the corresponding byte). The processing time is entered 16 times using the byte positions and values to determine the location within the matrix. Over time, the entries in L add up and are eventually subtracted by the overall average processing time. As a result, L contains the differences between the mean encryption time for each plaintext position and possible value.
⎡
meantime0 ⎢meantime0 L=⎢ . ⎣ . . meantime0
⎡
⎤
p0 ⎢ p1 ⎥ ×⎢ .⎥ ⎣ .⎦ . p15
meantime1 meantime1 . . . meantime1
meantime2 meantime2 . . . meantime2
... ... .. . ...
⎤
meantime255 meantime255 ⎥ ⎥ . ⎦ . . meantime255
(2)
In the attack step, the adversary again queries the victim with a set of random plaintexts. In the real scenario, the victim has an unknown key k˜ . Analogous to the learning step, the adversary builds a second matrix Lk˜ . The host computer compares both matrices row-wise. One row of L is correlated with the corresponding row of Lk˜ after it has been permuted with a key byte hypothesis. The permutation is the XOR operation of row index and key byte suggestion. The cache conflicts are assumed to be identical in both learning and attack step. So, for the correct term, the correlation will peak. Consequently, the same T-table entries cause cache hits and miss and eventually shorter or longer encryption times. Because of the permutation with the correct hypothesis, Lk˜ is transformed into L, which yields a high correlation coefficient. The coefficients of all key byte hypotheses are stored in a new matrix C, with one row per key byte and one column per hypothesis value. All terms with correlation coefficients beneath a threshold are excluded from further consideration. The most likely remaining key byte hypotheses are tested against a known plaintext-ciphertext pair. As noted by Neve et al. [10], the original attack is limited concerning how many predictions remain after the correlation step. If multiple T-tables entries fit on one cache line, their timing behavior is identical. The correlation step produces similar coefficients for all values that share one cache line. The threshold decision can consequently either exclude all or none of the corresponding key byte hypotheses. With b being the number of T-table entries per cache line, the number of key bits that remain after Bernstein’s attack is log2 (b) · 16. Further details about Bernstein’s attack can be found in work by Neve et al. [10].
C. Reinbrecht, B. Forlin and J. Sepúlveda / Microprocessors and Microsystems 66 (2019) 1–9
4.2.1. Bernstein at MPSoC Glass The cache timing attack was adapted for our target NoC-based MPSoC. This on-chip scenario has brought some advantages regarding the leakage collection. The lack of noise makes possible to obtain a very robust model (matrix L and matrix Lk˜ ). After the infection of a processor at MPSoC Glass (IP1 ), the attacker starts the learning phase to generate the matrix L. Since this step requires a known key, the AES algorithm executes locally at the infected IP. In spite of locally, the attacker uses the shared cache of the system (IP0 ), obtaining a very similar behavior from the victim. Once matrix L is ready, the malware must perform the attack step. The victim processor (IP13 ), a trusted IP in MPSoC Glass, provides encryptions and decryptions with the system private key as services. Hence, the attacker asks for an encryption service with random plaintexts and evaluates the overall execution time. Even the secure processor has to access the shared cache in some level of the AES computation, which leaks the information through the cache miss behavior. Therefore, our attacker succeeds in elaborating the matrix Lk˜ , used to retrieve the key. Typically, MPSoCs present severe restrictions in terms of computation and memory capabilities, since the power is an important issue during design time. Consequently, the calculations required to generate matrices L and Lk˜ , i.e. variance of each mean, can be prohibitive. To overcome such drawback, it is possible to calculate the mean time for each case and send to an external computer. This outer system can solve the global mean time and the variance for each byte possibility, important to perform the correlation of both matrices. This scenario implies that this attack could not guarantee that it can be performed entirely remotely. 4.3. Differential cache-collision attack: Bogdanov et al. attack The work by Bogdanov et al. [11] describes a chosen-plaintext attack that exploits wide-collisions in software implementations of AES. To start the attack, the adversary generates plaintext pairs (P1 and P2 ) that are consecutively encrypted by the target implementation. Each pair targets one of four diagonals found in the 4 × 4 state matrix of AES. The pairs are constructed such that the values of the bytes in the target diagonal are different between the pairs, whereas all other bytes are identical. The example below shows a plaintext pair that targets the main diagonal. The diagonal elements in P1 (labeled as ai ) are different from those in P2 (labeled as ei ). All elements xj are identical in both plaintexts. The concrete values of all a, e, and x are chosen randomly.
⎡
a0 ⎢ x4 P1 = ⎣ x8 x12
x1 a5 x9 x13
x2 x6 a10 x14
⎤
x3 x7 ⎥ x11 ⎦ a15
⎡
e0 ⎢ x4 P2 = ⎣ x8 x12
x1 e5 x9 x13
x2 x6 e10 x14
⎤
x3 x7 ⎥ x11 ⎦ e15
After computing the first round of AES, the state matrices of both plaintexts at the beginning of the second round (P12 and P22 ) look as follows. The matrix elements that correspond to ai and ei are highlighted in gray.
⎡
s0 ⎢s1 2 P1 = ⎣ s2 s3
s4 s5 s6 s7
s8 s9 s10 s11
⎤
s12 s13 ⎥ s14 ⎦ s15
⎡
s0 ⎢s1 2 P2 = ⎣ s2 s3
s4 s5 s6 s7
s8 s9 s10 s11
⎤
s12 s13 ⎥ s14 ⎦ s15
The original plaintext pair causes a wide-collision, if at least one collision happens at the input of the S-box in the second round. In the illustrated example, at least one of the highlighted si must be identical for both P12 and P22 . This identical byte then causes four more collisions in the S-box lookups of the third round. Assuming that s0 collides in the example (highlighted in green), the following matrices illustrate the input of the third round, P13
and P23 :
⎡
t0 ⎢t1 3 P1 = ⎣ t2 t3
t4 t5 t6 t7
t8 t9 t10 t11
⎤
t12 t13 ⎥ t14 ⎦ t15
5
⎡
t0 ⎢t1 3 P2 = ⎣ t2 t3
t4 t5 t6 t7
t8 t9 t10 t11
⎤
t12 t13 ⎥ t14 ⎦ t15
As shown, the first columns of both state matrices are identical, which causes four more identical S-box lookups. In total, this yields five internal collisions, or a wide-collision. The following paragraphs outline how wide-collisions are obtained and leveraged to recover the secret AES key from timing measurements. Online stage. The main goal of the online stage is to generate plaintext pairs and measure the encryption time of all P2 plaintexts. For this, N unique plaintext pairs with different diagonal bytes (ai , ei ) are chosen. For each of these pairs, the other bytes (xj ) are changed I times. These N · I plaintext pairs are then encrypted R times each. The parameter N thereby defines the number of candidates that might cause a wide-collision. The parameter I improves the detection of a wide-collision and rules out particularly unfortunate choices of xj . The parameter R simply reduces the noise of the timing measurements. The output of the online stage are the encryption times of all processed P2 plaintexts. Detection stage. In the detection stage, each of the N unique pairs is assigned a final timing value, which is the minimum encryption time over all corresponding I executions. The 4 pairs with the lowest timings are then classified as wide-collision candidates. In order to recover the entire key with only 232 search complexity, there must be at least 4 true wide-collisions among the chosen candidates. In general, ifm candidates are chosen, the final search 4+m 32 complexity increases to ·2 . 4 Key recovery stage. During the key recovery stage, the previously chosen plaintext pairs are used to prune the list of possible key choices. Since the candidates of each diagonal contain information about 4 bytes of the full key, every candidate list is used to prune the corresponding 232 choices. Each sub-key choice is thereby tested against the candidate list. If a wide-collision would occur for the given list, the sub-key choice is kept. All stages must be performed for each of the four diagonals. The remaining sub-key choices are then combined and exhaustively searched for the correct key. 4.3.1. Bogdanov et al. at MPSoC Glass Applying Bogdanov’s attack in an MPSoC that uses a NoC, can yield interesting results. First, since we have considerably less noise in the communication between attacker and victim (when compared to remote access through the internet), we can avoid to use the R parameter. Moreover, this clean timing sampling allows the attacker to obtain a better success rate in the candidates selection. Consequently, if the amount of false-positives is low, the computational effort to retrieve the key decreases considerably. During the execution of this attack at the target MPSoC, the malware was optimized to send to the external agent only the best candidates (in our experiments the ten fastest). This strategy allows to reduce the data to be extracted from the attacked system, becoming more practical to any device, specially mobile. 4.4. NoC timing attack NoC attacks require the control of at least one component in the MPSoC. There are techniques that an attacker can use to tamper the software and infect an IP [33]. By using malicious software (malware) that performs read and write transactions in forbidden
6
C. Reinbrecht, B. Forlin and J. Sepúlveda / Microprocessors and Microsystems 66 (2019) 1–9
memory areas, an attacker may change the behavior of an IP (victim IP) and turn it into an infected IP. Moreover, buffer overflows, and other similar techniques that address software weaknesses can be exploited for such a purpose. An infected IP may try to extract/infer data, modify the system behavior (by infecting other IPs) or deny the MPSoC service by employing malicious transactions. NoC timing attacks elicit channel leakage by evaluating the attacker throughput. As routers are shared, the communication collisions between malicious and sensitive traffic may reveal the sensitive behavior (e.g. mapping, topology, routing, transmission pattern and volume of communication). The collisions are detected by the reduction of throughput of the attacker. Several authors explained the concept behind NoC timing attacks. Yang and Suh [34] and Sepúlveda et al. [35] explored the NoC timing leakage to understand the relation between the MPSoC traffic and the secret key. In [36], the authors proposed the Firecracker attack. By using the NoC timing attack, the interaction between the shared cache and the Victim IP, which performs the cryptographic task, is monitored. When the AES is executed in a non-pre-load memory environment, the T-table content must be retrieved after the first round (as shown in Eq. (1)). Usually, T-tables are stored initially into a higher memory hierarchy (L2, DRAM), thus forcing the communication of the T-tables through the NoC. The high amount of communicated data degrades the attacker’s data injection rate, thus revealing the attacker the possible optimal time to trigger a cache attack. 5. Earthquake attack The differential cache-collision attack by Bogdanov et al. [11] targets internal collisions that occur during the first three rounds of AES. These collisions influence the overall execution time, but are not the only source of timing variations. The remaining rounds add noise to the timings and thereby increase the number of measurements required for a successful attack. Consequently, if an adversary is able to measure the execution time of only the first three rounds, the attack is expected to succeed with significantly fewer observations. This is the key improvement of the Earthquake attack compared to the original approach. The following paragraphs discuss how this improvement can be achieved on NoC-based MPSoCs and how the Earthquake attack can be launched in such a setting. Attack scenario. The target platform of the Earthquake attack is any MPSoC that implements a network-on-chip. As modern MPSoCs can host numerous components that are constantly growing in complexity, it is reasonable to assume that one of them can eventually be compromised by an adversary [33]. To start the Earthquake attack, this compromised component must be able to trigger AES encryptions with chosen inputs, either directly or indirectly. The AES functionality might for instance be offered by a cryptography API or by a centralized service manager running on one of the MPSoC components. Given that the AES implementation uses T-tables, the adversary component must also be located on the communication path between the AES service and the shared cache level of the MPSoC. Adversaries located on the sensitive path are able to sense the communication pattern and volume. The goal is to annotate the execution time at the end of the third round (after 48 T-table accesses). Cache hierarchy and configuration affect the attack. The L1 cache is typically a local memory located in the processor tile. Communication between the processor and the L1 cache will bypass the NoC and therefore cannot be observed by the attacker. This makes detecting the end of the third AES round more difficult and introduces noise in the timing measurements. In contrast, the
communication between the processor and the L2 cache is observable by the attacker. Although the local L1 cache has an adverse effect on timing measurements, its impact was limited in our experiments and did not require further action. Attack steps. The Earthquake attack consists of nine steps that are illustrated in Fig. 2. The first step of the attack is the infection of MPSoC components. The more components are compromised, the more likely it is for an attacker to be located on the sensitive path between the L2 cache and the AES component. The next three steps include choosing a target diagonal, generating plaintext pairs, and choosing multiple sets of xj per plaintext pair. The subsequent step performs the measurement phase, which in turn consists of four sub-steps. They are highlighted in Fig. 2 in a separated box with dashed lines. In the prime step, the shared cache is filled with random data from the adversary. This ensures that no T-table entries reside in the cache before the first encryption. In the subsequent step, the adversary triggers the encryption of one plaintext pair. During the second encryption, the adversary uses the approach proposed by Reinbrecht [15] to observe the end of the third round. In their work, the authors show that an adversary can observe accesses to a shared cache by saturating the communication path to the cache with own traffic. Communicating through shared paths affects the throughput to inject packets. This can be detected and allows the adversary to infer other cache activity. In the Earthquake attack, this technique is used to measure the time until the AES service has made 48 accesses to the cache. This is the number of T-table accesses during the first three rounds. In this way, the adversary can accurately detect the end of the third round and thereby significantly improve the detection of wide-collisions. After the measurement step, the detection and key recovery steps are performed as proposed in the original attack. In general, these final steps need not be performed on the MPSoC. Instead, an adversary can transmit the measurement results to an external device that offers more computational power, which speeds up the key recovery phase. 5.1. NoC timing attack in earthquake Fig. 1 shows the attack scenario used in our experiments. A sensitive process (AES) on IP13 is offering a performance-oriented AES encryption function (T-table implementation). The process uses four lookup tables (T0 , T1 , T2 , and T3 ) to perform the SubBytes, ShiftRows, and MixColumns transformations. When an encryption is requested, the AES implementation accesses the T-tables, thus, issuing data requests to the private L113 cache. Next, the availability of the data in L113 is checked. If the data is not in L1, a miss occurs and an access to L20 is performed. This generates sensitive communication through the NoC from IP13 to IP0 . The traffic must follow the deterministic path (sensitive path) determined by the XY routing algorithm, which includes five routers (Router31 , Router30 , Router20 , Router10 and Router00 ). The response message from IP0 to IP13 uses another five routers (Router00 , Router01 , Router11 , Router21 and Router31 ). Simultaneously, a malicious process (Attacker) is running on the infected IP1 , which is located on the sensitive path between the L2 cache and the AES process (Router01 ). The infected IP1 injects frequent dummy transactions to saturate a particular output port (North Port) of the Router01 . Due to the shared nature of the router resources, this causes communication collisions in the North output port between malicious and sensitive traffic. Attacker annotates its own throughput by using the time of successful packet injection in the router. Attacker’s throughput then reveals the access pattern and communication volume of the sensitive flow, as low throughput shows the existence of collisions inside the router. The security
C. Reinbrecht, B. Forlin and J. Sepúlveda / Microprocessors and Microsystems 66 (2019) 1–9
7
Fig. 2. Earthquake attack steps. The measurement parameter R is set to 1 and is therefore not considered further in the attack steps.
Fig. 3. Relation between the amount of encryptions and the revealed bytes of the key.
zone marked in Fig. 1 does not prevent the attack in this scenario, as the communication is observed outside of it. Placing the L2 cache within the security zone reduces the attack surface, but limits accessibility to the shared cache for non-secure components. 6. Experimental results and analysis This section describes the experiments realized to evaluate the main contributions of the present work: i) the efficiency of the Bernstein’s attack in a NoC-based MPSoC; ii) the efficiency of Bogdanov et al. attack in a NoC-based MPSoC; and iii) the increase of efficiency achieved by Earthquake. All experiments run at Glass MPSoC in an FPGA, following the configuration demonstrated in Section 2. 6.1. Bernstein’s attack evaluation The Bernstein’s attack was implemented and evaluated in our target platform. Since our system is all integrated inside an MPSoC, fewer samples than Bernstein paper was required. The learning phase used 10,0 0 0 encryptions to create the signatures of each plaintext byte. The signature is obtained calculating the variance of the mean time to encrypt the bytes with a fixed known key. The average of the variance from all possibilities is used as a reference to normalize the results. During the attack, the same calculation is used to create a database with an unknown key. All these data was send to an external computer for post-processing the correlation between the signatures. Different experiments were executed, varying the amount of encryptions, starting from 10,0 0 0 samples until 10 0,0 0 0,0 0 0 samples. Results at Fig. 3 shows that less than 10,0 0 0 samples do not provide the statistic properties required for
the correlation, and with about 10 0,0 0 0,0 0 0 samples the key can be retrieved. Besides, the hardware required in an MPSoC to execute such attack is much more complex than the hardware required in typical applications. Since all timing measurements are stored and manipulated in the attacker IP, the processing element needs to support 64 bits variables to be stored and managed in several arrays. As a consequence, it results in more constrained requirements of computational power and local memory sizes. As a final result, the total quantity of encryptions required for the NoC timing attack inside an MPSoC was about 226.57 , while Bernstein required 227 . 6.2. Bogdanov et al. attack evaluation The original work attacked through the internet, often presenting errors in the time measurements due to communication noises. However, inside our embedded environment, the communication noise added were about few cycles, not representing an issue, since cache misses are in greater magnitude. At our experiment, N was the number of different trials with the same target diagonal; and I was the number of changes in the pairwise equal elements to explore the best solution for such N. Both can cause improvements in the search for wide collisions. To test how the variation of these parameters impacted the results, we set alternate values for N and I, as follows: N1 = 50, N2 = 100, N3 = 150 and N4 = 250; I1 = 200, I2 = 400 and I3 = 600. These values were defined based on Bogdanov’s original experiments, where we choose lower values due to our environment presents less interferences. For this experiments, the results in terms of false-positives found in the ten best candidates of wide collisions are presented in Table 1, column Bogdanov. The optimum result of 0.5 false
8
C. Reinbrecht, B. Forlin and J. Sepúlveda / Microprocessors and Microsystems 66 (2019) 1–9 Table 1 False-positive wide-collisions per diagonal among the ten fastest plaintext pairs. Attack parameters
False-Positives
N
I
Bogdanov
Earthquake
50
200 400 600 200 400 600 200 400 600 200 400 600
1.25 1.25 1.00 1.00 1.00 1.00 0.75 0.75 0.50 1.00 0.50 0.50
0.75 0.25 0.50 0.75 0.25 0.25 0.75 0.00 0.25 0.25 0.00 0.00
100
150
250
Earthquake improvement
40% 80% 50% 25% 75% 75% 0% 100% 50% 75% 100% 100%
wide-collision per diagonal can be achieved through 10 0,0 0 0 encryptions when running this attack in an MPSoC scenario, while original Bogdanov et al. attack required about 32,768,0 0 0 encryptions (N = 1024, I = 800, R = 40) to surpass this result. 6.3. Earthquake evaluation In order to fairly evaluate and compare both attacks, the measurement parameters N, representing the number of different values for the target diagonal, and I, representing the number of different pairs for every diagonal value, are varied as follows: N1 = 50, N2 = 100, N3 = 150 and N4 = 250; I1 = 200, I2 = 400 and I3 = 600. The parameter R, which is used to reduce measurement noise, is set to 1 in this set of experiments, since both runs at MPSoC scenario. To evaluate the effectiveness of both attacks, measurements are taken for all combinations of the measurement parameters. From the obtained timings, the ten most likely wide-collision candidates are selected and tested for correctness. The corresponding numbers of false-positives are shown in Table 1 for both attacks. False-positives significantly pro-long or break the exhaustive key search phase. Therefore, the lower the number of false-positives, the higher the success probability of the attack. The results in Table 1 show that increasing N or I generally improves the attacks. Larger values for I improve the detection of wide-collisions that are caused by a given diagonal. However, increasing I also prolongs the execution time of the attack. In our experiments, the MPSoC operates at 50 MHz. With I = 200, the attack takes roughly eight hours to finish. By increasing I to 600, the attack took approximately twenty-five hours. Compared to Bogdanov et al.’s attack, the Earthquake attack is successful with fewer measurements (lower N and I). This renders the attack more practical. Table 1 shows that for N = 250 and I = 200, Earthquake produced only 0.25 false-positives per diagonal, whereas the original attack produced one false-positive per diagonal. Compared to the results provided in the original work [11], the Earthquake attack performs significantly better. Bogdanov et al. achieved a remaining attack complexity of 236.9 with 227.97 encryptions (N = 1024, I = 800, R = 40). In contrast, Earthquake achieves a remaining attack complexity of 232.0 with 220.19 encryptions (N = 250, I = 600, R = 1). The number of encryptions is also lower than the 226.66 queries consisting of 1024 encryptions each that Acıiçmez et al. [30] required for full key recovery. These results clearly illustrate that cache-collision attacks, and in particular the Earthquake attack, are a considerable threat in MPSoC settings. 7. Conclusion The present work has investigated the applicability of cache timing attacks under a promising platform paradigm, the Multi-
Processors Systems-on-Chip. MPSoCs are intended to integrate most devices that aims performance and energy efficiency, suitable for the new computation era, driven by Internet-of-Things, Deep Learning and Servers-on-Chip. From the analysis of established Side-Channel Attacks, we demonstrated that this novel architecture can be damaged. The cache timing attack of Bernstein, and the differential collision cache attack from Bogdanov et al. was deeply studied in order to adapt these techniques to apply in a multi-processing scenario. Results have demonstrated that the same vulnerabilities could be exploited, and the efficiency of such attacks were improved. The nature of run different applications, the victims and the attackers, on the same on-chip environment has enable a low noise condition for leakage extraction. The Bernstein’s attack at our MPSoC reduced the amount of encryptions required to break AES from 227 to 226.57 , while Bogdanov et al. reduced from 227.97 to 216.60 . Furthermore, we extended the work in [17], performing experiments with Earthquake attack, and comparing with Bernstein‘s timing attack under the MPSoC scenario. Earthquake allows to measure only the first three rounds of AES, which significantly improves the detection of wide-collisions. It also reduces the number of measurements and therefore the overall attack time. This is possible by jointly exploiting the shared cache level and the resources in the NoC communication infrastructure. The improvements turn the attack by Bogdanov et al. into a practical threat on modern MPSoC platforms. Securing MPSoCs falls into the co-design of hardware and software to provide a complete security. It also proves the vulnerability of MPSoCs to timing attacks, as we hope to bring attention to the danger these attacks present. As well as to the importance of defense strategies focused on securing the cache hierarchy. Acknowledgments This work was partly funded by the Bundesministerium für Bildung und Forschung (BMBF), grant number 01IS160253 (ARAMiS II). References [1] M. Hutter, J.-M. Schmidt, The temperature side channel and heating fault attacks, in: CARDIS 2013, Berlin, Germany, 2014, pp. 219–235. [2] P.C. Kocher, J. Jaffe, B. Jun, Differential power analysis, in: CRYPTO ’99, London, UK, 1999, pp. 388–397. [3] C.H. Gebotys, R.J. Gebotys, Secure elliptic curve implementations: an analysis of resistance to power-attacks in a DSP processor, in: CHES 02, London, UK, 2003, pp. 114–128. [4] M. Masoomi, M. Masoumi, M. Ahmadian, A practical differential power analysis attack against an FPGA implementation of AES cryptosystem, in: Information Society (i-Society), 2010 International Conference on, London, UK, 2010, pp. 308–312. [5] S.B. Ors, F. Gurkaynak, E. Oswald, B. Preneel, Power-analysis attack on an ASIC AES implementation, in: International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. Las Vegas, NV, USA, USA, 2004, pp. 546–552. [6] K. Gandolfi, C. Mourtel, F. Olivier, Electromagnetic analysis: Concrete results, in: CHES 2001, Berlin, Heidelberg, 2001, pp. 251–261. [7] P.C. Kocher, Timing attacks on implementations of diffie-hellman, rsa, dss, and other systems, in: CRYPTO 96, London, UK, 1996, pp. 104–113. [8] D.J. Bernstein, Cache timing attacks on AES, 2005, (Available at: https://cr.yp. to/antiforgery/cachetiming-20050414.pdf). Accessed at 2016-12-31. [9] J. Bonneau, I. Mironov, Cache-collision timing attacks against AES, in: Cryptographic Hardware and Embedded Systems - CHES 2006, Springer Berlin Heidelberg, Yokohama, Japan, 2006, pp. 201–215. [10] M. Neve, J.-P. Seifert, Z. Wang, A refined look at Bernsteins´ AES side-channel analysis, in: Proceedings of the 2006 ACM Symposium on Information, Computer and Communications Security, New York, NY, USA, 2006. 369–369 [11] A. Bogdanov, T. Eisenbarth, C. Paar, M. Wienecke, Differential cache-collision timing attacks on AES with applications to embedded CPUs, in: CT-RSA 2010, 2010, pp. 235–251. Berlin, Heidelberg. [12] O. Acıiçmez, Çetin K. Koç, Trace-driven cache attacks on AES, in: Proceedings of the 8th International Conference on Information and Communications Security, in: ICICS’06, Springer-Verlag, 2006, pp. 112–121. [13] D.A. Osvik, A. Shamir, E. Tromer, Cache Attacks and Countermeasures: The Case of AES, Springer Berlin Heidelberg, Berlin, Heidelberg, 2006, pp. 1–20.
C. Reinbrecht, B. Forlin and J. Sepúlveda / Microprocessors and Microsystems 66 (2019) 1–9 [14] C. Reinbrecht, B. Forlin, A. Zankl, J. Sepúlveda, Earthquake a NoC-based optimized differential cache-collision attack for MPSoCs, in: 2018 Design, Automation Test in Europe Conference Exhibition (DATE), 2018, pp. 648–653, doi:10.23919/DATE.2018.8342090. [15] C. Reinbrecht, A. Susin, L. Bossuet, J. Sepúlveda, Gossip NoC - avoiding timing side-channel attacks through traffic management, in: ISVLSI 16, 2016, pp. 601–606. Pittsburgh, USA. [16] W. Wolf, A.A. Jerraya, G. Martin, Multiprocessor system-on-chip (mpsoc) technology, IEEE TCAD 27 (10) (2008) 1701–1713. [17] C. Reinbrecht, B. Forlin, A. Zankl, J. Sepúlveda, Earthquake a NoC-based optimized differential cache-collision attack for MPSoCs, in: 2018 Design, Automation Test in Europe Conference Exhibition (DATE), 2018, pp. 648–653, doi:10.23919/DATE.2018.8342090. [18] ARM, White paper - big.little technology: The future of mobile, 2016, (Available at: https://www.arm.com). Accessed at 2017-10-07. [19] I.F. Altera, Nios ii classic processor reference guide, 2015, (Available at: https: //www.altera.com/, note = Accessed at 2017-10-07,). [20] J. Sepúlveda, D. Flórez, G. Gogniat, Efficient and flexible NoC-based group communication for secure MPSoCs, in: 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2015, pp. 1–6. Mexico City, Mexico. [21] J. Sepúlveda, Elastic security zones for NoC-based 3d-MPSoCs, in: ICECS 2014, 2014, pp. 506–509. Marseille, France [22] J. Kelsey, B. Schneier, D. Wagner, C. Hall, Side channel cryptanalysis of product ciphers, in: Computer Security — ESORICS 98: 5th European Symposium on Research in Computer Security, Springer Berlin Heidelberg, Louvain-la-Neuve, Belgium, 1998, pp. 97–110. [23] Y. Tsunoo, T. Saito, T. Suzaki, M. Shigeri, H. Miyauchi, Cryptanalysis of des implemented on computers with cache, in: CHES 2003: 5th International Workshop on Cryptographic Hardware and Embedded Systems, 2779, Springer Berlin Heidelberg, Cologne, Germany, 2003, pp. 62–76. [24] M. Weiss, B. Heinz, F. Stumpf, A cache timing attack on AES in virtualization environments, in: Financial Cryptography and Data Security, in: Lecture Notes in Computer Science, 7397, Springer, Berlin, Heidelberg, 2012, pp. 314–328. [25] M. Weiss, B. Weggenmann, M. August, G. Sigl, On cache timing attacks considering multi-core aspects in virtualized embedded systems, in: The 6th International Conference on Trustworthy Systems (InTrust 2014), Springer International Publishing, Beijing, China, 2014, pp. 151–167. [26] R. Spreitzer, T. Plos, On the applicability of time-driven cache attacks on mobile devices (extended version), 2013. [27] R. Spreitzer, B. Grard, Towards more practical time-driven cache attacks, in: D. Naccache, D. Sauveron (Eds.), Information Security Theory and Practice. Securing the Internet of Things, Lecture Notes in Computer Science, 8501, Springer Berlin Heidelberg, 2014, pp. 24–39. [28] G.I. Apecechea, M.S. Inci, T. Eisenbarth, B. Sunar, Fine grain cross-VM attacks on Xen and VMware are possible!, 2014, (Available at: http://eprint.iacr.org/) Accessed at 2017-10-07. [29] G. Irazoqui, M.S. Inci, T. Eisenbarth, B. Sunar, Fine grain cross-VM attacks on Xen and VMware are possible!, IACR Cryptol. ePrint Arch. (2014) 248. Accessed at 2017-10-07. [30] O. Acıiçmez, W. Schindler, Çetin K. Koç, Cache based remote timing attack on the aes, in: CT-RSA 2007, 2006, pp. 271–286. [31] C. Lauradoux, Collision attacks on processors with cache and countermeasures, in: WEWoRC’05, 2005, pp. 76–85. [32] J. Daemen, V. Rijmen, The Design of Rijndael, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2002. [33] L. Fiorin, G. Palermo, C. Silvano, A security monitoring service for NoCs, in: Proceedings of the 6th IEEE/ACM/IFIP international conference on Hardware/Software codesign and system synthesis (CODES+ISSS ’08), 2008, pp. 197–202. New York, NY, USA. [34] W. Yao, E. Suh, Efficient timing channel protection for on-chip networks, in: NOCS ’12 Proceedings of the 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip, IEEE, Lyngby, Denmark, 2012, pp. 142–151.
9
[35] J. Sepúlveda, J. Diguet, M. Strum, G. Gogniat, Noc-based protection for SoC time-driven attacks, Embedded Syst. Lett. IEEE 7 (1) (2015) 7–10. [36] C. Reinbrecht, A. Susin, L. Bossuet, G. Sigl, J. Sepúlveda, Timing attack on NoC-based systems: prime+probe attack and NoC-based protection, Microprocess. Microsyst. 51 (2017). Cezar Reinbrecht Ph.D. in Hardware Security (Federal University of Rio Grande do Sul - UFRGS, BR), M.Sc. in Computer Science (Federal University of Rio Grande do Sul-UFRGS, BR, 2012) and B.Sc. in Computer Engineering (Catholic University of Rio Grande do Sul - PUCRS, BR, 2009). Joined Laboratoire Hubert Curien (France) in 2015 working as a researcher in the field of MPSoC security. He worked at NSCAD Microeletronics from 2010 to 2014, being the project manager from 2013 to 2014 and instructor/IC designer from 2010 to 2012. As the project manager, the main projects were an ARM-M3 SoC with support to IEEE 802.15.4 and the start of the project of a transponder for the Brazilian Cubesat (satellite) targeted for the SBCDA (INPE). As IC designer, he worked in four main frontend designs, being relevant a fault tolerant SoC (containing two MIPS) with partnership with Brazilian Spatial Agency and a digital baseband for the IEEE 802.15.4. Also, he worked at Uniritter University as a professor in Informatics dept. from 2013 to 2015. His research interest includes MPSoCs architectures, Security on Networkson- Chips and silicon photonics. Bruno Forlin is an Electrical Engineering undergraduate student at Universidade Federal do Rio Grande do Sul, in South Brazil. Since entering the university he has taken part on projects to modify the FFMPEG tool for the Brazilian standard, while also working on image processing projects and embedded software. In the last few years he took part in projects involving Hardware Security, his current research field.
Johanna Sepúlveda received the M.Sc. and Ph.D. degrees in Electrical Engineering - Microelectronics by the University of São Paulo, Brazil in 2006 and 2011, respectively. She was Postdoctoral fellow at the Integrated Systems and Embedded Software group at this University and at the group of Embedded Security of the University of South Brittany, France. Moreover, Dr. Sepúlveda was a Visiting Researcher at the Computer Architecture group at the University of Bremen, Germany. In 2014, she worked as a Senior INRIA Postdoctoral researcher at the Heterogeneous Systems group at the University of Lyon, France. Since 2015, she holds a Senior Researcher Assistant position at the Technical University of Munich, Germany. She has been working in the field of embedded security design for more than 10 years. Her research interest also includes high performance SoC design and new technologies design.