Design of an Area Efficient Crypto Processor for 3GPP-LTE NB-IoT Devices
Journal Pre-proof
Design of an Area Efficient Crypto Processor for 3GPP-LTE NB-IoT Devices ´ Luis Cavo, Sebastien Fuhrmann, Liang Liu PII: DOI: Reference:
S0141-9331(19)30001-8 https://doi.org/10.1016/j.micpro.2019.102899 MICPRO 102899
To appear in:
Microprocessors and Microsystems
Received date: Revised date: Accepted date:
18 February 2019 20 September 2019 26 September 2019
´ Please cite this article as: Luis Cavo, Sebastien Fuhrmann, Liang Liu, Design of an Area Efficient Crypto Processor for 3GPP-LTE NB-IoT Devices, Microprocessors and Microsystems (2019), doi: https://doi.org/10.1016/j.micpro.2019.102899
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
Design of an Area Efficient Crypto Processor for 3GPP-LTE NB-IoT Devices Luis Cavo Dept. of EIT Lund University, Sweden
[email protected]
S´ebastien Fuhrmann Dept. of EIT Lund University, Sweden
[email protected]
Liang Liu Dept. of EIT Lund University, Sweden
[email protected] September 27, 2019 Abstract Providing information security is crucial for the Internet of Things (IoT) devices, platforms in which the available power budget is very limited. This paper tackles this challenge and presents a cryptographic processor compliant with the security algorithms specified by the 3rd Generation Partnership Project (3GPP) Long Term Evolution (LTE) NarrowBand IoT (NB-IoT) standard. The proposed processor has been optimized to the needs of the low end portfolio technologies that compose the IoT market, which addresses low-area, low-cost and low-data rate applications. Operation analysis at the algorithm-level and hardware sharing at the architecture-level have enabled extensive area reduction. The cryptographic processor has been described using the High-Level Synthesis (HLS) design flow and integrated with a general purpose processor in a cycle accurate virtual platform. The design achieves a reduction of area ranging from 5% to 42% in comparison to similar work. Synthesis results using a 65-nm CMOS technology show that the processor has a hardware cost of 53.6 kGE, and is capable of performing at 52.4 Mbps for the block cipher and 800 Mbps for the stream cipher algorithms at a 100 MHz clock.
1
Introduction
The Internet of Things (IoT) is enabling the deployment of a massive amount of devices which are interconnected to communicate and exchange data. It is projected that by 2022 there will be around 22 billion connected devices, of which 18 billion will be related to IoT falling in both wide-area and short-range 1
categories [1]. With this increasing number of connected of IoT devices, the techniques for providing information security are a key challenge that must be addressed during the design of such devices. More connected devices implies more vulnerabilities and more access points to be exploited by hackers. Power consumption is another challenge of the field of IoT. These devices must be power efficient in order to rely on its battery and be able to last for 10 years or more. Successive releases of Long Term Evolution (LTE) have optimized MachineType Communications (MTC) with improved support for low power wide-area connectivity. In LTE Release 13, enhanced MTC (eMTC) and NarrowBand IoT (NB-IoT) have been introduced, which provide further improvements such as device cost and complexity reduction, extended battery lifetime and enhanced coverage [2]. NB-IoT also provides end-to-end security, which entails trusted security and authentication features. Within the framework of IoT, the cryptographic algorithms used to secure sensible data must be adapted to the needs of embedded devices with limited resources, and therefore must meet limited area and power demands. The implementation of these algorithms can be done at both hardware and software level. The latter allows the reuse of resources and enables support of different cellular standards. However, the results obtained from software implementations are found to be slower and less secure than their hardware counterpart [3]. Furthermore, the processor subsystem in an NB-IoT node would become too overloaded with security processing if these were to be implemented as software functions. Consequently, this paper presents all the hardware accelerator solutions capable of performing the security functions needed to secure data in an NB-IoT device. This hardware module will handle the security algorithms SNOW 3G, AES and ZUC. An unified solution in which the algorithms can share the maximum amount of hardware resources, including a common datapath and common control logic has been identified as the best solution in order to fulfill the low power, low area requirements.
2
Theory and calculation
The user plane confidentiality and integrity protection for LTE systems are defined in Technical Specification System Architecture Evolution (SAE) of 3GPP under the Security Architecture specification [4]. Three Evolved Packet System (EPS) Encryption Algorithms (EEAs) and Integrity Algorithms (EIAs) are defined for the protection of data against malicious attacks: the 128-EEA1 based on SNOW 3G, the 128-EEA2 based on Advanced Encryption Standard (AES) and finally 128-EEA3 based on ZUC ciphering algorithm. Document [4] provides references to each of the individual algorithms described in the following sections.
2
2.1
Confidentiality Algorithms
The confidentiality algorithms will guarantee the inaccessibility of data from unauthorized entities. 2.1.1
128-EEA1 SNOW 3G algorithm
The keystream generator for 128-EEA1 is based on SNOW 3G, composed of three major modules shown in Fig. 1. The first building block is a LinearFeedback Shift Register (LFSR) consisting of 16 registers of 32 bits. The LFSR feeds a Finite State Machine (FSM) with its inputs values. The keystream produced by the keystream generator is obtained by XORing the data from register s0 and the output of the FSM. The keystream will then be used to mask the plaintext data using an exclusive OR operation. The LFSR initialization
Feedback
Divα
S15 S14 S13 S12 S11 S10
S9
S8
S7
S6
S5
Mulα
S4
S3
S2
FSM
R1
S1
R2
S2
S1
S0
zt
R3
Figure 1: Stream cipher algorithms structure. as well as the structure of the FSM is specified in [5]. The feedback path will perform a series of operations in a Galois Field (GF), GF(2m ). 2.1.2
128-EEA2 AES algorithm
AES is a symmetric block cipher algorithm specified by the National Institute of Standards and Technology (NIST) [6], that processes the input data arranged as a fixed block size of 128 bits. The principle of operation of AES cipher is to pass a block of data to encrypt, defined as the state, through a network of substitutions and permutations. AES encryption flow is shown in Fig. 2. The 128-EEA2 algorithm uses AES in the Counter (CTR) mode of operation, with a key size of 128 bits. The cipher block generates a series of output blocks that are XOR-ed with the input plain text data. AES encryption is performed over a set of input blocks, called counters, which are obtained as specified in 3
Figure 2: AES ciphering. Annex B of [4]. To recover a plaintext block of data, the same XOR operation is performed between the ciphertext data and the output of AES cipher, which is obtained using the same counter generated during encryption as an input block. 2.1.3
128-EEA3 ZUC algorithm
The keystream generator for 128-EEA3 algorithm has the same general structure as SNOW 3G algorithm and is shown on Fig. 3, but with different feedback and FSM logic. The LFSR consists of 16 registers of 31 bits each, and the feedback path is constructed by a primitive polynomial in GF(231 −1). This is a major difference with SNOW 3G algorithm, since ZUC will produce sequences over the prime field GF(231 − 1) instead of GF(2m ), like SNOW 3G cipher. This methodology for generating sequences contributes to this algorithm’s resistance to bit-oriented cryptographic attacks, fast correlation attacks, linear distinguishing attacks and algebraic attacks [9].
2.2
Integrity Algorithms
The purpose of these algorithms is to protect the integrity of the information, i.e. to authenticate the data with a Message Authentication Code (MAC) as being the original source of information without any tampered data. The keystream generators described in the previous section are used to produce a 32-bit MAC. 2.2.1
128-EIA1 SNOW 3G based algorithm
3GPP’s 128-EIA1 algorithm is implemented the same way as ETSI/SAGE UIA2 Integrity algorithm, specified in [7].
4
31
<<<17
<<<21
<<<20
<<<8
31
31
31
31
31
S15 S14 S13 S12 S11 S10
16 16
X0
16 16
32
S9
S8
S7
S6
S5
16 16
X1
X2
32
32
S4
S3
S2
S1
S0
16 16
X3
BR
31
<<<15
LFSR
Addition mod 231 - 1
32
W zt R1
R2
<<< 16 S·L1
S·L2
FSM
Figure 3: ZUC ciphering algorithm during keystream mode. 2.2.2
128-EIA2 AES based algorithm
The 128-EIA2 algorithm also uses AES cipher to perform integrity protection. This mode of operation for integrity protection is called Cipher based MAC (CMAC) mode. The message data is divided into blocks of 128 bits, which is the input size of data for the AES keystream generator. The resulting data from the cipher is XOR-ed with the next block of the message and encrypted in another round of AES cipher. This is repeated until all the message has been processed. 2.2.3
128-EIA3 ZUC based algorithm
This algorithm will process the keystreams produced by ZUC stream cipher as specified in [8] to produce an output word MAC.
2.3
Architecture exploration for Area Reduction
The aim of this section is to provide an analysis of the operations performed by the security algorithms and how these can be mapped to an architecture that meets the area requirements of NB-IoT devices. The High Level Synthesis (HLS) design flow is employed for exploring the design trade-offs of different
5
architectural solutions. In the previous section, we observed how NB-IoT security algorithms rely on certain finite field operations. AES uses a Rijndael substitution box SR in the subbytes operation, the original SNOW 3G algorithm uses four different lookup tables: Mulα, Divα, S1 and S2 and ZUC uses two non-Rijndael substitution boxes S0 and S1. 2.3.1
S-box SR and SQ
S-box SR is used in both AES and SNOW 3G algorithms. SNOW 3G uses this substitution box internally in the FSM to compute S1. Four instances of the S-box SR are be needed to produce one result of S1 in one clock cycle. On the other hand, AES must perform the subbytes operation on each byte of the state matrix, resulting in a total of 16 calculations. Thus, 16 instances of this S-box would be needed to perform the subbytes operation in one clock cycle. This substitution box is computed in two main steps [6] by: 1. Determining the multiplicative inverse for the input number in Rijndael’s field. The element zero ({00}) has no inverse, and thus will be mapped to itself. 2. Applying an affine transformation over Galois Field, given by: 0 bi = bi ⊕ b(i+4)mod8 ⊕ b(i+5)mod8 ⊕ b(i+6)mod8 ⊕ b(i+7)mod8 ⊕ ci . The affine transformation can be (1), where c = {1, 1, 0, 0, 0, 1, 1, 0}. 0 b0 1 0 0 0 0 b1 1 1 0 0 0 b2 1 1 1 0 0 b 1 1 1 1 3 0= b4 1 1 1 1 0 b 0 1 1 1 5 0 b6 0 0 1 1 0 0 0 0 1 b7
expressed in matrix form as in equation
1 0 0 0 1 1 1 1
1 1 0 0 0 1 1 1
1 1 1 0 0 0 1 1
1 b0 1 1 b1 1 1 b2 0 1 b3 0 + 0 b4 0 0 b5 1 0 b6 1 0 b7 1
(1)
S-box SQ is used in SNOW 3G FSM in a similar way to S-box SR to compute S2. Likewise, four different iterations are needed to compute the result. The S-box SQ is constructed using the Dickson polynomial: g49 (x) = x ⊕ x9 ⊕ x13 ⊕ x15 ⊕ x33 ⊕ x41 ⊕ x45 ⊕ x47 ⊕ x49 Then, the output will then be given by: SQ (x) = g49 (x) ⊕ 25hex .
6
2.3.2
S-box S0 and S1
These substitution boxes are us used in ZUC cipher algorithm. S-box S1 used in ZUC is very similar in construction to S-box SR . ZUC S-box S0 can be calculated implementing the operations shown in Fig. 4.
8 data
Split
P1
P2
P3
concatenate
data [7:4]
<<< 5 Cyclic shift
data [3:0]
Figure 4: On-the-fly implementation of ZUC S-box S0.
2.3.3
SNOW 3G feedback field operations analysis
The feedback polynomial is given by f (x) = αx16 + x14 + α−1 x15 + 1 defined over GF (232 ). SNOW 3G will use these field operators, α and α−1 along with 32-bit XOR operations to produce its output. These operations are commonly referred to as the Mulα and Divα operators. These functions will map an 8 bit input to a 32 bit output. It can be implemented as a precomputed table and will be used to initialize and feed the LFSR correctly. The operation of M ulα is governed by M ulα(x) = x4 +β 23 x3 + β 245 x2 + β 48 x + β 239 where x takes a value between 0 and 255 and β is a root of the polynomial x8 + x7 + x5 + x3 + 1. Divα is governed by Divα(x) = x4 + β 16 x3 + β 39 x2 + β 6 x + β 64 . 2.3.4
ZUC feedback loop analysis
The feedback path in ZUC ciphering algorithm performs the function: v = 215 s15 + 217 s13 + 221 s10 + 220 s4 + (1 + 28 )s0 mod (231 − 1)
The operations implemented in the feedback path are mod (231 −1) additions, which can be implemented as described in the specification [10]. The number of adders will increase by one for each addition in the previous equation due to
7
the modulo implementation. Addition in GF (231 − 1): v =a+b if carry bit is 1 then v = v + 1; end
Furthermore, as described in section 10.1.5 of [9], the operation a · x over GF (p), where a = 2i + 2j + 2k can be implemented with cyclic shifts and a modulo p addition as ax ≡ (x ≪31 i) + (x ≪31 j) + (x ≪31 k) mod p. A total of 10 31-bit adders are needed in order to calculate the feedback result which is fed to ZUC LFSR.
3
Results
The designed processor is reconfigurable to support all the aforementioned algorithms and modes, and presents the interfacing logic to act as a slave on a system bus to receive and send the necessary data and parameters. The top level architecture is illustrated in Fig. 5. The input parameters and control data is received and stored in a register space in the bus interface, and are used in the confidentiality and integrity blocks. The control data, like the algorithm selector will be used to realize the necessary connections in the datapath.
3.1
Cipher Block
The cipher block operates the algorithms described in the previous section, i.e., SNOW 3G, AES, and ZUC cipher. The latter are used as keystream generation cores that will feed a keystream to the confidentiality or integrity block, depending on which one is selected. A single cipher algorithm is enabled at the time. 3.1.1
AES Cipher Block
This processing core implements the AES encryption algorithm. The module is built around an algorithmic state machine that follows the encryption flow detailed for AES ciphering. When the key expansion is realized as an initialization, the resulting expanded key must be stored, occupying a total of 1408 bits. This represents an important area of storage, either as registers or a memory implementation. Moreover a single-port memory would constrain the throughput of this operation due to its recursiveness (a computed key word need previous results from the same RAM as where the result is written to). In order to reduce the area of the implementation, instead of storing the expanded key, it is computed on the fly by expanding the 128-bit key into 44 words of 32 bits, named wn with n ∈ (0, ..., 43), as shown in Fig. 6.
8
Figure 5: Block diagram of crypto processor.
Rcon = 01 02
04 08 10 20 40 80 1B
36
Figure 6: Key Expansion diagram The RotWord() operation realizes a cyclic byte shift to left. The word gets all its bytes substituted using the S-Box SR during the SubWord() operation. 9
The word derived from Rcon , depicted as W ordRconi in the figure, uses one byte from the constant vector as its most significant byte while the other least significant bytes are set to 0. For the expanded key word w5 , the operation is: w5 = SubW ord(RotW ord(w3 )) ⊕ W ordRcon0 ⊕ w1 with W ordRcon0 = 01000000hex . 3.1.2
Stream Cipher Block
The combined stream cipher block is able of generating output keystreams for both SNOW 3G and ZUC algorithms. The mode of operation must be selected before executing an EEA or EIA algorithm with this block. With the algorithm selected, the LFSR will be initialized accordingly, and the appropriate feedback and FSM blocks will be used.
3.2
Confidentiality Block
The confidentiality block contains the necessary logic to implement the 128-EEA confidentiality algorithms using the ciphering cores as keystream generators. The confidentiality block is composed of a 32-bit XOR operation between two data streams, the plaintext/ciphertext on one side, and the keystream data on the other side. The logic is different for the AES algorithm, which uses a counter output concatenated with the confidentiality input parameters as input to AES encryption. The output is then used as a keystream. This module also synchronizes the plaintext data from the TX FIFO and the keystream data from the cipher to produce an output data stream going into the RX FIFO.
3.3
Integrity Block
The three integrity algorithms all produce a 32-bit MAC data word, and they are designed as independent modules as the cost of merging operators have shown to be greater than the possible gain. 3.3.1
128-EIA1 Integrity Block
This integrity block implements the 128-EIA1 SNOW 3G based algorithm. The operation is based on iteratively updating a value, EVAL, from the received message M and five keystream words produced by the SNOW 3G cipher (z1 to z5 ). The detailed hardware logic of this operation is illustrated in Fig. 7. In the diagram, P is constructed from the concatenation of keystream words z1 and z2 , and Q from the concatenation of z3 and z4 . The operation EVAL M, detailed in [5], consumes the message by blocks of 64 bits to update the value EVAL. This computation will be performed iteratively until all input message blocks of 64 bits except the last are consumed. The result is then XOR-ed with the last message block conforming the message. This block, MD−1 , is padded with zeros 10
if the message length is not a multiple of 64 bits. The MUL GF (264 ) operation performs a modulo 64-bit multiplication between EVAL M (Mi , P ) ⊕ MD−1 and the concatenated keystream words Q. P
MD-1
Discard
e32 || e33 || … || e63 EVAL_M
MUL GF(264)
MAC-I
e0 || e1 || … || e31 M0 || … || MD-2
z5
Q
Figure 7: Diagram of 128-EIA1. Finally, the last 32-bit word keystream from the cipher is XOR-ed with the most significant 32 bits of the final value of EVAL. This result is the 32 bits word MAC. 3.3.2
128-EIA2 Integrity Block
The 128-EIA2 integrity is based on the CMAC mode of the AES algorithm. This block is mainly composed of XOR operations and a FSM that sequences the data communication with the AES cipher. 3.3.3
128-EIA3 Integrity Block
This circuit recursively XORs words from the generated ZUC keystream when the bit at the iteration index has a value of ’1’. The architecture implemented for this block is shown in Fig. 8. A FSM state controls the reception of the next message and keystream words. The process is repeated with these new data. Once the last message word has been received and used, T is XOR-ed two more times with keystream words, without condition. The last value of T is the MAC data that will be sent.
3.4
Techniques for Power and Area Reduction
This section discusses different area reduction techniques, which are crucial when targeting NB-IoT application. The High Level Synthesis (HLS) design flow is employed for exploring the design trade-offs of different architectural solutions. 3.4.1
Exploring algorithm similarity for hardware reuse
It is noticed from the previous sections how the two stream cipher algorithms implemented are very similar in structure, which calls attention to a potential 11
Keystream wj+1 Keystream wj
1 32
32
T
0
M Figure 8: Diagram of UPDATE T. hardware reuse. The similarities in operation include: 1) The control logic used in SNOW 3G and ZUC ciphering as well as the handshake protocol with external modules is the same. 2) A shift register of length 16 is used in both algorithms. If a 32-bit shift register is used, it can be implemented by both algorithms. 3) The finite-state machines used in both algorithms share common operations. Due to the operation similarities, a combined architecture is proposed in Fig. 9.
Figure 9: Combined architecture for SNOW 3G and ZUC. This stream cipher core will share a common 32-bit shift register and control logic between SNOW 3G and ZUC ciphering algorithm. The size of the FSM is mainly affected by the LookUp Tables (LUTs) used to perform the non-linear substitutions. These resources are independent for each algorithm and, therefore, a common FSM does not lead to improved results due to the multiplexing overhead of sharing the hardware resources. Finally, the feedback loops per12
form different operations and are thus maintained independent. The proposed combined architecture exhibits an improvement of 22% in the total area with respect to having two separate processing cores. 3.4.2
Architectural optimizations for S-box implementation
Rijndael S-box SR is used in both AES and SNOW 3G algorithms. It was seen in Section 2.3.1 how SNOW 3G needs four iterations of this substitution box internally in the FSM to compute S1 . AES cipher must perform the subbytes operation, consisting on a S-box SR substitution, on each byte of the state matrix resulting in a total of 16 calculations. In AES, for all substitutions to be done in parallel, 16 instances of S-box SR are needed, leading to 75% of the core being occupied by the substitution boxes. In order to improve the gate count of the design, using one substitution box will lead to a reduction in area footprint, with a penalty on the number of clock cycles needed to perform the substitutions. The FSM of SNOW 3G and ZUC both present 32 × 32 substitution boxes (S-boxes) which contribute to a large area overhead in these blocks. The Sboxes can be implemented following different approaches, generally using a memory, Look-Up Table (LUT) or as combinational logic. In the proposed design, SNOW 3G uses one instance of Rijndael S-box SR and one instance of S-box SQ which will lead to the generation of one output every four clock cycles. Similarly, one instance of S-box S0 and S-box S1 are used in ZUC’s FSM in order to reduce the area footprint of this module. An on-the-fly implementation of ZUC S-box S0 , presenting a 42% area improvement respect a LUT approach, is used to reduce hardware resources and was presented in Fig. 4. It is derived from the previous optimization that the throughput of the stream cipher block is divided by 4 in comparison to the maximum throughput achievable. By extension, four clock cycles are now available to compute the feedback operations in SNOW 3G and ZUC cipher algorithms. 3.4.3
Implementation of SNOW 3G feedback field operations
The LUT implementation is efficient in terms of processing time as it just depends on the latency to obtain a value from a memory cell. The drawback is the area utilization, particularly sensitive to the input and output data size. On the contrary, the combinational logic does not scale as much as the LUT implementation with the data size but it introduces an increased processing time that mostly depends on the computation complexity. Based on the analysis above and the 4 clock cycle budget available for these operations, we have applied different realization schemes for SNOW 3G feedback operators. It was seen in Section 2.3.3 that SNOW 3G feedback presents two field operators M ulα and Divα usually implemented as large LUTs, 8-bits input – 32-bits output, due to computation time constraint. The HLS flow has enabled the rapid exploration of different architectures. The results of such exploration is shown in Table 1. We observe how the pro-
13
posed implementation, marked as solution 5, presents the smallest area footprint. This solution implements the M ulα and Divα computations in a time multiplexed architecture, and presents a 41% improvement respect to a generic LUT approach, depicted as Solution 6 in the figure. Table 1: Different implementations of M ulα and Divα Implementation Sol. Sol. Sol. Sol. Sol. Sol.
3.4.4
1 2 3 4 5 6
Fully unrolled Pipelined (I stage) Pipelined (II stages) Pipelined (IV stages) Time multiplexed LUT
Area (normalized) 2.1 1.0 1.0 1.6 1.0 1.7
Latency (cycles) 1 123 62 4 4 1
ZUC feedback loop implementation
With the design choices made for the FSM of SNOW 3G and ZUC ciphering algorithms, the LFSR must be clocked once every four clock cycles in order to be synchronized with the FSM processing latency. Thus, the feedback of the LFSR can be optimized in area with at the expense of a higher latency in the feedback. A time multiplexed architecture consisting of two 32-bit adders and two 32bit registers is proposed to implement the chain of additions. This solutions has shown to lead to a 52% improvement in area with respect to an adder tree based approach, where 10 adders are necessary to compute the feedback function.
3.5
Power Savings
The architectural changes covered in the previous section, which have lead to a decreased area footprint in the design, will also have an impact on the overall power consumption. The results presented in this section have been obtained using PowerPro RTL Low-Power Platform power estimation, which is integrated in Catapult HLS. The LFSR of SNOW 3G is composed of a total of 512 bits, whereas ZUC LFSR consists of a total of 496 bits. This means that a total of 1008 bits are being shifted in the stream ciphers alone, every clock cycle. It is therefore expected that a combined shift register being clocked at one fourth of the original rate will lead to significant power savings. Table 2 presents the results of such analysis. In the table, the LFSR block’s power consumption is compared for different architectures (non combined vs. combined) and update rates (every clock cycle vs. every four clock cycles). On the other hand, time multiplexing the operations in the FSM and feedback loops of SNOW 3G and ZUC ciphering algorithms is also expected to 14
Table 2: LFSR power consumption comparison Block
Latency (cycles)
Total Power (uW)
Independent LFSR (SNOW 3G and ZUC)
1
307.7
Combined LFSR
4
94.5 (-69.3%)
lower the power consumption, since it will lead to a decrease in number of hardware resources needed to perform the various arithmetic operations in these sub-blocks. Table 3 shows the power consumption analysis of the various components of SNOW 3G cipher. The results shown are for a base implementation where SNOW 3G stream cipher is generating one output per clock cycle. To achieve this performance, the M ulα and Divα operators are implemented as LUT of 8-bits input – 32-bits output. To achieve the specified throughput, the SNOW 3G FSM in table Table 3 is composed of 4 instances of S-box SR and 4 instances of S-box SQ . Table 3: Estimated power consumption of different blocks in SNOW 3G cipher (Latency = 1) Block SNOW 3G LFSR SNOW 3G feedback SNOW 3G FSM
Latency (cycles)
Total Power (uW)
1
155.8
1
252.6
1
1158.9
Table 4 shows the power consumption analysis for the implementation proposed in this work. We observe how clocking the LFSR at one fourth of the original rate leads to great savings in the power consumption. SNOW 3G feedback is time multiplexed, which reduces the number of hardware resources needed to implement the feedback operation. This also translates into 20.6% power savings. Finally, in section Section 3.4.2 it was shown how the 4 clock cycle budget to compute the FSM output enabled us to reduce the number of substitution boxes used in the design. The power consumption of this implementation is also shown in table Table 4. Combining the results in the table, the overall power consumption of the SNOW 3G cipher block proposed in this work is estimated to be 811 uW. The base design, shown in table Table 3, where the LFSR is clocked every cycle has a total power consumption estimated in 1567 uW. The power consumption is 15
Table 4: Estimated power consumption of different blocks in SNOW 3G cipher (Latency = 4) Block SNOW 3G LFSR SNOW 3G feedback SNOW 3G FSM
Latency (cycles)
Total Power (uW) 92.5 (-40.6%) 200.6 (-20.6%) 517.5 (-55.3%)
4 4 4
therefore decreased by 48% in SNOW 3G cipher with the proposed architecture. Similar results are observed for ZUC algorithm, where the both the FSM and the LFSR present similar power saving figures as the SNOW 3G counterparts, due to the architectural similarities between both ciphers, shown in Section 3.4.1. On the other hand, the implementation proposed in Section 3.4.4 for ZUC cipher feedback loop reduces the power consumption from 527 uW to 280 uW (46.8% savings) using the proposed time multiplexing techniques applied to the adder tree.
4
Discussion
The crypto processor has been implemented in a 65-nm CMOS technology. The total gate count of the design is 53.6 kGE. The proposed solution is compared with previous work on security cores in Table 5. The results presented in the table are in kilo gate equivalent units (kGE), and therefore makes these comparisons independent of the manufacturing technology. It is noticed that we obtained an area reduction of 5.3% for the AES cipher module versus the solution presented in [11] and 5.1% for the Stream Cipher module versus the work [12]. In [15] different resource sharing techniques are analyzed for the stream ciphers SNOW 3G and ZUC. In the analyzed paper, the structural similarities identified are in line with the ones identified in our work. A different architecture is proposed for the HiPAcc-LTE, where the resulting design area for the combined ZUC and SNOW 3G accelerator is similar to the area obtained in our work. The power estimation for the HiPAcc-LTE is 17.32mW for the SNOW 3G cipher, selecting the best case scenario for the technology used (1.32 V and -40◦ C, minimal power consumption). Our proposed solution consumes a total estimate of 0.81 mW. However, the HiPAcc-LTE targets a high performance implementation, with a target frequency of 900 MHz, and therefore runs at a higher clock frequency. After applying a correction factor for the frequency scaling (P ∝ 1/f ), the HiPAcc-LTE will consume an estimate of 1.92 mW at 16
100 MHz, more than double compared to the solution proposed in this work. The SNOW 3G hardware implementation proposed in [16] will achieve a throughput of 7968 Mbps and a total area of 25016 kGE. Once again, the main design goal of the work analyzed is performance optimization, leading to higher throughput at the expense of 50% area increase in comparison to the combined stream cipher (SNOW 3G and ZUC cipher) proposed in our solution. For a complete Crypto Processor, very few existing works implementing all the 3GPP’s security algorithms for LTE were found in the literature. A similar IP from Synopsys is available [13]. When we compare their design with our solution, we notice that our proposal is in the lower range of their Crypto Processor. However, they do not clearly define what is included in their smaller area result, while our solution includes the processing cores as well the FIFOs and bus interface. Our proposed implementation is close to 42% smaller than solution [14]. In their work they have developed a high throughput ApplicationSpecific Instruction-set Processor (ASIP). The approach of this work differs from the one proposed in this document, given that they use software instructions to control the ciphering while our design is purely a hardware module requiring few parameters for operating. Table 5: Comparison with other work
Sub block Sub block Full block
5
Implementation AES (this work) AES [11] Stream Cipher (this work) Stream Cipher [12] Stream Cipher [15] Crypto Processor (this work) Crypto Processor [13] Crypto Processor [14]
Area (kGE) 7.1 7.5 16.6 17.5 16-18 53.6 40–150 128
Technology 65 nm 90 nm 65 nm 90 nm 65 nm 65 nm – 28 nm
High-Level Synthesis Flow
The design has been realised using High-Level Synthesis (HLS) flow. The hardware has been written in C++/SystemC using Algorithmic C libraries that provide bit accurate data types. These sources are interpreted using Catapult HLS from Mentor [17] to generate the RTL description in either Verilog or VHDL. This design flow has allowed us a wide design space exploration with easy comparison in terms of area and performance between different architectures. The goal was to optimize the area while keeping performances compliant with NB-IoT specifications. Having sources in C++ and SystemC granted us the possibility to verify the design on 3 different levels. The block designed was first tested using a software testbench. The SystemC model can then be imported into a virtual 17
modeling environment for verification. We have used SoC Designer [18] in order to model a system comprising a CPU, memories, DMA and bus interfaces. This environment allowed us to verify the functionality as well as the exact system performance due to the cycle accurate modeling, both at block and system level. This design flow allows for faster development as well as faster simulation run-time compared to the classical RTL flow. Finally, an RTL testbench can also be used to verify the equivalence of the extracted RTL with the SystemC model. The functionality of the design was tested at all levels against reference data provided by 3GPP specifications.
6
Conclusion
This paper presents a crypto processor implementing the 3GPP security algorithms targeting NB-IoT application. This design employed different techniques for both architecture and circuit level optimizations in order to achieve the minimum area. The focus was placed on this criteria as it is a critical aspect in the IoT field. This domain requires low cost and low power consumption while the performance requirement remains low. Improved area means both lower cost of the chip as well lower power consumption as most of it comes from leakage during idle times, which represents the majority for this application. The two stream ciphers, SNOW 3G and ZUC, are combined together by exploring algorithm-level similarities, resulting in significant reduction in hardware cost. Moreover the processor integrates the reconfigurability and bus interface to be used in an IoT SoC design.
References [1] Ericsson, “Ericsson Mobility Report,” www.ericsson.com/mobility-report, 2016. [2] R. Ratasuk and N. Mangalvedhe and Y. Zhang and M. Robert and J. P. Koskinen, “Overview of narrowband IoT in LTE Rel-13”, 2016 IEEE Conference on Standards for Communications and Networking (CSCN), pp. 1-7, Oct. 2016. [3] A. Gielata and P. Russek and K. Wiatr, “AES hardware implementation in FPGA for algorithm acceleration purpose”, 2008 International Conference on Signals and Electronic Systems, September 2008, pages 137-140. [4] 3GPP, “Technical Specification TS 33.401: www.3gpp.org/DynaReport/33401.htm.
Security Architecture,”
[5] ETSI/SAGE, “Document 1: UEA2 and UIA2 Specification,” www.gsma.com/aboutus/wp-content/uploads/2014/12/uea2uia2d1v21.pdf [6] NIST, “Advance Encryption Standard (AES),” http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf, 2001. 18
[7] 3GPP, “Technical Specification TS 35.215: Document 1: UEA2 and UIA2 specifications,” www.3gpp.org/DynaReport/35215.htm. [8] 3GPP, “Technical Specification TS 35.221: Document 1: EEA3 and EIA3 specifications,” www.3gpp.org/DynaReport/35221.htm. [9] 3GPP, “Technical Report TR 35.924: Document 4: Design and Evaluation Report,” www.3gpp.org/DynaReport/35924.htm. [10] 3GPP, “Technical Specification TS 35.222: Document 2: ZUC specification,” www.3gpp.org/DynaReport/35222.htm. [11] S. Hessel and D. Szczesny and N. Lohmann and A. Bilgic and J. Hausner, “Implementation and Benchmarking of Hardware Accelerators for Ciphering in LTE Terminals,” GLOBECOM 2009 - 2009 IEEE Global Telecommunications Conference, pp. 1-7, Nov. 2009. [12] S. Traboulsi and V. Frascolla and N. Pohl and J. Hausner and A. Bilgic, “A versatile low-power ciphering and integrity protection unit for LTEadvanced mobile devices,” 10th IEEE International NEWCAS Conference, vol. 46, no. 7. pp. 317-320, Jun. 2012. [13] Synopsys, “DesignWare LTE Security Protocol Accelerator,” https://www.synopsys.com/dw/ipdir.php?ds=security-protocol-acceleratorlte. [14] Y. Huo and D. Liu, “High-throughput area-efficient processor for 3GPP LTE cryptographic core algorithms,” 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) pp. 210-210, 2017. [15] Sourav Sen Gupta and Anupam Chattopadhyay and Ayesha Khalid, “Designing Integrated Accelerator for Stream Ciphers with Structural Similarities,” 2012. [16] Paris Kitsos and George Selimis and Odysseas Koufopavlou, “High performance ASIC implementation of the SNOW 3G stream cipher,” 2008. [17] Mentor, A Siemens Business, “Catapult https://www.mentor.com/hls-lp/catapult-high-level-synthesis/.
HLS,”
[18] Arm, “Arm SoC Designer,” https://developer.arm.com/products/systemdesign/cycle-models/arm-soc-designer.
19
Liang Liu Liang Liu is an Associate Professor and a Docent at Electrical and Information Technology (EIT) Department, Lund University. He received his B.S. and Ph.D. degree in the Department of Electronics Engineering (2005) and Micro-electronics (2010) from Fudan University (Shanghai, P.R.China). From Jan. 2010 to April 2010, he was with Electrical, Computer and Systems Engineering Department, Rensselaer Polytechnic Institute (New York, USA) as a visiting scholar. He joined Lund University as a Post-doc in 2010 and was Assistant Professor 2014-2015. In 2015, He received Docent. His research interest includes wireless communication system and digital integrated circuits design. Liang is active in several EU and Sweden projects, including FP7 MAMMOET, VINNOVA SoS, SSF HiPEC, and SSF DARE.
Luis Cavo Luis Cavo holds a B.S. in Telecommunications Engineering (2015) from Universidad Polit´ecnica de Madrid (UPM). In 2018, he received his M.Sc degree in System on Chip from the Department of Electrical and Information Technology (LTH) of Lund University. Among his research interests are integrated circuit design, with especial interest in architecture optimization and hardware-software co-design. He has worked on numerous projects as a digital IC designer and embedded hardware developer in the fields of communication systems and GNSS technology.
20
S´ ebastien Fuhrmann S´ebastien Fuhrmann is a Hardware Engineer. He received his M.Sc in Architect and Integrator of Electronics Systems ´ ´ ´ (2009) from Ecole Sup´erieure d’Ing´enieurs en Electrotechnique et Electronique (ESIEE Paris). He joined the Department of Electrical and Information Technology (LTH) at Lund University to study System on Chip Master program (2015). He has worked on several projects in the industry and research, including ST-Ericsson, CNRS and Sagem. His interests are in integrated circuit design, hardware performance optimization and complex systems architecture.
21
Conflict of Interest No conflict of interests in this manuscript.
22